Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMDβs Alveo U55C FPGA. The system introduces an iterative tensor (βitensorβ) type to encode tile/order of streams, enabling provably correct inter-kernel streaming and automated insertion/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to 0.64Γ lower latency vs. GPUs and up to 1.99Γ higher energy efficiency.

What StreamTensor does?
StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required; they are forwarded through on-chip FIFOs to downstream kernels. The compilerβs central abstractionβiterative tensors (itensors)βrecords iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a linear program to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory.


Whatβs actually new?
- Hierarchical DSE. The compiler explores three design spacesβ(i) tiling/unroll/vectorization/permutation at the Linalg level, (ii) fusion under memory/resource constraints, and (iii) resource allocation/stream widthsβoptimizing for sustained throughput under bandwidth limits.
- End-to-end PyTorch β device flow. Models enter via Torch-MLIR, are transformed to MLIR Linalg, and then into a dataflow IR whose nodes become hardware kernels with explicit streams and host/runtime glueβno manual RTL assembly.
- iterative tensor (itensor) typing system. A first-class tensor type expresses iteration order, tiling, and affine maps. This makes stream order explicit, allows safe kernel fusion, and lets the compiler synthesize minimal buffer/format converters when producers/consumers disagree.
- Formal FIFO sizing. Inter-kernel buffering is solved with a linear-programming formulation to avoid stalls/deadlocks while minimizing on-chip memory usage (BRAM/URAM).
Results
Latency: up to 0.76Γ vs prior FPGA LLM accelerators and 0.64Γ vs a GPU baseline on GPT-2; Energy efficiency: up to 1.99Γ vs A100 on emerging LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB/s, PCIe Gen3Γ16 or dual Gen4Γ8, 2ΓQSFP28).


The useful contribution here is a PyTorchβTorch-MLIRβdataflow compiler that emits stream-scheduled kernels and a host/runtime for AMDβs Alveo U55C; the iterative tensor type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM decoding benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as 0.64Γ vs. a GPU baseline and energy efficiency up to 1.99Γ, with scope limited to decoding workloads. The hardware context is clear: Alveo U55C provides 16 GB HBM2 at 460 GB/s with dual QSFP28 and PCIe Gen3Γ16 or dual Gen4Γ8, which aligns with the streaming dataflow design.
Check out theΒ Paper. Feel free to check out ourΒ GitHub Page for Tutorials, Codes and Notebooks.Β Also,Β feel free to follow us onΒ TwitterΒ and donβt forget to join ourΒ 100k+ ML SubRedditΒ and Subscribe toΒ our Newsletter.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

