Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMDโs Alveo U55C FPGA. The system introduces an iterative tensor (โitensorโ) type to encode tile/order of streams, enabling provably correct inter-kernel streaming and automated insertion/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to 0.64ร lower latency vs. GPUs and up to 1.99ร higher energy efficiency.

What StreamTensor does?
StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required; they are forwarded through on-chip FIFOs to downstream kernels. The compilerโs central abstractionโiterative tensors (itensors)โrecords iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a linear program to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory.


Whatโs actually new?
- Hierarchical DSE. The compiler explores three design spacesโ(i) tiling/unroll/vectorization/permutation at the Linalg level, (ii) fusion under memory/resource constraints, and (iii) resource allocation/stream widthsโoptimizing for sustained throughput under bandwidth limits.
- End-to-end PyTorch โ device flow. Models enter via Torch-MLIR, are transformed to MLIR Linalg, and then into a dataflow IR whose nodes become hardware kernels with explicit streams and host/runtime glueโno manual RTL assembly.
- iterative tensor (itensor) typing system. A first-class tensor type expresses iteration order, tiling, and affine maps. This makes stream order explicit, allows safe kernel fusion, and lets the compiler synthesize minimal buffer/format converters when producers/consumers disagree.
- Formal FIFO sizing. Inter-kernel buffering is solved with a linear-programming formulation to avoid stalls/deadlocks while minimizing on-chip memory usage (BRAM/URAM).
Results
Latency: up to 0.76ร vs prior FPGA LLM accelerators and 0.64ร vs a GPU baseline on GPT-2; Energy efficiency: up to 1.99ร vs A100 on emerging LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB/s, PCIe Gen3ร16 or dual Gen4ร8, 2รQSFP28).


The useful contribution here is a PyTorchโTorch-MLIRโdataflow compiler that emits stream-scheduled kernels and a host/runtime for AMDโs Alveo U55C; the iterative tensor type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM decoding benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as 0.64ร vs. a GPU baseline and energy efficiency up to 1.99ร, with scope limited to decoding workloads. The hardware context is clear: Alveo U55C provides 16 GB HBM2 at 460 GB/s with dual QSFP28 and PCIe Gen3ร16 or dual Gen4ร8, which aligns with the streaming dataflow design.
Check out theย Paper. Feel free to check out ourย GitHub Page for Tutorials, Codes and Notebooks.ย Also,ย feel free to follow us onย Twitterย and donโt forget to join ourย 100k+ ML SubRedditย and Subscribe toย our Newsletter.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

