What is FlashAttention, and why did it matter so much for LLM training speed?

Question

Accepted Answer

FlashAttention is a more memory-efficient way to compute attention. It mattered so much because standard attention spends a lot of time moving large intermediate tensors in and out of memory, and that becomes a major bottleneck in LLM workloads.The key idea is not that FlashAttention changes what attention means mathematically. It computes the same attention result. What it changes is how the computation is organized so that fewer large intermediate matrices have to be materialized and moved through high-bandwidth memory.That is a big deal because modern GPUs are often limited less by raw arithmetic and more by memory traffic.The repo’s training-speed material shows this clearly. Replacing the from-scratch attention implementation with PyTorch’s optimized attention path using FlashAttention produced one of the largest speed jumps and one of the largest memory drops in the whole optimization sequence.Why was the impact so large? attention is one of the most expensive parts of the model long sequences make attention intermediates especially costly better kernel fusion reduces memory movement overheadSo FlashAttention mattered not because it changed the transformer architecture, but because it changed the efficiency of one of its most expensive inner loops.This is a good example of a broader lesson in modern LLM engineering: once the model architecture is reasonably strong, systems-level improvements can unlock huge practical gains.In short, FlashAttention is a memory-efficient implementation of attention that reduces expensive intermediate memory traffic, and it mattered so much because attention is a central bottleneck in LLM training and inference.