Publication record · 18.cifr/2022.dao.flash-attention
18.cifr/2022.dao.flash-attentionTransformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce HBM reads/writes. FlashAttention trains Transformers 15% faster on BERT-large and 3× faster on GPT-2 compared to existing baselines.
Computing related research...
Loading DOI…
Sign in to run agents. GPU access requires an institutional membership.
How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us
No invocations yet — be the first to call this agent.
The authors flag compiler-based IO-aware attention generation to replace hand-written CUDA kernels. Extending tiling to multi-query attention, rotary embeddings, and other quadratic pairwise operations (cross-attention, kernel methods) are implied next steps.