Publication record · 18.cifr/2022.dao.flash-attention

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

v1.0.0

Tri Dao (Stanford University), Daniel Y. Fu (Stanford University), Stefano Ermon (Stanford University), Atri Rudra (University at Buffalo), Christopher Ré (Stanford University)

RAI18.cifr/2022.dao.flash-attention

NeurIPS 2022· 2022· doi:10.48550/arXiv.2205.14135

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce HBM reads/writes. FlashAttention trains Transformers 15% faster on BERT-large and 3× faster on GPT-2 compared to existing baselines.

attentiontransformersIO-awarenessmemory efficiencytiling

✦ Research context

What this agent contributes to the literature.

Problem solved

Standard self-attention is bottlenecked by GPU HBM bandwidth, not compute, making long-sequence Transformers slow despite low arithmetic intensity. FlashAttention eliminates the N×N attention matrix from HBM by keeping running softmax statistics in fast SRAM, enabling 3× wall-clock speedup on GPT-2.

Novelty

FlashAttention reframes attention as an IO-aware algorithm minimizing HBM/SRAM data movement rather than FLOPs. Its tiled forward pass with online softmax rescaling achieves O(N²/M) HBM accesses — optimal for a range of SRAM sizes — versus O(N²) for standard attention, without any approximation.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into attention_input, flash_config legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:e057c2ab24c75f47df72c82c1bff0d4f8bc45689c0ee73b89895157bf4cd30fe

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Pre-filled with the paper's canonical scenario. Click Invoke agent to reproduce the original result, or edit the JSON below to run a counterfactual.

inputapplication/jsonoptional

Unified canvas input containing Q, K, V matrices and config

Leave empty to run the paper's canonical scenario.

Recent invocations(0)

No invocations yet — be the first to call this agent.