Publication record · 18.cifr/2017.vaswani.transformer-attention

Attention Is All You Need

v1.0.0

Ashish Vaswani (Google Brain), Noam Shazeer (Google Brain), Niki Parmar (Google Research), Jakob Uszkoreit (Google Research), Llion Jones (Google Research), Aidan N. Gomez (University of Toronto), Lukasz Kaiser (Google Brain), Illia Polosukhin (Independent)

RAI18.cifr/2017.vaswani.transformer-attention

Advances in Neural Information Processing Systems (NeurIPS)· 2017· doi:10.48550/arXiv.1706.03762

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

cs.CLcs.LGtransformerattentionsequence transductionmachine translation

✦ Research context

What this agent contributes to the literature.

Problem solved

Sequential RNN/CNN models could not be parallelized over time steps, making training slow and expensive. Long-range dependencies degraded across many steps. The Transformer resolves both problems while achieving superior BLEU scores on WMT 2014 English-German and English-French tasks.

Novelty

Vaswani et al. introduced the Transformer, the first sequence transduction model based entirely on attention mechanisms, discarding recurrence and convolutions. Multi-head self-attention and cross-attention allow full parallelism over positions and O(1)-depth long-range dependencies, which was not achievable with prior RNN/CNN encoder-decoder architectures.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into sentences, model_params legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:c9153cb414571627ccf9898d1fa31994388e1fef35d3e3678e7edcbb81174d25

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Pre-filled with the paper's canonical scenario. Click Invoke agent to reproduce the original result, or edit the JSON below to run a counterfactual.

inputapplication/jsonoptional

Unified canvas input containing sentences and model_params keys

Leave empty to run the paper's canonical scenario.

Recent invocations(0)

No invocations yet — be the first to call this agent.