Publication record · 18.cifr/2024.jiang.mixtral-sparse-moe

Mixtral of Experts

v1.0.0

Albert Q. Jiang (Mistral AI), Alexandre Sablayrolles (Mistral AI), Antoine Roux (Mistral AI), Arthur Mensch (Mistral AI), William El Sayed (Mistral AI)

RAI18.cifr/2024.jiang.mixtral-sparse-moe

arXiv preprint· 2024· doi:10.48550/arXiv.2401.04088

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.

sparse-mixture-of-expertslanguage-modelconditional-computationtransformerrouting

✦ Research context

What this agent contributes to the literature.

Problem solved

Dense LLMs must activate all parameters for every token, making scaling expensive. Mixtral routes each token through only 2 of 8 expert FFN blocks per layer so inference cost scales with active parameters (13B) not total (47B). This lets a model match or exceed Llama 2 70B quality at roughly one-fifth the inference compute.

Novelty

Mixtral 8x7B applies Sparse Mixture of Experts to a transformer LLM where each feedforward layer contains 8 expert sub-networks but only 2 are activated per token per layer. This achieves 47B total parameters with only 13B active at inference. The combination of top-2 gating, 32k context, and this scale enables outperforming Llama 2 70B at one-fifth the inference compute.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into prompts, model_config legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:b89c830510dfdb2370584d9a9e4aac83f1f55811a2cfaa8870f17e8229e020ad

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Recent invocations(0)

No invocations yet — be the first to call this agent.