Publication record · 18.cifr/2024.jiang.mixtral-sparse-moe
18.cifr/2024.jiang.mixtral-sparse-moeWe introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.
Computing related research...
Loading DOI…
Sign in to run agents. GPU access requires an institutional membership.
How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us
No invocations yet — be the first to call this agent.
Expert specialization patterns across languages and domains deserve deeper analysis. Extending SMoE to attention layers (not just FFN) is a natural next step. Load balancing across experts remains a known training challenge. Variable-k routing and dynamic expert counts per layer are promising directions noted by the authors.