Fact-checked by Grok 2 weeks ago

Mixture of experts

A mixture of experts (MoE) is a technique that integrates multiple specialized sub-models, referred to as experts, each designed to handle distinct subsets of the input , coordinated by a gating network that dynamically routes inputs to the most appropriate experts or computes a weighted combination of their outputs. This approach enables efficient scaling of model capacity by activating only a subset of experts per input, thereby reducing computational overhead while maintaining high performance across diverse tasks. Originally proposed in the early , emerged as a method for in modular neural networks, where experts—typically simple feedforward networks—specialize in local regions of the input space, and the gating mechanism, often a softmax-based classifier, learns to partition the data adaptively during training via . The framework addressed limitations of monolithic models by promoting division of labor among components, leading to improved and robustness, as demonstrated in applications like phoneme recognition and . Subsequent refinements, such as hierarchical mixtures, extended this to deeper structures for handling complex, multi-level data distributions. In the era of , has seen a resurgence, particularly within architectures, where sparsely-gated variants activate only the top-k experts (e.g., k=1 or 2) for each to enable training of models with trillions of without proportional increases in inference cost. This innovation, introduced in 2017, incorporates auxiliary losses to balance expert utilization and prevent collapse to a few dominant experts. Modern implementations, such as those in large language models (LLMs), leverage layers—often replacing standard feedforward networks post-attention—to achieve state-of-the-art results in , with examples including Mixtral 8x7B (2023), which uses 8 experts per layer for efficient multilingual capabilities, and Grok-1 (2024), a 314 billion model employing for enhanced reasoning. Beyond , has been adapted for (e.g., V-MoE for image classification), recommender systems (e.g., for ), and tasks, underscoring its versatility in scaling AI systems.

Introduction

Definition and principles

A mixture of experts (MoE) is a machine learning architecture that builds upon ensemble methods, which combine multiple models to achieve improved predictive performance over individual learners by leveraging their collective strengths and reducing variance or bias. In this framework, the experts are typically specialized neural networks designed to handle distinct subsets of the input space, allowing for modular decomposition of complex modeling tasks into simpler, localized components. At its core, an MoE operates as a probabilistic where a gating network dynamically assigns input samples to one or more networks, producing an output as a weighted combination of the experts' predictions. The gating network computes probabilities that determine the contribution of each , effectively routing inputs based on their features to promote . Mathematically, for an input x, the output y is given by y = \sum_{i=1}^{N} g_i(x) \cdot e_i(x), where e_i(x) is the output of the i-th expert network, g_i(x) is the gating weight (a non-negative probability with \sum_{i=1}^{N} g_i(x) = 1), and N is the number of experts; the weights g_i(x) are often derived via a softmax function over gating scores. The key principles of MoE emphasize modularity, enabling each expert to specialize in a particular region of the input domain through a divide-and-conquer strategy that simplifies learning for complex functions. Conditional computation further enhances efficiency by activating only relevant experts per input, reducing the effective model size during inference while maintaining capacity. Routing can be implemented in soft variants, where multiple experts contribute with overlapping probabilities for smooth transitions, or hard variants, where a single expert dominates for clearer specialization, depending on the softmax temperature or selection mechanism.

Historical overview

The concept of mixture of experts (MoE) originated in the early 1990s through research on modular neural architectures, where and explored softmax-based gating mechanisms to enable across specialized sub-networks. This foundational work emphasized dividing complex tasks among multiple "experts" coordinated by a gating network that softly weights their contributions based on input features, addressing limitations in monolithic neural networks. A seminal advancement came in 1991 with the introduction of adaptive mixtures of local experts by Robert A. Jacobs, , Steven J. Nowlan, and Geoffrey E. Hinton, which formalized the framework using to train both the gating network and expert models. This approach allowed the system to dynamically partition the input space, improving generalization on diverse datasets compared to single-expert models. In the mid-1990s, developments extended to hierarchical structures, notably through and Robert A. Jacobs' work on meta-pi networks and hierarchical mixtures, which supported modular learning by organizing experts in tree-like topologies for handling nested decision boundaries. These innovations, published in 1994, enhanced scalability for supervised tasks by enabling and EM-based . MoE experienced a revival in deep learning around 2017, when Noam Shazeer and colleagues at proposed sparsely-gated MoE layers to scale neural networks efficiently by activating only a subset of experts per input, as demonstrated in large-scale language models. This conditional computation approach reduced computational overhead while expanding model capacity. Building on this, Dmitry Lepikhin et al. introduced GShard in 2020, applying sparsely-gated MoE to multilingual transformers with over 600 billion parameters, achieving state-of-the-art performance through automatic sharding. The transition to architectures culminated in 2021 with William Fedus et al.'s Switch Transformers, a sparse MoE variant that routed tokens to single experts, enabling trillion-parameter models trained 4-7 times faster than dense counterparts on benchmarks like C4.

Core Components

Gating networks

In mixture of experts (MoE) architectures, the gating network serves as the decision-making component that dynamically assigns input data to appropriate models based on learned input-dependent weights. Typically implemented as a small , such as a multi-layer (MLP), the gating network processes the input vector to produce an N-dimensional vector of logits, one for each , which are then normalized to form assignment probabilities. This design allows the gating mechanism to partition the input space adaptively, enabling specialized handling of different data regions without fixed boundaries. Gating functions primarily operate in two modes: soft gating and hard gating. In soft gating, the logits are passed through a to generate probabilistic weights, ensuring a smooth, weighted combination of expert outputs where all experts contribute to some degree, though dominant ones receive higher weights. The softmax operation is defined as g_i = \frac{\exp(s_i)}{\sum_{j=1}^N \exp(s_j)}, where s_i is the for the i-th expert and N is the number of experts; this approach promotes stable training by avoiding abrupt switches between experts. Conversely, hard gating selects a subset of experts, often via top-k selection, where only the k highest-scoring experts (based on logits) are activated with equal or weighted contributions, while others receive zero weight; this sparsity enhances computational efficiency in large-scale models. Early implementations in the relied on MLP-based gating networks for soft probabilistic assignment, as demonstrated in foundational models where the gating network was a simple structure trained to output mixing proportions directly influenced by input features. of the gating network occurs jointly with the expert models through , allowing end-to-end optimization of the entire system under a standard , such as for tasks. To prevent imbalances where certain experts are over- or under-utilized, modern training incorporates auxiliary losses that encourage even distribution of assignments across experts, though these are tuned sparingly to avoid interfering with primary task performance.

Expert models

In mixture of experts (MoE) architectures, expert models serve as specialized sub-networks responsible for processing assigned portions of the input . These experts are typically implemented as identical feed-forward neural networks, such as multi-layer perceptrons (MLPs), designed to operate effectively within distinct subspaces of the input domain. This design allows each expert to focus on modeling local patterns or features, contributing to the overall system's ability to handle complex, high-dimensional distributions. In some implementations, experts may take the form of convolutional neural networks (CNNs) for tasks involving spatial , but the core remains the partitioning of computational responsibility across multiple similar structures. The specialization of individual experts arises dynamically during the joint training of the MoE system, where the gating network routes inputs to appropriate experts based on learned affinities. This process encourages each expert to refine its parameters toward proficiency in specific regions of the input space, effectively enabling a divide-and-conquer strategy for approximating non-linear functions that would be challenging for a single monolithic network. Over training iterations, this routing-driven specialization leads to emergent division of labor among experts, with minimal overlap in their effective coverage areas, enhancing the model's representational capacity without proportional increases in active computation. Regarding parameterization, experts are generally maintained as separate networks to maximize specialization, though variants incorporate shared experts across multiple gating decisions or tasks to reduce redundancy and promote knowledge transfer. With N separate experts each comprising M parameters, the total model capacity reaches N × M parameters, far exceeding that of a dense equivalent while enabling sparse utilization—typically activating only a small fraction (e.g., top-2 experts per input) to balance scalability and efficiency. In classical examples, such as the adaptive mixtures proposed by Jacobs, Jordan, Nowlan, and Hinton, experts are often simple local linear models, each approximating the target function within a confined region of the input space to collectively cover the global mapping.

Classical Formulations

Adaptive mixtures of local experts

The adaptive mixture of local experts, introduced in , frames the mixture of experts as a probabilistic model where a gating partitions the input into regions, assigning each input to specialized models that approximate local functions, such as linear regressions, within those regions. The overall system output is a weighted sum of the experts' predictions, with weights determined by the gating 's soft assignments, enabling the model to handle complex, nonlinear mappings by combining simple local approximations. Training proceeds via an expectation-maximization (EM) algorithm, which iteratively refines the parameters of both the gating network and the experts to maximize the likelihood of the training data. In the E-step, the algorithm computes the posterior responsibilities g_i(x) = P(\text{expert } i \mid x) for each expert given the input x, representing the probability that expert i is responsible for the prediction. These responsibilities are derived from the gating function, typically implemented as a softmax over linear projections: g_i(x) = \frac{\exp(w_i^T x)}{\sum_j \exp(w_j^T x)}, where w_i are the parameters of the gating network for expert i. In the M-step, the experts are updated by minimizing weighted errors using these responsibilities as weights (e.g., weighted least squares for regression experts), while the gating parameters w_i are updated via a weighted multinomial logistic regression to better align assignments with observed outputs. This EM procedure promotes specialization, as experts focus on subsets of the data where the gating assigns high responsibility, reducing interference during learning. Early applications demonstrated the model's efficacy in supervised tasks, including on synthetic two-dimensional where mixtures of linear experts outperformed single global models in capturing linear functions, and on vowel datasets, achieving error rates comparable to multilayer perceptrons but with faster due to localized training.

Hierarchical mixtures

Hierarchical mixtures of experts extend the flat architecture by organizing the experts into a , enabling multi-level partitioning of the input space. At the , a top-level gating network divides the input into coarse clusters by computing soft probabilities over sub-mixtures, each of which contains its own gating network and experts. This process recurses down the tree, with gating networks at internal nodes selecting paths to more specialized subtrees or directly to terminal expert models, which produce the final outputs. The overall prediction is a weighted of the active experts, determined by the product of gating probabilities along the path from to . This tree-based gating provides several advantages over flat mixtures, particularly in handling complex data distributions with varying levels of granularity. Higher-level gates can capture global patterns, while lower-level experts specialize in local regions, allowing the model to adapt representations based on data characteristics rather than fixed assumptions. Computationally, the hierarchy avoids the exponential growth in the number of experts required for fine-grained partitioning in flat models; instead, it scales more efficiently by reusing shared structures across branches, reducing both parameter count and cost for deep trees. Training involves optimizing the gating and expert parameters to maximize the likelihood of the data, with extended through the to compute gradients for all levels. Error signals propagate backwards from the output, assigning "responsibilities" to experts and gates via the gating probabilities, which helps mitigate credit assignment challenges by localizing updates to relevant branches. However, deep hierarchies can amplify vanishing gradients or uneven credit flow across levels, requiring careful initialization or regularization. An alternative approach uses the expectation-maximization () algorithm, where the E-step computes posterior responsibilities for paths in the tree, and the M-step updates parameters independently for gates and experts, often converging faster than pure gradient methods. A seminal example is the hierarchical model proposed by Jordan and Jacobs in 1994, applied to modular divide-and-conquer tasks such as learning robot arm dynamics from . In this setup, the partitioned the input space (joint positions, velocities, and torques) into regions corresponding to distinct dynamic regimes, with experts modeling local forward dynamics (mapping to accelerations); the model achieved low error rates on held-out , outperforming flat mixtures by exploiting the for scalable specialization.

Other early variants

Meta-pi networks, developed by Hampshire and Waibel in 1992, extend the mixture of experts framework by replacing the standard softmax-based gating with a product-of-sums mechanism. This structure computes the gating function as the product across input modalities of sums over expert contributions within each modality, enabling nonlinear interactions and more flexible partitioning of the input space for robust multisource pattern recognition tasks, such as speech processing in noisy conditions. The design promotes distributed representations across multiple pi-network experts, where each expert specializes in subsets of features, improving overall system resilience to input variations. Bayesian mixtures of experts, as proposed by Waterhouse et al. in , integrate probabilistic priors over gating network parameters and expert models to explicitly handle uncertainty in expert allocation and predictions. By applying techniques, such as evidence approximation or sampling, the model estimates posterior distributions that account for parameter variability, enabling better generalization on limited data and automatic , including the determination of the optimal number of experts. This approach was particularly valuable for applications requiring confidence estimates, like , where it outperformed maximum likelihood methods in handling . These early variants, while innovative, were constrained by scalability challenges in the pre-deep learning era, as training multiple interconnected components via often demanded substantial computational resources unavailable at the time. Issues such as sensitivity to initialization, difficulty in optimizing non-convex objectives, and limited applicability to high-dimensional inputs restricted their adoption beyond toy problems or small-scale supervised tasks.

Modern Deep Learning Implementations

Sparsely-gated layers

Sparsely-gated layers represent an adaptation of the framework to deep neural networks, where traditional dense feed-forward layers are replaced by layers that activate only a small subset of experts for each . In this setup, an layer consists of a that selects the top-k experts out of N total experts (with k \ll N) to process the input, enabling conditional computation that scales model capacity without proportionally increasing computational cost. This approach allows for models with billions of parameters while keeping active computations manageable. The core architecture, introduced by Shazeer et al. (2017), features N feed-forward sub-networks and a trainable gating network that computes a sparse G(x) over the experts for an input x. The layer's output is given by y = \sum_{i=1}^{N} G(x)_i E_i(x), where E_i(x) is the output of the i-th , and G(x) is determined via a noisy top-k gating mechanism to promote expert diversity and balanced utilization. Specifically, the gating scores H(x) incorporate additive scaled by a learnable factor: H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{\text{noise}})_i), followed by a softmax applied only to the top-k scores after thresholding the rest to zero. This noisy top-k gating with load-balancing noise helps prevent collapse to a few dominant experts during . A key efficiency gain arises from the sparse activation: floating-point operations (FLOPs) during and scale linearly with the number of active experts (k \cdot d, where d is the input ) rather than the total (N \cdot d^2), allowing models to handle vastly larger counts—such as 137 billion—while maintaining throughput comparable to smaller dense networks. This conditional decouples parameter scaling from runtime costs, making sparsely-gated layers suitable for resource-constrained environments. In practice, these layers are integrated by stacking multiple layers in place of dense feed-forward blocks within deep architectures, such as between recurrent layers in sequence models or within blocks in later adaptations. For instance, Shazeer et al. (2017) applied them convolutionally between stacked LSTM layers for modeling and tasks, demonstrating effective scaling of network depth and width.

Routing mechanisms

In modern mixture-of-experts (MoE) architectures, particularly those integrated into transformer-based models, routing mechanisms determine how input are assigned to specialized sub-networks to enable sparse and efficient scaling. These mechanisms operate at the level within each MoE layer, allowing the model to dynamically select a subset of experts for computation while keeping the overall parameter count high without proportional increases in active parameters. Token routing, a prevalent approach in large-scale , involves computing affinity scores between each input and the available via a lightweight gating network, followed by selecting the top-k with the highest scores for that . This per- selection ensures sparsity by activating only a small fraction of the total per input, such as top-1 or top-2 , which has been shown to maintain performance comparable to dense models while reducing computational cost. For instance, the Switch Transformer model employs top-1 , where each is routed to exactly one , enabling models with over a parameters to train efficiently on language tasks by minimizing per . In contrast, choice routing inverts the selection process by having actively "pull" rather than "pushing" to , which addresses imbalances in and improves stability. Under this paradigm, each selects a fixed-capacity of the top-scoring based on shared scores, allowing variable numbers of per while ensuring uniform utilization. This method, introduced in heterogeneous layers, has demonstrated over 2x faster compared to traditional choice routing in large models, as it mitigates issues like underutilization during early phases. Routing strategies can be categorized as soft or hard based on their differentiability for gradient-based . Soft uses a full softmax over all experts to compute weighted contributions, providing smooth gradients but incurring higher computational overhead due to dense . Hard , such as top-k selection, enforces sparsity by assigning to a fixed number of experts without weighting, but requires approximations like the straight-through estimator to propagate gradients through non-differentiable operations during . This estimator treats the hard selection as identity in the forward pass while using soft gradients in the backward pass, enabling effective of sparse models without significant performance degradation. For scalability to billions or trillions of parameters, routing mechanisms incorporate distributed implementations across multiple , where are sharded over GPUs or TPUs to parallelize . In such setups, all-reduce operations synchronize router decisions, and expert parallelism ensures that only selected are activated per , achieving near-linear scaling in model size with constant inference latency. Models like leverage this distributed token to handle 1.2 trillion parameters efficiently on language benchmarks, demonstrating that careful partitioning of routing logic can support training on clusters with hundreds of accelerators.

Load balancing and capacity management

In large-scale mixture of experts (MoE) models, load balancing ensures that computational resources are distributed evenly across experts to prevent underutilization of some experts and overload of others, which could lead to inefficient training and inference. Uneven routing can cause certain experts to process disproportionate numbers of tokens, resulting in memory bottlenecks or dropped tokens during training. To address this, an auxiliary load balancing loss is incorporated into the training objective, formulated as J_{bal} = \alpha \cdot N \cdot \sum_{i=1}^{N} (f_i \cdot P_i), where N is the number of experts, f_i is the fraction of input tokens routed to expert i, P_i is the average gating probability assigned to expert i over the batch, and \alpha is a small scaling hyperparameter (typically around $10^{-2}). This term encourages uniform token distribution by penalizing imbalances between routing fractions and probabilities, promoting balanced expert utilization without significantly impacting the primary task loss. Capacity management complements load balancing by defining the maximum number of each can process per batch to avoid and dropped computations. The capacity factor C is defined as C = \frac{\text{[tokens](/page/The_Tokens) per layer} \times \text{top-}k}{\text{batch size} \times \text{number of [experts](/page/Expert)}}, and it is typically set greater than 1 to provide a for variability. For instance, in top-k setups, this ensures that have sufficient headroom (e.g., C = 1.25) to handle surges in assigned , minimizing the rate of dropped to below 1% while controlling computational overhead. For stable in architectures, an importance loss is employed, which weights expert contributions by their router probabilities to mitigate and encourage consistent specialization. This auxiliary term helps maintain balanced gradients across , particularly in sparsely activated layers, by the loss based on the aggregated router logits. In distributed environments, expert parallelism relies on all-to-all communication primitives to efficiently dispatch to specialized across multiple devices, ensuring for models with thousands of . This involves operations where are shuffled between accelerators, with the communication volume proportional to the active computations, enabling linear in model without full of all parameters.

Applications

Transformer integrations

In transformer architectures, mixture of experts (MoE) layers are commonly integrated by replacing the dense networks (FFNs) in each block with sparsely gated MoE modules, positioned after the multi-head self-attention sublayer. This placement enables conditional activation of experts per input , allowing the model to leverage specialized subnetworks while maintaining the overall structure for sequence processing tasks. Such integration promotes efficiency by activating only a fraction of the total parameters during and , contrasting with fully dense transformers that compute all parameters uniformly. A key benefit of this approach is the ability to scale to massive parameter counts without linearly increasing computational demands. For instance, the Switch Transformers model incorporates layers to achieve 1.6 trillion parameters overall, yet activates only approximately 7 billion per forward pass, resulting in 4 times faster pre-training compared to T5-XXL (11 billion parameters), or up to 7 times faster than the smaller T5-Base (220 million parameters). Similarly, the model employs in its transformer blocks to reach 1.2 trillion parameters—roughly seven times larger than —while using one-third the energy for training and half the for inference, yielding superior performance on 29 tasks in zero-shot and one-shot settings. These examples demonstrate how integration facilitates parameter-efficient scaling, where model capacity grows independently of active compute. Variants of MoE integration in transformers address challenges like training stability and downstream adaptability. The ST-MoE framework refines routing mechanisms within the FFN sublayers to mitigate instabilities during pre-training, enabling a 269 billion parameter model with computational costs comparable to a 32 billion parameter dense , and achieving state-of-the-art results on benchmarks such as SuperGLUE and XSum. This design enhances generalization by improving efficiency across diverse tasks, without altering the core attention mechanisms.

Large language models

Mixture of experts (MoE) architectures have become pivotal in scaling large language models (LLMs), enabling massive counts while activating only a subset during inference to reduce computational overhead. By routing tokens to specialized expert sub-networks, MoE allows models to achieve performance comparable to or exceeding dense counterparts with lower active , facilitating efficient and deployment on constraints. This approach has driven advancements in open-source and proprietary LLMs, particularly from 2023 onward, where MoE integration has improved scores and inference speed without proportional increases in resource demands. Key examples include Mixtral 8x7B, released by Mistral AI in December 2023, which features 47 billion total parameters but activates only 13 billion per token using a sparse MoE layer with eight experts and top-2 routing. Similarly, xAI's Grok-1, open-sourced in March 2024, employs a 314 billion parameter MoE with eight experts, activating two per token to balance capacity and efficiency in handling diverse tasks. These models demonstrate MoE's ability to scale beyond traditional dense architectures like Llama 2, offering high-capacity reasoning while maintaining accessibility for inference on standard GPU clusters. In 2024, further innovations emphasized efficiency and multilingual support. DeepSeek-V2, developed by DeepSeek AI and released in May 2024, incorporates 236 billion total parameters with 21 billion active per token, utilizing Multi-head Latent (MLA) for to 160 shared experts plus two auxiliary ones, which reduces communication overhead during distributed training. Alibaba's Qwen2-MoE variant, such as the 57 billion parameter model with 14 billion active (Qwen2-57B-A14B), enhances multilingual efficiency across 29 languages including English, Chinese, and , achieving strong in cross-lingual tasks with reduced compared to dense equivalents. These developments highlight MoE's role in optimizing for global applications, where expert specialization improves handling of language-specific nuances. MoE implementations yield notable performance gains, particularly in and cost efficiency. For instance, Mixtral 8x7B achieves lower on benchmarks like MMLU and HellaSwag than the dense 2 70B model while enabling 6x faster due to its sparse activation, resulting in up to 5x lower computational cost per token. DeepSeek-V2 similarly outperforms dense models like 3 70B in zero-shot reasoning tasks with 42.5% reduced training costs relative to a comparable dense 67B model, attributed to MLA's compression of key-value pairs that minimizes usage during . Grok-1's design supports extended context handling up to 8,192 tokens with efficient expert utilization, contributing to competitive scores in mathematical and coding benchmarks against models like GPT-3.5. Overall, these gains stem from 's conditional computation, which prioritizes relevant experts to enhance model quality at scale. Training MoE models benefits from established scaling laws, as explored in 2024 research. Studies show that MoE performance follows power-law relationships similar to dense models but modulated by factors like expert count, activation sparsity, and total compute budget; for example, fine-grained MoE variants scale optimally when expert granularity aligns with token-level routing, yielding perplexity improvements proportional to the logarithm of active parameters times training tokens. A comparative analysis confirms that MoE models transfer dense scaling laws effectively, achieving emergent abilities at lower effective compute than dense counterparts by leveraging expert parallelism, with optimal regimes identified around 10-20% activation rates for LLMs up to 100B active parameters. These laws guide hyperparameter selection, ensuring MoE LLMs like those above maximize throughput on TPUs or GPUs during pretraining on trillions of tokens. As of 2025, MoE continues to evolve in LLMs, with new releases emphasizing greater efficiency and specialization. For example, surveys highlight advancements in models like enhanced variants of DeepSeek and Qwen series, incorporating refined routing for better multilingual and multimodal capabilities, further reducing inference costs while scaling to hundreds of billions of parameters.

Other domains

Mixture of experts (MoE) architectures have been adapted to computer vision tasks, where sparse activation enables scaling without proportional increases in computational cost. In Vision MoE (V-MoE), introduced by Riquelme et al. in 2021, experts are integrated into Vision Transformer (ViT) layers to process image patches selectively, achieving competitive performance on ImageNet classification with models up to 15 billion parameters—approximately 25 times larger than base dense ViT counterparts—while activating a small fraction (around 10%) of the parameters per input. This sparse approach allows experts to specialize in different visual features, such as textures or shapes, improving efficiency in resource-constrained settings. Subsequent works have extended sparse experts to ViTs for tasks like object detection, where routing mechanisms assign patches to domain-specific subnetworks, reducing inference latency by up to 50% compared to dense models. In , has been applied to acoustic modeling for automatic (ASR) systems. The sparsely-gated layer, proposed by Shazeer et al. in , includes preliminary evaluations on speech tasks, demonstrating that models can match or exceed dense networks in accuracy with only a fraction of parameters activated per utterance. Experts in these models specialize in phonetic regions, such as handling accents or variations, enabling scalable training on massive speech corpora without full network activation. This approach influenced subsequent ASR systems, where layers have improved word error rates in multilingual settings. MoE techniques enhance recommendation systems by allowing experts to specialize in user or item embeddings within frameworks. In YouTube's deep -based ranking system, Covington et al. (2016) used shared bottom layers that feed into multiple task-specific towers, routed by features like user watch history, improving next-video prediction accuracy across diverse user behaviors. This specialization enables the towers to capture latent factors like genre preferences or temporal patterns in embeddings, boosting metrics such as mean average precision over single-tower models in large-scale deployments. Modern variants further decompose user-item interactions into expert submodels for cold-start scenarios, enhancing personalization in platforms handling billions of daily interactions. Multimodal applications of MoE emerged in the early 2020s, integrating and through hybrid architectures. For instance, Zhou et al. (2024) scaled vision-language models like CLIP using sparse MoE layers in CLIP-MoE, where experts handle modality-specific alignments—such as visual semantics or textual descriptions—achieving state-of-the-art zero-shot image-text retrieval on benchmarks like Flickr30k, with up to 4x parameter efficiency over dense baselines. These hybrids route inputs to cross-modal experts, mitigating interference between domains and improving downstream tasks like visual by 5-10% in retrieval accuracy. Recent CLIP-MoE variants further diversify experts via contrastive , enabling fine-tuned specialization for tasks like multimodal classification without retraining the full model.

Challenges and Recent Advances

Scalability and training issues

Training large Mixture of Experts (MoE) models encounters significant challenges in maintaining stability during optimization, primarily due to router collapse, a where the gating routes nearly all input to a single expert, leaving others idle and diminishing the benefits of sparsity. This instability arises from the competitive nature of routing decisions, which can converge to suboptimal equilibria without intervention. To counteract router collapse, entropy regularization is applied to the router's output , promoting more uniform probability assignments across experts and encouraging diverse token dispatching throughout . Hardware limitations further complicate the scalability of architectures, as the dispatching and combining phases rely on all-to-all communication primitives to exchange tokens between devices hosting different , resulting in substantial overhead and contention in distributed environments. These bottlenecks can severely limit model size and training throughput on standard GPU clusters. Expert parallelism mitigates such issues by sharding across multiple accelerators, thereby distributing demands and reducing per-device load while preserving overall computational efficiency. The expansive parameter counts in MoE models heighten the risk of , necessitating enormous training datasets to generalize effectively across diverse inputs. Empirical scaling laws derived in 2024 for fine-grained MoE configurations reveal that compute-optimal mixtures balance model parameters, data volume, and granularity, demonstrating that insufficient relative to parameters leads to diminished returns and instability, whereas adequately scaled data enables efficient performance gains without excessive . Assessing MoE efficacy requires specialized metrics to capture beyond conventional loss functions, including expert utilization—which quantifies the evenness of expert activation to detect imbalances—and efficiency, which evaluates active computational cost relative to total parameters to verify sparsity benefits. Low expert utilization signals persistent routing issues, while high efficiency underscores MoE's capacity to achieve dense-model performance at a fraction of the inference compute.

Developments since 2023

In late 2023, AI introduced , a sparse () model that integrates grouped-query (GQA) to enhance efficiency while scaling capacity through eight specialized experts per layer, activating only two per token. This achieved competitive performance on benchmarks like MMLU, surpassing denser models of similar active counts, by leveraging GQA to reduce overhead in key-value caching during . Building on this in 2024, DeepSeek AI released DeepSeek-V2, featuring fine-grained expert segmentation and shared experts that are always activated to capture universal knowledge, isolating them from task-specific experts to improve specialization without increasing computational load. The model, with 21 billion active parameters out of 236 billion total, demonstrated superior efficiency in multilingual tasks, reducing training costs by 42.5% compared to dense counterparts through sparse activation. By 2025, Alibaba's Qwen team advanced with Qwen3 series models. Researchers at further advanced MoE compression with the REAP (Router-weighted Expert Activation Pruning) technique applied to Qwen3 series models, such as Qwen3-480B-Coder, achieving up to 50% while retaining over 97% of baseline performance on benchmarks. This one-shot method targets redundant experts based on router biases, enabling deployment of trillion-parameter models on resource-constrained hardware without . Surveys from 2025, such as reviews on in large language models, highlight scaling laws that predict continued efficiency gains, with architectures enabling models several times larger than dense equivalents at equivalent costs. Innovations in diversification, including orthogonal initialization of routers to promote balanced and reduce risks, have been shown to improve load balancing in multi- setups. Hybrid dense-MoE layers, as implemented in models like DeepSeek-V3 with initial dense layers for stability followed by sparse MoE blocks, further mitigate training instabilities while preserving dense model strengths in early processing stages. Looking ahead, reports indicate 's role in AGI-scale systems, emphasizing "smarter scaling" through expert modularity to handle diverse reasoning paths efficiently in models exceeding 1 trillion parameters.

References

  1. [1]
    Adaptive Mixtures of Local Experts | Neural Computation | MIT Press
    Adaptive Mixtures of Local Experts Available. In Special Collection: CogNet. Robert A. Jacobs,. Robert A. Jacobs. Department of Brain and Cognitive Sciences ...Missing: paper | Show results with:paper
  2. [2]
    Outrageously Large Neural Networks: The Sparsely-Gated Mixture ...
    Jan 23, 2017 · We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks.
  3. [3]
    (PDF) Adaptive Mixtures of Local Experts - ResearchGate
    Adaptive Mixtures of Local Experts. The MIT Press. Neural Computation. March 1991; 3(1):79-87. DOI:10.1162/neco.1991.3.1.79. Authors: Robert Jacobs at ...
  4. [4]
    [PDF] Ensemble Methods in Machine Learning
    Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their ...
  5. [5]
    [PDF] Mixtures-of-Experts Robert Jacobs Department of Brain & Cognitive ...
    Aug 8, 2008 · The mixtures-of-experts (ME) architecture is a mixture model in which the mixture components are conditional probability distributions.Missing: paper | Show results with:paper
  6. [6]
    (PDF) Hierarchical mixtures of experts and the - ResearchGate
    Aug 9, 2025 · We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model.<|control11|><|separator|>
  7. [7]
    GShard: Scaling Giant Models with Conditional Computation ... - arXiv
    GShard is a module with APIs and XLA extension for parallel computation, enabling scaling of giant models with automatic sharding.
  8. [8]
    Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
    Jan 11, 2021 · Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Authors:William Fedus, Barret Zoph, Noam Shazeer.
  9. [9]
    [PDF] Adaptive Mixtures of Local Experts
    Adaptive Mixtures of Local Experts. Robert A. Jacobs. Michael I. Jordan. Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology ...
  10. [10]
    [PDF] Hierarchies of Adaptive Experts
    Jacobs, R.A, Jordan, M.L, Nowlan, S.J., & Hinton, G.E. (1991) Adaptive mixtures of local experts. Neural Computation, 3, 79-87. Preston, R.J. (1976) North ...
  11. [11]
    Hierarchical Mixtures of Experts and the EM Algorithm
    Mar 1, 1994 · Jordan, Robert A. Jacobs; Hierarchical Mixtures of Experts and the EM Algorithm. Neural Comput 1994; 6 (2): 181–214. doi: https://doi.org ...
  12. [12]
    [2202.09368] Mixture-of-Experts with Expert Choice Routing - arXiv
    Feb 18, 2022 · We propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting ...
  13. [13]
    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
    Dec 13, 2021 · GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. Authors:Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, ...
  14. [14]
    ST-MoE: Designing Stable and Transferable Sparse Expert Models
    Feb 17, 2022 · ST-MoE: Designing Stable and Transferable Sparse Expert Models. Authors:Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean ...
  15. [15]
    Mixtral of experts - Mistral AI
    Dec 11, 2023 · Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token.
  16. [16]
    [2401.04088] Mixtral of Experts - arXiv
    Jan 8, 2024 · As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a ...Missing: params | Show results with:params
  17. [17]
    Open Release of Grok-1 - xAI
    March 17, 2024. Open Release of Grok-1. We are releasing the weights and architecture of our 314 billion parameter Mixture-of-Experts model Grok-1.
  18. [18]
    Qwen/Qwen2-57B-A14B - Hugging Face
    Jul 21, 2025 · Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned ...
  19. [19]
    [2407.10671] Qwen2 Technical Report - arXiv
    Jul 15, 2024 · Qwen2 is a new large language and multimodal model with a range of parameters, including a 72B model, and is proficient in 30 languages.
  20. [20]
    xai-org/grok-1: Grok open release - GitHub
    Model Specifications · Parameters: 314B · Architecture: Mixture of 8 Experts (MoE) · Experts Utilization: 2 experts used per token · Layers: 64 · Attention Heads: 48 ...Missing: 2024 | Show results with:2024
  21. [21]
    [2402.07871] Scaling Laws for Fine-Grained Mixture of Experts - arXiv
    Feb 12, 2024 · We establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity.
  22. [22]
    [2409.19291] CLIP-MoE: Towards Building Mixture of Experts ... - arXiv
    Sep 28, 2024 · This paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE.Missing: variants 2020s
  23. [23]
    Training MoEs at Scale with PyTorch
    Jun 23, 2024 · Expert parallelism is a form of model parallelism where we place different experts on different GPUs for better performance. Instead of expert ...<|control11|><|separator|>
  24. [24]
    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture ...
    Jan 11, 2024 · We propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ...
  25. [25]
    Qwen3: Think Deeper, Act Faster | Qwen
    We are open-weighting two MoE models: Qwen3-235B-A22B, a large model with 235 billion total parameters and 22 billion activated parameters, and ...Missing: pruning | Show results with:pruning
  26. [26]
    REAP: One-Shot Pruning for Trillion-Parameter Mixture-of-Experts ...
    Oct 16, 2025 · For instance, with the Qwen3-480B-Coder-FP8 model, REAP at 50% pruning retains 97.6% of its baseline non-agentic coding ability and 96.7% on the ...