Fact-checked by Grok 2 weeks ago
References
-
[1]
Adaptive Mixtures of Local Experts | Neural Computation | MIT PressAdaptive Mixtures of Local Experts Available. In Special Collection: CogNet. Robert A. Jacobs,. Robert A. Jacobs. Department of Brain and Cognitive Sciences ...Missing: paper | Show results with:paper
-
[2]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture ...Jan 23, 2017 · We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks.
-
[3]
(PDF) Adaptive Mixtures of Local Experts - ResearchGateAdaptive Mixtures of Local Experts. The MIT Press. Neural Computation. March 1991; 3(1):79-87. DOI:10.1162/neco.1991.3.1.79. Authors: Robert Jacobs at ...
-
[4]
[PDF] Ensemble Methods in Machine LearningEnsemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their ...
-
[5]
[PDF] Mixtures-of-Experts Robert Jacobs Department of Brain & Cognitive ...Aug 8, 2008 · The mixtures-of-experts (ME) architecture is a mixture model in which the mixture components are conditional probability distributions.Missing: paper | Show results with:paper
-
[6]
(PDF) Hierarchical mixtures of experts and the - ResearchGateAug 9, 2025 · We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model.<|control11|><|separator|>
-
[7]
GShard: Scaling Giant Models with Conditional Computation ... - arXivGShard is a module with APIs and XLA extension for parallel computation, enabling scaling of giant models with automatic sharding.
-
[8]
Scaling to Trillion Parameter Models with Simple and Efficient SparsityJan 11, 2021 · Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Authors:William Fedus, Barret Zoph, Noam Shazeer.
-
[9]
[PDF] Adaptive Mixtures of Local ExpertsAdaptive Mixtures of Local Experts. Robert A. Jacobs. Michael I. Jordan. Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology ...
-
[10]
[PDF] Hierarchies of Adaptive ExpertsJacobs, R.A, Jordan, M.L, Nowlan, S.J., & Hinton, G.E. (1991) Adaptive mixtures of local experts. Neural Computation, 3, 79-87. Preston, R.J. (1976) North ...
-
[11]
Hierarchical Mixtures of Experts and the EM AlgorithmMar 1, 1994 · Jordan, Robert A. Jacobs; Hierarchical Mixtures of Experts and the EM Algorithm. Neural Comput 1994; 6 (2): 181–214. doi: https://doi.org ...
-
[12]
[2202.09368] Mixture-of-Experts with Expert Choice Routing - arXivFeb 18, 2022 · We propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting ...
-
[13]
GLaM: Efficient Scaling of Language Models with Mixture-of-ExpertsDec 13, 2021 · GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. Authors:Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, ...
-
[14]
ST-MoE: Designing Stable and Transferable Sparse Expert ModelsFeb 17, 2022 · ST-MoE: Designing Stable and Transferable Sparse Expert Models. Authors:Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean ...
-
[15]
Mixtral of experts - Mistral AIDec 11, 2023 · Concretely, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token.
-
[16]
[2401.04088] Mixtral of Experts - arXivJan 8, 2024 · As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a ...Missing: params | Show results with:params
-
[17]
Open Release of Grok-1 - xAIMarch 17, 2024. Open Release of Grok-1. We are releasing the weights and architecture of our 314 billion parameter Mixture-of-Experts model Grok-1.
-
[18]
Qwen/Qwen2-57B-A14B - Hugging FaceJul 21, 2025 · Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned ...
-
[19]
[2407.10671] Qwen2 Technical Report - arXivJul 15, 2024 · Qwen2 is a new large language and multimodal model with a range of parameters, including a 72B model, and is proficient in 30 languages.
-
[20]
xai-org/grok-1: Grok open release - GitHubModel Specifications · Parameters: 314B · Architecture: Mixture of 8 Experts (MoE) · Experts Utilization: 2 experts used per token · Layers: 64 · Attention Heads: 48 ...Missing: 2024 | Show results with:2024
-
[21]
[2402.07871] Scaling Laws for Fine-Grained Mixture of Experts - arXivFeb 12, 2024 · We establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity.
-
[22]
[2409.19291] CLIP-MoE: Towards Building Mixture of Experts ... - arXivSep 28, 2024 · This paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE.Missing: variants 2020s
-
[23]
Training MoEs at Scale with PyTorchJun 23, 2024 · Expert parallelism is a form of model parallelism where we place different experts on different GPUs for better performance. Instead of expert ...<|control11|><|separator|>
-
[24]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture ...Jan 11, 2024 · We propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ...
-
[25]
Qwen3: Think Deeper, Act Faster | QwenWe are open-weighting two MoE models: Qwen3-235B-A22B, a large model with 235 billion total parameters and 22 billion activated parameters, and ...Missing: pruning | Show results with:pruning
-
[26]
REAP: One-Shot Pruning for Trillion-Parameter Mixture-of-Experts ...Oct 16, 2025 · For instance, with the Qwen3-480B-Coder-FP8 model, REAP at 50% pruning retains 97.6% of its baseline non-agentic coding ability and 96.7% on the ...