MoEs papers reading list
Paper • 1701.06538 • Published • 4Note (Must Read) Scaled up idea of MoEs for LLMs, scaling the idea to a 137B LSTM for translation. This introduced sparsity and very fast inference at high scale. They faced challenges due to training stabilities and communication costs. Google
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper • 1907.04840 • Published • 3Note (Optional) This is among first sparse NN papers University of Washington
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper • 1910.02054 • Published • 4Note (Optional) Not really about MoEs, but discussed quite a bit about scaling to 1 trillion parameters and improving training speed with existing hardware. Microsoft
A Mixture of h-1 Heads is Better than h Heads
Paper • 2005.06537 • Published • 2Note (Optional) This paper explores using MoE mechanism for multi-head attention. University of Washington
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Paper • 2006.16668 • Published • 3Note (Must Read) Introduces Sparsely-Gated MoEs for transformers. This is among the first uses of MoEs in transformers, going beyond 600 billion parameters and with quite interesting hardware discussion. Google
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Paper • 2101.03961 • Published • 14Note (Must Read) The first open-source MoE with a 1.6 trillion parameter of 2048 experts released. It does a bunch of changes, such as simplifying the sparse routing, reducing to a single expert, and more. They also perform extensive studies and share interesting insights. Google
google/switch-c-2048
Text2Text Generation • Updated • 101 • 278Note This is one of the models released for Switch Transformers.
FastMoE: A Fast Mixture-of-Expert Training System
Paper • 2103.13262 • Published • 2Note (Optional) Distributed MoE training system based on PyTorch. Most previous implementations were dependant on TPUs and Mesh TensorFlow, so the authors aimed to decouple. Code is open https://github.com/laekov/fastmoe Tsinghua
BASE Layers: Simplifying Training of Large, Sparse Models
Paper • 2103.16716 • Published • 3Note (Optional) A routing formulation formulated as a linear assignment problem leads to optimal assignment so all experts receive the same number of tokens. It removes the need of additional hyperparameters or auxiliary losses. FAIR
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts
Paper • 2105.03036 • Published • 2Note (Optional) MoEs in the speech domain. Not much to use from here, but quite interesting to see this usage! Tencent
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
Paper • 2106.03760 • Published • 3Note (Optional) This introduces a new gate mechanism for the MoEs that is differentiable and hence can be nicely trained with the rest of the network + explicitly control the number of experts to select. Google, MIT
Scaling Vision with Sparse Mixture of Experts
Paper • 2106.05974 • Published • 3Note (Optional) This paper explores Vision MoE, which is really a sparse version of ViT. It ends up requiring half the compute for the same quality. This introduces batch prioritized routing which was later used for LLMs. Google
Hash Layers For Large Sparse Models
Paper • 2106.04426 • Published • 2Note (Optional) Different routing strategy based on hashing the input tokens and requires no additional parameters, which makes it robust. FAIR
DEMix Layers: Disentangling Domains for Modular Language Modeling
Paper • 2108.05036 • Published • 3Note (Optional) DEMix conditions LMs on the domain of the input. A DEMix layer focuses on having domain-specific experts, which makes LMs modular - experts can easily be mixed, added or removed. University of Washington, Allen AI, FAIR
A Machine Learning Perspective on Predictive Coding with PAQ
Paper • 1108.3298 • Published • 2Note (Optional) PAQ8s are similar to mixture of experts University of British Columbia
Efficient Large Scale Language Modeling with Mixtures of Experts
Paper • 2112.10684 • Published • 2Note (Optional) A detailed empirical study comparing MoEs with dense models in a bunch of settings (zero vs few-shot, etc) Meta
Unified Scaling Laws for Routed Language Models
Paper • 2202.01169 • Published • 2Note (Optional) This paper studies the behaviors of routing networks and derive scaling laws. Google
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper • 2202.08906 • Published • 2Note (Must Read) This is an excellent guide on all the issues around MoEs, how to face the training instabilities and more. It discussed fine-tuning issues, hyperparameters, and trade-offs. It also dives into what the experts learn. Google
Mixture-of-Experts with Expert Choice Routing
Paper • 2202.09368 • Published • 3Note (Optional) This paper introduces a new routing technique. Rather than have the tokens select the top-k experts, the experts select the top-k tokens, switching things around. Hence, a single token might be routed to the number of appropriate tokens. Google
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Paper • 2206.02770 • Published • 3Note (Optional) Being the first multimodal MoE, this paper explores how to handle different types of tokens. Google
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
Paper • 2208.03306 • Published • 2Note (Optional) This paper looks at experts as domain-experts as well. This allows to easily train parameters for new domain and merge them back into the model. University of Washington, Allen AI, Meta
A Review of Sparse Expert Models in Deep Learning
Paper • 2209.01667 • Published • 3Note (Optional) This is more of a survey of sparse expert models. Nice to get an overview of things up to 2022. Google
Sparsity-Constrained Optimal Transport
Paper • 2209.15466 • Published • 1Note (Optional) Routing techniques using optimal transport. University of Basel, Google
Mixture of Attention Heads: Selecting Attention Heads Per Token
Paper • 2210.05144 • Published • 2Note (Optional) While most MoEs focus on applying sparsity for the feed forward layers, MoA explores MoE mechanism for multi-head attention Beihang University, Mila, Tencent
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper • 2211.15841 • Published • 7Note (Optional) The authors reformulate MoEs computation in terms of block-sparse computations, which lead to 40% end-to-end speedup vs state-of-the-art. This framework was open-sourced and used for many follow-up works. Their proposal never drops tokens and maps efficiently to hardware. https://github.com/stanford-futuredata/megablocks Stanford, Microsoft, Google
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper • 2212.05055 • Published • 5Note (Optional) Sparse upcycling means reusing a dense checkpoint to initialize a sparse MoE. Google
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
Paper • 2305.14705 • PublishedNote (Optional) This is the first paper that does instruct-tuning with MoEs. Previous MoE fine-tunes did not have great quality. Surprisingly, it turned out that MoEs can do quite well for Instruct tuning. Google, Berkeley, MIT, University of Massachusetts Amherst, UT Austin
From Sparse to Soft Mixtures of Experts
Paper • 2308.00951 • Published • 20Note (Optional) Soft MoEs introduce a fully-differentiable sparse Transformer architecture. Rather than having hard assignment between tokens and experts, they insert a soft assignment by mixing tokens by computing weighted averages of all tokens. Google
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Paper • 2310.10837 • Published • 10Note (Optional) σ-MoE uses a bunch of changes, such as sigmoid rather than softmax, a special initialization, and regularization. Swiss AI Lab, Harvard, KAUST
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper • 2310.16795 • Published • 26Note (Optional) Sub-1-Bit Compression. Enough said. ISTA
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper • 2312.07987 • Published • 40Note (Optional) SwitchHead reduces compute and memory requirements while maintaining performance with the same parameter budget. Swiss AI Lab, KAUST, Harvard
Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning
Paper • 2312.12379 • Published • 2Note (Optional) MoCLE is a MoE architecture to activate task-customized parameters based on the instruction clusters. It uses an additional universal expert to improve generalization. Southern University of Science and Technology, Hong Kong University of Science and Technology, Huawei Noah's Ark Lab, Peng Cheng Lab.
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper • 2312.17238 • Published • 7Note (Optional) The authors use parameter offloading with a MoE-specific strategy to be able to run Mixtral-8x7B with mixed quantization on free Google Colab Yandex
Mixtral of Experts
Paper • 2401.04088 • Published • 158Note (Optional) The paper is a great example of a MoE which is open-sourced with great license. The paper does not provide many new things in terms of MoE understanding, but still nice to read given it's open source! Mistral
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 71Note (Optional) Combine state space models like Mamba with MoE. IDEAS NCBR, Polish Academy of Sciences, University of Warsaw.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 43Note (Optional) This paper introduces architectural changes such as expert segmentation for specialization as well as expert isolation and sharing. DeepSeek-AI, Peking University, Tsinghua University, Nanjing University
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 49Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
Paper • 2404.05567 • Published • 10CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Paper • 2405.05949 • Published • 2DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 13Multi-Head Mixture-of-Experts
Paper • 2404.15045 • Published • 59Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Paper • 2405.11273 • Published • 17Yuan 2.0-M32: Mixture of Experts with Attention Router
Paper • 2405.17976 • Published • 18Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
Paper • 2406.06563 • Published • 17Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
Paper • 2407.01906 • Published • 34A Closer Look into Mixture-of-Experts in Large Language Models
Paper • 2406.18219 • Published • 15ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper • 2407.04172 • Published • 22DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
Paper • 2407.04078 • Published • 16On scalable oversight with weak LLMs judging strong LLMs
Paper • 2407.04622 • Published • 11Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Paper • 2406.08085 • Published • 13Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction
Paper • 2407.03651 • Published • 15Mixture of A Million Experts
Paper • 2407.04153 • Published • 4MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Paper • 2407.21770 • Published • 22Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper • 2407.19985 • Published • 34Layerwise Recurrent Router for Mixture-of-Experts
Paper • 2408.06793 • Published • 30Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Paper • 2408.15664 • Published • 11GRIN: GRadient-INformed MoE
Paper • 2409.12136 • Published • 14Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
Paper • 2409.16040 • Published • 13Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper • 2410.05993 • Published • 107Stealing User Prompts from Mixture of Experts
Paper • 2410.22884 • Published • 13