MoEs papers reading list - a osanseviero Collection

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Paper • 1701.06538 • Published Jan 23, 2017 • 5

Note (Must Read) Scaled up idea of MoEs for LLMs, scaling the idea to a 137B LSTM for translation. This introduced sparsity and very fast inference at high scale. They faced challenges due to training stabilities and communication costs. Google

Sparse Networks from Scratch: Faster Training without Losing Performance

Paper • 1907.04840 • Published Jul 10, 2019 • 3

Note (Optional) This is among first sparse NN papers University of Washington

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Paper • 1910.02054 • Published Oct 4, 2019 • 4

Note (Optional) Not really about MoEs, but discussed quite a bit about scaling to 1 trillion parameters and improving training speed with existing hardware. Microsoft

A Mixture of h-1 Heads is Better than h Heads

Paper • 2005.06537 • Published May 13, 2020 • 2

Note (Optional) This paper explores using MoE mechanism for multi-head attention. University of Washington

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Paper • 2006.16668 • Published Jun 30, 2020 • 3

Note (Must Read) Introduces Sparsely-Gated MoEs for transformers. This is among the first uses of MoEs in transformers, going beyond 600 billion parameters and with quite interesting hardware discussion. Google

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Paper • 2101.03961 • Published Jan 11, 2021 • 14

Note (Must Read) The first open-source MoE with a 1.6 trillion parameter of 2048 experts released. It does a bunch of changes, such as simplifying the sparse routing, reducing to a single expert, and more. They also perform extensive studies and share interesting insights. Google

google/switch-c-2048

Text2Text Generation • Updated Jan 11 • 91 • 280

Note This is one of the models released for Switch Transformers.

FastMoE: A Fast Mixture-of-Expert Training System

Paper • 2103.13262 • Published Mar 24, 2021 • 2

Note (Optional) Distributed MoE training system based on PyTorch. Most previous implementations were dependant on TPUs and Mesh TensorFlow, so the authors aimed to decouple. Code is open https://github.com/laekov/fastmoe Tsinghua

BASE Layers: Simplifying Training of Large, Sparse Models

Paper • 2103.16716 • Published Mar 30, 2021 • 3

Note (Optional) A routing formulation formulated as a linear assignment problem leads to optimal assignment so all experts receive the same number of tokens. It removes the need of additional hyperparameters or auxiliary losses. FAIR

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Paper • 2105.03036 • Published May 7, 2021 • 2

Note (Optional) MoEs in the speech domain. Not much to use from here, but quite interesting to see this usage! Tencent

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Paper • 2106.03760 • Published Jun 7, 2021 • 3

Note (Optional) This introduces a new gate mechanism for the MoEs that is differentiable and hence can be nicely trained with the rest of the network + explicitly control the number of experts to select. Google, MIT

Scaling Vision with Sparse Mixture of Experts

Paper • 2106.05974 • Published Jun 10, 2021 • 3

Note (Optional) This paper explores Vision MoE, which is really a sparse version of ViT. It ends up requiring half the compute for the same quality. This introduces batch prioritized routing which was later used for LLMs. Google

Hash Layers For Large Sparse Models

Paper • 2106.04426 • Published Jun 8, 2021 • 2

Note (Optional) Different routing strategy based on hashing the input tokens and requires no additional parameters, which makes it robust. FAIR

DEMix Layers: Disentangling Domains for Modular Language Modeling

Paper • 2108.05036 • Published Aug 11, 2021 • 3

Note (Optional) DEMix conditions LMs on the domain of the input. A DEMix layer focuses on having domain-specific experts, which makes LMs modular - experts can easily be mixed, added or removed. University of Washington, Allen AI, FAIR

A Machine Learning Perspective on Predictive Coding with PAQ

Paper • 1108.3298 • Published Aug 16, 2011 • 2

Note (Optional) PAQ8s are similar to mixture of experts University of British Columbia

Efficient Large Scale Language Modeling with Mixtures of Experts

Paper • 2112.10684 • Published Dec 20, 2021 • 2

Note (Optional) A detailed empirical study comparing MoEs with dense models in a bunch of settings (zero vs few-shot, etc) Meta

Unified Scaling Laws for Routed Language Models

Paper • 2202.01169 • Published Feb 2, 2022 • 2

Note (Optional) This paper studies the behaviors of routing networks and derive scaling laws. Google

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Paper • 2202.08906 • Published Feb 17, 2022 • 2

Note (Must Read) This is an excellent guide on all the issues around MoEs, how to face the training instabilities and more. It discussed fine-tuning issues, hyperparameters, and trade-offs. It also dives into what the experts learn. Google

Mixture-of-Experts with Expert Choice Routing

Paper • 2202.09368 • Published Feb 18, 2022 • 3

Note (Optional) This paper introduces a new routing technique. Rather than have the tokens select the top-k experts, the experts select the top-k tokens, switching things around. Hence, a single token might be routed to the number of appropriate tokens. Google

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Paper • 2206.02770 • Published Jun 6, 2022 • 3

Note (Optional) Being the first multimodal MoE, this paper explores how to handle different types of tokens. Google

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Paper • 2208.03306 • Published Aug 5, 2022 • 2

Note (Optional) This paper looks at experts as domain-experts as well. This allows to easily train parameters for new domain and merge them back into the model. University of Washington, Allen AI, Meta

A Review of Sparse Expert Models in Deep Learning

Paper • 2209.01667 • Published Sep 4, 2022 • 3

Note (Optional) This is more of a survey of sparse expert models. Nice to get an overview of things up to 2022. Google

Sparsity-Constrained Optimal Transport

Paper • 2209.15466 • Published Sep 30, 2022 • 1

Note (Optional) Routing techniques using optimal transport. University of Basel, Google

Mixture of Attention Heads: Selecting Attention Heads Per Token

Paper • 2210.05144 • Published Oct 11, 2022 • 2

Note (Optional) While most MoEs focus on applying sparsity for the feed forward layers, MoA explores MoE mechanism for multi-head attention Beihang University, Mila, Tencent

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

Paper • 2211.15841 • Published Nov 29, 2022 • 7

Note (Optional) The authors reformulate MoEs computation in terms of block-sparse computations, which lead to 40% end-to-end speedup vs state-of-the-art. This framework was open-sourced and used for many follow-up works. Their proposal never drops tokens and maps efficiently to hardware. https://github.com/stanford-futuredata/megablocks Stanford, Microsoft, Google

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Paper • 2212.05055 • Published Dec 9, 2022 • 5

Note (Optional) Sparse upcycling means reusing a dense checkpoint to initialize a sparse MoE. Google

Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

Paper • 2305.14705 • Published May 24, 2023

Note (Optional) This is the first paper that does instruct-tuning with MoEs. Previous MoE fine-tunes did not have great quality. Surprisingly, it turned out that MoEs can do quite well for Instruct tuning. Google, Berkeley, MIT, University of Massachusetts Amherst, UT Austin

From Sparse to Soft Mixtures of Experts

Paper • 2308.00951 • Published Aug 2, 2023 • 20

Note (Optional) Soft MoEs introduce a fully-differentiable sparse Transformer architecture. Rather than having hard assignment between tokens and experts, they insert a soft assignment by mixing tokens by computing weighted averages of all tokens. Google

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Paper • 2310.10837 • Published Oct 16, 2023 • 10

Note (Optional) σ-MoE uses a bunch of changes, such as sigmoid rather than softmax, a special initialization, and regularization. Swiss AI Lab, Harvard, KAUST

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Paper • 2310.16795 • Published Oct 25, 2023 • 26

Note (Optional) Sub-1-Bit Compression. Enough said. ISTA

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper • 2312.07987 • Published Dec 13, 2023 • 41

Note (Optional) SwitchHead reduces compute and memory requirements while maintaining performance with the same parameter budget. Swiss AI Lab, KAUST, Harvard

Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

Paper • 2312.12379 • Published Dec 19, 2023 • 2

Note (Optional) MoCLE is a MoE architecture to activate task-customized parameters based on the instruction clusters. It uses an additional universal expert to improve generalization. Southern University of Science and Technology, Hong Kong University of Science and Technology, Huawei Noah's Ark Lab, Peng Cheng Lab.

Fast Inference of Mixture-of-Experts Language Models with Offloading

Paper • 2312.17238 • Published Dec 28, 2023 • 7

Note (Optional) The authors use parameter offloading with a MoE-specific strategy to be able to run Mixtral-8x7B with mixed quantization on free Google Colab Yandex

Mixtral of Experts

Paper • 2401.04088 • Published Jan 8 • 158

Note (Optional) The paper is a great example of a MoE which is open-sourced with great license. The paper does not provide many new things in terms of MoE understanding, but still nice to read given it's open source! Mistral

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Paper • 2401.04081 • Published Jan 8 • 70

Note (Optional) Combine state space models like Mamba with MoE. IDEAS NCBR, Polish Academy of Sciences, University of Warsaw.

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11 • 43

Note (Optional) This paper introduces architectural changes such as expert segmentation for specialization as well as expert isolation and sharing. DeepSeek-AI, Peking University, Tsinghua University, Nanjing University

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29 • 49

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Paper • 2404.05567 • Published Apr 8 • 9

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Paper • 2405.05949 • Published May 9 • 2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7 • 14

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

Paper • 2406.06563 • Published Jun 3 • 17

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Paper • 2407.01906 • Published Jul 2 • 34

A Closer Look into Mixture-of-Experts in Large Language Models

Paper • 2406.18219 • Published Jun 26 • 15

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4 • 22

DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning

Paper • 2407.04078 • Published Jul 4 • 17

On scalable oversight with weak LLMs judging strong LLMs

Paper • 2407.04622 • Published Jul 5 • 11

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Paper • 2406.08085 • Published Jun 12 • 13

Evaluating Language Model Context Windows: A "Working Memory" Test and Inference-time Correction

Paper • 2407.03651 • Published Jul 4 • 15

Mixture of A Million Experts

Paper • 2407.04153 • Published Jul 4 • 5

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31 • 22

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29 • 36

Layerwise Recurrent Router for Mixture-of-Experts

Paper • 2408.06793 • Published Aug 13 • 31

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Paper • 2408.15664 • Published Aug 28 • 11

GRIN: GRadient-INformed MoE

Paper • 2409.12136 • Published Sep 18 • 15

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Paper • 2409.16040 • Published Sep 24 • 12

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8 • 107

Stealing User Prompts from Mixture of Experts

Paper • 2410.22884 • Published Oct 30 • 14