Mixture-of-Experts Routing Was Called a Stupid Idea

Julien Reszka · 2020-06-15T21:47

In the Lex Fridman Discord paper review sessions, Mixture-of-Experts routing is being dismissed as unscalable. Here is why I think the pushback is wrong.

128 — Number of experts used in the sparsely-gated MoE layer proposed by Shazeer et al. in 2017, which outperformed dense LSTM models at lower compute cost per token — **128** Number of experts used in the sparsely-gated MoE layer proposed by Shazeer et al. in 2017, which outperformed dense LSTM models at lower compute cost per token Shazeer et al., Outrageously Large Neural Networks, ICLR 2017

In June 2020, a group of us were reviewing AI papers in the Lex Fridman Discord server. The paper on the table was about Mixture-of-Experts architectures: the idea that instead of routing every input token through the same dense network, you train a gating network that learns to send each token to whichever specialist sub-network is most likely to handle it well.

The room was not impressed. The pushback was consistent and confident:

load balancing is an unsolved problem and the training will collapse
you cannot backpropagate cleanly through a discrete routing decision
the communication overhead between experts on different devices will eat the compute savings
it is a clever trick, not a scalable architecture

The people saying this were not uninformed. They had read the papers. They understood the implementation challenges. The instability of early MoE training was real. The load balancing problem was real. The routing collapse problem was real.

What the pushback missed was the direction of travel. Every objection was a solved engineering problem dressed up as a fundamental architectural limit. Load balancing: Shazeer et al. already proposed an auxiliary loss term in 2017 that penalises uneven expert utilisation. Discrete routing gradient: the same paper showed that a noisy top-k gating function is differentiable enough to train. Communication overhead: it shrinks as hardware interconnects improve, and the parameter efficiency of MoE outpaces the overhead at scale.

The Lex Fridman Discord sessions were useful precisely because the disagreement was public and fast. You could watch someone state a confident objection and then watch someone else point to a sentence in the paper that answered it. The people pushing hardest against MoE in those sessions were doing so from a prior that dense transformers were the correct inductive bias for language. That prior was wrong, or at least not obviously right. The paper that would prove it conclusively, Switch Transformers by Fedus, Zoph and Shazeer, had not been submitted yet. It would appear on arXiv in January 2021. We were arguing about an idea whose vindication was six months away and nobody in the room knew it.

MoE is now the architecture behind some of the largest deployed models. The routing problem turned out to be tractable. The objections from those sessions were accurate descriptions of hard problems. What they got wrong was treating hard as impossible.

The pattern is worth naming because it repeats:

A new architectural idea surfaces with real implementation problems
The problems are treated as evidence the idea is wrong rather than evidence the idea is hard
The people closest to the current paradigm push back hardest
The implementation problems get solved by people who believed the idea was worth solving
The idea is adopted widely and the pushback is forgotten

Attention was called slow and memory-inefficient before it became the default. Transformers were called a brute-force approach with no inductive bias before they replaced everything else. The Kaplan scaling laws paper came out three months ago and people are still arguing about whether they mean anything. None of those objections were stupid. They were accurate descriptions of real problems. What they got wrong was the conclusion.

Myth: Mixture-of-Experts routing is a clever trick that will never scale to real language models — Reality: Every objection raised against MoE in 2020 is an engineering problem with a known partial solution in the 2017 Shazeer paper. Calling an unsolved engineering problem a theoretical dead end is a category error. — **Myth:** Mixture-of-Experts routing is a clever trick that will never scale to real language modelsShazeer et al., Outrageously Large Neural Networks, ICLR 2017

When a research idea is being dismissed loudly, read the original paper yourself instead of relying on the dismissal. The people pushing hardest against an idea are often the ones who feel most threatened by what it would mean if it worked.
Post on X

Discussion

Have you dismissed a technical idea because of implementation problems that might already be solved?
Post on X

Tobias R. Munich, Germany 2020-06-15

I am in similar discussions right now. The load balancing objection feels decisive to most people in the room. But it is a description of a hard optimisation problem, not a proof that the architecture cannot work.

Julien Reszka Paris, France 2020-06-16

Yes. The distinction between 'this is hard' and 'this is wrong' is the one people keep collapsing. Hard things get solved. Wrong things stay wrong.

Nadia F. Paris, France 2020-06-16

The five-step pattern is the useful part. You can run the same sequence on almost any architectural idea that gets called stupid before someone builds it at scale. The objections are always real. The conclusion is always wrong.

Erik S. Stockholm, Sweden 2020-06-17

Pushback: some ideas get dismissed and stay dismissed because they were actually wrong. Survivorship bias makes every failed dismissal look like a mistake. Most things that get called stupid ideas are stupid ideas.

Nadia F. Paris, France 2020-06-17

True, but the way to distinguish is to read the paper rather than relay the dismissal. The MoE objections were specific and checkable. Most people repeating them had not checked.