---
title: "Mixture-of-Experts Routing Was Called a Stupid Idea"
date: 2020-06-15T21:47
author: Julien Reszka
description: "In the Lex Fridman Discord paper review sessions, Mixture-of-Experts routing is being dismissed as unscalable. Here is why I think the pushback is wrong."
keywords: ["AI", "engineering", "software", "learning", "future of work"]
canonical: https://julienreszka.com/blog/mixture-of-experts-routing-was-called-a-stupid-idea/
---

# Mixture-of-Experts Routing Was Called a Stupid Idea

In the Lex Fridman Discord paper review sessions, Mixture-of-Experts routing is being dismissed as unscalable. Here is why I think the pushback is wrong.

In June 2020, a group of us were reviewing AI papers in the Lex Fridman Discord server. The paper on the table was about Mixture-of-Experts architectures: the idea that instead of routing every input token through the same dense network, you train a gating network that learns to send each token to whichever specialist sub-network is most likely to handle it well.

The room was not impressed. The pushback was consistent and confident:

- load balancing is an unsolved problem and the training will collapse
- you cannot backpropagate cleanly through a discrete routing decision
- the communication overhead between experts on different devices will eat the compute savings
- it is a clever trick, not a scalable architecture

The people saying this were not uninformed. They had read the papers. They understood the implementation challenges. The instability of early MoE training was real. The load balancing problem was real. The routing collapse problem was real.

What the pushback missed was the direction of travel. Every objection was a solved engineering problem dressed up as a fundamental architectural limit. Load balancing: Shazeer et al. already proposed an auxiliary loss term in 2017 that penalises uneven expert utilisation. Discrete routing gradient: the same paper showed that a noisy top-k gating function is differentiable enough to train. Communication overhead: it shrinks as hardware interconnects improve, and the parameter efficiency of MoE outpaces the overhead at scale.

The Lex Fridman Discord sessions were useful precisely because the disagreement was public and fast. You could watch someone state a confident objection and then watch someone else point to a sentence in the paper that answered it. The people pushing hardest against MoE in those sessions were doing so from a prior that dense transformers were the correct inductive bias for language. That prior was wrong, or at least not obviously right. The paper that would prove it conclusively, Switch Transformers by Fedus, Zoph and Shazeer, had not been submitted yet. It would appear on arXiv in January 2021. We were arguing about an idea whose vindication was six months away and nobody in the room knew it.

MoE is now the architecture behind some of the largest deployed models. The routing problem turned out to be tractable. The objections from those sessions were accurate descriptions of hard problems. What they got wrong was treating hard as impossible.

The pattern is worth naming because it repeats:

1. A new architectural idea surfaces with real implementation problems
2. The problems are treated as evidence the idea is wrong rather than evidence the idea is hard
3. The people closest to the current paradigm push back hardest
4. The implementation problems get solved by people who believed the idea was worth solving
5. The idea is adopted widely and the pushback is forgotten

Attention was called slow and memory-inefficient before it became the default. Transformers were called a brute-force approach with no inductive bias before they replaced everything else. The Kaplan scaling laws paper came out three months ago and people are still arguing about whether they mean anything. None of those objections were stupid. They were accurate descriptions of real problems. What they got wrong was the conclusion.

---

**Actionable insight:** When a research idea is being dismissed loudly, read the original paper yourself instead of relying on the dismissal. The people pushing hardest against an idea are often the ones who feel most threatened by what it would mean if it worked.

## Key figure

**128** — Number of experts used in the sparsely-gated MoE layer proposed by Shazeer et al. in 2017, which outperformed dense LSTM models at lower compute cost per token

*Source: Shazeer et al., Outrageously Large Neural Networks, ICLR 2017*

## Myth vs reality

**Myth:** Mixture-of-Experts routing is a clever trick that will never scale to real language models

**Reality:** Every objection raised against MoE in 2020 is an engineering problem with a known partial solution in the 2017 Shazeer paper. Calling an unsolved engineering problem a theoretical dead end is a category error.

*Source: Shazeer et al., Outrageously Large Neural Networks, ICLR 2017*