In June 2020, a group of us were reviewing AI papers in the Lex Fridman Discord server. The paper on the table was about Mixture-of-Experts architectures: the idea that instead of routing every input token through the same dense network, you train a gating network that learns to send each token to whichever specialist sub-network is most likely to handle it well.
The room was not impressed. The pushback was consistent and confident:
- load balancing is an unsolved problem and the training will collapse
- you cannot backpropagate cleanly through a discrete routing decision
- the communication overhead between experts on different devices will eat the compute savings
- it is a clever trick, not a scalable architecture
The people saying this were not uninformed. They had read the papers. They understood the implementation challenges. The instability of early MoE training was real. The load balancing problem was real. The routing collapse problem was real.
What the pushback missed was the direction of travel. Every objection was a solved engineering problem dressed up as a fundamental architectural limit. Load balancing: Shazeer et al. already proposed an auxiliary loss term in 2017 that penalises uneven expert utilisation. Discrete routing gradient: the same paper showed that a noisy top-k gating function is differentiable enough to train. Communication overhead: it shrinks as hardware interconnects improve, and the parameter efficiency of MoE outpaces the overhead at scale.
The Lex Fridman Discord sessions were useful precisely because the disagreement was public and fast. You could watch someone state a confident objection and then watch someone else point to a sentence in the paper that answered it. The people pushing hardest against MoE in those sessions were doing so from a prior that dense transformers were the correct inductive bias for language. That prior was wrong, or at least not obviously right. The paper that would prove it conclusively, Switch Transformers by Fedus, Zoph and Shazeer, had not been submitted yet. It would appear on arXiv in January 2021. We were arguing about an idea whose vindication was six months away and nobody in the room knew it.
MoE is now the architecture behind some of the largest deployed models. The routing problem turned out to be tractable. The objections from those sessions were accurate descriptions of hard problems. What they got wrong was treating hard as impossible.
The pattern is worth naming because it repeats:
- A new architectural idea surfaces with real implementation problems
- The problems are treated as evidence the idea is wrong rather than evidence the idea is hard
- The people closest to the current paradigm push back hardest
- The implementation problems get solved by people who believed the idea was worth solving
- The idea is adopted widely and the pushback is forgotten
Attention was called slow and memory-inefficient before it became the default. Transformers were called a brute-force approach with no inductive bias before they replaced everything else. The Kaplan scaling laws paper came out three months ago and people are still arguing about whether they mean anything. None of those objections were stupid. They were accurate descriptions of real problems. What they got wrong was the conclusion.
Discussion
I am in similar discussions right now. The load balancing objection feels decisive to most people in the room. But it is a description of a hard optimisation problem, not a proof that the architecture cannot work.
Yes. The distinction between 'this is hard' and 'this is wrong' is the one people keep collapsing. Hard things get solved. Wrong things stay wrong.
The five-step pattern is the useful part. You can run the same sequence on almost any architectural idea that gets called stupid before someone builds it at scale. The objections are always real. The conclusion is always wrong.
Pushback: some ideas get dismissed and stay dismissed because they were actually wrong. Survivorship bias makes every failed dismissal look like a mistake. Most things that get called stupid ideas are stupid ideas.
True, but the way to distinguish is to read the paper rather than relay the dismissal. The MoE objections were specific and checkable. Most people repeating them had not checked.