Learned Routers Don't Learn: Statistical Evidence for Expert Miscalibration in Mixture-of-Experts Models

Abstract

We present empirical evidence that learned routers in Mixture-of-Experts (MoE) transformer models are miscalibrated with respect to expert quality. Using a per-layer expert isolation methodology with log-probability scoring and rigorous multiple comparison correction (Benjamini-Hochberg step-up FDR), we demonstrate that: (1) experts have statistically significant domain specialization (207/896 expert-layer-domain combinations survive BH-FDR at alpha = 0.05), (2) the learned router ignores this specialization (Fisher z-averaged Spearman rho = -0.017 between natural routing probability and expert quality), and (3) a single expert (E2) is moderately preferred across all domains (~20% above uniform) despite never being the best expert for any domain tested. We propose semantic routing replacement and cross-model expert grafting as zero-training alternatives, and discuss implications for MoE architecture design.

Introduction

Mixture-of-Experts (MoE) architectures promise efficient scaling by activating only a subset of parameters per input. The core assumption is that the learned router selects experts that are best suited for the current input. But does it?

We present systematic evidence that in Phi-mini-MoE (a 16-expert, 32-layer MoE model), the router’s expert selection has near-zero correlation with actual expert quality on domain-specific tasks. Experts are specialized — the router just doesn’t know it.

Methodology

Per-Layer Expert Isolation

For each of 32 layers, 16 experts, and 4 domains (Math, Science, General Knowledge, Reasoning), we:

Force-route all tokens through a single expert at a single layer
Measure log-probability delta vs. uniform routing baseline
Compare forced performance with the router’s natural activation patterns

This yields 5,120 total evaluations (32 layers × 16 experts × 4 domains × n items).

Statistical Framework

Pairwise comparisons: Wilcoxon signed-rank test for non-parametric paired differences
Multiple comparison correction: Benjamini-Hochberg step-up FDR at α = 0.05
Router calibration: Spearman rank correlation with Fisher z-transform

Key Findings

Finding 1: Experts Have Measurable Domain Specialization

207 out of 896 expert-layer-domain combinations (23.1%) show statistically significant specialization after BH-FDR correction. Best experts differ across domains:

Math: Expert 7 at Layer 30
Science: Expert 1 at Layer 3
General Knowledge: Expert 11 at Layer 4
Reasoning: Expert 7 at Layer 30

Finding 2: The Learned Router Is Miscalibrated

Fisher z-averaged Spearman correlation between routing probability and expert quality:

Domain	Spearman ρ	Interpretation
Math	+0.043	Effectively zero
Science	-0.027	Effectively zero
General Knowledge	-0.146	Weakly negative
Reasoning	+0.063	Effectively zero
Average	-0.017	No correlation

The router shows no meaningful correlation with expert quality. For general knowledge, it actually avoids the best expert.

Finding 3: A Single Expert Dominates All Domains

Expert E2 is the most-activated expert across all four domains (~20% above uniform activation of 0.0625), despite never being the best expert for any domain tested. This suggests the router has learned a popularity bias rather than quality-based selection.

Finding 4: Layer Position Determines Expert Impact

A two-regime model emerges:

Early layers (L3–L5) and near-output layer (L30): Highest expert impact (avg best delta +0.647 to +0.779)
Middle layers (L8, L13): Minimal to no effect

Finding 5: BH Step-Up vs. Step-Down Matters

The incorrect step-down implementation yields 0/896 surviving tests (no specialization found), while the correct step-up yields 207/896. This implementation detail reversed the paper’s conclusions entirely — a critical methodological contribution.

v4.0 Update (n=50)

Scaling from n=10 to n=50 per evaluation dramatically strengthens the findings:

Metric	v3.2 (n=10)	v4.0 (n=50)
BH FDR significant	207/896 (23%)	482/896 (54%)
Mean Spearman ρ	-0.017	+0.069
Math ρ	+0.043	+0.098
Science ρ	-0.027	-0.005
General ρ	-0.146	+0.051
Reasoning ρ	+0.063	+0.132

With 5× more data, more than half of expert-layer-domain combinations now show significant specialization, yet the router correlation remains near zero. The miscalibration story is even stronger.

Discussion

Why Do Routers Miscalibrate?

The MoE training objective optimizes for the router and experts jointly. The router may learn to distribute tokens for load balancing or gradient flow rather than per-token expert quality. The auxiliary load-balancing loss, required for training stability, explicitly pushes against specialization-aware routing.

Implications

Semantic routing replacement: Replace learned routers with domain-aware selection
Cross-model expert grafting: Transfer specialized experts between models (connects to the Blades framework)
Architecture redesign: Rethink training objectives to align routing with expert quality

Conclusion

Learned routers in MoE models show near-zero correlation with expert quality despite significant expert specialization. This miscalibration suggests that current MoE training objectives do not adequately target individual expert quality-based selection, opening opportunities for zero-training improvements through semantic routing and expert grafting.