Introduction
Mixture-of-Experts (MoE) architectures promise efficient scaling by activating only a subset of parameters per input. The core assumption is that the learned router selects experts that are best suited for the current input. But does it?
We present systematic evidence that in Phi-mini-MoE (a 16-expert, 32-layer MoE model), the router’s expert selection has near-zero correlation with actual expert quality on domain-specific tasks. Experts are specialized — the router just doesn’t know it.
Methodology
Per-Layer Expert Isolation
For each of 32 layers, 16 experts, and 4 domains (Math, Science, General Knowledge, Reasoning), we:
- Force-route all tokens through a single expert at a single layer
- Measure log-probability delta vs. uniform routing baseline
- Compare forced performance with the router’s natural activation patterns
This yields 5,120 total evaluations (32 layers × 16 experts × 4 domains × n items).
Statistical Framework
- Pairwise comparisons: Wilcoxon signed-rank test for non-parametric paired differences
- Multiple comparison correction: Benjamini-Hochberg step-up FDR at α = 0.05
- Router calibration: Spearman rank correlation with Fisher z-transform
Key Findings
Finding 1: Experts Have Measurable Domain Specialization
207 out of 896 expert-layer-domain combinations (23.1%) show statistically significant specialization after BH-FDR correction. Best experts differ across domains:
- Math: Expert 7 at Layer 30
- Science: Expert 1 at Layer 3
- General Knowledge: Expert 11 at Layer 4
- Reasoning: Expert 7 at Layer 30
Finding 2: The Learned Router Is Miscalibrated
Fisher z-averaged Spearman correlation between routing probability and expert quality:
| Domain | Spearman ρ | Interpretation |
|---|---|---|
| Math | +0.043 | Effectively zero |
| Science | -0.027 | Effectively zero |
| General Knowledge | -0.146 | Weakly negative |
| Reasoning | +0.063 | Effectively zero |
| Average | -0.017 | No correlation |
The router shows no meaningful correlation with expert quality. For general knowledge, it actually avoids the best expert.
Finding 3: A Single Expert Dominates All Domains
Expert E2 is the most-activated expert across all four domains (~20% above uniform activation of 0.0625), despite never being the best expert for any domain tested. This suggests the router has learned a popularity bias rather than quality-based selection.
Finding 4: Layer Position Determines Expert Impact
A two-regime model emerges:
- Early layers (L3–L5) and near-output layer (L30): Highest expert impact (avg best delta +0.647 to +0.779)
- Middle layers (L8, L13): Minimal to no effect
Finding 5: BH Step-Up vs. Step-Down Matters
The incorrect step-down implementation yields 0/896 surviving tests (no specialization found), while the correct step-up yields 207/896. This implementation detail reversed the paper’s conclusions entirely — a critical methodological contribution.
v4.0 Update (n=50)
Scaling from n=10 to n=50 per evaluation dramatically strengthens the findings:
| Metric | v3.2 (n=10) | v4.0 (n=50) |
|---|---|---|
| BH FDR significant | 207/896 (23%) | 482/896 (54%) |
| Mean Spearman ρ | -0.017 | +0.069 |
| Math ρ | +0.043 | +0.098 |
| Science ρ | -0.027 | -0.005 |
| General ρ | -0.146 | +0.051 |
| Reasoning ρ | +0.063 | +0.132 |
With 5× more data, more than half of expert-layer-domain combinations now show significant specialization, yet the router correlation remains near zero. The miscalibration story is even stronger.
Discussion
Why Do Routers Miscalibrate?
The MoE training objective optimizes for the router and experts jointly. The router may learn to distribute tokens for load balancing or gradient flow rather than per-token expert quality. The auxiliary load-balancing loss, required for training stability, explicitly pushes against specialization-aware routing.
Implications
- Semantic routing replacement: Replace learned routers with domain-aware selection
- Cross-model expert grafting: Transfer specialized experts between models (connects to the Blades framework)
- Architecture redesign: Rethink training objectives to align routing with expert quality
Conclusion
Learned routers in MoE models show near-zero correlation with expert quality despite significant expert specialization. This miscalibration suggests that current MoE training objectives do not adequately target individual expert quality-based selection, opening opportunities for zero-training improvements through semantic routing and expert grafting.