Skip to main content
Mixture of ExpertsRouter MiscalibrationExpert SpecializationMoE ArchitectureSpearman CorrelationBenjamini-Hochberg

Learned Routers Don't Learn: Statistical Evidence for Expert Miscalibration in Mixture-of-Experts Models

A
Andrew Young
Automate Capture Research

Abstract

We present empirical evidence that learned routers in Mixture-of-Experts (MoE) transformer models are miscalibrated with respect to expert quality. Using a per-layer expert isolation methodology with log-probability scoring and rigorous multiple comparison correction (Benjamini-Hochberg step-up FDR), we demonstrate that: (1) experts have statistically significant domain specialization (207/896 expert-layer-domain combinations survive BH-FDR at alpha = 0.05), (2) the learned router ignores this specialization (Fisher z-averaged Spearman rho = -0.017 between natural routing probability and expert quality), and (3) a single expert (E2) is moderately preferred across all domains (~20% above uniform) despite never being the best expert for any domain tested. We propose semantic routing replacement and cross-model expert grafting as zero-training alternatives, and discuss implications for MoE architecture design.

Introduction

Mixture-of-Experts (MoE) architectures promise efficient scaling by activating only a subset of parameters per input. The core assumption is that the learned router selects experts that are best suited for the current input. But does it?

We present systematic evidence that in Phi-mini-MoE (a 16-expert, 32-layer MoE model), the router’s expert selection has near-zero correlation with actual expert quality on domain-specific tasks. Experts are specialized — the router just doesn’t know it.

Methodology

Per-Layer Expert Isolation

For each of 32 layers, 16 experts, and 4 domains (Math, Science, General Knowledge, Reasoning), we:

  1. Force-route all tokens through a single expert at a single layer
  2. Measure log-probability delta vs. uniform routing baseline
  3. Compare forced performance with the router’s natural activation patterns

This yields 5,120 total evaluations (32 layers × 16 experts × 4 domains × n items).

Statistical Framework

  • Pairwise comparisons: Wilcoxon signed-rank test for non-parametric paired differences
  • Multiple comparison correction: Benjamini-Hochberg step-up FDR at α = 0.05
  • Router calibration: Spearman rank correlation with Fisher z-transform

Key Findings

Finding 1: Experts Have Measurable Domain Specialization

207 out of 896 expert-layer-domain combinations (23.1%) show statistically significant specialization after BH-FDR correction. Best experts differ across domains:

  • Math: Expert 7 at Layer 30
  • Science: Expert 1 at Layer 3
  • General Knowledge: Expert 11 at Layer 4
  • Reasoning: Expert 7 at Layer 30

Finding 2: The Learned Router Is Miscalibrated

Fisher z-averaged Spearman correlation between routing probability and expert quality:

DomainSpearman ρInterpretation
Math+0.043Effectively zero
Science-0.027Effectively zero
General Knowledge-0.146Weakly negative
Reasoning+0.063Effectively zero
Average-0.017No correlation

The router shows no meaningful correlation with expert quality. For general knowledge, it actually avoids the best expert.

Finding 3: A Single Expert Dominates All Domains

Expert E2 is the most-activated expert across all four domains (~20% above uniform activation of 0.0625), despite never being the best expert for any domain tested. This suggests the router has learned a popularity bias rather than quality-based selection.

Finding 4: Layer Position Determines Expert Impact

A two-regime model emerges:

  • Early layers (L3–L5) and near-output layer (L30): Highest expert impact (avg best delta +0.647 to +0.779)
  • Middle layers (L8, L13): Minimal to no effect

Finding 5: BH Step-Up vs. Step-Down Matters

The incorrect step-down implementation yields 0/896 surviving tests (no specialization found), while the correct step-up yields 207/896. This implementation detail reversed the paper’s conclusions entirely — a critical methodological contribution.

v4.0 Update (n=50)

Scaling from n=10 to n=50 per evaluation dramatically strengthens the findings:

Metricv3.2 (n=10)v4.0 (n=50)
BH FDR significant207/896 (23%)482/896 (54%)
Mean Spearman ρ-0.017+0.069
Math ρ+0.043+0.098
Science ρ-0.027-0.005
General ρ-0.146+0.051
Reasoning ρ+0.063+0.132

With 5× more data, more than half of expert-layer-domain combinations now show significant specialization, yet the router correlation remains near zero. The miscalibration story is even stronger.

Discussion

Why Do Routers Miscalibrate?

The MoE training objective optimizes for the router and experts jointly. The router may learn to distribute tokens for load balancing or gradient flow rather than per-token expert quality. The auxiliary load-balancing loss, required for training stability, explicitly pushes against specialization-aware routing.

Implications

  1. Semantic routing replacement: Replace learned routers with domain-aware selection
  2. Cross-model expert grafting: Transfer specialized experts between models (connects to the Blades framework)
  3. Architecture redesign: Rethink training objectives to align routing with expert quality

Conclusion

Learned routers in MoE models show near-zero correlation with expert quality despite significant expert specialization. This miscalibration suggests that current MoE training objectives do not adequately target individual expert quality-based selection, opening opportunities for zero-training improvements through semantic routing and expert grafting.

Cite this article

Andrew Young (2026). Learned Routers Don't Learn: Statistical Evidence for Expert Miscalibration in Mixture-of-Experts Models. Automate Capture Research.