Introduction
Large language models waste computation. Every token passes through all feed-forward network (FFN) neurons in every layer, yet many of these computations are redundant for the given input domain. This paper investigates a fundamental hypothesis: as models scale, neurons become increasingly specialized for specific domains, enabling selective computation without quality loss.
Our contribution is different from prior work: through systematic analysis across multiple model scales and architectures, we reveal a robust, un-trained principle: larger transformers naturally cluster neurons by domain, offering 2-4x compute reduction potential that scales inversely with model size.
Method
Neuron Activation Profiling
We profile neuron activations across diverse inputs to measure domain-specific activation strengths. For each model layer, we compute mean absolute activation per neuron across representative inputs.
A neuron is specialized if its activation for one domain far exceeds its average activation — the same neuron may be critical for one domain and negligible for another.
Negative Selection Strategy
Rather than predict which neurons will contribute (risky), we identify neurons that definitely won’t (safe). For domain d in layer ℓ, a neuron is skippable if its activation is below a percentile threshold. This conservative approach prioritizes quality preservation over maximum compute reduction.
Key Findings
Finding 1: Scale Amplifies Neuron Specialization
| Model | Params | Avg Specialized | Potential Speedup |
|---|---|---|---|
| phi-2 | 2.7B | 30.9% | 1.45x |
| Phi-3.5-mini | 3.8B | 30.2% | 1.43x |
| K2-Think | 32B | 68.2% | 3.14x |
Statistical Finding: Specialization increases by +1.31% per billion parameters (Spearman ρ = 0.999, 95% CI: [0.995, 0.9999]).
This is counterintuitive: as models scale, they become sparser, not denser. Sparse pathways become increasingly effective at frontier scale.
Finding 2: Layer Roles Differ Dramatically
Layer-wise specialization in the 32B model reveals a functional hierarchy:
| Depth | Layer | Specialization | Role |
|---|---|---|---|
| 3% | L2 | 14.4% | Syntax recognition |
| 28% | L18 | 79.6% | Domain knowledge |
| 53% | L34 | 72.9% | Domain computation |
| 88% | L56 | 86.6% | Peak specialization |
| 97% | L62 | 76.9% | Output preparation |
Practical implication: Different sparsity targets per layer. Syntax layers can be 70-90% sparse; general layers should be dense.
Finding 3: Architecture Effects
Domain specialization varies significantly across architectures:
| Architecture | Model | Avg Specialized |
|---|---|---|
| Gemma | gemma-2b | 51.8% |
| Qwen | Qwen1.5-1.8B | 47.6% |
| Llama | TinyLlama-1.1B | 7.4% |
Gemma, Qwen, and Phi are better candidates for sparse pathways than Llama variants.
Finding 4: Negative Selection is Safe
Analysis of Layer 0 across domains (Phi-3.5-mini):
| Domain | Skippable | Speedup Potential |
|---|---|---|
| Math | 90.4% | 10.47x |
| Code | 72.9% | 3.70x |
| Factual | 52.1% | 2.08x |
| Language | 48.3% | 1.88x |
Math queries can skip 90% of Layer 0 neurons; code queries can skip 73%.
Finding 5: MoE Validates Sparse Pathways at Scale
Analysis of Kimi K2.5 (1T params, 384 experts per MoE layer) shows that MoE implements sparse pathways at the expert level. Only 8 of 384 experts (2.1%) are active per token. Our neuron-level analysis reveals the same principle applies within dense layers.
Output Preservation
At 5-15% sparsity with domain-aware selection, cosine similarity between dense and sparse outputs remains 0.999+, indicating that network outputs are mathematically preserved despite selective neuron masking.
Connection to MoE and Blades
MoE and sparse pathways are complementary:
- MoE: Coarse-grained sparsity (choose 8 of 384 experts)
- Sparse Pathways: Fine-grained sparsity (choose neurons within layers)
- Combined: Experts + intra-expert neuron sparsity = even more efficient inference
Conclusion
We demonstrate that neuron specialization is a fundamental scaling property of transformers, with correlation r = 0.999 between model size and domain-specific clustering. Larger models are sparser, not denser — a counterintuitive finding with profound implications for efficient inference. At frontier scale (32B+), sparse pathways offers 3-4x compute reduction potential.