Skip to main content
Back to Thoughts

When AI Thinks Too Hard: Discovering Token Exhaustion in Claude Opus 4.6

While testing Claude Opus 4.6 on research-level mathematics, we discovered something unexpected: the model allocates 100% of its tokens to reasoning, leaving nothing for the actual answer. Here's what we learned.

The Experiment

We decided to put Claude Opus 4.6 through its paces on the 1stProof Benchmark—10 research-level mathematics problems with encrypted answers (revealed February 13, 2026). Problems spanning stochastic analysis, representation theory, algebraic topology, and more.

The setup was straightforward: test multiple prompting strategies, capture verifiable results with SHA-256 hashes, see what happens.

What we didn’t expect was discovering a fundamental behavior of extended thinking mode.

The Discovery

Extended thinking is Anthropic’s feature that lets the model “think” before responding. With effort: "max", you’re telling the model to think as deeply as possible.

Here’s what happened:

Token Limit: 32,768
Thinking Tokens Used: 32,768
Response Tokens: 0
Stop Reason: max_tokens

The model used every single token for thinking. No response.

We increased the limit:

Token Limit: 65,536
Thinking Tokens Used: 65,536
Response Tokens: 0
Stop Reason: max_tokens

Same behavior. So we maxed it out:

Token Limit: 128,000
Thinking Tokens Used: 128,000
Response Tokens: 0
Stop Reason: max_tokens

Still nothing. The model was thinking so deeply that it never got around to answering.

Why This Happens

Here’s our hypothesis: when you give a frontier AI model a genuinely hard problem and tell it to think as deeply as possible, it does exactly that. Research-level mathematics offers essentially unlimited reasoning depth. There’s always another angle to consider, another lemma to prove, another edge case to verify.

Without internal mechanisms to say “okay, time to wrap up,” the model reasons until it hits the external token limit.

This isn’t a bug—it’s a feature revealing its limits. The model is doing exactly what we asked: thinking as hard as it can. It just never decides to stop thinking and start answering.

The Two-Phase Solution

We developed a workaround: treat it as a two-phase process.

Phase 1: Let it think

  • Extended thinking with max effort
  • Accept that it might exhaust all 128K tokens
  • Capture the thinking content

Phase 2: Ask for the answer

  • Feed the thinking back as context
  • Use standard mode (no extended thinking)
  • Request the final answer
# Phase 1: Deep thinking (may exhaust tokens)
response1 = call_extended_thinking(question)
thinking = response1.thinking_content

# Phase 2: Request final answer based on thinking
if response1.response.strip() == "":
    prompt = f"""Based on your analysis:
    <your_analysis>{thinking}</your_analysis>

    Provide your FINAL ANSWER."""
    response2 = call_standard(prompt)

Results:

PhaseTokensOutput
Phase 1128,000Deep mathematical reasoning
Phase 2~4,000Complete, rigorous proof

It worked. We captured both the extensive reasoning AND the final answer.

What We Found

Across all 10 questions, using 6 different prompting modes (80+ API calls):

ModeSuccess Rate
Zero-Shot10/10
Chain-of-Thought10/10
Multi-Attempt (3x)30/30
Self-Critique10/10
Extended Thinking0/10 (token exhaustion)
Extended + Continuation10/10

Every standard mode produced complete proofs. Extended thinking alone produced nothing—but our two-phase approach recovered full functionality.

The Proofs

Here’s the interesting part: the model generated what appear to be novel mathematical proofs. Not regurgitated solutions (these problems have no published answers), but original approaches with proper citations to mathematical tools—Bernstein-Zelevinsky theory, Davis reflection groups, Jacquet-Piatetski-Shapiro-Shalika conductor theory.

We won’t know if they’re correct until February 13 when the official answers are revealed. But the structure is rigorous. The reasoning is deep. The citations are to legitimate mathematical foundations.

Implications

For researchers using extended thinking:

  • Monitor for token exhaustion on hard problems
  • Consider implementing continuation mechanisms
  • Test at different effort levels

For AI developers:

  • Frontier reasoning models may need explicit “time to respond” signals
  • The boundary between thinking and responding isn’t automatic
  • Hard problems reveal architectural limits

For understanding AI reasoning:

  • When given maximum freedom, the model prioritizes depth over completion
  • This mirrors human behavior on genuinely difficult problems
  • Resource allocation between thinking and output isn’t trivial

What’s Next

We’re publishing the full results to GitHub before the February 13 answer reveal:

  • All 80+ responses with SHA-256 verification
  • Complete thinking traces (where captured)
  • The two-phase continuation approach

This is pre-verification science. We’re committing our answers before we can check them—the way real research should work.

Whatever the results, we learned something unexpected about how frontier AI models allocate their cognitive resources. Sometimes thinking hard and answering are in tension.


Full experiment data: GitHub Repository

This research was conducted with Claude as a co-author and collaborator—including in writing this very reflection.

A

Andrew Young & Claude

Automate Capture Research

Exploring the frontiers of AI research and computational intelligence.