When AI Thinks Too Hard: Discovering Token Exhaustion in Claude Opus 4.6

The Experiment

We decided to put Claude Opus 4.6 through its paces on the 1stProof Benchmark—10 research-level mathematics problems with encrypted answers (revealed February 13, 2026). Problems spanning stochastic analysis, representation theory, algebraic topology, and more.

The setup was straightforward: test multiple prompting strategies, capture verifiable results with SHA-256 hashes, see what happens.

What we didn’t expect was discovering a fundamental behavior of extended thinking mode.

The Discovery

Extended thinking is Anthropic’s feature that lets the model “think” before responding. With effort: "max", you’re telling the model to think as deeply as possible.

Here’s what happened:

Token Limit: 32,768
Thinking Tokens Used: 32,768
Response Tokens: 0
Stop Reason: max_tokens

The model used every single token for thinking. No response.

We increased the limit:

Token Limit: 65,536
Thinking Tokens Used: 65,536
Response Tokens: 0
Stop Reason: max_tokens

Same behavior. So we maxed it out:

Token Limit: 128,000
Thinking Tokens Used: 128,000
Response Tokens: 0
Stop Reason: max_tokens

Still nothing. The model was thinking so deeply that it never got around to answering.

Why This Happens

Here’s our hypothesis: when you give a frontier AI model a genuinely hard problem and tell it to think as deeply as possible, it does exactly that. Research-level mathematics offers essentially unlimited reasoning depth. There’s always another angle to consider, another lemma to prove, another edge case to verify.

Without internal mechanisms to say “okay, time to wrap up,” the model reasons until it hits the external token limit.

This isn’t a bug—it’s a feature revealing its limits. The model is doing exactly what we asked: thinking as hard as it can. It just never decides to stop thinking and start answering.

The Two-Phase Solution

We developed a workaround: treat it as a two-phase process.

Phase 1: Let it think

Extended thinking with max effort
Accept that it might exhaust all 128K tokens
Capture the thinking content

Phase 2: Ask for the answer

Feed the thinking back as context
Use standard mode (no extended thinking)
Request the final answer

# Phase 1: Deep thinking (may exhaust tokens)
response1 = call_extended_thinking(question)
thinking = response1.thinking_content

# Phase 2: Request final answer based on thinking
if response1.response.strip() == "":
    prompt = f"""Based on your analysis:
    <your_analysis>{thinking}</your_analysis>

    Provide your FINAL ANSWER."""
    response2 = call_standard(prompt)

Results:

Phase	Tokens	Output
Phase 1	128,000	Deep mathematical reasoning
Phase 2	~4,000	Complete, rigorous proof

It worked. We captured both the extensive reasoning AND the final answer.

What We Found

Across all 10 questions, using 6 different prompting modes (80+ API calls):

Mode	Success Rate
Zero-Shot	10/10
Chain-of-Thought	10/10
Multi-Attempt (3x)	30/30
Self-Critique	10/10
Extended Thinking	0/10 (token exhaustion)
Extended + Continuation	10/10

Every standard mode produced complete proofs. Extended thinking alone produced nothing—but our two-phase approach recovered full functionality.

The Proofs

Here’s the interesting part: the model generated what appear to be novel mathematical proofs. Not regurgitated solutions (these problems have no published answers), but original approaches with proper citations to mathematical tools—Bernstein-Zelevinsky theory, Davis reflection groups, Jacquet-Piatetski-Shapiro-Shalika conductor theory.

We won’t know if they’re correct until February 13 when the official answers are revealed. But the structure is rigorous. The reasoning is deep. The citations are to legitimate mathematical foundations.

Implications

For researchers using extended thinking:

Monitor for token exhaustion on hard problems
Consider implementing continuation mechanisms
Test at different effort levels

For AI developers:

Frontier reasoning models may need explicit “time to respond” signals
The boundary between thinking and responding isn’t automatic
Hard problems reveal architectural limits

For understanding AI reasoning:

When given maximum freedom, the model prioritizes depth over completion
This mirrors human behavior on genuinely difficult problems
Resource allocation between thinking and output isn’t trivial

What’s Next

We’re publishing the full results to GitHub before the February 13 answer reveal:

All 80+ responses with SHA-256 verification
Complete thinking traces (where captured)
The two-phase continuation approach

This is pre-verification science. We’re committing our answers before we can check them—the way real research should work.

Whatever the results, we learned something unexpected about how frontier AI models allocate their cognitive resources. Sometimes thinking hard and answering are in tension.

Full experiment data: GitHub Repository

This research was conducted with Claude as a co-author and collaborator—including in writing this very reflection.