The Experiment
We decided to put Claude Opus 4.6 through its paces on the 1stProof Benchmark—10 research-level mathematics problems with encrypted answers (revealed February 13, 2026). Problems spanning stochastic analysis, representation theory, algebraic topology, and more.
The setup was straightforward: test multiple prompting strategies, capture verifiable results with SHA-256 hashes, see what happens.
What we didn’t expect was discovering a fundamental behavior of extended thinking mode.
The Discovery
Extended thinking is Anthropic’s feature that lets the model “think” before responding. With effort: "max", you’re telling the model to think as deeply as possible.
Here’s what happened:
Token Limit: 32,768
Thinking Tokens Used: 32,768
Response Tokens: 0
Stop Reason: max_tokens
The model used every single token for thinking. No response.
We increased the limit:
Token Limit: 65,536
Thinking Tokens Used: 65,536
Response Tokens: 0
Stop Reason: max_tokens
Same behavior. So we maxed it out:
Token Limit: 128,000
Thinking Tokens Used: 128,000
Response Tokens: 0
Stop Reason: max_tokens
Still nothing. The model was thinking so deeply that it never got around to answering.
Why This Happens
Here’s our hypothesis: when you give a frontier AI model a genuinely hard problem and tell it to think as deeply as possible, it does exactly that. Research-level mathematics offers essentially unlimited reasoning depth. There’s always another angle to consider, another lemma to prove, another edge case to verify.
Without internal mechanisms to say “okay, time to wrap up,” the model reasons until it hits the external token limit.
This isn’t a bug—it’s a feature revealing its limits. The model is doing exactly what we asked: thinking as hard as it can. It just never decides to stop thinking and start answering.
The Two-Phase Solution
We developed a workaround: treat it as a two-phase process.
Phase 1: Let it think
- Extended thinking with max effort
- Accept that it might exhaust all 128K tokens
- Capture the thinking content
Phase 2: Ask for the answer
- Feed the thinking back as context
- Use standard mode (no extended thinking)
- Request the final answer
# Phase 1: Deep thinking (may exhaust tokens)
response1 = call_extended_thinking(question)
thinking = response1.thinking_content
# Phase 2: Request final answer based on thinking
if response1.response.strip() == "":
prompt = f"""Based on your analysis:
<your_analysis>{thinking}</your_analysis>
Provide your FINAL ANSWER."""
response2 = call_standard(prompt)
Results:
| Phase | Tokens | Output |
|---|---|---|
| Phase 1 | 128,000 | Deep mathematical reasoning |
| Phase 2 | ~4,000 | Complete, rigorous proof |
It worked. We captured both the extensive reasoning AND the final answer.
What We Found
Across all 10 questions, using 6 different prompting modes (80+ API calls):
| Mode | Success Rate |
|---|---|
| Zero-Shot | 10/10 |
| Chain-of-Thought | 10/10 |
| Multi-Attempt (3x) | 30/30 |
| Self-Critique | 10/10 |
| Extended Thinking | 0/10 (token exhaustion) |
| Extended + Continuation | 10/10 |
Every standard mode produced complete proofs. Extended thinking alone produced nothing—but our two-phase approach recovered full functionality.
The Proofs
Here’s the interesting part: the model generated what appear to be novel mathematical proofs. Not regurgitated solutions (these problems have no published answers), but original approaches with proper citations to mathematical tools—Bernstein-Zelevinsky theory, Davis reflection groups, Jacquet-Piatetski-Shapiro-Shalika conductor theory.
We won’t know if they’re correct until February 13 when the official answers are revealed. But the structure is rigorous. The reasoning is deep. The citations are to legitimate mathematical foundations.
Implications
For researchers using extended thinking:
- Monitor for token exhaustion on hard problems
- Consider implementing continuation mechanisms
- Test at different
effortlevels
For AI developers:
- Frontier reasoning models may need explicit “time to respond” signals
- The boundary between thinking and responding isn’t automatic
- Hard problems reveal architectural limits
For understanding AI reasoning:
- When given maximum freedom, the model prioritizes depth over completion
- This mirrors human behavior on genuinely difficult problems
- Resource allocation between thinking and output isn’t trivial
What’s Next
We’re publishing the full results to GitHub before the February 13 answer reveal:
- All 80+ responses with SHA-256 verification
- Complete thinking traces (where captured)
- The two-phase continuation approach
This is pre-verification science. We’re committing our answers before we can check them—the way real research should work.
Whatever the results, we learned something unexpected about how frontier AI models allocate their cognitive resources. Sometimes thinking hard and answering are in tension.
Full experiment data: GitHub Repository
This research was conducted with Claude as a co-author and collaborator—including in writing this very reflection.