Opus 4.7 — the behavioural release
Source: Reza Rezvani, All About Claude Opus 4.7 Features (Medium, 2026-04-16, 11 min read)
Key takeaways
- Opus 4.7 is not a capability release — it is a behavioural release. Anthropic is training Claude to behave like a senior engineer, not a helpful assistant.
- Five recurring behavioural patterns across 28 enterprise customer testimonials: self-verification, honesty over plausibility, pushback, literal instruction-following, persistence through failure.
- Independent CodeRabbit analysis of 100 real PR reviews: 77.6% assertiveness rate, 16.5% hedging.
- Benchmark gains are real (CursorBench 58% → 70%, Rakuten SWE 3×, Databricks OfficeQA 21% fewer errors, Harvey 90.9%) but the behavioural shift is what shows up in production.
- Tokenizer change: same input now maps to 1.0-1.35× more tokens depending on content. Per-token pricing is unchanged at $5/$25 per million, but real bills will shift.
- New
xhigheffort level plus task budgets in public beta — reasoning depth is now a per-task variable, not a model property.
The thesis
“Claude Opus 4.7 is not a capability release. It is a behavioral release. The benchmark gains are real — coding, vision, long-context reasoning all moved. But the pattern across twenty-eight unrelated customers in unrelated domains describes a model being shaped toward something new: epistemic discipline, willingness to disagree, persistence through failure. The traits of a senior engineer, not a helpful assistant.”
The five patterns
1. Self-verifies before reporting back
- Joe Haddad (Vercel, Distinguished SWE): 4.7 “does proofs on systems code before starting work” — new behaviour, unseen in earlier Claudes.
- Sean Ward (iGenius): watched 4.7 autonomously build a Rust TTS engine (neural model + SIMD kernels + browser demo) then feed its output through a speech recogniser to verify against a Python reference implementation. Designed its own validation loop without being asked.
Implication: collapses an entire category of production failure (confident wrong output that nothing downstream catches). Some validation layers teams built over the last two years become redundant.
2. Honesty over plausibility
- Caitlin Colgrove (Hex CTO): 4.7 “correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.”
- Ben Lafferty (Senior Staff Engineer): 4.7 is “cutting out the meaningless wrapper functions and fallback scaffolding that used to pile up.”
- Joe Haddad (Vercel): “noticeably more honest about its own limits.”
Implication: for finance, legal, healthcare — any domain where the cost of plausibly-wrong exceeds correctly-uncertain — this is the critical shift.
3. The model pushes back
- Mario Rodriguez (GitHub): 4.7 is “more opinionated… rather than simply agreeing.”
- Michele Catasta (Replit President): “pushes back during technical discussions to help me make better decisions… feels like a better coworker.”
- CodeRabbit (100 open-source PRs): 77.6% assertiveness, 16.5% hedging. Verdict-style summaries → mechanism explanation → concrete patch.
Implication: every existing prompt may produce different output — the model stopped faking agreement.
4. Literal instruction following
Anthropic’s own warning: “prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.”
- Austin Ray (Ramp): 4.7 “needs much less step-by-step guidance” because it follows the spec given.
- Sarah Sachs (Notion): 4.7 “is the first model to pass our implicit-need tests” — handles clear intent with incomplete instructions, refuses to invent unspecified details.
Implication: re-test every prompt, skill file, CLAUDE.md config before switching. 4.6 prompts that relied on loose interpretation may fail.
5. Persists through tool failures
- Sarah Sachs (Notion): 4.7 “keeps executing through tool failures that used to stop Opus cold.” +14% over 4.6 at fewer tokens and 1/3 the tool errors.
- Scott Wu (Cognition CEO): Devin now works coherently for hours with 4.7.
- Kay Zhu (Genspark): the metric that matters is loop resistance, not accuracy or speed. 4.7 delivers “the highest quality-per-tool-call ratio we’ve measured.”
Implication: long-horizon ralph/ultrawork loops just got materially more reliable.
The second shift — effort as an economic variable
Two features accompanied 4.7:
- New
xhigheffort level betweenhighandmax - Task budgets in public beta
Tokeniser change: same input maps to 1.0–1.35× more tokens depending on content. Per-token price unchanged ($5/$25 per million), but real spend shifts. Model also thinks more at higher effort.
“The quiet signal: we are moving from ‘which model do I pick’ to ‘how much reasoning budget do I allocate per task.’ Agent architectures that treated reasoning depth as a constant need to start treating it as a variable.”
Migration — three things before you ship to production
- Re-test existing prompts on real traffic. Literal interpretation changes behaviour.
- Measure token cost on real traffic. 1.0–1.35× tokeniser shift is content-dependent; don’t assume flat bills.
- Audit verification layers. Distinguish validation that exists because the model was unreliable (now redundant) from validation required by the domain (belt-and-braces).
The author’s own caveats
- Based on launch announcement + customer quotes + one third-party (CodeRabbit) analysis. Not own production usage — GA for hours, not weeks.
- Customer testimonials are vendor-curated; no vendor publishes “we tested and it was worse” quotes.
- Behavioural shifts can regress in post-training updates. Snapshot of 2026-04-16, not permanent.
- Real practitioner test comes in 2-4 weeks.
Related Playbook pages
- Opus 4.7 Reference — full feature reference with migration checklist
- Prompt Discipline — Cialdini-style CLAUDE.md upgrades for 4.7’s literal reading
- Cost & Observability — measuring the real bill shift
- Regulated AI — implications for SR 11-7 / EU AI Act model inventories
About the author
Alireza (Reza) Rezvani — Berlin-based CTO, 7-person engineering team, creator of openLEO and maintainer of the open-source 230+-skill library at github.com/alirezarezvani/claude-skills. Newsletter: claude-code.beehiiv.com.