Opus 4.7 — the behavioural release

Source: Reza Rezvani, All About Claude Opus 4.7 Features (Medium, 2026-04-16, 11 min read)

Key takeaways

Opus 4.7 is not a capability release — it is a behavioural release. Anthropic is training Claude to behave like a senior engineer, not a helpful assistant.
Five recurring behavioural patterns across 28 enterprise customer testimonials: self-verification, honesty over plausibility, pushback, literal instruction-following, persistence through failure.
Independent CodeRabbit analysis of 100 real PR reviews: 77.6% assertiveness rate, 16.5% hedging.
Benchmark gains are real (CursorBench 58% → 70%, Rakuten SWE 3×, Databricks OfficeQA 21% fewer errors, Harvey 90.9%) but the behavioural shift is what shows up in production.
Tokenizer change: same input now maps to 1.0-1.35× more tokens depending on content. Per-token pricing is unchanged at $5/$25 per million, but real bills will shift.
New xhigh effort level plus task budgets in public beta — reasoning depth is now a per-task variable, not a model property.

The thesis

“Claude Opus 4.7 is not a capability release. It is a behavioral release. The benchmark gains are real — coding, vision, long-context reasoning all moved. But the pattern across twenty-eight unrelated customers in unrelated domains describes a model being shaped toward something new: epistemic discipline, willingness to disagree, persistence through failure. The traits of a senior engineer, not a helpful assistant.”

The five patterns

1. Self-verifies before reporting back

Joe Haddad (Vercel, Distinguished SWE): 4.7 “does proofs on systems code before starting work” — new behaviour, unseen in earlier Claudes.
Sean Ward (iGenius): watched 4.7 autonomously build a Rust TTS engine (neural model + SIMD kernels + browser demo) then feed its output through a speech recogniser to verify against a Python reference implementation. Designed its own validation loop without being asked.

Implication: collapses an entire category of production failure (confident wrong output that nothing downstream catches). Some validation layers teams built over the last two years become redundant.

2. Honesty over plausibility

Caitlin Colgrove (Hex CTO): 4.7 “correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.”
Ben Lafferty (Senior Staff Engineer): 4.7 is “cutting out the meaningless wrapper functions and fallback scaffolding that used to pile up.”
Joe Haddad (Vercel): “noticeably more honest about its own limits.”

Implication: for finance, legal, healthcare — any domain where the cost of plausibly-wrong exceeds correctly-uncertain — this is the critical shift.

3. The model pushes back

Mario Rodriguez (GitHub): 4.7 is “more opinionated… rather than simply agreeing.”
Michele Catasta (Replit President): “pushes back during technical discussions to help me make better decisions… feels like a better coworker.”
CodeRabbit (100 open-source PRs): 77.6% assertiveness, 16.5% hedging. Verdict-style summaries → mechanism explanation → concrete patch.

Implication: every existing prompt may produce different output — the model stopped faking agreement.

4. Literal instruction following

Anthropic’s own warning: “prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.”

Austin Ray (Ramp): 4.7 “needs much less step-by-step guidance” because it follows the spec given.
Sarah Sachs (Notion): 4.7 “is the first model to pass our implicit-need tests” — handles clear intent with incomplete instructions, refuses to invent unspecified details.

Implication: re-test every prompt, skill file, CLAUDE.md config before switching. 4.6 prompts that relied on loose interpretation may fail.

5. Persists through tool failures

Sarah Sachs (Notion): 4.7 “keeps executing through tool failures that used to stop Opus cold.” +14% over 4.6 at fewer tokens and 1/3 the tool errors.
Scott Wu (Cognition CEO): Devin now works coherently for hours with 4.7.
Kay Zhu (Genspark): the metric that matters is loop resistance, not accuracy or speed. 4.7 delivers “the highest quality-per-tool-call ratio we’ve measured.”

Implication: long-horizon ralph/ultrawork loops just got materially more reliable.

The second shift — effort as an economic variable

Two features accompanied 4.7:

New xhigh effort level between high and max
Task budgets in public beta

Tokeniser change: same input maps to 1.0–1.35× more tokens depending on content. Per-token price unchanged ($5/$25 per million), but real spend shifts. Model also thinks more at higher effort.

“The quiet signal: we are moving from ‘which model do I pick’ to ‘how much reasoning budget do I allocate per task.’ Agent architectures that treated reasoning depth as a constant need to start treating it as a variable.”

Migration — three things before you ship to production

Re-test existing prompts on real traffic. Literal interpretation changes behaviour.
Measure token cost on real traffic. 1.0–1.35× tokeniser shift is content-dependent; don’t assume flat bills.
Audit verification layers. Distinguish validation that exists because the model was unreliable (now redundant) from validation required by the domain (belt-and-braces).

The author’s own caveats

Based on launch announcement + customer quotes + one third-party (CodeRabbit) analysis. Not own production usage — GA for hours, not weeks.
Customer testimonials are vendor-curated; no vendor publishes “we tested and it was worse” quotes.
Behavioural shifts can regress in post-training updates. Snapshot of 2026-04-16, not permanent.
Real practitioner test comes in 2-4 weeks.

Opus 4.7 Reference — full feature reference with migration checklist
Prompt Discipline — Cialdini-style CLAUDE.md upgrades for 4.7’s literal reading
Cost & Observability — measuring the real bill shift
Regulated AI — implications for SR 11-7 / EU AI Act model inventories

About the author

Alireza (Reza) Rezvani — Berlin-based CTO, 7-person engineering team, creator of openLEO and maintainer of the open-source 230+-skill library at github.com/alirezarezvani/claude-skills. Newsletter: claude-code.beehiiv.com.