Opus 4.7 — the behavioural release

Source: Reza Rezvani, All About Claude Opus 4.7 Features (Medium, 2026-04-16, 11 min read)

Key takeaways

  • Opus 4.7 is not a capability release — it is a behavioural release. Anthropic is training Claude to behave like a senior engineer, not a helpful assistant.
  • Five recurring behavioural patterns across 28 enterprise customer testimonials: self-verification, honesty over plausibility, pushback, literal instruction-following, persistence through failure.
  • Independent CodeRabbit analysis of 100 real PR reviews: 77.6% assertiveness rate, 16.5% hedging.
  • Benchmark gains are real (CursorBench 58% → 70%, Rakuten SWE 3×, Databricks OfficeQA 21% fewer errors, Harvey 90.9%) but the behavioural shift is what shows up in production.
  • Tokenizer change: same input now maps to 1.0-1.35× more tokens depending on content. Per-token pricing is unchanged at $5/$25 per million, but real bills will shift.
  • New xhigh effort level plus task budgets in public beta — reasoning depth is now a per-task variable, not a model property.

The thesis

“Claude Opus 4.7 is not a capability release. It is a behavioral release. The benchmark gains are real — coding, vision, long-context reasoning all moved. But the pattern across twenty-eight unrelated customers in unrelated domains describes a model being shaped toward something new: epistemic discipline, willingness to disagree, persistence through failure. The traits of a senior engineer, not a helpful assistant.”

The five patterns

1. Self-verifies before reporting back

  • Joe Haddad (Vercel, Distinguished SWE): 4.7 “does proofs on systems code before starting work” — new behaviour, unseen in earlier Claudes.
  • Sean Ward (iGenius): watched 4.7 autonomously build a Rust TTS engine (neural model + SIMD kernels + browser demo) then feed its output through a speech recogniser to verify against a Python reference implementation. Designed its own validation loop without being asked.

Implication: collapses an entire category of production failure (confident wrong output that nothing downstream catches). Some validation layers teams built over the last two years become redundant.

2. Honesty over plausibility

  • Caitlin Colgrove (Hex CTO): 4.7 “correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.”
  • Ben Lafferty (Senior Staff Engineer): 4.7 is “cutting out the meaningless wrapper functions and fallback scaffolding that used to pile up.”
  • Joe Haddad (Vercel): “noticeably more honest about its own limits.”

Implication: for finance, legal, healthcare — any domain where the cost of plausibly-wrong exceeds correctly-uncertain — this is the critical shift.

3. The model pushes back

  • Mario Rodriguez (GitHub): 4.7 is “more opinionated… rather than simply agreeing.”
  • Michele Catasta (Replit President): “pushes back during technical discussions to help me make better decisions… feels like a better coworker.”
  • CodeRabbit (100 open-source PRs): 77.6% assertiveness, 16.5% hedging. Verdict-style summaries → mechanism explanation → concrete patch.

Implication: every existing prompt may produce different output — the model stopped faking agreement.

4. Literal instruction following

Anthropic’s own warning: “prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.”

  • Austin Ray (Ramp): 4.7 “needs much less step-by-step guidance” because it follows the spec given.
  • Sarah Sachs (Notion): 4.7 “is the first model to pass our implicit-need tests” — handles clear intent with incomplete instructions, refuses to invent unspecified details.

Implication: re-test every prompt, skill file, CLAUDE.md config before switching. 4.6 prompts that relied on loose interpretation may fail.

5. Persists through tool failures

  • Sarah Sachs (Notion): 4.7 “keeps executing through tool failures that used to stop Opus cold.” +14% over 4.6 at fewer tokens and 1/3 the tool errors.
  • Scott Wu (Cognition CEO): Devin now works coherently for hours with 4.7.
  • Kay Zhu (Genspark): the metric that matters is loop resistance, not accuracy or speed. 4.7 delivers “the highest quality-per-tool-call ratio we’ve measured.”

Implication: long-horizon ralph/ultrawork loops just got materially more reliable.

The second shift — effort as an economic variable

Two features accompanied 4.7:

  • New xhigh effort level between high and max
  • Task budgets in public beta

Tokeniser change: same input maps to 1.0–1.35× more tokens depending on content. Per-token price unchanged ($5/$25 per million), but real spend shifts. Model also thinks more at higher effort.

“The quiet signal: we are moving from ‘which model do I pick’ to ‘how much reasoning budget do I allocate per task.’ Agent architectures that treated reasoning depth as a constant need to start treating it as a variable.”

Migration — three things before you ship to production

  1. Re-test existing prompts on real traffic. Literal interpretation changes behaviour.
  2. Measure token cost on real traffic. 1.0–1.35× tokeniser shift is content-dependent; don’t assume flat bills.
  3. Audit verification layers. Distinguish validation that exists because the model was unreliable (now redundant) from validation required by the domain (belt-and-braces).

The author’s own caveats

  • Based on launch announcement + customer quotes + one third-party (CodeRabbit) analysis. Not own production usage — GA for hours, not weeks.
  • Customer testimonials are vendor-curated; no vendor publishes “we tested and it was worse” quotes.
  • Behavioural shifts can regress in post-training updates. Snapshot of 2026-04-16, not permanent.
  • Real practitioner test comes in 2-4 weeks.

About the author

Alireza (Reza) Rezvani — Berlin-based CTO, 7-person engineering team, creator of openLEO and maintainer of the open-source 230+-skill library at github.com/alirezarezvani/claude-skills. Newsletter: claude-code.beehiiv.com.


Built by Force Information Systems · Harris Computer · Constellation Software. Licensed under MIT.