AI Engineering Briefing — April 2026
Author: Alex O’Reilly, Force Information Systems Date: 17 April 2026 Length: ~20-minute read
Why this exists
Over the past two weeks the AI-assisted development space moved enough that a summary email wasn’t enough. This briefing covers what shipped, what matters, and what you should do about it — with enough technical depth for an engineering audience but without the raw research-trail of the source digest (69 primary items across Medium, Reddit, Forbes, and vendor announcements).
Target audience: engineering leaders inside Harris and Constellation companies who need to make model-selection, tooling, and architecture decisions in the next 30 days.
How to read this
Three complementary layers across the Playbook:
- This briefing (20 min) — what to do in the next 30 days
- News & Research (57 pages) — per-article deep reads of every source cited below. Each page has headline → key takeaways → full notes → what to do → related Playbook pages
- Reference pages (8 pages) — full technical depth on each topic: Opus 4.7, Cost & Observability, Multi-Model Orchestration, Regulated AI, Prompt Discipline, Local Models, Knowledge & Context, BMad Autonomous Development
Every quote and number in this briefing traces back to a named primary source; inline links in each section go to the corresponding News page for full context.
Executive summary
- Claude Opus 4.7 shipped on 16 April. It’s a behavioural upgrade, not a capability upgrade — trained to behave like a senior engineer (self-verifies, pushes back, honest about gaps, literal instructions, persists through tool failures). Your existing 4.6-era prompts may behave differently.
- Token costs can be cut ~75% with a two-command plugin install (
JuliusBrussee/caveman, 13K GitHub stars). The brevity improves accuracy, backed by a 2025 paper. - Production observability is now a solved problem via Claude Code’s built-in OpenTelemetry support. Docker-Compose stack is published. Expect to discover your team’s actual adoption and cache-ratio patterns.
- OpenAI shipped an official Codex plugin for Claude Code. 5-minute install, ~$30/month, gives you
/codex:adversarial-reviewfor second-opinion code review. - Microsoft and Anthropic both shipped multi-model orchestration products the same week. Microsoft’s “Copilot Cowork” has a per-model attribution gap that’s a problem for regulated verticals. Anthropic’s “Managed Agents” has a structural advantage over LangGraph/CrewAI that’s worth understanding before your next architectural decision.
- Gemma 4 makes local agentic coding viable for the first time (86.4% on tau2-bench vs Gemma 3’s 6.6%). Relevant for privacy-sensitive or offline workloads.
- SR 11-7 + EU AI Act Article 12 (effective 2 August 2026) create a compliance problem for any regulated Harris vertical that deploys an opaque multi-model pipeline without per-model attribution in audit logs.
- Autonomous overnight sprint execution is real and installable —
/bad(BMad Autonomous Development) coordinates dependency-graph-parallelised stories through git worktrees with self-healing CI.
Everything in this briefing is actionable in the next 30 days if you want it to be. The rest of the document is the “how.”
The April 2026 landscape in one picture
flowchart TB
subgraph Model["Model tier — what shipped"]
O47[Opus 4.7<br/>behavioural release<br/>16 April]
Mythos[Claude Mythos<br/>tier above Opus<br/>restricted preview]
O47 -.-> Mythos
end
subgraph Cost["Cost & observability"]
Cave[caveman<br/>75% output tokens]
OTel[OpenTelemetry<br/>built-in, opt-in]
end
subgraph Orch["Multi-model orchestration"]
Codex[OpenAI Codex plugin<br/>for Claude Code]
Managed[Anthropic<br/>Managed Agents]
Cowork[MS Copilot Cowork<br/>Critique / Council]
end
subgraph Gov["Regulation & compliance"]
SR[SR 11-7 attribution gap]
EU[EU AI Act Article 12<br/>effective 2 Aug 2026]
end
subgraph Local["Local inference"]
G4[Gemma 4<br/>86.4% tau2-bench]
end
subgraph Loop["Autonomous loops"]
BAD["/bad BMad<br/>overnight sprint"]
end
O47 --> Cave
O47 --> OTel
O47 --> Codex
Cowork --> SR
Cowork --> EU
Managed -.-> BAD
classDef hot fill:#ffd966,stroke:#a07800,color:#4d3800;
classDef risk fill:#d9534f,stroke:#8b2c2c,color:#fff;
classDef tool fill:#f4f7fb,stroke:#1a2b4a,color:#1a2b4a;
classDef model fill:#1a2b4a,stroke:#0f1a2e,color:#fff;
class O47,Mythos model;
class Cave,OTel,Codex,Managed,G4,BAD tool;
class Cowork hot;
class SR,EU risk;

1. Claude Opus 4.7 — the behavioural release
Released: 16 April 2026. Available across Claude products, the API (claude-opus-4-7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Zero migration friction for Harris Azure tenants.
Pricing: Per-token price unchanged at $5 / $25 per million input/output. Tokenizer shifted — same input now maps to 1.0–1.35× more tokens depending on content. Real bills will shift even though the rate card didn’t.
Deep reads: Opus 4.7 — the behavioural release (Rezvani) · Opus 4.7 punishes bad prompting (Njenga) · Launch-day community reactions
Capability vs behavioural release — what actually changed
flowchart LR
P46[4.6-era prompt] -->|loose interpretation<br/>gaps filled silently| O46[4.6 output]
P46 -->|literal interpretation<br/>inconsistencies surfaced| O47O[4.7 output]
O47O -->|verifies itself<br/>pushes back when wrong| SE([Senior-engineer behaviour])
classDef in fill:#1a2b4a,stroke:#0f1a2e,color:#fff,rx:10,ry:10;
classDef weak fill:#f4f7fb,stroke:#1a2b4a,color:#1a2b4a;
classDef strong fill:#ffd966,stroke:#a07800,color:#4d3800;
class P46,SE in;
class O46 weak;
class O47O strong;

The five behavioural patterns
Anthropic published 28 enterprise customer testimonials (Stripe, Replit, Cognition, Harvey, Hex, Vercel, Notion, GitHub, iGenius, Ramp, Genspark, and others). Reza Rezvani’s analysis distils them into five recurring patterns:
- Self-verifies before reporting back. Vercel’s Joe Haddad reports 4.7 “does proofs on systems code before starting work.” iGenius’s Sean Ward watched 4.7 autonomously build a Rust TTS engine and design its own validation loop.
- Honest about missing data. Hex CTO: 4.7 “correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.” For finance, legal, and healthcare this is the critical shift.
- Pushes back when you’re wrong. Independent CodeRabbit analysis of 100 PR reviews measured 77.6% assertiveness, 16.5% hedging. Imperatives replace tentative suggestions.
- Literal instruction following. Anthropic’s own warning: “prompts written for earlier models can sometimes now produce unexpected results.” Re-test everything before migrating.
- Persists through tool failures. Notion reports +14% over 4.6 at fewer tokens and 1/3 the tool errors. Cognition’s CEO: Devin now works coherently for hours.
Benchmark deltas
| Metric | 4.6 | 4.7 |
|---|---|---|
| CursorBench | 58% | 70% |
| Rakuten-SWE-Bench | baseline | 3× production tasks |
| Databricks OfficeQA Pro | baseline | 21% fewer errors |
| Harvey BigLaw Bench (high effort) | baseline | 90.9% substantive |
| Image resolution | ~0.8 MP | ~3.75 MP (3×) |
| Knowledge cutoff | May 2025 | Jan 2025 (regression) |
The knowledge-cutoff regression is unexplained. If your team relies on post-Jan-2025 library awareness, factor this into model routing.
New capabilities in Claude Code
xhigheffort level — new tier betweenhighandmax, now default across all Claude Code plans. Defaults cost more tokens than 4.6 defaults./ultrareviewslash command — dedicated review session, 3 free for Pro/Max.- Auto mode extended to Max — Claude makes permission decisions during agentic runs.
- Task budgets (public beta) — guide how Claude spends tokens across a longer run.
- 1M context variant — invoke with
/model claude-opus-4-7[1m].
Cybersecurity safeguards — important for security teams
Opus 4.7 is the first Anthropic model shipping with safeguards detecting and blocking requests tied to prohibited or high-risk cybersecurity uses. Legitimate security work (vulnerability research, penetration testing, red-teaming) will hit silent refusals unless the user has access through Anthropic’s Cyber Verification Program.
If you run an internal security or red-team function at Harris:
- Inventory workflows using Claude for security research before upgrading
- Apply for Cyber Verification Program at claude.com
- Document which workflows require verified access for your compliance team
Field reports worth knowing
- Usage-limit burn is dramatically higher. Community reports of Max-plan users hitting 70%+ of 5-hour limits after two prompts.
xhighdefault + 1.0-1.35× tokenizer + more thinking at higher effort compound. - System-reminder leak. Anthropic’s per-file-read malware-check reminder leaks into visible output on code-analysis sessions (“This file is clearly not malware — it’s a standard Vue 3 component…”). A real token tax worth noticing.
- “Car wash test” regressions reported by community testers.
- Model-identification bug — some early testers found 4.7 claiming “4.6 doesn’t exist, did you mean 4.5?” Resolved for most users within 24 hours.
The five patterns — one diagram
mindmap
root((Opus 4.7<br/>behavioural))
Self-verifies
Proofs before starting
Designs own validation
Honesty
Reports missing data
Fewer hallucinated fallbacks
Pushes back
77.6% assertiveness
Imperatives not hedges
Literal instructions
Re-test 4.6 prompts
Surfaces inconsistencies
Persists
Through tool failures
Highest quality-per-tool-call
Migration checklist
Before you ship 4.7 to production:
- Re-test existing prompts on real traffic. Literal interpretation changes output.
- Measure token cost on real traffic. The 1.0–1.35× tokenizer shift is content-dependent — measure, don’t assume.
- Audit verification layers. Some validation built around 4.6 is now redundant (model self-verifies); some still matters (domain requires belt-and-braces). Table:
| Check exists because… | Keep under 4.7? |
|---|---|
| Model was unreliable and hallucinated fallbacks | Probably redundant |
| Regulator requires documented validation | Keep |
| Cross-system data dependencies need reconciliation | Keep |
| We assumed the model wouldn’t surface inconsistency itself | Test — may now be redundant |
2. Cost optimisation — two levers
There are now two complementary techniques for controlling Claude Code spend. Combined, expect 50–75% reduction in token consumption.
Deep reads: Cut Claude Code’s output tokens by 75% (Dunlop) · The New Claude Code Monitoring (Rezvani) · Graperoot “178x” — the honest reframe
The two-lever model
flowchart LR
Prompt([Your prompt]) --> CC[Claude Code]
CC --> Model[Opus 4.7]
Model --> Response[Response]
Response --> Caveman{caveman<br/>plugin}
Caveman -->|compressed| Output([Output tokens -75%])
Telemetry[(OpenTelemetry)] -. metrics .- CC
Telemetry --> Dashboard[Grafana dashboard]
Dashboard --> Signals([Cache ratio<br/>Adoption<br/>Per-prompt cost])
classDef tool fill:#f4f7fb,stroke:#1a2b4a,color:#1a2b4a;
classDef runtime fill:#1a2b4a,stroke:#0f1a2e,color:#fff,rx:8,ry:8;
classDef gate fill:#ffd966,stroke:#a07800,color:#4d3800;
class Prompt,Output,Signals runtime;
class CC,Model,Response,Telemetry,Dashboard tool;
class Caveman gate;

Lever 1: Output compression — the caveman plugin
Alex Dunlop measured it directly. Same bug, same fix:
- Default Claude Code: 1,252 tokens
- With
/caveman: 410 tokens
The ~800-token difference is filler: “Certainly,” “Sure, I’d be happy to help with that,” “The issue you’re experiencing is most likely caused by…”
Install (two commands):
claude plugin marketplace add JuliusBrussee/caveman
claude plugin install caveman@caveman
Activate with /caveman. Three modes: lite / full (default) / ultra.
A companion caveman-compress rewrites CLAUDE.md files — reported ~45% savings on every session load.
The accuracy story is the important bit. The 2025 paper “Brevity Constraints Reverse Performance Hierarchies in Language Models” finds that brief responses improve accuracy by 26% on benchmarks. Verbose isn’t smarter; it’s just more expensive.
Lever 2: Observability — OpenTelemetry stack
Claude Code ships with OpenTelemetry instrumentation built in but opt-in and off by default. Turn it on with:
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.your-company.com:4317
Reza Rezvani published a production-ready Docker Compose stack (OTel Collector + Prometheus + Grafana). For small-team pilots, Grafana Cloud’s free tier accepts OTLP directly — skip the collector entirely.
The 8 metrics ranked by operational value:
Track from day one:
claude_code.cost.usage(bymodel) — catches teams running Sonnet where Haiku would sufficeclaude_code.token.usage(bytype: input/output/cache_read/cache_creation) — cache-read ratio is the single best indicator of configuration healthclaude_code.active_time.total— Claude-working vs developer-waiting
After baseline:
claude_code.session.countper user — real adoption vs self-reportedclaude_code.commit.count+pull_request.count— connect usage to outputclaude_code.code_edit_tool.decision— accept/reject ratio reveals CLAUDE.md quality
What the telemetry revealed on Rezvani’s 7-person team
- 3 of 7 engineers got 60%+ cache-read ratios; 4 were below 15%. Same codebase, same CLAUDE.md — only difference was prompt structure. Invisible without telemetry.
- 2 engineers generated 80% of sessions; 3 had essentially stopped after month one. Manager gut feel said the team was using Claude Code daily. The data said otherwise.
This is the kind of signal you can’t get any other way. Running the dashboard for two weeks will tell you the truth about adoption and spend.
Traces (beta) — TRACEPARENT propagation
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1 + OTEL_TRACES_EXPORTER=otlp enables distributed tracing. Every Bash subprocess Claude Code runs receives a TRACEPARENT environment variable with W3C trace context. Subprocess spans automatically attach to the parent trace — end-to-end visibility from prompt to CI pipeline.
For Claude Code action in CI/CD this answers: “that automated PR review took 45 seconds — where did the time go?”
3. Multi-model orchestration — three products, one week
Three separate product launches in early April 2026 converged on the same architectural question: when you have multiple AI models, how do you orchestrate them?
Deep reads: I Ran Codex and Claude Side by Side (Liu) · Anthropic Managed Agents launch · Building a first Managed Agent (Njenga) · The Orchestrator Was Missing (Rezvani)
Three architectures you should know
flowchart TB
subgraph Adv["Adversarial review (sequential)"]
A1[Claude drafts]
A1 --> A2[Codex reviews]
A2 --> A3([Findings])
end
subgraph Council["Model Council (parallel, transparent)"]
M1[Claude answers]
M2[Codex answers]
M1 --> MJ[Judge synthesises]
M2 --> MJ
MJ --> MO([Agreements + divergences])
end
subgraph Crit["Microsoft Critique (sequential, opaque)"]
C1[GPT drafts] --> C2[Claude audits]
C2 --> C3([Single final output<br/>attribution hidden])
end
classDef good fill:#ffd966,stroke:#a07800,color:#4d3800;
classDef opaque fill:#f4f7fb,stroke:#1a2b4a,color:#1a2b4a;
classDef gate fill:#d9534f,stroke:#8b2c2c,color:#fff;
class A1,A2,A3 opaque;
class M1,M2,MJ,MO good;
class C1,C2 opaque;
class C3 gate;

OpenAI Codex plugin for Claude Code (30 March 2026)
Official OpenAI release. Repo: github.com/openai/codex-plugin-cc. Not a fork — OpenAI publishing a plugin that runs Codex inside their direct competitor’s tool.
Install (5 minutes):
npm install -g @openai/codex
codex login
# inside claude code:
/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/codex:setup
Cost: ~$0.02 per call at GPT-5.4 pricing ($2.50/$15 per million). 50 calls/day ≈ $30/month.
Three commands worth knowing:
| Command | When to use |
|---|---|
/codex:adversarial-review | Before merging anything non-trivial |
/codex:rescue [task] | Background task hand-off — keeps your Claude context |
/codex:result | Retrieve the async result |
The loop-breaker heuristic (community practice, validated across multiple sources): when Opus gets stuck in a loop, hand the problem to Codex. Opus overthinks; Codex pinpoints. The plugin makes this a one-command move.
Anthropic Managed Agents (8 April 2026, public beta)
API-only launch. Named customers: Notion, Sentry, Rakuten, Asana, Vibecode. Rakuten also appeared in Opus 4.7’s SWE-Bench 3× claim — suggests they’re running a deep Anthropic-integrated stack worth studying.
The technical argument from the launch thread (Soft_Match5737):
“The key advantage of first-party managed agents over CrewAI/AutoGen/LangGraph is context continuity. Third-party frameworks shuttle messages between isolated API calls, which means every agent handoff loses the implicit reasoning state. When Anthropic controls both the orchestration and the model, they can maintain internal representations across agent boundaries without serialising everything to text.”
If that structural claim holds up in practice, it’s a real efficiency edge over any third-party orchestration framework — including the OMC stack you may be building on.
Worth reading the blog and trying the quickstart before committing further investment to any third-party multi-agent framework for production workloads. Links:
- https://platform.claude.com/workspaces/default/agent-quickstart
- https://claude.com/blog/claude-managed-agents
Microsoft Copilot Cowork (30 March 2026)
Two multi-model architectures inside Outlook / Teams / Excel for M365 Frontier-tier customers.
- Critique (sequential) — GPT drafts, Claude audits, user sees only final reviewed output. Clean UX, opaque per-model attribution.
- Model Council (parallel) — GPT and Claude each produce an independent report, a judge model synthesises showing agreements and divergences. Transparent.
The benchmark marketing — 13.8% DRACO win — is worth being skeptical about: the benchmark was created by Perplexity (Microsoft’s competitor), Microsoft ran the tests, and GPT-5.2 (same vendor) was the judge. No independent replications.
Commercial context: Microsoft stock fell 23% in Q1 2026 — worst quarter since 2008. Only 3.3% of 450M M365 subscribers pay for Copilot. E7 bundle at $99/seat (1 May 2026) — M365 E5 + Copilot + Agent 365 + Entra Suite — is the monetisation vehicle; Cowork multi-model features justify the price jump.
4. Regulated AI — the compliance problem nobody’s writing about
For any Harris / Constellation vertical touching regulated work (financial services, healthcare, utilities, public sector), Critique’s attribution gap is the issue to raise before deployment.
Deep reads: I Ran Codex and Claude Side by Side — Liu’s bank-employee analysis of the SR 11-7 attribution gap · Claude Mythos preview · Glasswing & Claude Mythos for CTOs
The compliance decision tree
flowchart TB
A([Considering multi-model AI product]) --> B{Workload touches<br/>SR 11-7 or Annex III?}
B -- no --> C[Deploy normally]
B -- yes --> D{Vendor provides<br/>per-model attribution?}
D -- yes --> E{Notifies proactively<br/>on model updates?}
E -- yes --> F[Deploy with audit integration]
E -- no --> G[Require contractual SLA<br/>or decline]
D -- no --> H{Transparent alternative<br/>available?}
H -- yes --> I[Use Model Council<br/>or equivalent]
H -- no --> J[Decline until<br/>attribution exists]
classDef ok fill:#1a2b4a,stroke:#0f1a2e,color:#fff,rx:8,ry:8;
classDef gate fill:#ffd966,stroke:#a07800,color:#4d3800;
classDef bad fill:#d9534f,stroke:#8b2c2c,color:#fff;
class C,F,I ok;
class B,D,E,H gate;
class G,J bad;

SR 11-7 (US Federal Reserve)
Applies to every model a bank uses in a consequential decision. Requires:
- Model identification (you must know which models produced which claim)
- Independent validation per model
- Documented limitations per model
- Ongoing monitoring
- No outsourced accountability — using a vendor does not transfer your bank’s regulatory responsibility
The OCC, Fed, and FDIC are actively applying SR 11-7 principles to generative AI deployments. A Critique-style pipeline where the bank cannot identify which GPT and Claude versions produced a given claim has a material compliance gap.
EU AI Act Article 12 (effective 2 August 2026)
Requires automatic, event-level logs with traceability for high-risk AI systems. Annex III classifies as high-risk:
- Credit scoring
- AML / fraud detection
- Loan approval
- KYC / identity verification
- Many HR, insurance, and public-sector use cases
Microsoft’s current M365 audit logging is who-ran-what-query-when. Not the same as per-model attribution per inference call. Whether E7 customers can get model-level logs is a question Microsoft has not publicly answered.
The three vendor questions
Put these in writing on every vendor call for Cowork (or any similar multi-model enterprise product):
- Which model versions are in the Critique pipeline right now, and will you notify us proactively when either model is updated?
- Do customers at our plan tier receive per-model attribution in audit logs, or only aggregate platform query logs?
- How does your documentation support our SR 11-7 model inventory and EU AI Act Article 12 traceability obligations for the multi-model feature specifically?
If the answers are vague or deferred, that is the answer. It doesn’t mean don’t deploy — it means don’t deploy the opaque pipeline for high-risk AI use cases until the documentation exists. Use Model Council, Anthropic Managed Agents with explicit multi-model visibility, or your own transparent multi-model pattern in the meantime.
5. Prompt discipline — the Cialdini playbook for LLMs
Rick Hightower’s April 2026 analysis lands on a finding worth internalising: LLMs don’t just hallucinate — they rationalise, cut corners, and abandon plans under pressure, in patterns measurably similar to tired human developers.
The Wharton study
July 2025, co-authored by Robert Cialdini himself. Across 28,000 conversations with GPT-4o-mini:
| Condition | Compliance rate |
|---|---|
| Baseline | 33.3% |
| With persuasion principles | 72.0% |
| Commitment principle (foot-in-the-door) | 100% (from 10%) |
Authority claims alone lifted compliance by 65% on requests the AI would normally refuse.
The practical translation — four principles
- Authority — replace “it would be good to” with “You MUST” and “This is not negotiable.”
- Commitment — force the agent to state its plan before acting. Once stated, it’s committed.
- Social proof — invoke “standard practice” or “production-grade requires” to activate the model’s pattern-match against professional norms.
- Scarcity (inverted) — use adversarial pressure-tests to verify the agent won’t abandon discipline under urgency.
The key innovation — rationalisation tables
Catalog the specific excuses your agent generates when it wants to cut a corner. Pair each with a pre-written rebuttal.
## Common Rationalisations (Do Not Fall for These)
| If you think... | The reality is... |
|---|---|
| "This change is too small to test" | Small changes cause big outages. Test it. |
| "I'll fix the linting later" | Later never comes. Fix it now. |
| "The existing tests cover this" | Verify that claim. Run them. Check coverage. |
| "This is just a config change" | Config changes cause more outages than code changes. Test it. |
This is more potent than it looks because Opus 4.7’s literal instruction following amplifies both the authority and commitment effects. Weak, suggestion-based CLAUDE.md files produce noticeably worse output on 4.7 than on 4.6. The wording change from “should” to “MUST” is a free upgrade on every prompt your team runs.
Action item
During the 4.7 migration window, audit every CLAUDE.md across your projects. Replace suggestion-language with authority-language. Add rationalisation tables for your agents’ top 3 excuses. Version and iterate.
6. Local models — Gemma 4 is viable
Gemma 4 scores 86.4% on the tau2-bench function-calling benchmark. Gemma 3 scored 6.6%. The gap that makes local agentic coding practical.
Deep reads: I ran Gemma 4 as a local model in Codex CLI (Vaughan) · Gemma 4 — Google’s open-source release (Njenga)
flowchart LR
A([Gemma 3<br/>6.6% tau2-bench]) -->|local agentic coding| B([Broken])
C([Gemma 4<br/>86.4% tau2-bench]) -->|local agentic coding| D([Works])
classDef bad fill:#d9534f,stroke:#8b2c2c,color:#fff;
classDef good fill:#ffd966,stroke:#a07800,color:#4d3800;
classDef state fill:#1a2b4a,stroke:#0f1a2e,color:#fff,rx:8,ry:8;
class A bad;
class C good;
class B,D state;

Apache 2.0 licensed, 256K context, self-hostable on Azure ML or on-prem.
When it’s worth it
- Privacy-sensitive workloads — codebases that can’t leave the machine. Especially relevant for Centurion (banking) and any regulated Constellation vertical.
- Cost-sensitive iteration loops — multiple parallel sessions add up on cloud; local is free after hardware.
- Resilience against cloud outages / pricing changes
Setup (24 GB Apple Silicon)
Do not use Ollama on Apple Silicon with Gemma 4 as of April 2026. v0.20.3 has a streaming bug that routes tool-calls to the reasoning field and a Flash Attention freeze on prompts >500 tokens (Codex CLI’s system prompt is ~27,000 tokens).
Use llama.cpp:
brew install llama.cpp
llama-server \
-m /path/to/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 32768 -np 1 --jinja \
-ctk q8_0 -ctv q8_0
Setup (NVIDIA)
Ollama v0.20.5 works. vLLM fails (PyTorch ABI mismatch on Blackwell). llama.cpp from source compiles but Codex CLI’s wire_api rejects it.
ollama pull gemma4:31b
codex --oss -m gemma4:31b
Real-world performance (Vaughan’s same-task benchmark)
| Config | Passes | Time | Tool calls |
|---|---|---|---|
| GPT-5.4 cloud | 5/5 first try | 65 s | — |
| GB10 31B Dense | 5/5 first try | 7 min | 3 |
| Mac 26B MoE | 5/5 eventually | 4m 42s | 10 |
Surprise finding: the Mac generates tokens 5.1× faster than the GB10 despite identical memory bandwidth — MoE sparse activation dominates memory-bandwidth-limited generation. But raw speed doesn’t matter: the Mac’s 5.1× speed advantage only made it 30% faster end-to-end because the speed went into retries. First-pass reliability matters more than raw throughput for agentic work.
The hybrid workflow
codex --profile local for iteration and privacy-sensitive work. Default cloud for anything complex. Codex CLI’s profile system makes switching a single flag.
7. CLI vs MCP — choose per integration, not per system
The most useful architectural insight from the April 2026 research is Reza Rezvani’s per-integration decision framework. After 14 months running both in production, his team’s split is roughly 70/30 CLI/MCP — not by philosophy, by triage.
Deep read: The CLI vs MCP Debate Is Asking the Wrong Question (Rezvani)
The per-integration decision
flowchart TB
A([New integration]) --> B{Tool runs locally?}
B -- yes --> C{Has mature CLI?}
B -- no --> D{Has a CLI?}
C -- yes --> E[CLI]
C -- no --> F[Wrap in shell script]
D -- yes, ambient auth --> E
D -- yes, delegated OAuth --> G[MCP]
D -- no --> G
F --> H{Scales past 2 integrations?}
H -- yes --> E
H -- no --> G
classDef start fill:#1a2b4a,stroke:#0f1a2e,color:#fff,rx:8,ry:8;
classDef gate fill:#ffd966,stroke:#a07800,color:#4d3800;
classDef cli fill:#e8f1fa,stroke:#4a6fa5,color:#1a2b4a;
classDef mcp fill:#f4f7fb,stroke:#1a2b4a,color:#1a2b4a;
class A start;
class B,C,D,H gate;
class E,F cli;
class G mcp;
What broke at MCP-only
- Six MCP servers loaded ~48,000 tokens of tool schemas before the user typed a character
- On 200K context that’s 24% consumed by plumbing; on 128K, 37.5%
- Context isn’t storage — it’s attention. Multi-step reasoning degraded visibly
The data (5-workflow controlled comparison)
| Metric | MCP-only | Hybrid (CLI exec + MCP reads) |
|---|---|---|
| Median tokens/workflow | 67,200 | 23,400 |
| Completion rate | 74% | 96% |
| Multi-step reasoning failures | 31% | 8% |
| Avg completion time | 47 s | 19 s |
Three factors for the right transport
- Where does the tool run? Local → CLI. Remote infra → depends.
- How does it authenticate? Ambient credentials → CLI. Delegated OAuth multi-tenant → MCP.
- What does the workflow look like? Single-tenant dev automation → CLI overwhelmingly. Multi-tenant production acting on behalf of customers → MCP governance.
Production split (Rezvani’s team)
- CLI: git, file operations, builds, tests, deploys
- MCP: Slack messaging, SaaS API queries, services with no CLI (Salesforce, Workday, ServiceNow)
What both camps get wrong
- CLI camp: assumes single-dev security scales to multi-tenant (it doesn’t); ignores that most SaaS services don’t have CLIs and never will
- MCP camp: dismisses token overhead as “price of admission” — 48K of plumbing lobotomises the agent’s attention; lacks native composability (can’t pipe one MCP tool’s output into another; Unix solved this decades ago)
The real leverage is agent-native tool interface design — machine-readable output by default, schema introspection at runtime, input validation for agent-specific mistakes, dry-run modes.
8. Knowledge & context — Karpathy’s “LLM Wiki”
Andrej Karpathy dropped a GitHub Gist titled simply “LLM Wiki” in April. It’s not an app or a library — it’s a design pattern for putting an LLM-maintained, compounding layer of markdown files between you and your raw source material.
Deep read: Why Karpathy’s “LLM Wiki” is the Future (evoailabs)
Why this matters
- RAG has no memory of prior questions. Ask the same compound question tomorrow — it redoes the synthesis from scratch.
- Human-maintained wikis decay because bookkeeping burden grows faster than value.
- Karpathy’s pattern inverts both — the LLM does the grunt work of reading, extracting, cross-referencing.
“Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.”
Three operations
- Ingest — drop file in raw/, agent writes summary + updates 10-15 concept pages + adds backlinks
- Query — agent reads index.md, navigates to relevant pages, answers; new insights from chat get filed back
- Lint — periodic health-check for broken links, stale claims, contradictions, orphan pages
The ecosystem worth knowing
- Waykee Cortex — hierarchical team knowledge; dual-inheritance of Knowledge + Work layers
- Sage-Wiki — treats LLM as a compiler; typed-entity system prevents duplicates
- Thinking-MCP — captures how you think, not factual data; node decay mirrors a live human brain
- ELF — scientific research; base-delta protocol for incremental experiments
- qmd (by Shopify CEO Tobi Lütke) — local BM25 + vector hybrid search with MCP server
What to do with this
For any Harris internal knowledge base (customer support, engineering onboarding, compliance references) — this is the right architecture. Replacing or augmenting Notion/Confluence with an LLM-maintained wiki is practical today. The “new team members instantly browse an up-to-date wiki nobody manually wrote” use case is directly applicable to FIS customer support and onboarding.
9. Autonomous overnight development — /bad
/bad (BMad Autonomous Development) is a coordinator-only skill that takes over the moment your planning is done and runs sprint execution autonomously.
Deep read: /bad: BMad Autonomous Development
GitHub: stephenleo/bmad-autonomous-development Install: npx skills add https://github.com/stephenleo/bmad-autonomous-development (BMAD must already be installed) Invoke: /bad in Claude Code.
Architecture
- Dependency mapping — builds a graph from
sprint-status.ymlto identify parallelisable stories - Isolated execution — each story runs in its own git worktree, preventing environment pollution
- 4-step lifecycle per task: BMAD Create-Story → Dev-Story → Code-Review → GitHub PR
- Self-healing CI — monitors CI results and reviewer comments, auto-fixes until green
Never writes code itself. Delegates every unit of work to dedicated sub-agents with fresh context windows — prevents the “context explosion” that happens when a single session stays open too long.
Pair with
- The
cavemanplugin (above) for output compression - The OpenTelemetry stack (above) for visibility into subagent spend
- The “loop-breaker” heuristic (above) for when a specific task jams
When to use it
- Sprint-level execution where PRDs + stories are already written
- Maintenance backlogs of well-scoped, parallelisable tickets
- Dependency-bumping campaigns across many services
- Generated-code-heavy tasks (migrations, scaffolding)
When not
- Exploratory or discovery work (that’s the human part)
- Customer-facing changes without human PR review
- Anything touching regulatory/compliance decisions
- Projects without good CI coverage — the self-heal loop relies on signal to operate
Hosted alternative — Claude Code Routines (14 April 2026)
Anthropic shipped Routines the same week: saved Claude Code configs (prompt + repo + connectors) that run on Anthropic’s cloud with three trigger types — scheduled (cron-like), API (HTTP endpoint), GitHub webhook. Research preview. Daily limits: Pro 5 · Max 15 · Team/Enterprise 25. Create via /schedule or at claude.ai/code.
/bad vs Routines, one line each:
| Need | Use |
|---|---|
| Dependency-graph-parallel sprint execution, OMC/BMAD tooling in loop | /bad (local, worktree-isolated) |
| Nightly triage / weekly audit / webhook-triggered PR analysis, no local infra | Routines (hosted) |
| Regulated vertical (SR 11-7 / EU AI Act) | Neither until per-run audit logging is confirmed |
10. The 30-day action list
If you read nothing else, do this:
- Install
cavemanin your Claude Code setup. Two commands, ~75% output-token reduction. Take 5 minutes today. - Turn on Claude Code telemetry for your team. Console exporter proves the concept in 10 minutes. Stand up the Grafana Cloud free tier this week. In two weeks you’ll know your actual adoption and cache-ratio picture.
- Upgrade to Opus 4.7 with the 3-step checklist — re-test prompts, measure token cost on real traffic, audit which verification layers are now redundant.
- If you have an internal security team doing vulnerability research or pen-testing, apply for the Cyber Verification Program before the 4.7 upgrade hits.
- Install the OpenAI Codex plugin in one Claude Code environment to try
/codex:adversarial-reviewon next week’s PRs. ~$30/month. - Audit every CLAUDE.md for authority-language vs suggestion-language. Add a rationalisation table for your agents’ top 3 excuses. Takes less than an hour per project.
- If you’re evaluating Copilot Cowork for any regulated vertical, put the three vendor questions in writing before signing.
- If you’re building on LangGraph / CrewAI / custom multi-agent — read the Anthropic Managed Agents blog and try the quickstart before your next architectural decision.
- For any privacy-sensitive codebase (Centurion banking, regulated verticals) — stand up a Gemma 4 local-model test this quarter.
- Pilot one Claude Code Routine on a low-risk scheduled job (nightly dep audit, changelog). Research preview; daily limits are tight on Pro (5), roomier on Team/Enterprise (25). Keep outputs as draft PRs until you’ve watched a few dozen runs.
- Skim the Playbook (link below) — the full technical detail for every topic above sits in a dedicated page there.
Links and references
Reference pages (full technical depth)
- Claude Opus 4.7 Reference
- Cost & Observability
- Multi-Model Orchestration
- Regulated AI
- Prompt Discipline
- Local Models
- Knowledge & Context
- BMad Autonomous Development
News & Research deep reads (57 articles)
The News & Research section has a dedicated per-article page for every substantive source cited above, with thematic and chronological browsing. Each page links the primary source and the related Playbook reference pages.
Key sources linked inline in each section above:
- Reza Rezvani, All About Claude Opus 4.7 Features — behavioural release thesis
- Joe Njenga, Claude Opus 4.7 Is Here — feature breakdown
- Alex Dunlop, I Cut Claude Code’s Output Tokens by 75% — caveman plugin
- Reza Rezvani, The New Claude Code Monitoring — OpenTelemetry stack
- Yanli Liu, I Ran Codex and Claude Side by Side — multi-model + compliance
- Daniel Vaughan, I ran Gemma 4 as a local model in Codex CLI — local setup
- Rick Hightower, Superpowers: The Psychology Hack — Cialdini prompting
- evoailabs, Why Karpathy’s “LLM Wiki” is the Future
- Reza Rezvani, The CLI vs MCP Debate Is Asking the Wrong Question
- Anthropic, Introducing Claude Managed Agents — public beta
Contact: Alex O’Reilly, Force Information Systems, aoreilly@harriscomputer.com
Generated by Claude Opus 4.7 via oh-my-claudecode. The research trail for this briefing spans 69 research items across Medium, Reddit, Forbes, and vendor announcements, consolidated into 57 per-article News & Research pages.