I ran Gemma 4 as a local model in Codex CLI
Source: Daniel Vaughan, I ran Gemma 4 as a local model in Codex CLI (Medium / Google Cloud Community, 2026-04-13, 7 min read)
Key takeaways
- Gemma 3 scored 6.6% on tau2-bench function-calling. Gemma 4 scores 86.4%. The gap that makes local agentic coding practical.
- Apple Silicon: do NOT use Ollama with Gemma 4 — two bugs block it. Use llama.cpp with
--jinja. - NVIDIA: vLLM fails (PyTorch ABI mismatch), llama.cpp from source rejected by Codex CLI — Ollama v0.20.5 is the first path that works.
- On the same task, GPT-5.4 cloud finished in 65s first-try. Mac 26B MoE finished in 4m 42s with 10 tool calls. GB10 31B Dense finished in 7m with 3 tool calls.
- Token speed doesn’t dominate: first-pass reliability matters more. Mac was 5.1× faster at generation but only 30% faster end-to-end.
- Hybrid workflow:
codex --profile localfor iteration / privacy-sensitive work; cloud default for complex.
Why local, why now
Three drivers pushing Vaughan (and others) toward local:
- Cost — heavy Codex CLI runs add up, multiple parallel sessions compound
- Privacy — some codebases can’t leave the machine
- Resilience — cloud APIs throttle, go down, change pricing. Local runs.
The previous blocker was function calling. Gemma 3: 6.6% on tau2-bench. 93 failures out of 100. Not a foundation for anything agentic. Gemma 4: 86.4%. Broken → works.
Hardware tested
- Mac M4 Pro, 24 GB — ran 26B MoE variant, llama.cpp, Q4_K_M quantisation
- Dell GB10, 128 GB unified (NVIDIA Blackwell) — ran 31B Dense variant, Ollama v0.20.5
- Cloud baseline — GPT-5.4 with high reasoning effort
Both local machines configured as custom model providers in Codex CLI config.toml with wire_api = "responses".
Mac setup (what broke)
Ollama failed — two bugs
- v0.20.3 streaming bug routes Gemma 4’s tool-call responses to the wrong field (reasoning output instead of
tool_callsarray) - Flash Attention freeze hangs Ollama on any prompt longer than ~500 tokens with Gemma 4 on Apple Silicon
- Codex CLI’s system prompt alone is ~27,000 tokens — the request starts ingesting and then nothing useful happens
llama.cpp worked — the six load-bearing flags
brew install llama.cpp
llama-server \
-m /path/to/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--port 1234 -ngl 99 -c 32768 -np 1 --jinja \
-ctk q8_0 -ctv q8_0
| Flag | Why it matters on 24 GB |
|---|---|
-np 1 | Single slot. Multiple slots multiply KV cache memory. |
-ctk q8_0 -ctv q8_0 | Quantises KV cache (940 MB → 499 MB) |
--jinja | Required for Gemma 4’s tool-calling template |
-c 32768 | Context. System prompt needs ≥27K. |
-m with direct path | Avoid -hf which silently downloads a 1.1 GB vision projector causing OOM |
Codex CLI config needs web_search = "disabled" — Codex sends a web_search_preview tool type that llama.cpp rejects.
NVIDIA GB10 setup
- vLLM failed. v0.19.0 compiled against PyTorch 2.10.0; only CUDA-enabled PyTorch for aarch64 Blackwell (compute capability sm_121) is 2.11.0+cu128. Different ABI. ImportError.
- llama.cpp from source compiled fine but Codex CLI’s
wire_api = "responses"sends non-function tool types llama.cpp rejects. - Ollama v0.20.5 worked. The Apple Silicon streaming bug did not reproduce on NVIDIA.
ollama pull gemma4:31b
# SSH tunnel if remote (Codex --oss checks only localhost):
ssh -L 11434:localhost:11434 user@gb10-host
codex --oss -m gemma4:31b
Text generation and tool calling both worked on first attempt.
Setup time: Mac took most of an afternoon. GB10 took about an hour, most of it waiting for model downloads.
The benchmark
Same task through codex exec --full-auto: write a parse_csv_summary Python function with error handling + tests + run them. Single practical spot check, not statistically robust.
| Config | Result | Time | Tool calls | Code quality |
|---|---|---|---|---|
| GPT-5.4 cloud | 5/5 first try | 65 s | — | Type-hinted, clean exception chaining, boolean detection, helper function |
| GB10 31B Dense | 5/5 first try | 7 min | 3 | Functional, no type hints/boolean detection, solid error handling, no dead code |
| Mac 26B MoE | 5/5 eventually | 4m 42s | 10 | Dead code left in; test file took 5 attempts — heredoc failures like filerypt instead of file_path, encoding=' 'utf-8' with rogue space |
The speed surprise
Both machines have 273 GB/s LPDDR5X memory bandwidth. Mac generates tokens 5.1× faster than GB10.
Vaughan did not expect this. The explanation is the Mixture of Experts architecture:
- Token generation is memory-bandwidth limited
- 31B Dense reads all 31.2B params per token (~17.4 GB through 273 GB/s → 10 tok/s)
- 26B MoE activates only 3.8B per token (~1.9 GB through 273 GB/s → 52 tok/s)
Same pipe, vastly different payload. MoE sparse activation dominates memory-bandwidth-limited generation.
Prompt processing surprise too: Mac held its own vs Blackwell GPU — 531 tok/s vs 548 tok/s at 8K context.
What changed Vaughan’s mind
“I went into this assuming token speed would dominate the experience. On this task, it did not. The Mac generated tokens 5.1 times faster. It still finished only 30 per cent sooner (4m 42s versus 6m 59s). The time went into retries: ten tool calls instead of three, five failed test writes and dead code the model did not clean up. The GB10’s slower model got it right first time.”
“The cloud model made the same point more sharply. It was fastest, used the fewest tokens and needed no repair pass. Five out of five in 65 seconds. For this workflow, first-pass reliability mattered more than raw generation speed.”
The author’s honest caveat
- This is a 24 GB / Q4_K_M / Codex CLI harness result — not a universal verdict on Gemma 4 on Apple Silicon
- Has not rerun at a higher quant on a roomier Apple Silicon machine; expects that to matter
The hybrid workflow
“I can see how a hybrid approach might be useful.
codex --profile localfor iteration and privacy-sensitive work. Default cloud for anything complex. Codex CLI’s profile system makes switching a single flag.”
Specific tips that will save you time
- Apple Silicon: Ollama not usable with Gemma 4 for this workload. Use llama.cpp with
--jinja. Setweb_search = "disabled". Use-mwith direct GGUF path, not-hf. Context 32,768 (Codex system prompt needs ≥27K). Quantise KV cache with-ctk q8_0 -ctv q8_0. - NVIDIA GB10: Ollama v0.20.5 is the first path that worked reliably.
codex --oss -m gemma4:31b. Tunnel port 11434 via SSH for remote. - Set
stream_idle_timeout_msto at least 1,800,000. A single tool-call cycle took 1m 39s on the Mac. Default timeout kills sessions before the model finishes thinking. - Pin your llama.cpp version. A reported 3.3× speed regression between builds means benchmarks can change overnight.
Configuration baseline (for reproducibility)
- Date run: 2026-04-12
- Codex CLI v0.120.0
- Mac: llama.cpp ggml 0.9.11 (build 8680), gemma-4-26B-A4B-it Q4_K_M
- GB10: Ollama v0.20.5, gemma-4-31B-it Q4_K_M
- Cloud: GPT-5.4 high reasoning effort
Related Playbook pages
- Local Models — Gemma 4 setup reference with the hybrid workflow pattern
- Multi-Model Orchestration — combining local and cloud
- Regulated AI — when local inference is a compliance win