Empirical benchmark · 600 runs · 5 models · 3 protocols

NAC3 vs MCP vs Raw DOM
at production scale.

Five LLMs, three protocols, four tasks, ten iterations each — the largest public comparison of agentic UI protocols to date. Zero phantom-selector silent damage under NAC3, 15% under Raw DOM, 3.5× fewer round-trips than MCP. Full reproducibility: every JSONL row carries runtime version and manifest checksum, the entire test bundle is downloadable from this repo, and the cost of re-running the full benchmark is under US$15.

5 modelsSonnet 4.6 · Haiku 4.5 · DeepSeek · Gemini 3.1 Flash Lite · GPT-4o-mini

3 protocolsNAC3 · MCP · Raw DOM

4 tasksatomic · exploratory · compound · async

600 runs10 iter × 4 tasks × 3 conds × 5 models

Read the results → How it was built Re-run the benchmark

NAC3 success

95.5%

191/200 runs · vs 91% Raw · vs 78.5% MCP

Silent damage

0/200

NAC3 · vs 30/200 in Raw DOM (15%)

Round-trips

1.0

NAC3 · vs 3.5 for MCP

Cheap-model unlock

100%

Gemini Flash Lite + GPT-4o-mini under NAC3

§ 01

Two documents,
one dataset.

The benchmark output is split into two artifacts. Results is the public report — what we measured and what it means. Methodology is the audit trail — how the numbers were actually made, what fixes were applied along the way, and what the dataset deliberately does not say. Both share the same 600-row JSONL dataset.

§ Findings

Benchmark results.

The public report. Hero numbers, hallucination breakdown by category, success per (model × condition × task), cost-latency trade-off, failure modes, model fit. Eight sections covering what 600 runs revealed about agentic UI protocols in production.

Open results →

§ How it was built

Benchmark methodology.

The audit document. Setup & scale (600/600 verified), the five fixes applied iteratively during development, anti-bias controls (one prompt, one fixture, ten iterations per cell), traceability (runtime version + manifest checksum per row), and explicit limitations.

Open methodology →

Runtime provenance. All 600 runs executed against @nac3/runtime at commit-equivalent of v2.3.0 (the npm constant exposes 2.2.1, kept for backwards compat — see methodology §07 for the equivalence). Manifest checksum 67db3eef3658 across all rows. The benchmark dataset does not test the v2.4 RFCs (snapshot versioning and authority) — those have their own benchmark, separately, when the runtime ships them.

§ 02

Re-run the entire benchmark
in under two hours.

The complete test harness — fixture app, 4 task definitions, 5 model bindings, 3 condition runners, scoring, and reporting scripts — is bundled into a single tarball downloadable from this directory. Cost to re-run the full 600 runs: roughly US$15 in API credits. Wall-clock: 1h 23min on commodity hardware. Bring your own API keys.

Re-running the benchmark

Bring API keys for the five providers in your .env: Anthropic, OpenAI, DeepSeek, Google Gemini. The harness reads them at boot. No credentials ship with the bundle.

Download the bundle.
Download nac3-benchmark-v2.3.tar.gz → or clone the source tree at benchmark/bundle/ in this repo.
Extract and install. tar -xzf nac3-benchmark-v2.3.tar.gz cd nac3-benchmark-v2.3 npm install npx playwright install chromium
Configure API keys — replace the fictional placeholder values with your real keys. cp .env.example .env # The .env.example ships with fictional placeholders like # ANTHROPIC_API_KEY=sk-ant-api03-FAKE_az302_…_replace_with_real_xxxx # Open .env in your editor and REPLACE every "FAKE_..." value with # the actual API key from your provider dashboard: # ANTHROPIC_API_KEY=<your real key from console.anthropic.com> # OPENAI_API_KEY=<your real key from platform.openai.com> # DEEPSEEK_API_KEY=<your real key from platform.deepseek.com> # GEMINI_API_KEY=<your real key from aistudio.google.com> # The harness exits with an auth error if it sees a placeholder value.
Start the fixture server. node fixtures/invoice-app/server.ts & # serves the test fixture at http://127.0.0.1:8080
Run the full 600-cell benchmark. bash scripts/run_final_600.sh # 5 models × 3 conds × 4 tasks × 10 iter # ~1h 23min, ~US$15
Generate the unified report. RESULTS_DIR=results/N10_final \ node scripts/unified_report.mjs \ --since=2026-05-19T23 > docs/MY_REPORT.md
Subset the run. Skip the full sweep and target specific cells: MODEL=gemini-3.1-flash-lite \ PILOT_CONDITIONS=nac3 \ PILOT_TASK_IDS=T1_add_monitor \ PILOT_ITER=3 \ PILOT_OUT_DIR=results/quick \ npx tsx src/harness_3way.ts
Verify provenance. Every row in the resulting JSONL carries runtime_version, bench_version, and manifest_checksum. If your manifest_checksum differs from 67db3eef3658, the fixture diverged from the published baseline.

§ 03

What the dataset does not say.

A benchmark is a measurement, not a verdict. NAC3 caught zero silent damage in this dataset — that is not a guarantee for arbitrary production deployments. A model can still emit a semantically wrong plan with all-valid nac_ids; the protocol only blocks structural hallucinations (phantom selectors). The full list of intentional limitations — one fixture, one language, five models, four task shapes — is in methodology §08.

Two documents,one dataset.