Five LLMs, three protocols, four tasks, ten iterations each — the largest public comparison of agentic UI protocols to date. Zero phantom-selector silent damage under NAC3, 15% under Raw DOM, 3.5× fewer round-trips than MCP. Full reproducibility: every JSONL row carries runtime version and manifest checksum, the entire test bundle is downloadable from this repo, and the cost of re-running the full benchmark is under US$15.
The benchmark output is split into two artifacts. Results is the public report — what we measured and what it means. Methodology is the audit trail — how the numbers were actually made, what fixes were applied along the way, and what the dataset deliberately does not say. Both share the same 600-row JSONL dataset.
The public report. Hero numbers, hallucination breakdown by category, success per (model × condition × task), cost-latency trade-off, failure modes, model fit. Eight sections covering what 600 runs revealed about agentic UI protocols in production.
Open results →The audit document. Setup & scale (600/600 verified), the five fixes applied iteratively during development, anti-bias controls (one prompt, one fixture, ten iterations per cell), traceability (runtime version + manifest checksum per row), and explicit limitations.
Open methodology →@nac3/runtime at commit-equivalent of v2.3.0 (the npm constant exposes 2.2.1, kept for backwards compat — see methodology §07 for the equivalence). Manifest checksum 67db3eef3658 across all rows. The benchmark dataset does not test the v2.4 RFCs (snapshot versioning and authority) — those have their own benchmark, separately, when the runtime ships them.
The complete test harness — fixture app, 4 task definitions, 5 model bindings, 3 condition runners, scoring, and reporting scripts — is bundled into a single tarball downloadable from this directory. Cost to re-run the full 600 runs: roughly US$15 in API credits. Wall-clock: 1h 23min on commodity hardware. Bring your own API keys.
Bring API keys for the five providers in your .env: Anthropic, OpenAI, DeepSeek, Google Gemini. The harness reads them at boot. No credentials ship with the bundle.
benchmark/bundle/ in this repo.
runtime_version, bench_version, and manifest_checksum. If your manifest_checksum differs from 67db3eef3658, the fixture diverged from the published baseline.
A benchmark is a measurement, not a verdict. NAC3 caught zero silent damage in this dataset — that is not a guarantee for arbitrary production deployments. A model can still emit a semantically wrong plan with all-valid nac_ids; the protocol only blocks structural hallucinations (phantom selectors). The full list of intentional limitations — one fixture, one language, five models, four task shapes — is in methodology §08.