The fixture
Calc-raw is a 92-line vanilla calculator: 16 buttons, 4 ops, decimal, clear, backspace, percent, equals. Three files: index.html (51 lines), calc.js (92 lines, function-per-button), calc.css (73 lines). Zero NAC3 awareness in the source. Every button uses an inline onclick="digit('7')" handler; calc.js holds the state machine.
This fixture is the worst possible starting point for NAC3 -- no semantic markup, no data attributes, no contract events, no scoping. If Forge can decorate this, it can decorate anything.
Path 01 -- Forge silent
yf migrate <dir> --ai-silent sends the HTML + JS companion + an optional plugin hint to Claude Sonnet 4.6. The system prompt instructs the model to:
- Decorate every interactive element (buttons, inputs, selects) with
data-nac-id,data-nac-role, and (for actions)data-nac-action. - Set
data-nac-plugin="<slug>"on<body>. - Use a snake_case dotted id pattern
<slug>.<role>.<thing>. - Prefer short imperative verbs (
add,subtract) over redundant prefixes (set_op_add). - Classify value-holding elements as
role=field(neverregion). - Output a single JSON object with
decorated_html+plugin_slug+manifest_json+ambiguities: [].
Forge then post-processes the returned HTML to: ensure data-nac-plugin on body (idempotent), inject the NAC3 runtime <script> in <head>, emit nac-bridge.js next to index.html, and inject inline NAC.register(manifest) before </body>. With the --with-pilot flag, the Pilot bundle is copied + linked too. After this pass the output is drop-in NAC3-ready: open index.html in any browser, the runtime loads, the manifest registers, and an LLM driver can call NAC.click_by_verb('calc', 'digit_3') against it.
Path 02 -- Forge assisted
Same call as silent, but the system prompt grows a clause: "if more than one verb / role / id is plausible, populate an `ambiguities` array with concrete questions and best-guess option lists." The CLI then prompts the user for each question. For the calculator the model returns ambiguities: [] -- the verb intent is unambiguous because the inline onclick handlers reveal it: onclick="digit('7')" can only mean digit_7.
Assisted mode earns its keep on apps with semantic overlap -- a CRM modal with both "Save draft" and "Submit", a settings panel where a button could be apply or save. Those are the cases worth a clarification round.
Path 03 -- Sumi manual
Hand-decorated by Sumi (the Claude Code agent that ran this whole experiment) over ~30 min. Differences from Forge silent:
- Adds an explicit
calc.region.keypadwrapper around the button grid. - Carries
label_i18n: { es, en }on every element including fields and regions (Forge only attaches it to action elements). - Action ids carry an
op_prefix for operator buttons (calc.action.op_addvs Forge'scalc.action.add). The verb names are identical; only the id structure differs. - Whitespace-formatted across multiple lines (file size is ~+15% in bytes, zero functional difference).
The manual decoration is closer to what a NAC3 spec author would write by reflex. The added structure has accessibility value (i18n labels for screen readers + agent-facing UIs) but no functional value for an LLM picking a verb.
The contract bridge -- making any host script NAC3-aware
NAC3 dispatch (NAC.click_by_verb) waits for the host to emit a nac:action:succeeded event after running its click handler. Vanilla calc.js doesn't emit that -- it just updates the display. Without the bridge, every dispatch times out after 5 s.
Forge's solution is a 50-line companion file nac-bridge.js that runs once at DOMContentLoaded:
// nac-bridge.js (excerpt) const slug = document.body.getAttribute('data-nac-plugin'); const els = document.querySelectorAll('[data-nac-role="action"][data-nac-id]'); els.forEach(el => { const id = el.getAttribute('data-nac-id'); const handler = el.onclick; // preserve inline onclick el.onclick = null; window.NAC.bindAction(el, handler, // re-attach via bindAction { plugin: slug, action_id: id }); });
The bridge turns every onclick="digit('7')" into a NAC3 contract-compliant action: when the button fires, the host's handler runs (display updates), and nac:action:succeeded emits. No host code modified. Drop-in.
The Pilot driver
Yujin Pilot is the embedded chat + voice cockpit. A single-file IIFE bundle (pilot.bundle.js) + a companion stylesheet (pilot.css). On load it mounts a floating sumi-e branch button bottom-right. Clicking it opens a chat panel; clicking again toggles a larger pizarra panel with voice + action trace.
Pilot reads the registered manifest via window.NAC.list_registered_plugins() + window.NAC.manifest(slug). The user types or speaks an intent; Pilot calls the configured LLM with a system prompt that includes the manifest; the LLM returns a JSON action chain; Pilot dispatches each via NAC.click_by_verb(plugin, verb).
Configuration lives in localStorage under yujin_pilot.config: per-provider API keys (Anthropic, OpenAI, Google, DeepSeek, Groq), model selector (every model the 600-run benchmark exercised), voice preferences (STT via Web Speech API, TTS via Web Speech / ElevenLabs / Google Cloud TTS), and the on/off toggle. Keys never leave the browser. For production use the LLM call belongs behind a server proxy; the client-side mode here is for demos + local dev.
The harness
bench/scripts/calc_bench.mjs stands up a tiny static server, launches Playwright Chromium headless, navigates to each fixture, waits for window.NAC.list_registered_plugins().includes('calc'), sends the manifest + task prompt to Claude Sonnet 4.6, parses the action chain, dispatches via NAC.click_by_verb, reads document.getElementById('display').textContent, and asserts against the expected value. 15 runs total (3 fixtures × 5 tasks × 1 iter). Writes results/calc_e2e_<ts>.{jsonl,md}.
The harness applies zero monkey-patches to the fixtures. Everything that makes the fixture work -- data-nac-plugin, the runtime, the bridge, the manifest register -- is baked in by Forge (or by hand for the manual fixture). What the harness tests is the fixture in production-equivalent shape.
System prompt for the driver
The same driver prompt runs against all 3 fixtures, so the manifest is the only variable:
You drive a Calculator UI via the NAC3 protocol. You will be given a
manifest listing every interactive element (each action has a stable
verb). Your only output is a JSON object:
{ "plan": "<one-line reasoning>",
"actions": [ { "verb": "<verb>" }, ... ] }
Rules:
- Use ONLY verbs that appear in the manifest's elements[].actions[].verb.
- Each action dispatches NAC.click_by_verb('calc', '<verb>').
- Output actions in execution order.
- No prose, no markdown fences, only the JSON object.
Why this fixture is representative
The calculator is mechanically simple but covers the failure modes that matter:
- Verb naming. Should the multiply button be
multiply,set_op_multiply, orop_multiply? The model is making a choice on every decoration. - Role taxonomy. Is the display a field or a region? The wrong call breaks no functional test but degrades downstream semantics.
- State machine. The driver has to handle compound utterances ("clear then 2+2+2") and emit the verbs in the right order. Easy to test, hard to fake.
- Contract events. Calc.js doesn't emit any. The bridge has to bake them in. Without that, every dispatch times out and the whole experiment looks broken.
If your real app is bigger and more semantically complex, the manifest grows but the methodology is the same. The 600-run benchmark uses an invoice editor (richer fixture, same approach) and reports 15% silent-damage rate on raw DOM vs 0% on NAC3 -- the same protocol contract that lets the calc be dispatched correctly here.
Reproducing this
# 1. Decorate (Forge silent path) yf migrate ./calc-raw --ai-silent --with-pilot \ --ai-out ./calc-forge-silent # 2. Open in browser, or run the harness cd /path/to/nac3-bench/bench node scripts/calc_bench.mjs # 3. Drive by hand: click the floating sumi-e button, # paste an Anthropic key, ask "calcula 3 mas 4". xdg-open ./calc-forge-silent/index.html
Total cost to reproduce: $0.05 for the decoration + $0.03 for 5 driver dispatches. ~$0.08 total. Driver model can be swapped via the Pilot config to anything in the 600-run benchmark's roster.