NAC3 v2.3 / Migration paper / Methodology
← results  ·  migration home  ·  byte-by-byte diff →
Paper 02 of 03 · Harness · prompts · taxonomy decisions

How the migration was driven and how to repeat it.

Three components: Yujin Forge as the AI decorator, the contract bridge that makes any host app NAC3-compliant without touching its JS, and Yujin Pilot as the embedded driver. The harness in bench/scripts/calc_bench.mjs is the test rig. The system prompt and the role taxonomy are the only places where opinions matter; everything else is plumbing.

The fixture

Calc-raw is a 92-line vanilla calculator: 16 buttons, 4 ops, decimal, clear, backspace, percent, equals. Three files: index.html (51 lines), calc.js (92 lines, function-per-button), calc.css (73 lines). Zero NAC3 awareness in the source. Every button uses an inline onclick="digit('7')" handler; calc.js holds the state machine.

This fixture is the worst possible starting point for NAC3 -- no semantic markup, no data attributes, no contract events, no scoping. If Forge can decorate this, it can decorate anything.

Path 01 -- Forge silent

yf migrate <dir> --ai-silent sends the HTML + JS companion + an optional plugin hint to Claude Sonnet 4.6. The system prompt instructs the model to:

Forge then post-processes the returned HTML to: ensure data-nac-plugin on body (idempotent), inject the NAC3 runtime <script> in <head>, emit nac-bridge.js next to index.html, and inject inline NAC.register(manifest) before </body>. With the --with-pilot flag, the Pilot bundle is copied + linked too. After this pass the output is drop-in NAC3-ready: open index.html in any browser, the runtime loads, the manifest registers, and an LLM driver can call NAC.click_by_verb('calc', 'digit_3') against it.

Path 02 -- Forge assisted

Same call as silent, but the system prompt grows a clause: "if more than one verb / role / id is plausible, populate an `ambiguities` array with concrete questions and best-guess option lists." The CLI then prompts the user for each question. For the calculator the model returns ambiguities: [] -- the verb intent is unambiguous because the inline onclick handlers reveal it: onclick="digit('7')" can only mean digit_7.

Assisted mode earns its keep on apps with semantic overlap -- a CRM modal with both "Save draft" and "Submit", a settings panel where a button could be apply or save. Those are the cases worth a clarification round.

Path 03 -- Sumi manual

Hand-decorated by Sumi (the Claude Code agent that ran this whole experiment) over ~30 min. Differences from Forge silent:

The manual decoration is closer to what a NAC3 spec author would write by reflex. The added structure has accessibility value (i18n labels for screen readers + agent-facing UIs) but no functional value for an LLM picking a verb.

The contract bridge -- making any host script NAC3-aware

NAC3 dispatch (NAC.click_by_verb) waits for the host to emit a nac:action:succeeded event after running its click handler. Vanilla calc.js doesn't emit that -- it just updates the display. Without the bridge, every dispatch times out after 5 s.

Forge's solution is a 50-line companion file nac-bridge.js that runs once at DOMContentLoaded:

// nac-bridge.js (excerpt)
const slug = document.body.getAttribute('data-nac-plugin');
const els = document.querySelectorAll('[data-nac-role="action"][data-nac-id]');
els.forEach(el => {
  const id      = el.getAttribute('data-nac-id');
  const handler = el.onclick;            // preserve inline onclick
  el.onclick = null;
  window.NAC.bindAction(el, handler,     // re-attach via bindAction
                        { plugin: slug, action_id: id });
});

The bridge turns every onclick="digit('7')" into a NAC3 contract-compliant action: when the button fires, the host's handler runs (display updates), and nac:action:succeeded emits. No host code modified. Drop-in.

The Pilot driver

Yujin Pilot is the embedded chat + voice cockpit. A single-file IIFE bundle (pilot.bundle.js) + a companion stylesheet (pilot.css). On load it mounts a floating sumi-e branch button bottom-right. Clicking it opens a chat panel; clicking again toggles a larger pizarra panel with voice + action trace.

Pilot reads the registered manifest via window.NAC.list_registered_plugins() + window.NAC.manifest(slug). The user types or speaks an intent; Pilot calls the configured LLM with a system prompt that includes the manifest; the LLM returns a JSON action chain; Pilot dispatches each via NAC.click_by_verb(plugin, verb).

Configuration lives in localStorage under yujin_pilot.config: per-provider API keys (Anthropic, OpenAI, Google, DeepSeek, Groq), model selector (every model the 600-run benchmark exercised), voice preferences (STT via Web Speech API, TTS via Web Speech / ElevenLabs / Google Cloud TTS), and the on/off toggle. Keys never leave the browser. For production use the LLM call belongs behind a server proxy; the client-side mode here is for demos + local dev.

Pilot is not Forge. Forge builds the static NAC3 surface (HTML attrs + manifest + bridge). Pilot is the runtime agent that talks to that surface. The two ship separately and can be used independently: an app can be Forge-decorated and driven by a different agent; Pilot can drive any NAC3-compliant app, whether Forge-decorated or hand-decorated.

The harness

bench/scripts/calc_bench.mjs stands up a tiny static server, launches Playwright Chromium headless, navigates to each fixture, waits for window.NAC.list_registered_plugins().includes('calc'), sends the manifest + task prompt to Claude Sonnet 4.6, parses the action chain, dispatches via NAC.click_by_verb, reads document.getElementById('display').textContent, and asserts against the expected value. 15 runs total (3 fixtures × 5 tasks × 1 iter). Writes results/calc_e2e_<ts>.{jsonl,md}.

The harness applies zero monkey-patches to the fixtures. Everything that makes the fixture work -- data-nac-plugin, the runtime, the bridge, the manifest register -- is baked in by Forge (or by hand for the manual fixture). What the harness tests is the fixture in production-equivalent shape.

System prompt for the driver

The same driver prompt runs against all 3 fixtures, so the manifest is the only variable:

You drive a Calculator UI via the NAC3 protocol. You will be given a
manifest listing every interactive element (each action has a stable
verb). Your only output is a JSON object:

{ "plan": "<one-line reasoning>",
  "actions": [ { "verb": "<verb>" }, ... ] }

Rules:
- Use ONLY verbs that appear in the manifest's elements[].actions[].verb.
- Each action dispatches NAC.click_by_verb('calc', '<verb>').
- Output actions in execution order.
- No prose, no markdown fences, only the JSON object.

Why this fixture is representative

The calculator is mechanically simple but covers the failure modes that matter:

If your real app is bigger and more semantically complex, the manifest grows but the methodology is the same. The 600-run benchmark uses an invoice editor (richer fixture, same approach) and reports 15% silent-damage rate on raw DOM vs 0% on NAC3 -- the same protocol contract that lets the calc be dispatched correctly here.

Reproducing this

# 1. Decorate (Forge silent path)
yf migrate ./calc-raw --ai-silent --with-pilot \
  --ai-out ./calc-forge-silent

# 2. Open in browser, or run the harness
cd /path/to/nac3-bench/bench
node scripts/calc_bench.mjs

# 3. Drive by hand: click the floating sumi-e button,
#    paste an Anthropic key, ask "calcula 3 mas 4".
xdg-open ./calc-forge-silent/index.html

Total cost to reproduce: $0.05 for the decoration + $0.03 for 5 driver dispatches. ~$0.08 total. Driver model can be swapped via the Pilot config to anything in the 600-run benchmark's roster.