NAC3 v2.3 / Migration paper / Results
← migration home  ·  methodology →
Paper 01 of 03 · 1 calculator · 3 decorations · 15 driver runs

Cost, latency, success rate after migration to NAC3.

How much does it cost to decorate a vanilla web app with NAC3? How does the manifest shape affect the cost of every subsequent dispatch? Three migration paths against one calculator. Driver: Claude Sonnet 4.6. Tasks: 5 calc operations. Iterations: 1 per (fixture, task). All numbers come from the harness at bench/scripts/calc_bench.mjs.

The three paths

PathEffortOne-shot LLM costManifest elements
Forge silent
yf migrate --ai-silent
23 s wall, no human input$0.0468 · 2224 in / 2672 out23 (4 region/field + 19 actions)
Forge assisted
yf migrate --ai-assisted
25 s wall, model self-reports no ambiguities$0.0473 · 2379 in / 2677 out23 (identical to silent for this app)
Sumi manual
hand-decorated reference
~30 min--23 (2 region + 2 field + 19 actions)
Why assisted == silent here. Assisted mode lets the model raise ambiguity questions ("is this 'save' or 'submit'?") before producing the final decoration. The calculator's button labels and inline onclick handlers leave no ambiguity, so the model returns the same JSON as silent. Assisted's value shows up on apps with overlapping action semantics (CRM forms with two save-like buttons, multi-tab dashboards, etc.).

Functional success

All three decorations drive the calc correctly on every task in the suite.

TaskForge silentForge assistedSumi manualExpected
3 + 4 =7 OK7 OK7 OK7
12 + 15 =27 OK27 OK27 OK27
7 x 8 =56 OK56 OK56 OK56
100 / 4 =25 OK25 OK25 OK25
clear + 2+2+2 =6 OK6 OK6 OK6

Cost per dispatch

The same driver (Sonnet 4.6) runs each of the 5 tasks against each fixture. Tokens and latency are mean per task; cost is the sum over 5 tasks.

FixtureMean tokens inMean tokens outMean LLM latencyTotal cost (5 tasks)
forge-silent11531142510 ms$0.02575
forge-assisted11531142398 ms$0.02564
sumi-manual14421132542 ms$0.03004

Manual costs +17% to drive per task compared to Forge. The delta is the label_i18n payload Sumi attached to every action + field + region. Useful for accessibility, neutral for the model picking a verb -- the model already has the verb name.

Total cost of ownership

One-time decoration plus N dispatches at the per-task rates above. (Decoration cost for manual modeled at $30 = 30 min @ $60/h for a human.)

ApproachDecorationPer taskAt N = 100At N = 10000
Forge silent$0.0468$0.00515$0.56$51.5
Forge assisted$0.0473$0.00513$0.56$51.3
Sumi manual$30.00 (30 min)$0.00601$30.60$90.1

Forge wins on TCO out to the multi-thousand-dispatch range. For the long tail (>50k dispatches in one app), a hybrid wins: have Forge produce the manifest, then trim label_i18n off non-action elements in a one-line post-pass.

Latency

Driver latency (LLM round-trip) is ~2.5 s mean across all three. The decoration call itself (one-shot Claude Sonnet 4.6 with the whole HTML + JS) is ~23 s -- one-off per app, ignorable in TCO. Dispatch latency in-browser is <50 ms per verb (NAC.click_by_verb is essentially DOM .click() + a contract event).

Conclusions

  1. NAC3 migration is cheap. Forge does a clean job for $0.05 and 25 s. The output is drop-in: same UI, plus a JSON manifest the agent reads.
  2. Silent beats manual on operational cost. +17% input tokens for the manual reference is the cost of being thorough on label_i18n. If you're driving the app from an LLM thousands of times, drop the i18n off non-action elements.
  3. Assisted only earns its keep on ambiguous apps. For a calculator, the model has nothing to ask. For a CRM with two save-like buttons, the assisted round saves a bad guess.
  4. Functional parity is the headline. Five tasks, three decoration paths, zero failures -- the model picks the right verb chain from any of the three manifest shapes.