NAC3 v2.3 / Conduit case study
← spec home  ·  600-run benchmark  ·  migration paper  ·  2026-05-20
Case study · Conduit (RealWorld) migrated by Forge in 70 seconds

One vanilla React app,
two minutes to NAC3.

Took the community Conduit fork -- a Medium clone with 47 React components, an Express+Sequelize backend, and a hand-written Cypress suite of 14 tests. Yujin Forge's new --ai-apply walked the JSX, decorated 34 files with NAC3 attributes, aggregated a 59-element manifest, and wired the SPA's index.html with runtime + bridge + register. $0.46 in Claude tokens. ~70 seconds wall clock. Then --gen-tests emitted 66 e2e cases covering the full app surface -- 4.7x more cases than the hand-written Cypress.

App migratedConduit (RealWorld). 47 React components, 4 routes, full auth + CRUD.
Forge commandyf migrate ./frontend --ai-apply --with-pilot
Cost$0.46 total (Claude Sonnet 4.6, 47 LLM calls, concurrency 4).
Wall time~70 seconds to decorate + 30 seconds to gen tests.
Manifest elements
59
14 action, 19 navigation, 11 field, 6 tab, 9 region. Semantic IDs + imperative verbs.
E2E cases auto-generated
66
Across Playwright + Cypress + Vitest + WDIO -- one command, four frameworks.
vs hand-written Cypress
4.7×
Same fork carries 14 hand-written tests across 3 page flows. Forge covers all 59 elements.
Operational pass-rate
5 / 11
vs 1 / 66
Hand-written wins on pass-rate today: gen-tests v1 is SPA-naive. See section 04b.
01 / The fixture

Conduit -- the industry's benchmark Medium clone.

Conduit is what the RealWorld project ships as the canonical app for comparing frameworks: a Medium-style blogging site with user auth, articles, comments, follow/favorite, tags, profile + settings pages. Real surface, real complexity, used to evaluate ~24 frontend frameworks in head-to-head form. We picked the community fork that comes with both a React/Vite frontend AND a hand-written Cypress test suite, so the comparison Forge-vs-manual was apples-to-apples in the same codebase.

PropertyValue
FrontendReact 18 + Vite + SWC. 47 .jsx files in src/.
BackendExpress + Sequelize + PostgreSQL. (We swapped to SQLite via the Sequelize dialect knob for zero-setup reproducibility.)
Existing tests14 Cypress cases across 3 page flows (editor / login / register).
LicenseMIT. Thinkster / RealWorld attribution preserved in the fork.
02 / Migration phase VERIFIED

Forge --ai-apply decorates 47 files in 70 seconds.

--ai-apply is the JSX counterpart to --ai-silent. Walks every .jsx / .tsx under <repo>/src, sends each to Claude Sonnet 4.6 in parallel (concurrency 4 by default, max 8), receives back the decorated JSX plus a manifest fragment listing what was decorated. Forge aggregates fragments into a single manifest.json at the project root.

# From a fresh clone of the Conduit fork
yf migrate ./conduit-forge-ai \
  --ai-apply \
  --ai-plugin-slug conduit \
  --subdir src \
  --ai-out ./conduit-forge-ai \
  --ai-concurrency 4

# Output:
AI apply complete.
  Files scanned:   47
  Files decorated: 34
  Manifest elements: 59
  Plugin slug:     conduit
  Tokens in/out:   43428 / 22192
  Total cost:      $0.4632
  Total LLM time:  262s wall (across 47 files, divided by concurrency = ~70s real)

Per-role breakdown of the 59-element manifest:

RoleCountExample IDs
action14conduit.action.publish, conduit.action.follow, conduit.action.favorite
navigation19conduit.navigation.login, conduit.navigation.author_avatar
field11conduit.field.title, conduit.field.body, conduit.field.comment_body
tab6conduit.tab.your_feed, conduit.tab.global_feed, conduit.tab.tag_feed
region9conduit.region.popular_tags, conduit.region.feed_toggler

Quality assessment: IDs are semantic (not ordinal like btn_3), verbs are imperative (publish, follow, favorite, delete), roles match content (display/history-style fields as field, navigation links as navigation, feed switchers as tab).

03 / Test-corpus generation VERIFIED

One command, four frameworks, 66 test cases.

yf gen-tests takes the Forge-aggregated manifest and emits an e2e suite in whichever test framework the user's project carries. Multi-framework on purpose: most adoptions already have Playwright or Cypress or Vitest installed -- Forge does not mandate a choice.

yf gen-tests ./conduit-forge-ai --framework all --base-url "http://localhost:3000"

# Output:
Manifest elements:  59
Test cases:         66                          # some elements have multiple verbs
Frameworks:         playwright, cypress, vitest, wdio

playwright: + tests/forge-e2e/conduit.spec.ts
cypress:    + cypress/e2e/forge/conduit.cy.js
vitest:     + tests/forge-e2e/conduit.test.ts
wdio:       + test/specs/forge-e2e/conduit.test.ts

Coverage: 59 / 59 manifest elements exercised.

The coverage metric is "agent-relevant surface coverage" -- one test case per (nac_id, role, verb) tuple in the manifest. This is the honest claim; we do not report this as line-coverage (which the manifest alone cannot guarantee).

FrameworkFile emittedPattern
Playwrighttests/forge-e2e/conduit.spec.tsOne test(...) per case, navigates + waits for NAC.register, then dispatches via NAC.click_by_verb
Cypresscypress/e2e/forge/conduit.cy.jsOne it(...) per case inside a top describe, uses cy.window().then((win) => win.NAC.click_by_verb(...))
Vitest + jsdomtests/forge-e2e/conduit.test.tsStructural tests: manifest plugin_slug matches, every element id is present in the loaded HTML's data-nac-id attributes
WebdriverIOtest/specs/forge-e2e/conduit.test.tsAsync-await pattern with browser.execute for NAC calls
04 / Comparison vs hand-written Cypress VERIFIED (structural)

Forge: 4.7× more cases, full surface.

The fork's testAutomation/ directory carries 14 hand-written Cypress cases across 3 .cy.js files. Forge's emitter produces 66 cases in one .cy.js. The comparison is structural here -- counting cases + verifying surface coverage -- because operational pass-rate validation requires running both suites against a live backend.

Forge auto (Cypress emitter)Hand-written Cypress (in same fork)
Test cases6614
Surface coverage59 / 59 manifest elements (100%)3 page flows (editor / login / register)
Time to author~30 seconds (one command)Days-person (assumed; not measured for this fork)
BrittlenessSelectors are data-nac-id attrs -- stable across CSS/HTML refactorsMostly text/class selectors -- break when designers move things
Maintenance after refactorRe-run yf gen-testsManual update of every broken selector
04b / Operational run -- HONEST VERIFIED

Quantity is not quality. SPA-naive gen-tests need flow awareness.

We stood up the Conduit backend (SQLite via Sequelize dialect swap), brought up the migrated Vite frontend, and ran both Cypress suites end-to-end. The structural delta (66 vs 14) holds; the operational delta tells a different story we publish unsoftened.

SuiteTotalPassingPass rateWhy
Hand-written Cypress (3 page flows)11 (1 pending)545%Tests are flow-aware (navigate to /editor before testing editor). One test had a pre-existing cy.type(undefined) bug from a faker reuse issue. The remaining 5 failures are likely env-specific (SQLite vs the prod DB the suite was tuned against).
Forge auto Cypress (SPA-naive, v1)6611.5%Each test calls cy.visit(BASE_URL) (home page) then dispatches one verb. For an SPA where 80% of action elements only mount under specific routes (editor / login / register / profile), most dispatches don't find the element. This is a real gap in gen-tests v1.
The honest read. Forge gen-tests v1 covers every manifest element with 1 test apiece. That's structurally complete but operationally naive for SPAs -- it doesn't know that conduit.action.publish only exists under /editor, or that conduit.action.follow requires being on a profile page. v2 of gen-tests needs route-awareness: per-element navigation hints derived from React Router config or from a Pilot dispatch simulation. Until then the suite is best read as a checklist of decorated elements, not as a runnable acceptance suite.

Two structural conclusions hold despite the operational gap:

  1. Forge's 59-element manifest is correct: NAC3 attached, plugin registered, every decorated element discoverable via NAC.click_by_verb when its containing route is mounted.
  2. The SPA-aware nac-bridge (MutationObserver) works: late-mounted action/navigation/tab elements get bound automatically as routes change.
04c / Third honest finding -- component reuse VERIFIED

One JSX component, many DOM instances, same nac_id.

Ran a Pilot-style operational demo: tell Sonnet 4.6 "log in as sumi@yujin.app", give it the live manifest, let it pick verbs turn by turn, dispatch via NAC.click_by_verb. Every dispatch returned OK (the bridge fires the contract event), but the URL never changed. The model picked conduit.navigation.nav_item verb navigate six turns in a row -- each click hit the same DOM element (the first NavItem) which happened to be the Home link, not the SignIn link.

Diagnosis: Conduit's Navbar uses a <NavItem> component reused for Home, SignIn, SignUp, Settings, Profile. --ai-apply sees the component once, decorates the JSX with data-nac-id="conduit.navigation.nav_item", and that single ID propagates to every rendered instance. The manifest deduplicates by ID, so the 5+ DOM instances collapse into 1 manifest entry. NAC.click_by_verb then matches the first DOM occurrence -- always the same one, regardless of which the user actually wanted.

FindingWhat to do
Component reuse generates duplicate nac_idsPer-component-instance suffixing (e.g. conduit.navigation.nav_item[home], nav_item[sign_in]) derived from the prop / children text. Next iteration of --ai-apply.
Manifest dedup hides this from the agentSurface duplicate-id warnings during apply. Optionally: emit one manifest entry per occurrence rather than per declaration.
Model needs disambiguation contextSystem prompt extension: when multiple manifest entries share a verb, ask Pilot to use additional payload like child-text or aria-label.
Why this matters. This is the most actionable finding in the case study. It's a clean architectural gap with three independent fixes, all small. Once landed, the Pilot demo above goes from "every click hits the first NavItem" to "Sonnet picks the SignIn link, clicks it, fills email + password, submits". That's the unlock for true SPA operation.
04d / Update -- H2 fix landed VERIFIED

Per-instance disambiguation shipped + verified.

Closed finding 04c at the architectural level. --ai-apply now infers per-instance qualifiers from JSX props (label, name, to, slug, etc), bakes them into data-nac-id as a __qualifier suffix, and emits N manifest entries per known instance (or instance_pattern for runtime-dynamic qualifiers). The NAC3 runtime's click_by_verb now accepts { qualifier } and { qualifier_ordinal } in opts for explicit instance targeting.

LayerWhat changed
Forge --ai-apply promptSTATIC / DYNAMIC / UNINFERRABLE doctrine. Per-instance suffix baked into JSX. Warning emitted when reusable component cannot be qualified.
Manifest schemaNew optional fields: instance_pattern, qualifier_source, warning.
NAC3 runtime (nac.browser.js)click_by_verb(plugin, verb, { qualifier, qualifier_ordinal }). Backward compatible (no qualifier = old behavior).
Pilot system promptDoctrine explained: when entries share verb but differ in suffix, pass qualifier in payload.

Verification on Conduit v2 (re-decorated post-H2):

MetricPre-H2Post-H2
Manifest size59 elements61 elements (NavItem 1 -> 4 instances, FeedToggler 1 -> 2, DropdownItem 1 -> 3)
Elements with __qualifier09 (static enumerated)
Elements with instance_pattern031 (dynamic templates)
Warnings (uninferrable reusables)hidden12 surfaced with file + suggestion
Pilot demo dispatchloops on Home foreverSonnet picks navigate + qualifier="login", URL changes to /login, then continues to fill form_fieldset__email
Decoration cost$0.46$0.66 (+41%, prompt denser; one-shot per project)
The Pilot demo now reaches the login form and starts filling email -- a step the pre-H2 demo never reached (it was stuck looping clicks against the first NavItem). The remaining gap to complete login end-to-end is no longer architectural: it's that the turn-to-turn state probe shown to the model needs to include "this field already has value X" so it doesn't loop on the same fill. That's a Pilot improvement, scheduled separately. The structural unlock is in.
04e / Update -- H1 (gen-tests v2) landed VERIFIED, partial

Route-aware emitter lifts pass-rate 22×.

Closed Hallazgo 1 at the architectural level. --ai-apply now extracts a route_map from any file that is a React Router root, and each manifest element carries mounted_at: string[] + requires_auth: boolean. The Cypress emitter consumes both: every test does cy.visit(mounted_at[0]) before dispatch, and auth-gated tests insert a loginViaApi() pre-step (drops a JWT in localStorage, no UI navigation needed).

MetricPre-H1 (v1)Post-H1 (v3)
Cypress test count6670 (manifest grew by 4 elements post-H2)
Cypress pass-rate1 / 66 = 1.5%23 / 70 = 33%
Liftbaseline22×
Elements with mounted_at042 / 62
Elements marked requires_auth015 / 62
Route distributionn/a'*' x19, '/' x3, /article/:slug x5, /editor* x5, /login x3, /profile/:username x2, /register x1, /settings x4
Why 33% and not 70% (target). The remaining failures are no longer about navigation -- they're about fixture state. Tests for favorite_article need an article to exist; tests for follow_author need a second user. The dynamic routes (/article/:slug, /profile/:username) need fixture seed data in the SQLite before cy.visit. Next iteration of gen-tests: emit beforeEach blocks with Faker-style seed data per dynamic-route test. Per the brief: "Si queda por debajo de 70%, investigar antes de declarar v2 listo." -- investigated and reported. The route-awareness piece IS shipped; the fixture-awareness piece is the next slice.

Per the brief's communication discipline: gen-tests is not communicated publicly as a finished product until pass-rate is ≥70% sustained. This case study documents the H1 architectural unlock + the honest 22× lift, NOT a sales claim. The "agent-relevant surface coverage" metric (every (nac_id, role, verb) tuple) remains the only quantitative claim safe to publish; the test-pass-rate is documented as an honest WIP signal here.

04f / Update -- field-state probe + React-safe NAC.fill VERIFIED

Sonnet drives Conduit login end-to-end. 4 turns. $0.022.

Two fixes that together destrabaron el demo Pilot end-to-end:

  1. Pilot per-turn state probe now includes the current value of every visible input/textarea (keyed by data-nac-id + fallbacks), plus the navbar text snapshot, plus the last 5 action_history entries. System prompt got a STATE AWARENESS section: "before you emit a fill, check if the field already has the target value -- if it does, ADVANCE."
  2. NAC.fill is now React-safe. The previous implementation set el.value = newVal directly. React silently reverts that on the next render because its synthetic event system tracks the original property descriptor on HTMLInputElement.prototype. The fix: use Object.getOwnPropertyDescriptor(HTMLInputElement.prototype, 'value').set.call(el, newVal). Same pattern for textarea / select / checkbox. Plain DOM behavior unchanged. (Committed in rpaforce-crm@c1bd8dc5.)
MetricPre-H4.prePost-H4.pre
Login flow completionnever (loops on fill-email forever)success=true in 4 turns
Turns used6 (cap) without success4 (navigate, fill email, fill password, click submit)
Determinism (3 runs, temp=0)n/a (didn\'t complete)3 / 3 byte-identical (same tokens, turns, cost)
Cost per run$0.038+ wasted$0.0222
Tokens (in / out per run)9781 / 6646136 / 252
The 5 items of "Conduit listo para caso de exito" -- updated.
  • [x] H2 cerrado -- per-instance qualifiers across all 3 layers.
  • [~] H1 cerrado at architecture; 33% pass rate (was 1.5%, 22× lift); 70% target needs fixture-state next.
  • [x] End-to-end Sonnet demo: login flow completes deterministically at $0.0222.
  • [ ] H4 (cheap-model validation across {Gemini Flash Lite, GPT-4o-mini, Haiku, Sonnet} × N≥5) is now unblocked -- the field-state + React-safe fill that gated it both shipped.
  • [ ] Public reproducer repo with MIT attribution -- next step.
04g / Coverage check post-H4.pre VERIFIED

Two distinct metrics: operational vs structural.

Re-corri la suite Cypress generada por Forge contra la Conduit live despues de H4.pre + el fix de React-safe NAC.fill, esperando un lift sobre el 33% pre-H4. El numero quedo en 23/70 = 33%. Dos fixes que parecian deberian moverlo no lo movieron, y la razon es clara + honesta:

Fix probadoEsperadoRealPor que
React-safe NAC.fillfield_write tests subiriansin cambioLos field_write tests de Forge no asertan que el valor persiste -- solo dispatchan. Pilot (multi-step flow) SI depende del valor real para el siguiente paso; por eso el demo login paso de 0% a 100% con el mismo fix. Distintos consumidores, distinto efecto.
Auth localStorage key fix (jwtToken -> loggedUser shape)auth-gated tests subiriansin cambioEl bloqueo principal de los 47 failing no es auth -- es fixture-state. Tests sobre /article/:slug y /profile/:username necesitan un articulo / usuario seedeado en la SQLite ANTES del cy.visit. Sin contenido en esas rutas, el elemento target no monta.
Dos metricas que no se confunden.
  • Forge gen-tests v2 (Cypress estructural): 23 / 70 = 33% pass. Cada test es una "checklist entry" por elemento del manifest. Mide cobertura de superficie.
  • Pilot end-to-end (operational): 3 / 3 = 100% pass, $0.0222, 4 turns, byte-deterministic. Sonnet drives the SPA through a real multi-step flow.
Son metricas distintas y el caso comercial vive mas en la segunda. "Cuesta $0.022 que Sonnet logueye en tu Conduit migrado" es un numero mas relevante para una demo de venta que "33% de cobertura sintetica por elemento". Ambos numeros se publican aqui por honestidad estructural; el de gen-tests sigue NO comunicandose publicamente hasta ≥70% (per la disciplina del brief).

Lo que destrabaria el 70%: gen-tests v3 con fixture-aware emission. Para cada test cuyo visit_route contiene :param, emitir un beforeEach que crea el recurso via API (POST /api/articles, POST /api/users) y alias el id en cy.wrap().as(). Forge ya tiene el shape del backend (lo vio durante --ai-apply en los archivos src/services/*.js); el LLM puede inferir el seed setup. Estimado: 2-3h de trabajo dev, costo de regeneracion ~$0.50.

05 / What's verified vs what's next PENDING

The honest picture at this checkpoint.

ItemStatus
Forge migrates Conduit's 47 JSX files end-to-endverified
Manifest quality (semantic IDs, imperative verbs, correct roles)verified
Forge auto-generates 66 e2e cases in 4 frameworksverified
SPA-aware nac-bridge (MutationObserver re-binds on route change)verified (live probe on running Conduit)
NAC3 runtime registers the 59-element manifest on page loadverified (Playwright probe)
Pass-rate of the Forge-generated Cypress suite against live Conduitverified -- 1/66 (1.5%), SPA-naive limitation surfaced honestly
Pass-rate of the hand-written Cypress suite against live Conduitverified -- 5/11 (45%), incl. pre-existing flake
Robustness test: refactor CSS, re-run both suites -- which survives?pending
Pilot-style operational demo (Sonnet drives the migrated app)verified -- surfaced component-reuse duplicate-id finding
Operational testing N≥5 across {Gemini Flash Lite, GPT-4o-mini, Haiku, Sonnet}pending -- gated on fixing component-reuse dedup first
Public reproducer repo with MIT attribution to Thinksterpending
06 / Reproduce

Three commands to repeat this on your machine.

# 1. Clone Conduit + install
git clone https://github.com/digitalinnovationone/conduit-realworld-example-app-with-cypress-automation conduit
cd conduit
npm install

# 2. Decorate with Forge (you need a paid seat + ANTHROPIC_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
cp -r ./frontend ./frontend-forge-ai
yf migrate ./frontend-forge-ai \
  --ai-apply --ai-plugin-slug conduit --subdir src --ai-out ./frontend-forge-ai

# 3. Generate the test corpus in whatever framework you want
yf gen-tests ./frontend-forge-ai --framework cypress
# or --framework playwright,vitest,wdio,all

Total cost to reproduce: ~$0.50 in Anthropic tokens. Total wall time: ~3 minutes including npm install.

07 / The tools

Forge + Pilot. Build once, drive forever.

Everything in this case study runs on three pieces of open infrastructure: NAC3 v2.3.1 (the protocol, Apache-2.0), Yujin Forge (the build tool, paid seat), and Yujin Pilot (the embedded driver, open core). All three ship today.