NAC3 v2.3 -- Conduit case study (Forge migrates RealWorld in 70s)

01 / The fixture

Conduit -- the industry's benchmark Medium clone.

Conduit is what the RealWorld project ships as the canonical app for comparing frameworks: a Medium-style blogging site with user auth, articles, comments, follow/favorite, tags, profile + settings pages. Real surface, real complexity, used to evaluate ~24 frontend frameworks in head-to-head form. We picked the community fork that comes with both a React/Vite frontend AND a hand-written Cypress test suite, so the comparison Forge-vs-manual was apples-to-apples in the same codebase.

Property	Value
Frontend	React 18 + Vite + SWC. 47 .jsx files in `src/`.
Backend	Express + Sequelize + PostgreSQL. (We swapped to SQLite via the Sequelize dialect knob for zero-setup reproducibility.)
Existing tests	14 Cypress cases across 3 page flows (editor / login / register).
License	MIT. Thinkster / RealWorld attribution preserved in the fork.

02 / Migration phase VERIFIED

Forge --ai-apply decorates 47 files in 70 seconds.

--ai-apply is the JSX counterpart to --ai-silent. Walks every .jsx / .tsx under <repo>/src, sends each to Claude Sonnet 4.6 in parallel (concurrency 4 by default, max 8), receives back the decorated JSX plus a manifest fragment listing what was decorated. Forge aggregates fragments into a single manifest.json at the project root.

# From a fresh clone of the Conduit fork
yf migrate ./conduit-forge-ai \
  --ai-apply \
  --ai-plugin-slug conduit \
  --subdir src \
  --ai-out ./conduit-forge-ai \
  --ai-concurrency 4

# Output:
AI apply complete.
  Files scanned:   47
  Files decorated: 34
  Manifest elements: 59
  Plugin slug:     conduit
  Tokens in/out:   43428 / 22192
  Total cost:      $0.4632
  Total LLM time:  262s wall (across 47 files, divided by concurrency = ~70s real)

Per-role breakdown of the 59-element manifest:

Role	Count	Example IDs
action	14	`conduit.action.publish`, `conduit.action.follow`, `conduit.action.favorite`
navigation	19	`conduit.navigation.login`, `conduit.navigation.author_avatar`
field	11	`conduit.field.title`, `conduit.field.body`, `conduit.field.comment_body`
tab	6	`conduit.tab.your_feed`, `conduit.tab.global_feed`, `conduit.tab.tag_feed`
region	9	`conduit.region.popular_tags`, `conduit.region.feed_toggler`

Quality assessment: IDs are semantic (not ordinal like btn_3), verbs are imperative (publish, follow, favorite, delete), roles match content (display/history-style fields as field, navigation links as navigation, feed switchers as tab).

03 / Test-corpus generation VERIFIED

One command, four frameworks, 66 test cases.

yf gen-tests takes the Forge-aggregated manifest and emits an e2e suite in whichever test framework the user's project carries. Multi-framework on purpose: most adoptions already have Playwright or Cypress or Vitest installed -- Forge does not mandate a choice.

yf gen-tests ./conduit-forge-ai --framework all --base-url "http://localhost:3000"

# Output:
Manifest elements:  59
Test cases:         66                          # some elements have multiple verbs
Frameworks:         playwright, cypress, vitest, wdio

playwright: + tests/forge-e2e/conduit.spec.ts
cypress:    + cypress/e2e/forge/conduit.cy.js
vitest:     + tests/forge-e2e/conduit.test.ts
wdio:       + test/specs/forge-e2e/conduit.test.ts

Coverage: 59 / 59 manifest elements exercised.

The coverage metric is "agent-relevant surface coverage" -- one test case per (nac_id, role, verb) tuple in the manifest. This is the honest claim; we do not report this as line-coverage (which the manifest alone cannot guarantee).

Framework	File emitted	Pattern
Playwright	`tests/forge-e2e/conduit.spec.ts`	One `test(...)` per case, navigates + waits for NAC.register, then dispatches via `NAC.click_by_verb`
Cypress	`cypress/e2e/forge/conduit.cy.js`	One `it(...)` per case inside a top `describe`, uses `cy.window().then((win) => win.NAC.click_by_verb(...))`
Vitest + jsdom	`tests/forge-e2e/conduit.test.ts`	Structural tests: manifest plugin_slug matches, every element id is present in the loaded HTML's data-nac-id attributes
WebdriverIO	`test/specs/forge-e2e/conduit.test.ts`	Async-await pattern with `browser.execute` for NAC calls

04 / Comparison vs hand-written Cypress VERIFIED (structural)

Forge: 4.7× more cases, full surface.

The fork's testAutomation/ directory carries 14 hand-written Cypress cases across 3 .cy.js files. Forge's emitter produces 66 cases in one .cy.js. The comparison is structural here -- counting cases + verifying surface coverage -- because operational pass-rate validation requires running both suites against a live backend.

	Forge auto (Cypress emitter)	Hand-written Cypress (in same fork)
Test cases	66	14
Surface coverage	59 / 59 manifest elements (100%)	3 page flows (editor / login / register)
Time to author	~30 seconds (one command)	Days-person (assumed; not measured for this fork)
Brittleness	Selectors are `data-nac-id` attrs -- stable across CSS/HTML refactors	Mostly text/class selectors -- break when designers move things
Maintenance after refactor	Re-run `yf gen-tests`	Manual update of every broken selector

04b / Operational run -- HONEST VERIFIED

Quantity is not quality. SPA-naive gen-tests need flow awareness.

We stood up the Conduit backend (SQLite via Sequelize dialect swap), brought up the migrated Vite frontend, and ran both Cypress suites end-to-end. The structural delta (66 vs 14) holds; the operational delta tells a different story we publish unsoftened.

Suite	Total	Passing	Pass rate	Why
Hand-written Cypress (3 page flows)	11 (1 pending)	5	45%	Tests are flow-aware (navigate to /editor before testing editor). One test had a pre-existing `cy.type(undefined)` bug from a faker reuse issue. The remaining 5 failures are likely env-specific (SQLite vs the prod DB the suite was tuned against).
Forge auto Cypress (SPA-naive, v1)	66	1	1.5%	Each test calls `cy.visit(BASE_URL)` (home page) then dispatches one verb. For an SPA where 80% of action elements only mount under specific routes (editor / login / register / profile), most dispatches don't find the element. This is a real gap in gen-tests v1.

The honest read. Forge gen-tests v1 covers every manifest element with 1 test apiece. That's structurally complete but operationally naive for SPAs -- it doesn't know that conduit.action.publish only exists under /editor, or that conduit.action.follow requires being on a profile page. v2 of gen-tests needs route-awareness: per-element navigation hints derived from React Router config or from a Pilot dispatch simulation. Until then the suite is best read as a checklist of decorated elements, not as a runnable acceptance suite.

Two structural conclusions hold despite the operational gap:

Forge's 59-element manifest is correct: NAC3 attached, plugin registered, every decorated element discoverable via NAC.click_by_verb when its containing route is mounted.
The SPA-aware nac-bridge (MutationObserver) works: late-mounted action/navigation/tab elements get bound automatically as routes change.

04c / Third honest finding -- component reuse VERIFIED

One JSX component, many DOM instances, same nac_id.

Ran a Pilot-style operational demo: tell Sonnet 4.6 "log in as sumi@yujin.app", give it the live manifest, let it pick verbs turn by turn, dispatch via NAC.click_by_verb. Every dispatch returned OK (the bridge fires the contract event), but the URL never changed. The model picked conduit.navigation.nav_item verb navigate six turns in a row -- each click hit the same DOM element (the first NavItem) which happened to be the Home link, not the SignIn link.

Diagnosis: Conduit's Navbar uses a <NavItem> component reused for Home, SignIn, SignUp, Settings, Profile. --ai-apply sees the component once, decorates the JSX with data-nac-id="conduit.navigation.nav_item", and that single ID propagates to every rendered instance. The manifest deduplicates by ID, so the 5+ DOM instances collapse into 1 manifest entry. NAC.click_by_verb then matches the first DOM occurrence -- always the same one, regardless of which the user actually wanted.

Finding	What to do
Component reuse generates duplicate nac_ids	Per-component-instance suffixing (e.g. `conduit.navigation.nav_item[home]`, `nav_item[sign_in]`) derived from the prop / children text. Next iteration of --ai-apply.
Manifest dedup hides this from the agent	Surface duplicate-id warnings during apply. Optionally: emit one manifest entry per occurrence rather than per declaration.
Model needs disambiguation context	System prompt extension: when multiple manifest entries share a verb, ask Pilot to use additional payload like child-text or aria-label.

Why this matters. This is the most actionable finding in the case study. It's a clean architectural gap with three independent fixes, all small. Once landed, the Pilot demo above goes from "every click hits the first NavItem" to "Sonnet picks the SignIn link, clicks it, fills email + password, submits". That's the unlock for true SPA operation.

04d / Update -- H2 fix landed VERIFIED

Per-instance disambiguation shipped + verified.

Closed finding 04c at the architectural level. --ai-apply now infers per-instance qualifiers from JSX props (label, name, to, slug, etc), bakes them into data-nac-id as a __qualifier suffix, and emits N manifest entries per known instance (or instance_pattern for runtime-dynamic qualifiers). The NAC3 runtime's click_by_verb now accepts { qualifier } and { qualifier_ordinal } in opts for explicit instance targeting.

Layer	What changed
Forge `--ai-apply` prompt	STATIC / DYNAMIC / UNINFERRABLE doctrine. Per-instance suffix baked into JSX. Warning emitted when reusable component cannot be qualified.
Manifest schema	New optional fields: `instance_pattern`, `qualifier_source`, `warning`.
NAC3 runtime (`nac.browser.js`)	`click_by_verb(plugin, verb, { qualifier, qualifier_ordinal })`. Backward compatible (no qualifier = old behavior).
Pilot system prompt	Doctrine explained: when entries share verb but differ in suffix, pass qualifier in payload.

Verification on Conduit v2 (re-decorated post-H2):

Metric	Pre-H2	Post-H2
Manifest size	59 elements	61 elements (NavItem 1 -> 4 instances, FeedToggler 1 -> 2, DropdownItem 1 -> 3)
Elements with `__`qualifier	0	9 (static enumerated)
Elements with `instance_pattern`	0	31 (dynamic templates)
Warnings (uninferrable reusables)	hidden	12 surfaced with file + suggestion
Pilot demo dispatch	loops on Home forever	Sonnet picks `navigate` + `qualifier="login"`, URL changes to `/login`, then continues to fill `form_fieldset__email`
Decoration cost	$0.46	$0.66 (+41%, prompt denser; one-shot per project)

The Pilot demo now reaches the login form and starts filling email -- a step the pre-H2 demo never reached (it was stuck looping clicks against the first NavItem). The remaining gap to complete login end-to-end is no longer architectural: it's that the turn-to-turn state probe shown to the model needs to include "this field already has value X" so it doesn't loop on the same fill. That's a Pilot improvement, scheduled separately. The structural unlock is in.

04e / Update -- H1 (gen-tests v2) landed VERIFIED, partial

Route-aware emitter lifts pass-rate 22×.

Closed Hallazgo 1 at the architectural level. --ai-apply now extracts a route_map from any file that is a React Router root, and each manifest element carries mounted_at: string[] + requires_auth: boolean. The Cypress emitter consumes both: every test does cy.visit(mounted_at[0]) before dispatch, and auth-gated tests insert a loginViaApi() pre-step (drops a JWT in localStorage, no UI navigation needed).

Metric	Pre-H1 (v1)	Post-H1 (v3)
Cypress test count	66	70 (manifest grew by 4 elements post-H2)
Cypress pass-rate	1 / 66 = 1.5%	23 / 70 = 33%
Lift	baseline	22×
Elements with `mounted_at`	0	42 / 62
Elements marked `requires_auth`	0	15 / 62
Route distribution	n/a	'' x19, '/' x3, /article/:slug x5, /editor x5, /login x3, /profile/:username x2, /register x1, /settings x4

Why 33% and not 70% (target). The remaining failures are no longer about navigation -- they're about fixture state. Tests for favorite_article need an article to exist; tests for follow_author need a second user. The dynamic routes (/article/:slug, /profile/:username) need fixture seed data in the SQLite before cy.visit. Next iteration of gen-tests: emit beforeEach blocks with Faker-style seed data per dynamic-route test. Per the brief: "Si queda por debajo de 70%, investigar antes de declarar v2 listo." -- investigated and reported. The route-awareness piece IS shipped; the fixture-awareness piece is the next slice.

Per the brief's communication discipline: gen-tests is not communicated publicly as a finished product until pass-rate is ≥70% sustained. This case study documents the H1 architectural unlock + the honest 22× lift, NOT a sales claim. The "agent-relevant surface coverage" metric (every (nac_id, role, verb) tuple) remains the only quantitative claim safe to publish; the test-pass-rate is documented as an honest WIP signal here.

04f / Update -- field-state probe + React-safe NAC.fill VERIFIED

Sonnet drives Conduit login end-to-end. 4 turns. $0.022.

Two fixes that together destrabaron el demo Pilot end-to-end:

Pilot per-turn state probe now includes the current value of every visible input/textarea (keyed by data-nac-id + fallbacks), plus the navbar text snapshot, plus the last 5 action_history entries. System prompt got a STATE AWARENESS section: "before you emit a fill, check if the field already has the target value -- if it does, ADVANCE."
NAC.fill is now React-safe. The previous implementation set el.value = newVal directly. React silently reverts that on the next render because its synthetic event system tracks the original property descriptor on HTMLInputElement.prototype. The fix: use Object.getOwnPropertyDescriptor(HTMLInputElement.prototype, 'value').set.call(el, newVal). Same pattern for textarea / select / checkbox. Plain DOM behavior unchanged. (Committed in rpaforce-crm@c1bd8dc5.)

Metric	Pre-H4.pre	Post-H4.pre
Login flow completion	never (loops on fill-email forever)	success=true in 4 turns
Turns used	6 (cap) without success	4 (navigate, fill email, fill password, click submit)
Determinism (3 runs, temp=0)	n/a (didn\'t complete)	3 / 3 byte-identical (same tokens, turns, cost)
Cost per run	$0.038+ wasted	$0.0222
Tokens (in / out per run)	9781 / 664	6136 / 252

The 5 items of "Conduit listo para caso de exito" -- updated.

[x] H2 cerrado -- per-instance qualifiers across all 3 layers.
[~] H1 cerrado at architecture; 33% pass rate (was 1.5%, 22× lift); 70% target needs fixture-state next.
[x] End-to-end Sonnet demo: login flow completes deterministically at $0.0222.
[ ] H4 (cheap-model validation across {Gemini Flash Lite, GPT-4o-mini, Haiku, Sonnet} × N≥5) is now unblocked -- the field-state + React-safe fill that gated it both shipped.
[ ] Public reproducer repo with MIT attribution -- next step.

04g / Coverage check post-H4.pre VERIFIED

Two distinct metrics: operational vs structural.

Re-corri la suite Cypress generada por Forge contra la Conduit live despues de H4.pre + el fix de React-safe NAC.fill, esperando un lift sobre el 33% pre-H4. El numero quedo en 23/70 = 33%. Dos fixes que parecian deberian moverlo no lo movieron, y la razon es clara + honesta:

Fix probado	Esperado	Real	Por que
React-safe NAC.fill	field_write tests subirian	sin cambio	Los `field_write` tests de Forge no asertan que el valor persiste -- solo dispatchan. Pilot (multi-step flow) SI depende del valor real para el siguiente paso; por eso el demo login paso de 0% a 100% con el mismo fix. Distintos consumidores, distinto efecto.
Auth localStorage key fix (jwtToken -> loggedUser shape)	auth-gated tests subirian	sin cambio	El bloqueo principal de los 47 failing no es auth -- es fixture-state. Tests sobre `/article/:slug` y `/profile/:username` necesitan un articulo / usuario seedeado en la SQLite ANTES del `cy.visit`. Sin contenido en esas rutas, el elemento target no monta.

Dos metricas que no se confunden.

Forge gen-tests v2 (Cypress estructural): 23 / 70 = 33% pass. Cada test es una "checklist entry" por elemento del manifest. Mide cobertura de superficie.
Pilot end-to-end (operational): 3 / 3 = 100% pass, $0.0222, 4 turns, byte-deterministic. Sonnet drives the SPA through a real multi-step flow.

Son metricas distintas y el caso comercial vive mas en la segunda. "Cuesta $0.022 que Sonnet logueye en tu Conduit migrado" es un numero mas relevante para una demo de venta que "33% de cobertura sintetica por elemento". Ambos numeros se publican aqui por honestidad estructural; el de gen-tests sigue NO comunicandose publicamente hasta ≥70% (per la disciplina del brief).

Lo que destrabaria el 70%: gen-tests v3 con fixture-aware emission. Para cada test cuyo visit_route contiene :param, emitir un beforeEach que crea el recurso via API (POST /api/articles, POST /api/users) y alias el id en cy.wrap().as(). Forge ya tiene el shape del backend (lo vio durante --ai-apply en los archivos src/services/*.js); el LLM puede inferir el seed setup. Estimado: 2-3h de trabajo dev, costo de regeneracion ~$0.50.

05 / What's verified vs what's next PENDING

The honest picture at this checkpoint.

Item	Status
Forge migrates Conduit's 47 JSX files end-to-end	verified
Manifest quality (semantic IDs, imperative verbs, correct roles)	verified
Forge auto-generates 66 e2e cases in 4 frameworks	verified
SPA-aware nac-bridge (MutationObserver re-binds on route change)	verified (live probe on running Conduit)
NAC3 runtime registers the 59-element manifest on page load	verified (Playwright probe)
Pass-rate of the Forge-generated Cypress suite against live Conduit	verified -- 1/66 (1.5%), SPA-naive limitation surfaced honestly
Pass-rate of the hand-written Cypress suite against live Conduit	verified -- 5/11 (45%), incl. pre-existing flake
Robustness test: refactor CSS, re-run both suites -- which survives?	pending
Pilot-style operational demo (Sonnet drives the migrated app)	verified -- surfaced component-reuse duplicate-id finding
Operational testing N≥5 across {Gemini Flash Lite, GPT-4o-mini, Haiku, Sonnet}	pending -- gated on fixing component-reuse dedup first
Public reproducer repo with MIT attribution to Thinkster	pending

06 / Reproduce

Three commands to repeat this on your machine.

# 1. Clone Conduit + install
git clone https://github.com/digitalinnovationone/conduit-realworld-example-app-with-cypress-automation conduit
cd conduit
npm install

# 2. Decorate with Forge (you need a paid seat + ANTHROPIC_API_KEY)
export ANTHROPIC_API_KEY=sk-ant-...
cp -r ./frontend ./frontend-forge-ai
yf migrate ./frontend-forge-ai \
  --ai-apply --ai-plugin-slug conduit --subdir src --ai-out ./frontend-forge-ai

# 3. Generate the test corpus in whatever framework you want
yf gen-tests ./frontend-forge-ai --framework cypress
# or --framework playwright,vitest,wdio,all

Total cost to reproduce: ~$0.50 in Anthropic tokens. Total wall time: ~3 minutes including npm install.

07 / The tools

Forge + Pilot. Build once, drive forever.

Everything in this case study runs on three pieces of open infrastructure: NAC3 v2.3.1 (the protocol, Apache-2.0), Yujin Forge (the build tool, paid seat), and Yujin Pilot (the embedded driver, open core). All three ship today.