Conduit -- the industry's benchmark Medium clone.
Conduit is what the RealWorld project ships as the canonical app for comparing frameworks: a Medium-style blogging site with user auth, articles, comments, follow/favorite, tags, profile + settings pages. Real surface, real complexity, used to evaluate ~24 frontend frameworks in head-to-head form. We picked the community fork that comes with both a React/Vite frontend AND a hand-written Cypress test suite, so the comparison Forge-vs-manual was apples-to-apples in the same codebase.
| Property | Value |
|---|---|
| Frontend | React 18 + Vite + SWC. 47 .jsx files in src/. |
| Backend | Express + Sequelize + PostgreSQL. (We swapped to SQLite via the Sequelize dialect knob for zero-setup reproducibility.) |
| Existing tests | 14 Cypress cases across 3 page flows (editor / login / register). |
| License | MIT. Thinkster / RealWorld attribution preserved in the fork. |
Forge --ai-apply decorates 47 files in 70 seconds.
--ai-apply is the JSX counterpart to --ai-silent. Walks every .jsx / .tsx under <repo>/src, sends each to Claude Sonnet 4.6 in parallel (concurrency 4 by default, max 8), receives back the decorated JSX plus a manifest fragment listing what was decorated. Forge aggregates fragments into a single manifest.json at the project root.
# From a fresh clone of the Conduit fork yf migrate ./conduit-forge-ai \ --ai-apply \ --ai-plugin-slug conduit \ --subdir src \ --ai-out ./conduit-forge-ai \ --ai-concurrency 4 # Output: AI apply complete. Files scanned: 47 Files decorated: 34 Manifest elements: 59 Plugin slug: conduit Tokens in/out: 43428 / 22192 Total cost: $0.4632 Total LLM time: 262s wall (across 47 files, divided by concurrency = ~70s real)
Per-role breakdown of the 59-element manifest:
| Role | Count | Example IDs |
|---|---|---|
| action | 14 | conduit.action.publish, conduit.action.follow, conduit.action.favorite |
| navigation | 19 | conduit.navigation.login, conduit.navigation.author_avatar |
| field | 11 | conduit.field.title, conduit.field.body, conduit.field.comment_body |
| tab | 6 | conduit.tab.your_feed, conduit.tab.global_feed, conduit.tab.tag_feed |
| region | 9 | conduit.region.popular_tags, conduit.region.feed_toggler |
Quality assessment: IDs are semantic (not ordinal like btn_3), verbs are imperative (publish, follow, favorite, delete), roles match content (display/history-style fields as field, navigation links as navigation, feed switchers as tab).
One command, four frameworks, 66 test cases.
yf gen-tests takes the Forge-aggregated manifest and emits an e2e suite in whichever test framework the user's project carries. Multi-framework on purpose: most adoptions already have Playwright or Cypress or Vitest installed -- Forge does not mandate a choice.
yf gen-tests ./conduit-forge-ai --framework all --base-url "http://localhost:3000" # Output: Manifest elements: 59 Test cases: 66 # some elements have multiple verbs Frameworks: playwright, cypress, vitest, wdio playwright: + tests/forge-e2e/conduit.spec.ts cypress: + cypress/e2e/forge/conduit.cy.js vitest: + tests/forge-e2e/conduit.test.ts wdio: + test/specs/forge-e2e/conduit.test.ts Coverage: 59 / 59 manifest elements exercised.
The coverage metric is "agent-relevant surface coverage" -- one test case per (nac_id, role, verb) tuple in the manifest. This is the honest claim; we do not report this as line-coverage (which the manifest alone cannot guarantee).
| Framework | File emitted | Pattern |
|---|---|---|
| Playwright | tests/forge-e2e/conduit.spec.ts | One test(...) per case, navigates + waits for NAC.register, then dispatches via NAC.click_by_verb |
| Cypress | cypress/e2e/forge/conduit.cy.js | One it(...) per case inside a top describe, uses cy.window().then((win) => win.NAC.click_by_verb(...)) |
| Vitest + jsdom | tests/forge-e2e/conduit.test.ts | Structural tests: manifest plugin_slug matches, every element id is present in the loaded HTML's data-nac-id attributes |
| WebdriverIO | test/specs/forge-e2e/conduit.test.ts | Async-await pattern with browser.execute for NAC calls |
Forge: 4.7× more cases, full surface.
The fork's testAutomation/ directory carries 14 hand-written Cypress cases across 3 .cy.js files. Forge's emitter produces 66 cases in one .cy.js. The comparison is structural here -- counting cases + verifying surface coverage -- because operational pass-rate validation requires running both suites against a live backend.
| Forge auto (Cypress emitter) | Hand-written Cypress (in same fork) | |
|---|---|---|
| Test cases | 66 | 14 |
| Surface coverage | 59 / 59 manifest elements (100%) | 3 page flows (editor / login / register) |
| Time to author | ~30 seconds (one command) | Days-person (assumed; not measured for this fork) |
| Brittleness | Selectors are data-nac-id attrs -- stable across CSS/HTML refactors | Mostly text/class selectors -- break when designers move things |
| Maintenance after refactor | Re-run yf gen-tests | Manual update of every broken selector |
Quantity is not quality. SPA-naive gen-tests need flow awareness.
We stood up the Conduit backend (SQLite via Sequelize dialect swap), brought up the migrated Vite frontend, and ran both Cypress suites end-to-end. The structural delta (66 vs 14) holds; the operational delta tells a different story we publish unsoftened.
| Suite | Total | Passing | Pass rate | Why |
|---|---|---|---|---|
| Hand-written Cypress (3 page flows) | 11 (1 pending) | 5 | 45% | Tests are flow-aware (navigate to /editor before testing editor). One test had a pre-existing cy.type(undefined) bug from a faker reuse issue. The remaining 5 failures are likely env-specific (SQLite vs the prod DB the suite was tuned against). |
| Forge auto Cypress (SPA-naive, v1) | 66 | 1 | 1.5% | Each test calls cy.visit(BASE_URL) (home page) then dispatches one verb. For an SPA where 80% of action elements only mount under specific routes (editor / login / register / profile), most dispatches don't find the element. This is a real gap in gen-tests v1. |
conduit.action.publish only exists under /editor, or that conduit.action.follow requires being on a profile page. v2 of gen-tests needs route-awareness: per-element navigation hints derived from React Router config or from a Pilot dispatch simulation. Until then the suite is best read as a checklist of decorated elements, not as a runnable acceptance suite.
Two structural conclusions hold despite the operational gap:
- Forge's 59-element manifest is correct: NAC3 attached, plugin registered, every decorated element discoverable via
NAC.click_by_verbwhen its containing route is mounted. - The SPA-aware nac-bridge (MutationObserver) works: late-mounted action/navigation/tab elements get bound automatically as routes change.
One JSX component, many DOM instances, same nac_id.
Ran a Pilot-style operational demo: tell Sonnet 4.6 "log in as sumi@yujin.app", give it the live manifest, let it pick verbs turn by turn, dispatch via NAC.click_by_verb. Every dispatch returned OK (the bridge fires the contract event), but the URL never changed. The model picked conduit.navigation.nav_item verb navigate six turns in a row -- each click hit the same DOM element (the first NavItem) which happened to be the Home link, not the SignIn link.
Diagnosis: Conduit's Navbar uses a <NavItem> component reused for Home, SignIn, SignUp, Settings, Profile. --ai-apply sees the component once, decorates the JSX with data-nac-id="conduit.navigation.nav_item", and that single ID propagates to every rendered instance. The manifest deduplicates by ID, so the 5+ DOM instances collapse into 1 manifest entry. NAC.click_by_verb then matches the first DOM occurrence -- always the same one, regardless of which the user actually wanted.
| Finding | What to do |
|---|---|
| Component reuse generates duplicate nac_ids | Per-component-instance suffixing (e.g. conduit.navigation.nav_item[home], nav_item[sign_in]) derived from the prop / children text. Next iteration of --ai-apply. |
| Manifest dedup hides this from the agent | Surface duplicate-id warnings during apply. Optionally: emit one manifest entry per occurrence rather than per declaration. |
| Model needs disambiguation context | System prompt extension: when multiple manifest entries share a verb, ask Pilot to use additional payload like child-text or aria-label. |
Per-instance disambiguation shipped + verified.
Closed finding 04c at the architectural level. --ai-apply now infers per-instance qualifiers from JSX props (label, name, to, slug, etc), bakes them into data-nac-id as a __qualifier suffix, and emits N manifest entries per known instance (or instance_pattern for runtime-dynamic qualifiers). The NAC3 runtime's click_by_verb now accepts { qualifier } and { qualifier_ordinal } in opts for explicit instance targeting.
| Layer | What changed |
|---|---|
Forge --ai-apply prompt | STATIC / DYNAMIC / UNINFERRABLE doctrine. Per-instance suffix baked into JSX. Warning emitted when reusable component cannot be qualified. |
| Manifest schema | New optional fields: instance_pattern, qualifier_source, warning. |
NAC3 runtime (nac.browser.js) | click_by_verb(plugin, verb, { qualifier, qualifier_ordinal }). Backward compatible (no qualifier = old behavior). |
| Pilot system prompt | Doctrine explained: when entries share verb but differ in suffix, pass qualifier in payload. |
Verification on Conduit v2 (re-decorated post-H2):
| Metric | Pre-H2 | Post-H2 |
|---|---|---|
| Manifest size | 59 elements | 61 elements (NavItem 1 -> 4 instances, FeedToggler 1 -> 2, DropdownItem 1 -> 3) |
Elements with __qualifier | 0 | 9 (static enumerated) |
Elements with instance_pattern | 0 | 31 (dynamic templates) |
| Warnings (uninferrable reusables) | hidden | 12 surfaced with file + suggestion |
| Pilot demo dispatch | loops on Home forever | Sonnet picks navigate + qualifier="login", URL changes to /login, then continues to fill form_fieldset__email |
| Decoration cost | $0.46 | $0.66 (+41%, prompt denser; one-shot per project) |
Route-aware emitter lifts pass-rate 22×.
Closed Hallazgo 1 at the architectural level. --ai-apply now extracts a route_map from any file that is a React Router root, and each manifest element carries mounted_at: string[] + requires_auth: boolean. The Cypress emitter consumes both: every test does cy.visit(mounted_at[0]) before dispatch, and auth-gated tests insert a loginViaApi() pre-step (drops a JWT in localStorage, no UI navigation needed).
| Metric | Pre-H1 (v1) | Post-H1 (v3) |
|---|---|---|
| Cypress test count | 66 | 70 (manifest grew by 4 elements post-H2) |
| Cypress pass-rate | 1 / 66 = 1.5% | 23 / 70 = 33% |
| Lift | baseline | 22× |
Elements with mounted_at | 0 | 42 / 62 |
Elements marked requires_auth | 0 | 15 / 62 |
| Route distribution | n/a | '*' x19, '/' x3, /article/:slug x5, /editor* x5, /login x3, /profile/:username x2, /register x1, /settings x4 |
favorite_article need an article to exist; tests for follow_author need a second user. The dynamic routes (/article/:slug, /profile/:username) need fixture seed data in the SQLite before cy.visit. Next iteration of gen-tests: emit beforeEach blocks with Faker-style seed data per dynamic-route test. Per the brief: "Si queda por debajo de 70%, investigar antes de declarar v2 listo." -- investigated and reported. The route-awareness piece IS shipped; the fixture-awareness piece is the next slice.
Per the brief's communication discipline: gen-tests is not communicated publicly as a finished product until pass-rate is ≥70% sustained. This case study documents the H1 architectural unlock + the honest 22× lift, NOT a sales claim. The "agent-relevant surface coverage" metric (every (nac_id, role, verb) tuple) remains the only quantitative claim safe to publish; the test-pass-rate is documented as an honest WIP signal here.
Sonnet drives Conduit login end-to-end. 4 turns. $0.022.
Two fixes that together destrabaron el demo Pilot end-to-end:
- Pilot per-turn state probe now includes the current value of every visible input/textarea (keyed by
data-nac-id+ fallbacks), plus the navbar text snapshot, plus the last 5 action_history entries. System prompt got a STATE AWARENESS section: "before you emit a fill, check if the field already has the target value -- if it does, ADVANCE." - NAC.fill is now React-safe. The previous implementation set
el.value = newValdirectly. React silently reverts that on the next render because its synthetic event system tracks the original property descriptor onHTMLInputElement.prototype. The fix: useObject.getOwnPropertyDescriptor(HTMLInputElement.prototype, 'value').set.call(el, newVal). Same pattern for textarea / select / checkbox. Plain DOM behavior unchanged. (Committed inrpaforce-crm@c1bd8dc5.)
| Metric | Pre-H4.pre | Post-H4.pre |
|---|---|---|
| Login flow completion | never (loops on fill-email forever) | success=true in 4 turns |
| Turns used | 6 (cap) without success | 4 (navigate, fill email, fill password, click submit) |
| Determinism (3 runs, temp=0) | n/a (didn\'t complete) | 3 / 3 byte-identical (same tokens, turns, cost) |
| Cost per run | $0.038+ wasted | $0.0222 |
| Tokens (in / out per run) | 9781 / 664 | 6136 / 252 |
- [x] H2 cerrado -- per-instance qualifiers across all 3 layers.
- [~] H1 cerrado at architecture; 33% pass rate (was 1.5%, 22× lift); 70% target needs fixture-state next.
- [x] End-to-end Sonnet demo: login flow completes deterministically at $0.0222.
- [ ] H4 (cheap-model validation across {Gemini Flash Lite, GPT-4o-mini, Haiku, Sonnet} × N≥5) is now unblocked -- the field-state + React-safe fill that gated it both shipped.
- [ ] Public reproducer repo with MIT attribution -- next step.
Two distinct metrics: operational vs structural.
Re-corri la suite Cypress generada por Forge contra la Conduit live despues de H4.pre + el fix de React-safe NAC.fill, esperando un lift sobre el 33% pre-H4. El numero quedo en 23/70 = 33%. Dos fixes que parecian deberian moverlo no lo movieron, y la razon es clara + honesta:
| Fix probado | Esperado | Real | Por que |
|---|---|---|---|
| React-safe NAC.fill | field_write tests subirian | sin cambio | Los field_write tests de Forge no asertan que el valor persiste -- solo dispatchan. Pilot (multi-step flow) SI depende del valor real para el siguiente paso; por eso el demo login paso de 0% a 100% con el mismo fix. Distintos consumidores, distinto efecto. |
| Auth localStorage key fix (jwtToken -> loggedUser shape) | auth-gated tests subirian | sin cambio | El bloqueo principal de los 47 failing no es auth -- es fixture-state. Tests sobre /article/:slug y /profile/:username necesitan un articulo / usuario seedeado en la SQLite ANTES del cy.visit. Sin contenido en esas rutas, el elemento target no monta. |
- Forge gen-tests v2 (Cypress estructural): 23 / 70 = 33% pass. Cada test es una "checklist entry" por elemento del manifest. Mide cobertura de superficie.
- Pilot end-to-end (operational): 3 / 3 = 100% pass, $0.0222, 4 turns, byte-deterministic. Sonnet drives the SPA through a real multi-step flow.
Lo que destrabaria el 70%: gen-tests v3 con fixture-aware emission. Para cada test cuyo visit_route contiene :param, emitir un beforeEach que crea el recurso via API (POST /api/articles, POST /api/users) y alias el id en cy.wrap().as(). Forge ya tiene el shape del backend (lo vio durante --ai-apply en los archivos src/services/*.js); el LLM puede inferir el seed setup. Estimado: 2-3h de trabajo dev, costo de regeneracion ~$0.50.
The honest picture at this checkpoint.
| Item | Status |
|---|---|
| Forge migrates Conduit's 47 JSX files end-to-end | verified |
| Manifest quality (semantic IDs, imperative verbs, correct roles) | verified |
| Forge auto-generates 66 e2e cases in 4 frameworks | verified |
| SPA-aware nac-bridge (MutationObserver re-binds on route change) | verified (live probe on running Conduit) |
| NAC3 runtime registers the 59-element manifest on page load | verified (Playwright probe) |
| Pass-rate of the Forge-generated Cypress suite against live Conduit | verified -- 1/66 (1.5%), SPA-naive limitation surfaced honestly |
| Pass-rate of the hand-written Cypress suite against live Conduit | verified -- 5/11 (45%), incl. pre-existing flake |
| Robustness test: refactor CSS, re-run both suites -- which survives? | pending |
| Pilot-style operational demo (Sonnet drives the migrated app) | verified -- surfaced component-reuse duplicate-id finding |
| Operational testing N≥5 across {Gemini Flash Lite, GPT-4o-mini, Haiku, Sonnet} | pending -- gated on fixing component-reuse dedup first |
| Public reproducer repo with MIT attribution to Thinkster | pending |
Three commands to repeat this on your machine.
# 1. Clone Conduit + install git clone https://github.com/digitalinnovationone/conduit-realworld-example-app-with-cypress-automation conduit cd conduit npm install # 2. Decorate with Forge (you need a paid seat + ANTHROPIC_API_KEY) export ANTHROPIC_API_KEY=sk-ant-... cp -r ./frontend ./frontend-forge-ai yf migrate ./frontend-forge-ai \ --ai-apply --ai-plugin-slug conduit --subdir src --ai-out ./frontend-forge-ai # 3. Generate the test corpus in whatever framework you want yf gen-tests ./frontend-forge-ai --framework cypress # or --framework playwright,vitest,wdio,all
Total cost to reproduce: ~$0.50 in Anthropic tokens. Total wall time: ~3 minutes including npm install.
Forge + Pilot. Build once, drive forever.
Everything in this case study runs on three pieces of open infrastructure: NAC3 v2.3.1 (the protocol, Apache-2.0), Yujin Forge (the build tool, paid seat), and Yujin Pilot (the embedded driver, open core). All three ship today.