Shipwright

An autonomous product loop that compares options, routes the best work, ships code, and publishes the evidence.

Selected direction Variant A

Experiment cockpit

Active lane Observe

observe:ab_consensus_work_order:pricing-page-watch

Judge signal 12

Scored model votes across judge runs.

Deploy deployed

2 blockers still visible.

Evidence Bundle

Artifact manifest is complete with 30 required evidence files present.

artifact-manifest.json evidence-index.json lane-handoff.md lane-route.svg

Bundle complete Missing 0 Health attention Required 30 Ready 15 Attention 4

GitHub Adoption

Reusable Project Setup

Installer output, repo settings, provider readiness, and website publishing are tracked as one onboarding path.

shipwright github-setup --repo matthoffner/shipwright --workflow .github/workflows/shipwright-dogfood.yml --schedule-mode dry_run --dispatch-dry-run --publish-site false

gh workflow run .github/workflows/shipwright-dogfood.yml -f mode=dry_run -f publish_site=false

Workflow pass

Shipwright GitHub adoption wiring is ready.

site/adoption-status.json

Setup helper not recorded

Setup helper has not been recorded for this repo yet.

site/github-setup-status.json

Fresh repo smoke passed

Fresh reusable GitHub adoption smoke passed.

site/adoption-smoke-status.json

Doctor needs attention

Shipwright can start, but 1 readiness warning should be reviewed before schedules cook.

site/doctor-status.json

OpenRouter configured

Codex is configured for OpenRouter with openai/gpt-5.1-codex-mini.

site/codex-provider-status.json

Website deployed

Vercel accepted the latest site deploy.

site/deploy-status.json

Next: Use this same install, github-setup, and manual dry-run path in target GitHub projects.

Product model

Autonomous Product Loop

Shipwright is the loop. A/B consensus is the decision/evidence layer. Lanes are execution modes, not the product concept.

1 Observe
Read product history, runtime evidence, blockers, and current state.
2 Compare
Use A/B consensus, judges, votes, and screenshots to choose a direction.
3 Route
Turn the selected direction into one lane-specific work order.
4 Ship
Let the autonomous worker make bounded code changes and verify them.
5 Report
Publish the changelog, audit trail, deploy status, and website evidence.

Decision / evidence layer A/B consensus compares variants and produces a work order before Shipwright routes it to a lane.

Current winner: Variant A / Experiment cockpit. Evidence comes from model judges, human votes, Playwright captures, runtime probes, and readiness gates.

Lane routing

Active lane: Observe

Observe owns this run; 1 fallback lane remains eligible and 3 lanes are blocked or off.

Active 1 Eligible 1 Blocked 2 Off 1

Route map

Observe owns this run; Intake is the first fallback.

Intake fallback -> Build blocked -> Review blocked -> Release disabled -> Observe active

Step 5/5 Fallback Intake Blocked 2 Off 1

Current owner Observe

Work starts here for this run.

First fallback Intake

Use this lane if the owner blocks.

Blocked lanes Build, Review

2 lanes need signal or repair before routing.

Off lanes Release

1 lane is disabled by configuration.

Handoff artifact: site/lane-handoff.md / Visual artifact: site/lane-route.svg

1
Intake fallback

Self-dogfood runtime work is eligible from the repository roadmap and config
2
Build blocked

No failing check signal is available in the local scaffold
3
Review blocked

No review comment or maintainer mention signal is available in the local scaffold
4
Release disabled

Lane 'release' is disabled in shipwright.yml
5
Observe active

A/B consensus work order 1 is ready for dogfood execution: Watch: Pricing Page Experiment

eligible

Intake

turn roadmap, config, and product notes into implementation work

Next action: intake:advance_runtime
Why now: Self-dogfood runtime work is eligible from the repository roadmap and config
Priority: 80

Evidence

action_status: eligible
candidate_priority: 80
selected: false

blocked

Build

repair failing checks and local verification failures

Next action: build:repair_verification
Why now: No failing check signal is available in the local scaffold
Priority: none

Evidence

action_status: blocked
candidate_priority: none
selected: false

blocked

Review

respond to review comments and maintainer mentions

Next action: review:address_feedback
Why now: No review comment or maintainer mention signal is available in the local scaffold
Priority: none

Evidence

action_status: blocked
candidate_priority: none
selected: false

off

Release

surface ready changes and unblock release decisions

Next action: release:prepare_delivery
Why now: Lane 'release' is disabled in shipwright.yml
Priority: none

Evidence

action_status: blocked
candidate_priority: none
selected: false

active

Observe

improve reports, journals, website output, and run evidence

Next action: observe:ab_consensus_work_order:pricing-page-watch
Why now: A/B consensus work order 1 is ready for dogfood execution: Watch: Pricing Page Experiment
Priority: 60

Evidence

action_status: eligible
candidate_priority: 60
selected: true

Evidence Index

Run Evidence

attention

4 evidence components need attention; Observe owns this run; 1 fallback lane remains eligible and 3 lanes are blocked or off.

Ready 15 Attention 4 Blocked 0 Missing 0

Next Actions

Pricing Page Experiment: needs more signal
Product history Focused Section Size: fail - Largest changelog section has 15 entries; DevBox-style sections stay small and outcome-specific.
Let the dogfood workflow deploy site/ to Vercel and record the result.
Run Shipwright Dogfood in non-dry-run mode to capture worker execution evidence.
Run Shipwright Dogfood in non-dry-run mode to capture worker contract evidence.

Lane Board

ready

selected / site/lane-board.json

Observe owns this run; 1 fallback lane remains eligible and 3 lanes are blocked or off.

Evidence

outcome: dry_run
mode: dry_run
selected_lane: observe
selected_task: observe:ab_consensus_work_order:pricing-page-watch

Next Actions

Execute observe:ab_consensus_work_order:pricing-page-watch in the Observe lane.
Clear blocker for build: No failing check signal is available in the local scaffold
Clear blocker for review: No review comment or maintainer mention signal is available in the local scaffold

Current State

attention

has_blockers / site/current-state.json

Autonomous Dogfood Runtime is the latest product outcome; 0 work orders ready, 2 blocked targets across 2 failing checks.

Evidence

ready_work_orders: 0
blocked: 2
blocking_checks: 2

Next Actions

Pricing Page Experiment: needs more signal
Product history Focused Section Size: fail - Largest changelog section has 15 entries; DevBox-style sections stay small and outcome-specific.

A/B Consensus Queue

ready

ship / site/ab-consensus-queue.json

Shipwright Site is ready for implementation with Variant A; dogfood, visual, and consensus evidence are aligned.

Evidence

work_orders: 1
ship: 2
watch: 1
blocked: 0

Next Actions

Complete the primary unblock action for Pricing Page Experiment: Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
Regenerate the report and confirm the item leaves blocked status or records a narrower blocker.

Doctor

attention

needs_attention / site/doctor-status.json

Shipwright can start, but 1 readiness warning should be reviewed before schedules cook.

Evidence

mode: dry_run
workspace: /home/runner/work/shipwright/shipwright
workflow: .github/workflows/shipwright-dogfood.yml
config: shipwright.yml

Next Actions

Let the dogfood workflow deploy site/ to Vercel and record the result.

Adoption

ready

pass / site/adoption-status.json

Shipwright GitHub adoption wiring is ready.

Evidence

workflow: .github/workflows/shipwright-dogfood.yml
workflow_template: local_dogfood
config: shipwright.yml
targets: shipwright.targets.json

Adoption Smoke

ready

passed / site/adoption-smoke-status.json

Fresh reusable GitHub adoption smoke passed.

Evidence

repo: matthoffner/shipwright
workspace: /tmp/shipwright-adoption-smoke-oSzOdh
template: reusable
steps: 4

Next Actions

Use this same install, github-setup, and manual dry-run path in target GitHub projects.

Project Dependencies

ready

installed / site/project-dependency-status.json

Project dependencies installed with npm ci.

Evidence

workspace: /home/runner/work/shipwright/shipwright
package_manager: npm
package_name: shipwright
lockfile: package-lock.json

Browser Install

ready

installed / site/browser-install-status.json

Playwright browser chromium is installed for surface captures.

Evidence

runtime_dir: /home/runner/work/shipwright/shipwright
browser: chromium
with_deps: true
command: npx playwright install --with-deps chromium

CI Run

ready

passed / site/ci-run-status.json

Shipwright CI run completed.

Evidence

mode: dry_run
preflight: passed
work_cycle: passed
finalize: passed

Next Actions

Run shipwright orchestrate with site/ab-consensus-queue.json.
Run Shipwright in autonomous or yolo mode to allow changelog and git publishing.
Run shipwright ci-finalize to refresh report and deploy evidence.

CI Preflight

ready

passed / site/ci-preflight-status.json

Shipwright CI preflight completed and wrote setup evidence.

Evidence

mode: dry_run
targets: 3
steps: 11
project_dependencies: passed / Project dependencies installed with npm ci.

Next Actions

Run shipwright orchestrate with site/ab-consensus-queue.json.

CI Work Cycle

ready

passed / site/ci-work-cycle-status.json

Shipwright CI work cycle completed in dry_run mode.

Evidence

mode: dry_run
steps: 6
write_prerequisites: skipped
publish: skipped

Next Actions

Run Shipwright in autonomous or yolo mode to allow changelog and git publishing.
Run shipwright ci-finalize to refresh report and deploy evidence.

Safety

ready

pass / site/safety-status.json

Shipwright dry_run mode has the required safety controls.

Evidence

mode: dry_run
checks: 7
pass: 7
warn: 0

Verification

ready

skipped / site/verification-status.json

Verification was skipped because Shipwright is running in dry_run mode.

Evidence

mode: dry_run
commands: 3
skipped: dry_run

Next Actions

Run Shipwright in autonomous or yolo mode to verify post-worker changes.

Deployment

ready

deployed / site/deploy-status.json

Vercel accepted the latest site deploy.

Evidence

provider: vercel
state: deployed
VERCEL_TOKEN: configured
VERCEL_ORG_ID: configured

Next Actions

Verify the live Vercel site renders the latest site artifacts.

CI Finalize

ready

passed / site/ci-finalize-status.json

CI finalize passed; publish_site true; deploy configured_unverified.

Evidence

publish_site: true
deploy_deferred: true
deploy_state: configured_unverified
doctor_state: needs_attention

Git Publish

ready

no_changes / site/git-publish-status.json

Git publishing skipped because Shipwright is running in dry_run mode.

Evidence

mode: dry_run
pushed: false

Next Actions

Run Shipwright in autonomous or yolo mode when changes should be committed and pushed.

Codex Provider

ready

configured / site/codex-provider-status.json

Codex is configured for OpenRouter with openai/gpt-5.1-codex-mini.

Evidence

provider: openrouter
model: openai/gpt-5.1-codex-mini
wire_api: responses
OPENROUTER_API_KEY: configured

Next Actions

Run Codex with OPENROUTER_API_KEY available in the environment.

Codex Worker

attention

not_recorded / site/codex-worker-status.json

No Codex worker run has been recorded for this report.

Evidence

codex-worker-status.json missing

Next Actions

Run Shipwright Dogfood in non-dry-run mode to capture worker execution evidence.

Worker Contract

attention

unknown / site/codex-contract-status.json

No Codex worker contract trace has been recorded for this report.

Evidence

commands: 0
first edit command index: -1

Next Actions

Run Shipwright Dogfood in non-dry-run mode to capture worker contract evidence.

Current State

Source product history History score 86 Work orders 0 Blocked targets 2 Failing checks 2 Deploy deployed

Autonomous Dogfood Runtime is the latest product outcome; 0 work orders ready, 2 blocked targets across 2 failing checks.

Journal-independent digest built from product history, consensus, runtime, credential, and screenshot evidence.

Latest outcome: 2026-06-21 / Autonomous Dogfood Runtime

A/B accepted2/3

Work orders0

Blocked targets2

Failing checks2

Deploy blocked0

Rendered targets3/3

Captured surfaces3/3

Current State Details Work orders, blockers, next actions, and evidence sources.

Work Orders

Consensus

Watch: Pricing Page Experiment 80% watch
Pricing Page Experiment is 80% ready; collect one more signal before promotion.
Gate passed: Visual / Gate passed: Dogfood

Ready

Signals

Shipwright A/B Lab: rendered 3/3
Onboarding Flow Experiment: rendered 3/3
Pricing Page Experiment: rendered 3/3
Shipwright A/B Lab: captured visual evidence
Onboarding Flow Experiment: captured visual evidence
Pricing Page Experiment: captured visual evidence

Blocked

Blockers

Pricing Page Experiment: needs more signal
Product history Focused Section Size: fail - Largest changelog section has 15 entries; DevBox-style sections stay small and outcome-specific.

Actions

Split oversized sections into narrower dated outcomes before the website treats product history as healthy.
Add lightweight page events once Shipwright has real recurring users.
Compare first-viewport comprehension after each generated site change.
Capture real usage events after the first UI slice ships.
Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.

Evidence

Sources

Derived from CHANGELOG.md product history, not raw journal prose.
Uses target runtime, credential, screenshot, dogfood plan, and A/B consensus artifacts.
Uses deployment status evidence so the website cannot silently drift behind CI.
Run journal remains audit evidence only.

UI Consensus

A/B consensus winner Variant A Status accepted Confidence high Mode ab test

Variant A: Experiment cockpit

Should Shipwright lead with a toggleable experiment cockpit or a fast winner board?

Variant A wins because Shipwright is becoming an autonomous A/B platform: generate two UI versions, capture them with Playwright, ask multiple LLMs to judge, let humans vote, then hand the winner to an agent.

Consensus handoff

Render a first-class Variant A / Variant B toggle on the website.

Score

9.1 / 6.7

Margin 2.4; high confidence; threshold 1.5

Judges

3 selected the winning variant.

Experiment subjects

Shipwright keeps built-in UI subjects in the same judge-and-vote comparison set.

Variant Toggle The A/B presentation candidates Shipwright can hand to a worker.

Variant Toggle

Switch between the two generated UI directions before trusting the winner.

Variant A Selected winner

Experiment cockpit

Lead with a toggleable A/B workspace: Variant A, Variant B, Playwright evidence, model-judge scorecards, and human vote state.

Strengths

Makes each autonomous UI run inspectable without reading raw logs.
Turns the website into a product control surface for A/B decisions.
Gives model judges and humans the same evidence packet.

Risks

Needs strong hierarchy so the cockpit does not feel like raw CI output.
Needs real Playwright captures for each generated variant before claims are trusted.

LLM Judges

GPT-5.1 Judge 9/10

Variant A / reasoning model. The cockpit keeps variants, evidence, model reasoning, and the worker handoff in one decision packet.

Claude Sonnet Judge 8/10

Variant A / product critique. A is more honest about uncertainty because it shows the loser, dissent, and evidence gaps before shipping.

Gemini Judge 7/10

Variant B / visual comparison. B is easier to scan, but it needs the cockpit below the fold to keep the decision trustworthy.

Human Vote Panel

Human Panel 8/10

Variant A / group vote. People need the toggle and vote record before trusting an autonomous winner.

Playwright Evidence

Promotion stays blocked until rendered screenshots and UI signals are captured for the subject.

Consensus Cockpit Target tabs, evidence, and implementation handoffs.

Judge cockpit 3 subjects Artifact site/ab-tests.json Mode offline consensus

UI consensus target summary

One-card-per-target view of the A/B winner, vote confidence, runtime proof, screenshot capture, and first worker action.

Variant A Ready ready

Shipwright Site

Variant A clears the A/B threshold by 2.4 weighted points.

Votes: 3/3 majority, 0 dissent
Runtime: rendered 3/3
Capture: captured
First Action: Promote Variant A: Experiment cockpit as the default website layout.

Variant A Ready ready

Onboarding Flow Experiment

Variant A clears the A/B threshold by 1.2 weighted points.

Votes: 3/3 majority, 0 dissent
Runtime: rendered 3/3
Capture: captured
First Action: Promote Variant A: Guided checklist for Onboarding Flow Experiment.

Variant B Ready ready

Pricing Page Experiment

The margin is 0.9, below the 1 point threshold, so more signal is required.

Votes: 3/3 majority, 0 dissent
Runtime: rendered 3/3
Capture: captured
First Action: Promote Variant B: Judge-proof pricing for Pricing Page Experiment.

Shipwright Site

Should Shipwright lead with a toggleable experiment cockpit or a fast winner board?

Winner Variant A Confidence high Status accepted Readiness ready Margin 2.4

Variant A: Experiment cockpit

Variant A clears the A/B threshold by 2.4 weighted points.

Readiness

Shipwright A/B Lab rendered 3/3 expected signals and can be used as consensus evidence.

Signals

credential: not required
runtime: rendered
start attempt: not attempted
matched signals: 3/3

Next Actions

Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.

Surface Snapshot

Surface rendered Signals 0 Capture captured

https://shipwright-seven.vercel.app

Shipwright

No UI signals captured.

Variant A: Experiment cockpit winner

Lead with a toggleable A/B workspace: Variant A, Variant B, Playwright evidence, model-judge scorecards, and human vote state.

Risk: Needs strong hierarchy so the cockpit does not feel like raw CI output.

Variant B: Fast winner board

Lead with the winning variant, summarized rationale, and one next action before showing the deeper judge evidence.

Risk: Can hide dissent and weak evidence behind a premature recommendation.

LLM Judges

GPT-5.1 Judge chose Variant A at 9/10: The cockpit keeps variants, evidence, model reasoning, and the worker handoff in one decision packet.
Claude Sonnet Judge chose Variant A at 8/10: A is more honest about uncertainty because it shows the loser, dissent, and evidence gaps before shipping.
Gemini Judge chose Variant B at 7/10: B is easier to scan, but it needs the cockpit below the fold to keep the decision trustworthy.
Human Panel chose Variant A at 8/10: People need the toggle and vote record before trusting an autonomous winner.

LLM Judge Matrix

Consensus Response Packet

Selected Variant A Runner-up Variant B

Should Shipwright lead with a toggleable experiment cockpit or a fast winner board?

3/3 judges selected Variant A. No dissent recorded for this consensus packet.

Evidence Checklist

Decision status: accepted
Playwright evidence: captured
Runtime evidence: rendered
Strongest criterion: Playwright Evidence +3
Dissent: none

Answer the UI consensus question for Shipwright Site: "Should Shipwright lead with a toggleable experiment cockpit or a fast winner board?" Use Variant A as the selected direction, keep Variant B risk visible, and verify captured visual evidence plus rendered runtime evidence before implementation.

Majority 3/3 Dissent 0 Playwright captured Runtime rendered

Criterion deltas for Variant A over Variant B
Criterion	Winner	Runner-up	Delta
LLM Judge Agreement	9	7	+2
Playwright Evidence	9	6	+3
Human Vote Clarity	9	6	+3
Agent Handoff	9	8	+1

Async judge responses
Judge	Vote	Alignment	Confidence	Evidence
gpt-5.1-codex-mini Judgeopenrouter model judge / openai/gpt-5.1-codex-mini	Variant A	majority	8/10	judge-run:judge-run-landing-shipwright-site, source:openrouter, Variant A weighted score, Variant A strength, Variant B risk
gemini-2.5-flash-lite Judgeopenrouter model judge / google/gemini-2.5-flash-lite	Variant A	majority	8/10	judge-run:judge-run-landing-shipwright-site, source:openrouter, Variant A thesis, Variant B risk
claude-haiku-4.5 Judgeopenrouter model judge / anthropic/claude-haiku-4.5	Variant A	majority	9/10	judge-run:judge-run-landing-shipwright-site, source:openrouter, Variant A thesis: Makes each autonomous UI run inspectable without reading raw logs, Variant A strength: Gives model judges and humans the same evidence packet, Variant A LLM Judge Agreement: 9/10 - The model judges can inspect both variants and explain agreement or dissent, Variant B risk: Can hide dissent and weak evidence behind a premature recommendation, Variant B Agent Handoff: 8/10 - weaker evidence context can make autonomous work brittle

Scorecard

Variant A 9.1

Experiment cockpit

Variant B 6.7

Fast winner board

Dissent

The visual judge prefers Variant B's faster first impression for casual readers.
Variant A still needs real per-variant Playwright screenshots before it can be treated as production-grade evidence.

Implementation Handoff

Use site/ab-tests.json as the source of truth. Improve the Shipwright website by implementing the winning A/B consensus direction while preserving product history, dissent, and machine-readable artifacts.

Readiness: ready.
Shipwright A/B Lab rendered 3/3 expected signals and can be used as consensus evidence.
Next: Keep this subject in the default experiment path.
Next: Use this rendered runtime as the baseline for the next UI consensus comparison.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake
curl -fsSL https://shipwright-seven.vercel.app/ab-tests.json

Evidence Gaps

follow_up: landing-live-behavior - The winner is based on offline consensus, not live visitor behavior.
follow_up: landing-first-impression - Design dissent says the product story still needs to survive the evidence-first layout.

Next Actions

Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.

Onboarding Flow Experiment

Should onboarding lead with a guided checklist or an autonomous summary?

Winner Variant A Confidence medium Status accepted Readiness ready Margin 1.2

Variant A: Guided checklist

Variant A clears the A/B threshold by 1.2 weighted points.

Readiness

Onboarding Flow Experiment rendered 3/3 expected signals and can be used as consensus evidence.

Signals

credential: not required
runtime: rendered
start attempt: not attempted
matched signals: 3/3

Next Actions

Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.

Surface Snapshot

Surface metadata only Signals 2 Capture captured

metadata

Untitled surface

npm run check npm run report

Variant A: Guided checklist winner

Lead onboarding with a concrete sequence of setup, first experiment, judge review, and publish steps.

Risk: Can feel too procedural for experienced users who already know the flow.

Variant B: Autonomous summary

Lead onboarding with what Shipwright already inferred and one high-confidence next action.

Risk: Can hide important setup gaps if the summary is too confident.

LLM Judges

GPT-5.1 Judge chose Variant A at 9/10: Onboarding Flow Experiment should optimize for the experiment job named by its surface: Built-in onboarding UI variant.
Claude Sonnet Judge chose Variant A at 8/10: The winning variant creates clearer product intent and a better autonomous handoff.
Gemini Judge chose Variant B at 7/10: The alternate keeps useful pressure on visual simplicity and first-impression quality.
Human Panel chose Variant A at 8/10: The group vote favors the direction with the clearest implementation and review path.

LLM Judge Matrix

Consensus Response Packet

Selected Variant A Runner-up Variant B

Should onboarding lead with a guided checklist or an autonomous summary?

3/3 judges selected Variant A. No dissent recorded for this consensus packet.

Evidence Checklist

Decision status: accepted
Playwright evidence: captured
Runtime evidence: rendered
Strongest criterion: Implementation Fit +2
Dissent: none

Answer the UI consensus question for Onboarding Flow Experiment: "Should onboarding lead with a guided checklist or an autonomous summary?" Use Variant A as the selected direction, keep Variant B risk visible, and verify captured visual evidence plus rendered runtime evidence before implementation.

Majority 3/3 Dissent 0 Playwright captured Runtime rendered

Criterion deltas for Variant A over Variant B
Criterion	Winner	Runner-up	Delta
Experiment Fit	9	8	+1
Judgeability	8	7	+1
Implementation Fit	9	7	+2
Evidence Quality	8	7	+1

Async judge responses
Judge	Vote	Alignment	Confidence	Evidence
gpt-5.1-codex-mini Judgeopenrouter model judge / openai/gpt-5.1-codex-mini	Variant A	majority	8/10	judge-run:judge-run-target-onboarding-flow, source:openrouter, Variant A thesis, Variant B risk
gemini-2.5-flash-lite Judgeopenrouter model judge / google/gemini-2.5-flash-lite	Variant A	majority	8/10	judge-run:judge-run-target-onboarding-flow, source:openrouter, Variant A thesis, Variant B risk
claude-haiku-4.5 Judgeopenrouter model judge / anthropic/claude-haiku-4.5	Variant A	majority	9/10	judge-run:judge-run-target-onboarding-flow, source:openrouter, Variant A weighted score 8.6 vs Variant B 7.4, Variant A Judgeability 8/10 vs Variant B 7/10, Variant B risk: hidden setup gaps if summary too confident, Variant A strength: makes first successful run obvious, Variant B Implementation Fit 7/10 requires accurate state inference

Scorecard

Variant A 8.6

Guided checklist

Variant B 7.4

Autonomous summary

Dissent

The losing variant may still be better for first-time users or marketing pages.

Implementation Handoff

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant A for Onboarding Flow Experiment. Preserve the primary job "Built-in onboarding UI variant" and verify the generated screenshots before shipping.

Readiness: ready.
Onboarding Flow Experiment rendered 3/3 expected signals and can be used as consensus evidence.
Next: Keep this subject in the default experiment path.
Next: Use this rendered runtime as the baseline for the next UI consensus comparison.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Evidence Gaps

follow_up: onboarding-flow-live-behavior - The current decision is an offline consensus, not a live product experiment.

Next Actions

Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.

Pricing Page Experiment

Should pricing lead with a simple plan comparison or proof from judge outcomes?

Winner Variant B Confidence low Status needs more signal Readiness ready Margin 0.9

Variant B: Judge-proof pricing

The margin is 0.9, below the 1 point threshold, so more signal is required.

Readiness

Pricing Page Experiment rendered 3/3 expected signals and can be used as consensus evidence.

Signals

credential: not required
runtime: rendered
start attempt: not attempted
matched signals: 3/3

Next Actions

Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.

Surface Snapshot

Surface metadata only Signals 2 Capture captured

metadata

Untitled surface

npm run check npm run report

Variant A: Simple plan comparison

Lead pricing with straightforward plans, limits, and the first practical upgrade point.

Risk: Underplays the differentiator: autonomous evidence-backed shipping.

Variant B: Judge-proof pricing winner

Lead pricing with proof: experiments run, judges consulted, human votes collected, and winners shipped.

Risk: Requires stronger examples before it beats a simple comparison table.

LLM Judges

GPT-5.1 Judge chose Variant B at 9/10: Pricing Page Experiment should optimize for the experiment job named by its surface: Built-in pricing UI variant.
Claude Sonnet Judge chose Variant B at 8/10: The winning variant creates clearer product intent and a better autonomous handoff.
Gemini Judge chose Variant A at 7/10: The alternate keeps useful pressure on visual simplicity and first-impression quality.
Human Panel chose Variant B at 8/10: The group vote favors the direction with the clearest implementation and review path.

LLM Judge Matrix

Consensus Response Packet

Selected Variant B Runner-up Variant A

Should pricing lead with a simple plan comparison or proof from judge outcomes?

3/3 judges selected Variant B. No dissent recorded for this consensus packet.

Evidence Checklist

Decision status: needs_more_signal
Playwright evidence: captured
Runtime evidence: rendered
Strongest criterion: Evidence Quality +2
Dissent: none

Answer the UI consensus question for Pricing Page Experiment: "Should pricing lead with a simple plan comparison or proof from judge outcomes?" Use Variant B as the selected direction, keep Variant A risk visible, and verify captured visual evidence plus rendered runtime evidence before implementation.

Majority 3/3 Dissent 0 Playwright captured Runtime rendered

Criterion deltas for Variant B over Variant A
Criterion	Winner	Runner-up	Delta
Experiment Fit	9	8	+1
Judgeability	9	8	+1
Implementation Fit	8	9	-1
Evidence Quality	9	7	+2

Async judge responses
Judge	Vote	Alignment	Confidence	Evidence
gpt-5.1-codex-mini Judgeopenrouter model judge / openai/gpt-5.1-codex-mini	Variant B	majority	7/10	judge-run:judge-run-target-pricing-page, source:openrouter, Variant B thesis, Variant B Evidence Quality
gemini-2.5-flash-lite Judgeopenrouter model judge / google/gemini-2.5-flash-lite	Variant B	majority	9/10	judge-run:judge-run-target-pricing-page, source:openrouter, Variant B thesis, Variant B strength, Variant B risk, Variant A risk
claude-haiku-4.5 Judgeopenrouter model judge / anthropic/claude-haiku-4.5	Variant B	majority	9/10	judge-run:judge-run-target-pricing-page, source:openrouter, Variant B weighted score 8.9 vs Variant A 8.0, Variant B evidence quality 9/10 vs Variant A 7/10, Variant A risk: underplays autonomous evidence-backed shipping differentiator, Variant B strength: connects price to outcomes and makes platform defensible, Variant B experiment fit 9/10 aligns with Shipwright's core differentiator

Scorecard

Variant A 8

Simple plan comparison

Variant B 8.9

Judge-proof pricing

Dissent

The losing variant may still be better for operators who need dense controls.

Implementation Handoff

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant B for Pricing Page Experiment. Preserve the primary job "Built-in pricing UI variant" and verify the generated screenshots before shipping.

Readiness: ready.
Pricing Page Experiment rendered 3/3 expected signals and can be used as consensus evidence.
Next: Keep this subject in the default experiment path.
Next: Use this rendered runtime as the baseline for the next UI consensus comparison.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Evidence Gaps

follow_up: pricing-page-live-behavior - The current decision is an offline consensus, not a live product experiment.

Next Actions

Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.

Scoring Details Rubric, judge votes, and next actions.

A/B Options

Variant A: Experiment cockpit selected

Lead with a toggleable A/B workspace: Variant A, Variant B, Playwright evidence, model-judge scorecards, and human vote state.

Strength: Makes each autonomous UI run inspectable without reading raw logs.

Risk: Needs strong hierarchy so the cockpit does not feel like raw CI output.

Variant B: Fast winner board

Lead with the winning variant, summarized rationale, and one next action before showing the deeper judge evidence.

Strength: Lets a busy operator see what won immediately.

Risk: Can hide dissent and weak evidence behind a premature recommendation.

Weighted Rubric

Variant A 9.1

Experiment cockpit

LLM Judge Agreement: 9/10, weighted 2.7
Playwright Evidence: 9/10, weighted 2.3
Human Vote Clarity: 9/10, weighted 2.3
Agent Handoff: 9/10, weighted 1.8

Variant B 6.7

Fast winner board

LLM Judge Agreement: 7/10, weighted 2.1
Playwright Evidence: 6/10, weighted 1.5
Human Vote Clarity: 6/10, weighted 1.5
Agent Handoff: 8/10, weighted 1.6

Decision threshold: Variant A clears the A/B threshold by 2.4 weighted points.

LLM Judge Agreement 30%

Multiple model judges can compare the variants and explain the same winner.

Playwright Evidence 25%

The decision is grounded in rendered UI capture instead of prose alone.

Human Vote Clarity 25%

A group of people can vote, see dissent, and understand how their input affects the result.

Agent Handoff 20%

The selected variant produces a concrete implementation prompt for an autonomous worker.

LLM Judges

GPT-5.1 Judge 9/10

Variant A / reasoning model. The cockpit keeps variants, evidence, model reasoning, and the worker handoff in one decision packet.

Claude Sonnet Judge 8/10

Variant A / product critique. A is more honest about uncertainty because it shows the loser, dissent, and evidence gaps before shipping.

Gemini Judge 7/10

Variant B / visual comparison. B is easier to scan, but it needs the cockpit below the fold to keep the decision trustworthy.

Human Panel 8/10

Variant A / group vote. People need the toggle and vote record before trusting an autonomous winner.

Next Actions

Render a first-class Variant A / Variant B toggle on the website.

Keep model judges, human votes, and Playwright evidence in the same decision packet.

Generate worker prompts from the winning variant only after evidence gates pass.

Judge Runs

Source openrouter Provider scored Winner Variant A Judges 3

Shipwright Site

3/3 model judges selected Variant A for Shipwright Site.

Screenshots

Variant A: judge-runs/landing-shipwright-site/variant-A.png
Variant B: judge-runs/landing-shipwright-site/variant-B.png

Votes

gpt-5.1-codex-mini Judge: Variant A (8/10)
gemini-2.5-flash-lite Judge: Variant A (8/10)
claude-haiku-4.5 Judge: Variant A (9/10)

Source openrouter Provider scored Winner Variant A Judges 3

Shipwright A/B Lab

3/3 model judges selected Variant A for Shipwright A/B Lab.

Screenshots

Variant A: judge-runs/target-shipwright-site/variant-A.png
Variant B: judge-runs/target-shipwright-site/variant-B.png

Votes

gpt-5.1-codex-mini Judge: Variant A (8/10)
gemini-2.5-flash-lite Judge: Variant A (8/10)
claude-haiku-4.5 Judge: Variant A (9/10)

Source openrouter Provider scored Winner Variant A Judges 3

Onboarding Flow Experiment

3/3 model judges selected Variant A for Onboarding Flow Experiment.

Screenshots

Variant A: judge-runs/target-onboarding-flow/variant-A.png
Variant B: judge-runs/target-onboarding-flow/variant-B.png

Votes

gpt-5.1-codex-mini Judge: Variant A (8/10)
gemini-2.5-flash-lite Judge: Variant A (8/10)
claude-haiku-4.5 Judge: Variant A (9/10)

Source openrouter Provider scored Winner Variant B Judges 3

Pricing Page Experiment

3/3 model judges selected Variant B for Pricing Page Experiment.

Screenshots

Variant A: judge-runs/target-pricing-page/variant-A.png
Variant B: judge-runs/target-pricing-page/variant-B.png

Votes

gpt-5.1-codex-mini Judge: Variant B (7/10)
gemini-2.5-flash-lite Judge: Variant B (9/10)
claude-haiku-4.5 Judge: Variant B (9/10)

A/B Test Consensus

Ship 2 Blocked 0 Watch 1

2/3 A/B consensus items are shippable; 0 blocked; 1 need watch.

Operator queue derived from A/B tests, consensus lanes, judge-panel judge matrices, and dogfood readiness.

Priority #1 Score 189 watch watch Ready 80% Effort medium Variant B

Watch: Pricing Page Experiment

Pricing Page Experiment is 80% ready; collect one more signal before promotion.

Next verification: node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json

Consensus Handoff Packet

Improve Shipwright's report, queue, or target metadata handoff for Pricing Page Experiment; do not edit the target app source.

Metric

Buyer confidence: a visitor can connect price to shipped A/B outcomes and judge evidence.

Runtime Proof

The margin is 0.9, below the 1 point threshold, so more signal is required. The rendered surface was captured and can be reviewed alongside the scorecard.

Target Metadata

Run state: watch; dogfood ready; visual captured; promotion needs_more_signal.
Operator metric: Buyer confidence: a visitor can connect price to shipped A/B outcomes and judge evidence.
Consensus: Variant B; 3 majority / 0 dissent; score delta +0.9.
Next action: Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
Verification: npm run check

Operator Trust

Run state: watch; dogfood ready; visual captured; promotion needs_more_signal.

Source Boundary

.shipwright/targets/pricing-page is dogfood evidence only; durable changes for this work order belong in Shipwright files.

First Edit

Start in src/report.ts, src/ab-consensus-queue.ts, and their focused tests before inspecting any generated site artifact.

Worker Edit Recipe

Real command 1: run the required rg command from the execution prompt.
Real command 2: inspect one focused src/report.ts or src/ab-consensus-queue.ts range, 160 lines or less.
Real command 3: inspect one final adjacent source range only if needed; do not run another search.
Real command 4: edit src/ab-consensus-queue.ts, src/report.ts, or the matching focused test.
Change shape: make the Pricing Page Experiment target handoff clearer without editing .shipwright/targets/pricing-page.
Verify with npm run check before finishing.

Verify Next

node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json

Experiment Packet

Assignment experiment subject Readout needs more signal

pricing-page selects Variant B with low confidence; 2 observed signals and 1 missing signal.

Allocation

Variant A: 50% - Simple plan comparison
Variant B: 50% - Judge-proof pricing

Events

shipwright_ab_test_exposed
shipwright_ab_test_primary_signal
shipwright_ab_test_guardrail_signal
shipwright_ab_test_decision

Missing Signals

accepted decision threshold

Proof

Gate passed: Visual
Gate passed: Dogfood
Gate passed: Judge Matrix
Gate needs work: Decision: watch
Gate needs work: Promotion: watch

Acceptance

Complete the primary unblock action for Pricing Page Experiment: Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
Regenerate the report and confirm the item leaves blocked status or records a narrower blocker.

Failed Gates

Decision: watch
Promotion: watch

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant B for Pricing Page Experiment. Preserve the primary job "Built-in pricing UI variant" and verify the generated screenshots before shipping.

Recommended ship winner Status ship

Shipwright Site

Shipwright Site is ready for implementation with Variant A; dogfood, visual, and consensus evidence are aligned.

Recommended Next

Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.
landing-live-behavior: Add lightweight page events once Shipwright has real recurring users.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake
curl -fsSL https://shipwright-seven.vercel.app/ab-tests.json

Use site/ab-tests.json as the source of truth. Improve the Shipwright website by implementing the winning A/B consensus direction while preserving product history, dissent, and machine-readable artifacts.

Clear First

0 items need blocker resolution before implementation.

No items in this lane.

Ready To Ship

2 items have aligned consensus and dogfood evidence.

Status ship landing Variant A Confidence high Ready 100% Dogfood ready Playwright captured Delta 2.4

Shipwright Site

Time-to-action: can a returning operator identify the next useful Shipwright move?

Ship winner / Shipwright Site is 100% ready; assign Variant A to a worker with the promotion packet.

3 majority / 0 dissent / ready for build

A/B variant comparison

Winner Variant A Runner-up Variant B Margin 2.4

Winning: Experiment cockpit (9.1)
Runner-up: Fast winner board (6.7)
Votes: A:3 / B:1
Why: Lead with a toggleable A/B workspace: Variant A, Variant B, Playwright evidence, model-judge scorecards, and human vote state.
Risk: Can hide dissent and weak evidence behind a premature recommendation.

Experiment Packet

Assignment synthetic judge Readout accepted

shipwright-site selects Variant A with high confidence; 3 observed signals and 0 missing signals.

Allocation

Variant A: 50% - Experiment cockpit
Variant B: 50% - Fast winner board

Events

shipwright_ab_test_exposed
shipwright_ab_test_primary_signal
shipwright_ab_test_guardrail_signal
shipwright_ab_test_decision

Gates

Decision pass accepted with high confidence and +2.4 margin.
Promotion pass ready for build
Visual pass Captured Shipwright A/B Lab at https://shipwright-seven.vercel.app.
Dogfood pass Shipwright A/B Lab is ready.
Judge Matrix pass 3/3 judges aligned with 0 dissent.

Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.
landing-live-behavior: Add lightweight page events once Shipwright has real recurring users.

Criteria

Experiment Fit: 9/6 (+3)
Judgeability: 8/7 (+1)
Implementation Fit: 9/6 (+3)
Evidence Quality: 9/5 (+4)

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake
curl -fsSL https://shipwright-seven.vercel.app/ab-tests.json

Use site/ab-tests.json as the source of truth. Improve the Shipwright website by implementing the winning A/B consensus direction while preserving product history, dissent, and machine-readable artifacts.

Status ship target Variant A Confidence medium Ready 100% Dogfood ready Playwright captured Delta 1.2

Onboarding Flow Experiment

Activation clarity: a new user can identify the first experiment, judge review, and publish path.

Ship winner / Onboarding Flow Experiment is 100% ready; assign Variant A to a worker with the promotion packet.

3 majority / 0 dissent / ready for build

A/B variant comparison

Winner Variant A Runner-up Variant B Margin 1.2

Winning: Guided checklist (8.6)
Runner-up: Autonomous summary (7.4)
Votes: A:3 / B:1
Why: Lead onboarding with a concrete sequence of setup, first experiment, judge review, and publish steps.
Risk: Can hide important setup gaps if the summary is too confident.

Experiment Packet

Assignment experiment subject Readout accepted

onboarding-flow selects Variant A with medium confidence; 3 observed signals and 0 missing signals.

Allocation

Variant A: 50% - Guided checklist
Variant B: 50% - Autonomous summary

Events

shipwright_ab_test_exposed
shipwright_ab_test_primary_signal
shipwright_ab_test_guardrail_signal
shipwright_ab_test_decision

Gates

Decision pass accepted with medium confidence and +1.2 margin.
Promotion pass ready for build
Visual pass Captured Onboarding Flow Experiment at targets/onboarding-flow.html.
Dogfood pass Onboarding Flow Experiment is ready.
Judge Matrix pass 3/3 judges aligned with 0 dissent.

Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.
onboarding-flow-live-behavior: Capture real usage events after the first UI slice ships.
Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.

Criteria

Experiment Fit: 9/8 (+1)
Judgeability: 8/7 (+1)
Implementation Fit: 9/7 (+2)
Evidence Quality: 8/7 (+1)

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant A for Onboarding Flow Experiment. Preserve the primary job "Built-in onboarding UI variant" and verify the generated screenshots before shipping.

Watch

1 item need more signal but are not hard-blocked.

Status watch target Variant B Confidence low Ready 80% Dogfood ready Playwright captured Delta 0.9

Pricing Page Experiment

Buyer confidence: a visitor can connect price to shipped A/B outcomes and judge evidence.

Watch / Pricing Page Experiment is 80% ready; collect one more signal before promotion.

3 majority / 0 dissent / needs more signal

A/B variant comparison

Winner Variant B Runner-up Variant A Margin 0.9

Winning: Judge-proof pricing (8.9)
Runner-up: Simple plan comparison (8)
Votes: A:1 / B:3
Why: Lead pricing with proof: experiments run, judges consulted, human votes collected, and winners shipped.
Risk: Underplays the differentiator: autonomous evidence-backed shipping.

Experiment Packet

Assignment experiment subject Readout needs more signal

pricing-page selects Variant B with low confidence; 2 observed signals and 1 missing signal.

Allocation

Variant A: 50% - Simple plan comparison
Variant B: 50% - Judge-proof pricing

Events

shipwright_ab_test_exposed
shipwright_ab_test_primary_signal
shipwright_ab_test_guardrail_signal
shipwright_ab_test_decision

Missing Signals

accepted decision threshold

Gates

Decision watch needs more signal with low confidence and +0.9 margin.
Promotion watch needs more signal
Visual pass Captured Pricing Page Experiment at targets/pricing-page.html.
Dogfood pass Pricing Page Experiment is ready.
Judge Matrix pass 3/3 judges aligned with 0 dissent.

Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
pricing-page-live-behavior: Capture real usage events after the first UI slice ships.
Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.

Criteria

Experiment Fit: 9/8 (+1)
Judgeability: 9/8 (+1)
Implementation Fit: 8/9 (-1)
Evidence Quality: 9/7 (+2)

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant B for Pricing Page Experiment. Preserve the primary job "Built-in pricing UI variant" and verify the generated screenshots before shipping.

Queue Next Actions

Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.
landing-live-behavior: Add lightweight page events once Shipwright has real recurring users.
landing-first-impression: Compare first-viewport comprehension after each generated site change.
Keep this subject in the default experiment path.

landing offline consensus Winner Variant A Confidence high Phase ready for build Playwright captured Split Variant A: 50%, Variant B: 50%

Shipwright Site

Should Shipwright lead with a toggleable experiment cockpit or a fast winner board?

Primary metricTime-to-action: can a returning operator identify the next useful Shipwright move?

Score9.1 / 6.7

VotesVariant A: 3, Variant B: 1

Playwright evidencecaptured / 1

Decision Rule

Variant A clears the A/B threshold by 2.4 weighted points. The rendered surface was captured and can be reviewed alongside the scorecard.
Threshold 1.5; margin 2.4; status accepted.
Synthetic product, design, engineering, and operations judges until live traffic exists.

Experiment Packet

Assignment: synthetic judge
Readout: accepted / high
shipwright-site selects Variant A with high confidence; 3 observed signals and 0 missing signals.
Promote Variant A when promotion gates stay green for Time-to-action: can a returning operator identify the next useful Shipwright move?.

Allocation

Variant A: 50% - Experiment cockpit
Variant B: 50% - Fast winner board

Event Contract

shipwright_ab_test_exposed
shipwright_ab_test_primary_signal
shipwright_ab_test_guardrail_signal
shipwright_ab_test_decision

Observed Signals

decision threshold accepted
visual evidence captured
runtime matched 3/3 signals

Missing Signals

No missing signals.

Playwright Evidence

Captured Shipwright A/B Lab at https://shipwright-seven.vercel.app.
The rendered surface was captured and can be reviewed alongside the scorecard.
runtime matched 3/3 signals

Image: captures/shipwright-site.png

Guardrails

First-time comprehension of what Shipwright is becoming.
Auditability of the decision through JSON artifacts.
Visibility of dissent and next actions.

Variant A: Experiment cockpit

Lead with a toggleable A/B workspace: Variant A, Variant B, Playwright evidence, model-judge scorecards, and human vote state.

Audience: Returning operators and agents reviewing autonomous runs.

Experience: Evidence-first console with status, changelog, consensus, and next actions.

Signal: More readers can explain the selected lane and next action without opening logs.

Variant B: Fast winner board

Lead with the winning variant, summarized rationale, and one next action before showing the deeper judge evidence.

Audience: First-time readers trying to understand the product story.

Experience: Narrative journal-led page centered on the latest run.

Signal: More readers understand the story, but fewer can act on the current state.

Next Actions

Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.

Promotion Packet

Owner: Shipwright website worker
Rollout: Keep the generated site deterministic; publish Variant A as the default until live traffic exists.
Sample: Synthetic consensus with multiple LLM judges, a human panel, and visual evidence status: captured.
Ship: Variant A clears the A/B threshold by 2.4 weighted points. Ship when the generated site exposes the A/B toggle, LLM judges, human votes, and Playwright evidence together.
Stop: Stop promotion if the first viewport hides current run state, if JSON artifacts stop being generated, or if the page regresses into raw journal text.

Implementation Brief

Promote Variant A: Experiment cockpit as the default website layout.
Keep the changelog as product history and the journal as compact audit evidence.
Preserve dissent in the page so the fast-winner-board risk stays visible.
Make the A/B consensus artifact good enough for another worker to implement without reading source code.

Instrumentation

shipwright_ab_test_exposed: A user or dogfood worker sees either variant. test_id, subject_id, variant_id, run_id, default_test_id:ui-consensus-landing-ab
shipwright_ab_test_primary_signal: The primary metric can be evaluated for the viewed variant. test_id, variant_id, metric_name, metric_value, default_test_id:ui-consensus-landing-ab
shipwright_ab_test_decision: Variant A is promoted, rejected, or sent back for more signal. test_id, winner, runner_up, margin, decision_status, default_test_id:ui-consensus-landing-ab

Evidence Gaps

landing-live-behavior (follow up): The winner is based on offline consensus, not live visitor behavior. Add lightweight page events once Shipwright has real recurring users.
landing-first-impression (follow up): Design dissent says the product story still needs to survive the evidence-first layout. Compare first-viewport comprehension after each generated site change.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake
curl -fsSL https://shipwright-seven.vercel.app/ab-tests.json

Worker Prompt

Use site/ab-tests.json as the source of truth. Improve the Shipwright website by implementing the winning A/B consensus direction while preserving product history, dissent, and machine-readable artifacts.

target offline consensus Winner Variant A Confidence medium Phase ready for build Playwright captured Split Variant A: 50%, Variant B: 50%

Onboarding Flow Experiment

Should onboarding lead with a guided checklist or an autonomous summary?

Primary metricActivation clarity: a new user can identify the first experiment, judge review, and publish path.

Score8.6 / 7.4

VotesVariant A: 3, Variant B: 1

Playwright evidencecaptured / 1

Decision Rule

Variant A clears the A/B threshold by 1.2 weighted points. The rendered surface was captured and can be reviewed alongside the scorecard.
Threshold 1; margin 1.2; status accepted.
Synthetic app-specific judges derived from the target consensus rubric.

Experiment Packet

Assignment: experiment subject
Readout: accepted / medium
onboarding-flow selects Variant A with medium confidence; 3 observed signals and 0 missing signals.
Promote Variant A when promotion gates stay green for Activation clarity: a new user can identify the first experiment, judge review, and publish path..

Allocation

Variant A: 50% - Guided checklist
Variant B: 50% - Autonomous summary

Event Contract

shipwright_ab_test_exposed
shipwright_ab_test_primary_signal
shipwright_ab_test_guardrail_signal
shipwright_ab_test_decision

Observed Signals

decision threshold accepted
visual evidence captured
runtime matched 3/3 signals

Missing Signals

No missing signals.

Playwright Evidence

Captured Onboarding Flow Experiment at targets/onboarding-flow.html.
The rendered surface was captured and can be reviewed alongside the scorecard.
runtime matched 3/3 signals

Image: captures/onboarding-flow.png

Guardrails

The winning direction preserves the target app's declared primary job.
The losing direction's strongest risk remains visible before implementation.
Future dogfood runs have concrete UI signals to inspect.

Variant A: Guided checklist

Lead onboarding with a concrete sequence of setup, first experiment, judge review, and publish steps.

Audience: New operators setting up their first autonomous A/B run.

Experience: Guided checklist for setup, first experiment, judge review, and publish steps.

Signal: New users can complete the first run without reading docs or raw logs.

Variant B: Autonomous summary

Lead onboarding with what Shipwright already inferred and one high-confidence next action.

Audience: Returning operators who want Shipwright to infer the next step.

Experience: Autonomous summary that explains inferred state and recommends one action.

Signal: Operators move faster when setup assumptions are already correct.

Next Actions

Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.

Promotion Packet

Owner: Onboarding Flow Experiment experiment worker
Rollout: Build as a generated UI variant first, then promote the winner after judge and human-vote evidence stays coherent.
Sample: Synthetic experiment judges plus visual evidence status: captured.
Ship: Variant A clears the A/B threshold by 1.2 weighted points. Ship when the subject has captured Playwright evidence and no blocking evidence gaps remain.
Stop: Stop promotion if the screenshots are missing, the judge matrix loses consensus, or the losing variant's primary risk becomes a real blocker.

Implementation Brief

Promote Variant A: Guided checklist for Onboarding Flow Experiment.
Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.
Primary target job: Built-in onboarding UI variant.
Keep the losing variant's strongest risk visible in the implementation notes.

Instrumentation

shipwright_ab_test_exposed: A user or dogfood worker sees either variant. test_id, subject_id, variant_id, run_id, default_test_id:target-onboarding-flow-ab-test
shipwright_ab_test_primary_signal: The primary metric can be evaluated for the viewed variant. test_id, variant_id, metric_name, metric_value, default_test_id:target-onboarding-flow-ab-test
shipwright_ab_test_decision: Variant A is promoted, rejected, or sent back for more signal. test_id, winner, runner_up, margin, decision_status, default_test_id:target-onboarding-flow-ab-test

Evidence Gaps

onboarding-flow-live-behavior (follow up): The current decision is an offline consensus, not a live product experiment. Capture real usage events after the first UI slice ships.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Worker Prompt

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant A for Onboarding Flow Experiment. Preserve the primary job "Built-in onboarding UI variant" and verify the generated screenshots before shipping.

target offline consensus Winner Variant B Confidence low Phase needs more signal Playwright captured Split Variant A: 50%, Variant B: 50%

Pricing Page Experiment

Should pricing lead with a simple plan comparison or proof from judge outcomes?

Primary metricBuyer confidence: a visitor can connect price to shipped A/B outcomes and judge evidence.

Score8.9 / 8

VotesVariant A: 1, Variant B: 3

Playwright evidencecaptured / 1

Decision Rule

The margin is 0.9, below the 1 point threshold, so more signal is required. The rendered surface was captured and can be reviewed alongside the scorecard.
Threshold 1; margin 0.9; status needs more signal.
Synthetic app-specific judges derived from the target consensus rubric.

Experiment Packet

Assignment: experiment subject
Readout: needs more signal / low
pricing-page selects Variant B with low confidence; 2 observed signals and 1 missing signal.
Hold Variant B until missing signals are resolved.

Allocation

Variant A: 50% - Simple plan comparison
Variant B: 50% - Judge-proof pricing

Event Contract

shipwright_ab_test_exposed
shipwright_ab_test_primary_signal
shipwright_ab_test_guardrail_signal
shipwright_ab_test_decision

Observed Signals

visual evidence captured
runtime matched 3/3 signals

Missing Signals

accepted decision threshold

Playwright Evidence

Captured Pricing Page Experiment at targets/pricing-page.html.
The rendered surface was captured and can be reviewed alongside the scorecard.
runtime matched 3/3 signals

Image: captures/pricing-page.png

Guardrails

The winning direction preserves the target app's declared primary job.
The losing direction's strongest risk remains visible before implementation.
Future dogfood runs have concrete UI signals to inspect.

Variant A: Simple plan comparison

Lead pricing with straightforward plans, limits, and the first practical upgrade point.

Audience: Buyers scanning pricing before they understand the platform deeply.

Experience: Simple plan comparison with limits, usage, and the first upgrade moment.

Signal: Visitors understand cost and constraints quickly.

Variant B: Judge-proof pricing

Lead pricing with proof: experiments run, judges consulted, human votes collected, and winners shipped.

Audience: Buyers evaluating whether autonomous experimentation is worth paying for.

Experience: Outcome proof cards showing experiments run, judges consulted, votes collected, and winners shipped.

Signal: Visitors connect pricing to evidence-backed shipping outcomes.

Next Actions

Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.

Promotion Packet

Owner: Pricing Page Experiment experiment worker
Rollout: Build as a generated UI variant first, then promote the winner after judge and human-vote evidence stays coherent.
Sample: Synthetic experiment judges plus visual evidence status: captured.
Ship: The margin is 0.9, below the 1 point threshold, so more signal is required. Ship when the subject has captured Playwright evidence and no blocking evidence gaps remain.
Stop: Stop promotion if the screenshots are missing, the judge matrix loses consensus, or the losing variant's primary risk becomes a real blocker.

Implementation Brief

Promote Variant B: Judge-proof pricing for Pricing Page Experiment.
Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
Primary target job: Built-in pricing UI variant.
Keep the losing variant's strongest risk visible in the implementation notes.

Instrumentation

shipwright_ab_test_exposed: A user or dogfood worker sees either variant. test_id, subject_id, variant_id, run_id, default_test_id:target-pricing-page-ab-test
shipwright_ab_test_primary_signal: The primary metric can be evaluated for the viewed variant. test_id, variant_id, metric_name, metric_value, default_test_id:target-pricing-page-ab-test
shipwright_ab_test_decision: Variant B is promoted, rejected, or sent back for more signal. test_id, winner, runner_up, margin, decision_status, default_test_id:target-pricing-page-ab-test

Evidence Gaps

pricing-page-live-behavior (follow up): The current decision is an offline consensus, not a live product experiment. Capture real usage events after the first UI slice ships.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Worker Prompt

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant B for Pricing Page Experiment. Preserve the primary job "Built-in pricing UI variant" and verify the generated screenshots before shipping.

Dogfood Status Current readiness across the built-in experiment subjects.

Ready 3 Blocked 0 Watch 0

3/3 experiment subjects are ready; 0 blocked; 0 need watch.

Single dogfood view derived from target inspection, credentials, runtime starts, rendered probes, surface captures, and consensus handoffs.

Status ready Runtime rendered 3/3 Capture captured Consensus ready for build

Shipwright A/B Lab

Shipwright A/B Lab has rendered runtime, captured surface, and ready consensus evidence.

Target: remote
Credential: not_required
Runtime start: not_recorded
Plan: self

Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.
Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.

Evidence

remote url configured: https://shipwright-seven.vercel.app
No repository credential is required for this target.
Fetched https://shipwright-seven.vercel.app; matched 3/3 expected runtime signals.
Captured Shipwright A/B Lab at https://shipwright-seven.vercel.app.

Status ready Runtime rendered 3/3 Capture captured Consensus ready for build

Onboarding Flow Experiment

Onboarding Flow Experiment has rendered runtime, captured surface, and ready consensus evidence.

Target: metadata_only
Credential: not_required
Runtime start: not_recorded
Plan: self

Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.
Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.
onboarding-flow-live-behavior: Capture real usage events after the first UI slice ships.

Evidence

target is generated by this repository
No repository credential is required for this target.
Fetched targets/onboarding-flow.html; matched 3/3 expected runtime signals.
Captured Onboarding Flow Experiment at targets/onboarding-flow.html.

Status ready Runtime rendered 3/3 Capture captured Consensus needs more signal

Pricing Page Experiment

Pricing Page Experiment has rendered runtime, captured surface, and ready consensus evidence.

Target: metadata_only
Credential: not_required
Runtime start: not_recorded
Plan: self

Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.
Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
pricing-page-live-behavior: Capture real usage events after the first UI slice ships.

Evidence

target is generated by this repository
No repository credential is required for this target.
Fetched targets/pricing-page.html; matched 3/3 expected runtime signals.
Captured Pricing Page Experiment at targets/pricing-page.html.

Experiment Next Actions

Keep this subject in the default experiment path.
Use this rendered runtime as the baseline for the next UI consensus comparison.
Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.
landing-live-behavior: Add lightweight page events once Shipwright has real recurring users.

Consensus Board Promotion lanes and worker prompts for the next implementation pass.

Source ab tests Ready 2/3 Needs signal 1

2/3 UI consensus subjects are ready for build; 1 need more signal.

A/B consensus board for deciding what a worker can implement now versus what needs runtime, credential, or screenshot evidence first.

Ready For Build

Consensus winners with accepted decision rules and captured visual evidence.

landing ready for build Winner Variant A Playwright captured Blockers 0

Shipwright Site

Time-to-action: can a returning operator identify the next useful Shipwright move?

Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Use site/ab-tests.json as the source of truth. Improve the Shipwright website by implementing the winning A/B consensus direction while preserving product history, dissent, and machine-readable artifacts.

target ready for build Winner Variant A Playwright captured Blockers 0

Onboarding Flow Experiment

Activation clarity: a new user can identify the first experiment, judge review, and publish path.

Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.
onboarding-flow-live-behavior: Capture real usage events after the first UI slice ships.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant A for Onboarding Flow Experiment. Preserve the primary job "Built-in onboarding UI variant" and verify the generated screenshots before shipping.

Needs More Signal

Consensus winners blocked by missing screenshots, runtime evidence, or credential setup.

target needs more signal Winner Variant B Playwright captured Blockers 0

Pricing Page Experiment

Buyer confidence: a visitor can connect price to shipped A/B outcomes and judge evidence.

Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
pricing-page-live-behavior: Capture real usage events after the first UI slice ships.

Verify

npm run check
node dist/cli.js capture-surfaces --registry shipwright.targets.json --site-dir site --output metrics/surface-captures.json
npm run verify:intake

Use site/ab-tests.json, site/consensus-matrix.json, and site/surface-captures.json to implement Variant B for Pricing Page Experiment. Preserve the primary job "Built-in pricing UI variant" and verify the generated screenshots before shipping.

Board Next Actions

Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
pricing-page-live-behavior: Capture real usage events after the first UI slice ships.
Render a first-class Variant A / Variant B toggle on the website.
Keep model judges, human votes, and Playwright evidence in the same decision packet.
Generate worker prompts from the winning variant only after evidence gates pass.
landing-live-behavior: Add lightweight page events once Shipwright has real recurring users.

Codex Worker Autonomous write result, changed files, and failure evidence.

Codex Worker Run

not_recorded

No Codex worker run has been recorded for this report.

Exit none Changed 0 Timed out false

No changed files recorded yet.

Evidence

codex-worker-status.json missing

Next Actions

Run Shipwright Dogfood in non-dry-run mode to capture worker execution evidence.

Worker Contract Whether the autonomous worker stayed inside the command contract.

Codex Worker Contract

unknown

No Codex worker contract trace has been recorded for this report.

Commands 0 First edit -1 Forbidden 0

Evidence

commands: 0
first edit command index: -1

No command trace captured yet.

Next Actions

Run Shipwright Dogfood in non-dry-run mode to capture worker contract evidence.

UI Experiments Poll questions, responses, and synthesis outputs.

Shipwright A/B Lab UI Consensus Poll

llm judge / remote / rendered

Autonomous A/B experiment console. Should the experiment console optimize for operator confidence or fast winner selection?

Winner Variant A: Evidence-first surface.

Readiness remote

Surface rendered

Poll Questions

Should the experiment console optimize for operator confidence or fast winner selection? Use Variant A: Evidence-first surface.
What should block Shipwright from implementing the winning UI direction? The losing variant may still be better for first-time users or marketing pages.

Responses

GPT-5.1 Judge selected Variant A at 9/10.
Claude Sonnet Judge selected Variant A at 8/10.
Gemini Judge selected Variant B at 7/10.
Human Panel selected Variant A at 8/10.

Synthesis

3/4 judges selected Variant A.
Shipwright should run Shipwright A/B Lab as an A/B subject with Variant A: Evidence-first surface.
Use the captured rendered surface signals as the baseline for the next UI comparison.

Surface Signals

No surface signals captured.

Source Model

No source model extracted yet.

Onboarding Flow Experiment UI Consensus Poll

llm judge / metadata only / metadata only

Built-in onboarding UI variant. Should onboarding lead with a guided checklist or an autonomous summary?

Winner Variant A: Guided checklist.

Readiness metadata only

Surface metadata only

Poll Questions

Should onboarding lead with a guided checklist or an autonomous summary? Use Variant A: Guided checklist.
What should block Shipwright from implementing the winning UI direction? The losing variant may still be better for first-time users or marketing pages.

Responses

GPT-5.1 Judge selected Variant A at 9/10.
Claude Sonnet Judge selected Variant A at 8/10.
Gemini Judge selected Variant B at 7/10.
Human Panel selected Variant A at 8/10.

Synthesis

3/4 judges selected Variant A.
Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.
Use declared subject metadata until generated UI evidence is available.

Surface Signals

npm run check
npm run report

Source Model

No source model extracted yet.

Pricing Page Experiment UI Consensus Poll

llm judge / metadata only / metadata only

Built-in pricing UI variant. Should pricing lead with a simple plan comparison or proof from judge outcomes?

Winner Variant B: Judge-proof pricing.

Readiness metadata only

Surface metadata only

Poll Questions

Should pricing lead with a simple plan comparison or proof from judge outcomes? Use Variant B: Judge-proof pricing.
What should block Shipwright from implementing the winning UI direction? The losing variant may still be better for operators who need dense controls.

Responses

GPT-5.1 Judge selected Variant B at 9/10.
Claude Sonnet Judge selected Variant B at 8/10.
Gemini Judge selected Variant A at 7/10.
Human Panel selected Variant B at 8/10.

Synthesis

3/4 judges selected Variant B.
Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.
Use declared subject metadata until generated UI evidence is available.

Surface Signals

npm run check
npm run report

Source Model

No source model extracted yet.

UI Surfaces Rendered or source-level surface evidence.

shipwright-site

rendered / https://shipwright-seven.vercel.app

Title: Shipwright

Headings: none

Signals: none

onboarding-flow

metadata only / no route

Title: unknown

Headings: none

Signals: npm run check, npm run report

pricing-page

metadata only / no route

Title: unknown

Headings: none

Signals: npm run check, npm run report

Runtime Evidence Runtime URLs, matched signals, and missing signals.

Shipwright A/B Lab

rendered

URL: https://shipwright-seven.vercel.app

Start: npm run report

Matched: 3/3

Matched Signals

A/B Test Consensus
LLM Judges
Variant Toggle

Next Actions

Use this rendered runtime as the baseline for the next UI consensus comparison.

Onboarding Flow Experiment

rendered

URL: targets/onboarding-flow.html

Start: none

Matched: 3/3

Matched Signals

Guided checklist
Judge review
Publish path

Next Actions

Use this rendered runtime as the baseline for the next UI consensus comparison.

Pricing Page Experiment

rendered

URL: targets/pricing-page.html

Start: none

Matched: 3/3

Matched Signals

Judge-proof pricing
Experiments run
Winners shipped

Next Actions

Use this rendered runtime as the baseline for the next UI consensus comparison.

Runtime Start Attempts What Shipwright tried to start for each subject.

No runtime start attempts recorded yet.

Doctor One-command readiness for imported GitHub projects.

Shipwright Doctor

needs attention

Shipwright can start, but 1 readiness warning should be reviewed before schedules cook.

Mode dry_run Workflow configured Checks 5

Workspace: /home/runner/work/shipwright/shipwright

Evidence

mode: dry_run
workspace: /home/runner/work/shipwright/shipwright
workflow: .github/workflows/shipwright-dogfood.yml
config: shipwright.yml
targets: shipwright.targets.json
pass: 4

Next Actions

Let the dogfood workflow deploy site/ to Vercel and record the result.

GitHub Adoption

pass

Shipwright GitHub adoption wiring is ready.

Evidence

workflow: .github/workflows/shipwright-dogfood.yml
workflow_template: local_dogfood
config: shipwright.yml
targets: shipwright.targets.json
pass: 10
warn: 0

Safety Policy

pass

Shipwright dry_run mode has the required safety controls.

Evidence

mode: dry_run
checks: 7
pass: 7
warn: 0
fail: 0

Verification Plan

pass

Verification is configured with 3 commands.

Evidence

command: npm run build
command: npm test
command: npm run report

Codex Provider

pass

Codex is configured for OpenRouter with openai/gpt-5.1-codex-mini.

Evidence

provider: openrouter
model: openai/gpt-5.1-codex-mini
wire_api: responses
OPENROUTER_API_KEY: configured
config_path: /home/runner/.codex/config.toml

Website Deploy

warn

Vercel deploy secrets are configured, but this report has not observed a completed deploy.

Evidence

provider: vercel
state: configured_unverified
VERCEL_TOKEN: configured
VERCEL_ORG_ID: configured
VERCEL_PROJECT_ID: configured
message: Website deploy deferred until ci-run writes top-level status and refreshes final evidence.

Next Actions

Let the dogfood workflow deploy site/ to Vercel and record the result.

Adoption Reusable GitHub Actions onboarding and evidence wiring.

GitHub Adoption

pass

Shipwright GitHub adoption wiring is ready.

Provider github Template local dogfood Workflow configured Checks 10

Workflow: .github/workflows/shipwright-dogfood.yml

Evidence

workflow: .github/workflows/shipwright-dogfood.yml
workflow_template: local_dogfood
config: shipwright.yml
targets: shipwright.targets.json
pass: 10
warn: 0

GitHub Actions Workflow

pass

Found Shipwright workflow at .github/workflows/shipwright-dogfood.yml.

Evidence

workflow: .github/workflows/shipwright-dogfood.yml

Workflow Triggers

pass

Workflow supports manual and scheduled runs.

Evidence

workflow_dispatch: configured
schedule: configured

Workflow Permissions

pass

Workflow has the permissions required to write evidence and read checks.

Evidence

contents: write
actions: read
checks: read

Workflow Commands

pass

Workflow runs the full Shipwright adoption, evidence, worker, publish, and deploy-gate loop through ci-run.

Evidence

command: ci-run
phase_chain: delegated_to_ci_run

Shipwright Runtime

pass

Dogfood workflow builds Shipwright in-repo and uses the local CLI.

Evidence

runtime_mode: local_dogfood
workflow: .github/workflows/shipwright-dogfood.yml
cli: node dist/cli.js

Evidence Upload

pass

Workflow uploads website and machine-readable evidence artifacts.

Evidence

upload-artifact: v7
site/**
metrics/**/*.json
metrics/**/*.jsonl
metrics/**/*.md
metrics/**/*.log

Adoption Runbook

pass

SHIPWRIGHT.md documents manual validation, secrets, variables, and evidence review.

Evidence

runbook: SHIPWRIGHT.md
manual dry run: documented
OPENROUTER_API_KEY: documented
SHIPWRIGHT_SCHEDULE_MODE: documented
evidence-index: documented
artifact-manifest: documented

Shipwright Config

pass

shipwright.yml is parseable and points at a GitHub repository with verification.

Evidence

provider: github
repo: matthoffner/shipwright
mode: dry_run
target_branches: main
verification: npm run build && npm test && npm run report

Target Registry

pass

Target registry has 3 experiment subjects.

Evidence

targets: 3
shipwright-site: shipwright/static-site
onboarding-flow: shipwright/static-site
pricing-page: shipwright/static-site

Package Verification

pass

Package scripts and Shipwright verification commands line up.

Evidence

script: build
script: test
script: check
verification: npm run build && npm test && npm run report

Project Dependencies Package manager detection and install evidence for imported projects.

Project Dependencies

installed

Project dependencies installed with npm ci.

Manager npm Package shipwright Lockfile package-lock.json

Workspace: /home/runner/work/shipwright/shipwright

Install: npm ci

Evidence

workspace: /home/runner/work/shipwright/shipwright
package_manager: npm
package_name: shipwright
lockfile: package-lock.json
install_command: npm ci
steps: 1

Project dependency install

passed

Exit 0 Duration 0ms

Command: npm ci

Project dependencies were already installed before this command.

Browser Install Playwright browser readiness for surface capture and judge evidence.

Browser Install

installed

Playwright browser chromium is installed for surface captures.

Browser chromium Deps true Exit 0 Duration 32895ms

Runtime: /home/runner/work/shipwright/shipwright

Command: npx playwright install --with-deps chromium

Evidence

runtime_dir: /home/runner/work/shipwright/shipwright
browser: chromium
with_deps: true
command: npx playwright install --with-deps chromium
exit_code: 0
signal: null

Safety YOLO and autonomous mode policy readiness.

Safety Policy

pass

Shipwright dry_run mode has the required safety controls.

Mode dry_run Checks 7 Failures 0

Evidence

mode: dry_run
checks: 7
pass: 7
warn: 0
fail: 0

Hard Stops

pass

Required hard stops are configured.

Evidence

hard_stop: secret_detected
hard_stop: production_config_change
hard_stop: explicit_human_block
hard_stop: budget_exhausted

Forbidden Paths

pass

Forbidden path coverage includes environment, secret, and production paths.

Evidence

forbidden_path: .env
forbidden_path: secrets/**
forbidden_path: infra/prod/**

Human Approval Gates

pass

Human approval is required for schedule and destructive changes.

Evidence

approval: merge
approval: schedule_update
approval: destructive_change

Verification Commands

pass

Verification commands provide build/test/check coverage.

Evidence

verification: npm run build
verification: npm test
verification: npm run report

Change Limits

pass

Autonomous change limits are bounded.

Evidence

maxOpenChanges: 3
maxChangesPerRun: 1
maxCommentsPerRun: 3
maxIterations: 20

Target Branch

pass

Target branches are explicit and include main.

Evidence

target_branch: main

Mode Readiness

pass

dry_run mode has the lanes required for the loop.

Evidence

mode: dry_run
enabled_lanes: build, review, intake, observe

Verification Post-worker command gate and durable status evidence.

Verification Gate

skipped

Verification was skipped because Shipwright is running in dry_run mode.

Mode dry_run Commands 3 Config shipwright.yml

Workspace: /home/runner/work/shipwright/shipwright

Evidence

mode: dry_run
commands: 3
skipped: dry_run

Next Actions

Run Shipwright in autonomous or yolo mode to verify post-worker changes.

npm run build

skipped

Exit none Signal none Duration 0ms Timed out false

Skipped because Shipwright is running in dry_run mode.

npm test

skipped

Exit none Signal none Duration 0ms Timed out false

Skipped because Shipwright is running in dry_run mode.

npm run report

skipped

Exit none Signal none Duration 0ms Timed out false

Skipped because Shipwright is running in dry_run mode.

Deployment Vercel deploy state and next action.

Vercel Site

deployed

Provider: vercel

URL: https://shipwright-seven.vercel.app

Vercel accepted the latest site deploy.

Required Secrets

VERCEL_TOKEN: configured
VERCEL_ORG_ID: configured
VERCEL_PROJECT_ID: configured

Evidence

provider: vercel
state: deployed
VERCEL_TOKEN: configured
VERCEL_ORG_ID: configured
VERCEL_PROJECT_ID: configured
message: Final refreshed CI evidence is projected for the Vercel deploy.

Next Actions

Verify the live Vercel site renders the latest site artifacts.

Git Publish Direct-to-main commit and push evidence.

Main Branch Publish

no changes

Git publishing skipped because Shipwright is running in dry_run mode.

Remote origin Branch main Pushed false Files 0

Commit: none

No allowed files were published in this run.

Evidence

mode: dry_run
pushed: false

Next Actions

Run Shipwright in autonomous or yolo mode when changes should be committed and pushed.

Codex Provider OpenRouter provider setup and model readiness.

OpenRouter Codex

configured

Model: openai/gpt-5.1-codex-mini

Config: /home/runner/.codex/config.toml

Env key: OPENROUTER_API_KEY

Codex is configured for OpenRouter with openai/gpt-5.1-codex-mini.

Evidence

provider: openrouter
model: openai/gpt-5.1-codex-mini
wire_api: responses
OPENROUTER_API_KEY: configured
config_path: /home/runner/.codex/config.toml

Next Actions

Run Codex with OPENROUTER_API_KEY available in the environment.

Experiment Subjects Subject metadata and current recommendations.

Shipwright A/B Lab

static-site / Autonomous A/B experiment console

Question: Should the experiment console optimize for operator confidence or fast winner selection?

Winner: Variant A - Evidence-first surface

Shipwright should run Shipwright A/B Lab as an A/B subject with Variant A: Evidence-first surface.

remote

Package: shipwright

Package manager: npm

Lockfile: package-lock.json

Install: npm ci

Framework: static-site

Scripts: vercel deploy

Onboarding Flow Experiment

static-site / Built-in onboarding UI variant

Question: Should onboarding lead with a guided checklist or an autonomous summary?

Winner: Variant A - Guided checklist

Shipwright should run Onboarding Flow Experiment as an A/B subject with Variant A: Guided checklist.

metadata only

Package: shipwright

Package manager: npm

Lockfile: package-lock.json

Install: npm ci

Framework: static-site

Scripts: none

Pricing Page Experiment

static-site / Built-in pricing UI variant

Question: Should pricing lead with a simple plan comparison or proof from judge outcomes?

Winner: Variant B - Judge-proof pricing

Shipwright should run Pricing Page Experiment as an A/B subject with Variant B: Judge-proof pricing.

metadata only

Package: shipwright

Package manager: npm

Lockfile: package-lock.json

Install: npm ci

Framework: static-site

Scripts: none

Experiment Subject Plan Preparation plan for each generated subject.

Shipwright A/B Lab

self / generated site

Repository: https://github.com/matthoffner/shipwright.git

Start: npm run report

Checks: none

Next Actions

Keep validating the generated website before using it as the baseline for other experiment subjects.

Onboarding Flow Experiment

self / generated site

Repository: https://github.com/matthoffner/shipwright.git

Start: npm run report

Checks: none

Next Actions

Keep validating the generated website before using it as the baseline for other experiment subjects.

Pricing Page Experiment

self / generated site

Repository: https://github.com/matthoffner/shipwright.git

Start: npm run report

Checks: none

Next Actions

Keep validating the generated website before using it as the baseline for other experiment subjects.

Target Checkouts Repository and workspace materialization evidence.

Shipwright A/B Lab

self

Repository: none

Workspace: .

Secret: none

Evidence

target is generated by this repository

Onboarding Flow Experiment

self

Repository: none

Workspace: .

Secret: none

Evidence

target is generated by this repository

Pricing Page Experiment

self

Repository: none

Workspace: .

Secret: none

Evidence

target is generated by this repository

Credential Readiness Secret and credential checks for the run.

Shipwright A/B Lab

not required

Secret: none

Repository: none

Evidence

No repository credential is required for this target.

Next Actions

Keep this subject in the default experiment path.

Onboarding Flow Experiment

not required

Secret: none

Repository: none

Evidence

No repository credential is required for this target.

Next Actions

Keep this subject in the default experiment path.

Pricing Page Experiment

not required

Secret: none

Repository: none

Evidence

No repository credential is required for this target.

Next Actions

Keep this subject in the default experiment path.

Product History Changelog-derived product memory.

Date 2026-06-21 Added 5 Changed 0 Fixed 7

Autonomous Dogfood Runtime

Added: Add Typed Codex Provider Setup — Add typed codex provider setup.
Added: Surface Codex Contract Failures — Surface codex contract failures.
Added: Lane decision artifacts — made orchestration decisions inspectable through structured metrics.
Added: Codex dogfood writes — enabled bounded Codex improvements inside the Shipwright workflow.
Added: Direct-to-main dogfood — let Shipwright commit its own verified dogfood output to `main`.

Sources: f8681de, 14022ba, 8d809ad, 88a46f8, befd57e, 9bd11f6, 3fd8fd5, a2eb9c5

Date 2026-06-21 Added 7 Changed 0 Fixed 8

Shipwright Product Updates

Added: Add Variant Judge Runs — Add variant judge runs.
Added: Pivot Shipwright To A/B Experiment Platform — Pivot shipwright to ab experiment platform.
Added: Add A/B Consensus Worker Edit Recipes — Add ab consensus worker edit recipes.
Added: Surface A/B Consensus Execution Metadata — Surface ab consensus execution metadata.
Added: Add A/B Consensus Handoff Packets — Add ab consensus handoff packets.

Sources: 22deb3b, dcce361, 5596a2f, 5b8f8f3, 8aba939, d84d6fc, b3bbd49, 20be7b1

Date 2026-06-21 Added 3 Changed 0 Fixed 2

Consensus Work Orders

Added: A/B consensus work orders — turns ready consensus winners and blocker-clearing tasks into assignable worker packets above the detailed lanes.
Added: A/B consensus priority backlog — ranks consensus work orders with priority, effort, and next verification commands so workers can act from the generated queue.
Added: A/B consensus gate checks — shows decision, promotion, visual, dogfood, and judge-matrix gates for each consensus handoff.
Fixed: A/B consensus work orders — turns ready consensus winners and blocker-clearing tasks into assignable worker packets above the detailed lanes.
Fixed: A/B consensus readiness actions — keeps blocked target summary cards pointed at unblock work instead of promotion handoffs.

Sources: 8b62a54, 4f5360e, 17565b6, 4637579, 18244cf, baf2252, a6211db, 5b2f557

Date 2026-06-21 Added 3 Changed 0 Fixed 1

Product History Surface

Added: Product outcome changelog sections — splits same-day changes into DevBox-style outcome sections so the website can show product history without relying on journal entries.
Added: Product history digest — grouped changelog outcomes for the website and reduced journal output to compact audit metadata.
Added: DevBox-style changelog — made `CHANGELOG.md` the human-facing product history and kept journals as audit evidence.
Fixed: Artifact-only dogfood journals — keeps dogfood journal entries in workflow evidence artifacts so tracked history stays focused on the changelog.

Sources: f8f0c7e, 4ac0e4a, e8b7185, 50a329e, 415bd0a

Date 2026-06-21 Added 0 Changed 0 Fixed 4

Product History Quality

Fixed: Changelog history window — keeps the automated changelog scan wide enough to preserve older product outcomes during dogfood refreshes.
Fixed: Product history classification — keeps new outcomes in named DevBox-style changelog sections instead of the generic product-updates bucket.
Fixed: Changelog synthesis quality — tightened deterministic summaries so new product entries stay useful instead of echoing commit subjects.
Fixed: Product history grouping — keeps UI workspace changes summarized as DevBox-style product history instead of noisy commit echoes.

Sources: 41e7f4d, e64776f, c9ca2b5, 0423133, 594ee12, 9486979, cf3b96d

Date 2026-06-21 Added 2 Changed 0 Fixed 1

Autonomous Deployment Evidence

Added: Website evidence artifact — uploads the generated site, screenshots, and machine-readable consensus artifacts when Vercel cannot publish the report.
Added: Deploy status evidence — records Vercel readiness and deploy failures as structured evidence so stale live sites do not look current.
Fixed: Deploy status evidence — records Vercel readiness and deploy failures as structured evidence so stale live sites do not look current.

Sources: 0d1b1c4, 6027161, 52047cc

Date 2026-06-21 Added 3 Changed 0 Fixed 1

Product History Benchmark

Added: Product history section guard — flags oversized changelog sections so Shipwright keeps copying DevBox's concise outcome history instead of growing giant buckets.
Added: Current-state digest — summarizes what changed, what is ready, what is blocked, and next actions from evidence artifacts instead of journal prose.
Added: DevBox changelog benchmark — scores Shipwright product history against DevBox-style outcome sections, category grouping, source traceability, and journal-noise removal.
Fixed: Product history section guard — flags oversized changelog sections so Shipwright keeps copying DevBox's concise outcome history instead of growing giant buckets.

Sources: 5c1e5e9, 4803642, 5ed88f0, b2d24ff

Date 2026-06-21 Added 5 Changed 0 Fixed 1

Consensus Experiment Design

Added: A/B consensus experiment packets — adds assignment, allocation, event contract, and readout state to each UI consensus handoff.
Added: A/B consensus variant comparison — exposes winner, runner-up, margin, vote split, thesis, and risk for every consensus queue item.
Added: A/B test consensus plans — published hypotheses, split, metrics, guardrails, decision rules, and results for each UI consensus test.
Added: A/B consensus scorecards — added weighted criteria, confidence, and machine-readable scorecards for UI consensus.
Added: UI experiment consensus artifacts — added group-vote poll synthesis for dogfood UI decisions.

Sources: 900c725, 7630e18, d4fb1bf, 9c0c80d, 8462a34, 123764a, 4369873

Date 2026-06-21 Added 3 Changed 0 Fixed 1

Consensus Response Artifacts

Added: Consensus response packets — packages each UI consensus target as a group-vote question, majority, dissent, evidence checklist, and copyable response prompt.
Added: Consensus response matrix — publishes group-vote judge alignment, criterion deltas, Playwright evidence, and runtime evidence for each UI consensus subject.
Added: UI consensus report — promoted consensus decisions into the generated Shipwright report.
Fixed: Consensus response matrix — publishes group-vote judge alignment, criterion deltas, Playwright evidence, and runtime evidence for each UI consensus subject.

Sources: 8112ae5, a7dd438, c242277, 204f69f

Date 2026-06-21 Added 3 Changed 0 Fixed 2

Experiment Operator Plan

Added: Experiment blocker diagnosis — gives blocked app targets one primary unblock path with next action and verification command before the raw evidence list.
Added: Experiment status board — merges target inspection, credentials, runtime, screenshots, and consensus handoffs into one per-app readiness view.
Added: Experiment subject plan — published preparation, verification, credential, blocker, and next-action plans for each A/B subject.
Fixed: Experiment blocker diagnosis — gives blocked app targets one primary unblock path with next action and verification command before the raw evidence list.
Fixed: Experiment subject plan — published preparation, verification, credential, blocker, and next-action plans for each A/B subject.

Sources: 63793df, aca4554, bccb365, 71d71e8, 9fc122b, 4cd5549, 74cab03

Date 2026-06-21 Added 1 Changed 0 Fixed 3

Consensus Blocker Clarity

Added: Consensus blocker clarity — separates blocked targets from failing checks so the A/B consensus queue matches the top current-state digest.
Fixed: Consensus blocker clarity — clears blocker language from ready-for-build consensus cards when captured runtime evidence already proves the target is reviewable.
Fixed: A/B consensus lane layout — gives blocker-heavy consensus lanes enough room for readable evidence and worker handoffs.
Fixed: Unique A/B consensus subjects — keeps `site/ab-tests.json` from counting the same generated surface twice.

Sources: 5d742bd, 93f168f, 884aed9, d9de7b5, 5ecc32e

Date 2026-06-21 Added 5 Changed 0 Fixed 0

Consensus Decision Workspace

Added: UI consensus cockpit — added target tabs, decision panels, vote/rubric details, dissent, and copyable worker handoffs to the generated website.
Added: A/B consensus operator lanes — adds clear-first, ready-to-ship, and watch lanes with one recommended worker action for the consensus UI.
Added: A/B consensus queue — ranks consensus items into ship, blocked, and watch states using test plans, judge matrices, and dogfood readiness.
Added: Consensus board — turns A/B consensus plans into ready-for-build and needs-more-signal lanes on the generated website.
Added: UI consensus workspace — turned the report-style consensus section into a judge-panel decision workspace with options, rubric, votes, responses, and surface signals.

Sources: 87c41ff, 524f823, 6b64374, d76685a, 2381c3a, bad937b, 06ad625

DevBox Changelog Benchmark Checks that keep the product history readable.

Reference DevBox CHANGELOG.md pattern Score 86

6/7 DevBox-style product history checks pass.

Shipwright should keep copying the useful DevBox split: changelog for product history, journal for audit evidence.

Dated sections name product outcomes.
Entries are grouped under Added, Changed, and Fixed.
Bullets use a bold outcome label followed by a short user-facing explanation.
Source commits remain visible, while routine run journals stay audit-only.

Dated Outcome Sections pass

18 dated product outcome sections are available.

Evidence

Autonomous Dogfood Runtime
Shipwright Product Updates
Consensus Work Orders
Product History Surface

Keep grouping changes by named product outcome instead of dumping every run into one feed.

Added Changed Fixed Grouping pass

2 changelog categories are represented: Added, Fixed.

Evidence

Added
Fixed

Preserve DevBox-style Added, Changed, and Fixed category headings as the product matures.

Concise Product Bullets pass

92/92 entries use the DevBox-style bold outcome plus one-line explanation.

Evidence

Add Typed Codex Provider Setup
Surface Codex Contract Failures
Lane decision artifacts
Codex dogfood writes

Rewrite vague commit-derived entries into product outcome bullets before publishing.

Source Traceability pass

92/92 entries retain source commit references.

Evidence

Add Typed Codex Provider Setup: f8681de
Surface Codex Contract Failures: 14022ba
Lane decision artifacts: 8d809ad
Codex dogfood writes: 88a46f8

Keep source commit links on every generated changelog bullet.

Journal Noise Removed pass

No routine dogfood or journal-only entries are promoted into product history.

Keep dogfood run bookkeeping in metrics/journal artifacts and reserve CHANGELOG.md for product outcomes.

Named Outcomes pass

Section titles name product outcomes instead of generic update buckets.

Evidence

Autonomous Dogfood Runtime
Shipwright Product Updates
Consensus Work Orders
Product History Surface

Prefer headings like DevBox's V5 onboarding or duplicate-MR guardrail sections over generic update labels.

Focused Section Size fail

Largest changelog section has 15 entries; DevBox-style sections stay small and outcome-specific.

Evidence

Shipwright Product Updates: 15 entries
Autonomous Dogfood Runtime: 12 entries
Consensus Experiment Design: 6 entries
Consensus Work Orders: 5 entries

Split oversized sections into narrower dated outcomes before the website treats product history as healthy.

Run Audit The latest workflow journal entry.

Audit-only journal entry from 2026-06-21.md at Jun 21, 2026, 10:00 AM UTC.

Writer: matthoffner
Repo: matthoffner/shipwright
Mode: yolo
Outcome: would_run
Lane: intake
Task: intake:advance_runtime
Run: https://github.com/matthoffner/shipwright/actions/runs/27900743899

Operating Model How Shipwright decides when to plan, write, or stop.

Dry runs explain the next move before Shipwright writes code.

Autonomous runs turn bounded work into verified changes.

YOLO mode keeps shipping while the changelog carries product history.