Reviewer validate layer passes through when worktree is missing — test_reviewer_03 RED #13

Open
opened 2026-06-24 09:09:54 +00:00 by kaykayyali · 0 comments
Owner

Reviewer validate layer passes through when worktree is missing — test_reviewer_03 RED

The contract

wiki/concepts/reviewer-contract.md §1 Layer 1 (validate) lists 6 mechanical checks as a hard gate:

  1. pr_url IS NOT NULL
  2. PR is open and mergeable
  3. ## Test Command exits 0 in the worktree
  4. Build succeeds
  5. Lint clean
  6. Branch is rebased onto current main

The contract treats these as a hard block: any failure → tests_failed (or rebase_conflict for #6), row goes back to build with attempts++.

The gap

src/damascus/phases.py:311-317 has two early-return pass branches:

test_cmd = feedback.get("test_cmd")
if not test_cmd:
    return _verdict("pass", {"note": "no test_cmd recorded; PR exists, passing through"})

wt = _worktree_path(project, story_id)
if not wt.exists():
    return _verdict("pass", {"note": "worktree gone; PR exists, passing through"})

When either is hit, the reviewer passes through to merge even though no actual test/build/lint was run. The "pass" verdict then routes to merged per _next_phase_on_verdict. The validate layer's hard gate is silently downgraded to advisory in the missing-worktree / no-test_cmd case.

Why it matters

This is the "Surviving bugs revealed by revived-dead tests" pattern from the operator skill. The test was previously dead (MySQL connection refused), the Postgres migration revived it, and its first failure surfaces a real Loop-C bug. A build that ships a no-worktree state (e.g. the build phase ran into a network error and left the row in review with no worktree) currently passes review and reaches merged, even though no actual test was run, no actual lint was checked, no actual build was verified. The whole point of the validate layer — defense in depth against build-phase lies — is bypassed.

The E2E test that surfaces it

tests/e2e/test_reviewer.py::test_reviewer_03_validate_layer_runs_test_cmd — currently RED:

E       AssertionError: expected build/blocked, got review

The test inserts a row with test_cmd="exit 1", sets phase='review' with a fake pr_url, runs a cycle, and asserts the row routes to build or blocked. The actual behavior: the worktree doesn't exist, the reviewer returns pass (worktree-gone early-return), the row is supposed to go review → merged but stays in review because the actual merge call against a fake PR URL fails.

The three options (from the gap note)

Option A — Fail-closed on missing worktree / test_cmd (5-line patch)

Replace both early-return pass branches with tests_failed:

if not test_cmd:
    return _verdict("tests_failed", {"review_test": False, "reason": "no test_cmd recorded"})
if not wt.exists():
    return _verdict("tests_failed", {"review_test": False, "reason": "worktree missing"})

Cost: 5 minutes. Risk: rows that legitimately have a PR but no worktree (worktree pruned, branch deleted on remote but PR still open) now go to build instead of merged. Recommendation: ship Option B in the same PR; if Option B is too large for one PR, ship Option A as a follow-up fix and schedule Option B as a follow-up PR.

The state-resume-protocol contract says build() already implements idempotent resume (worktree list, branch check, PR existence check, PR #9 merged 2026-06-24 05:15). Apply the same pattern to the reviewer's validate layer: before failing-closed, try to re-create the worktree at the row's branch field's tip. If re-create succeeds, run the test_cmd. If it fails, return tests_failed with the recreate error.

Cost: 30-60 minutes. Risk: worktree-recreate path could fail in a way that masks the original test failure. Mitigation: log the recreate error prominently in last_feedback.recreate_error.

Option C — Distinguish "validate passed" from "validate skipped" via typed verdict (largest change)

Add a new typed verdict validate_skipped to the VERDICTS set. The reviewer returns pass only when validate actually ran and passed. The state machine then enforces "merged requires validate_ran=true." Requires a contract update, the typed-verdict table, and a re-audit of every place that consumes pass.

Cost: 2-3 hours. Risk: big blast radius.

Recommendation

Option B (re-create worktree from branch field, then validate). Reuses the state-resume-protocol pattern from build(). The reviewer contract should follow the same shape as the build contract.

What needs human/BMAD input

The options require a design call on how aggressive the worktree-recreate path should be:

  • B-1: Re-create worktree at row's branch tip unconditionally
  • B-2: Re-create only if branch is on a known set of projects (whitelist)
  • B-3: Re-create only if the row's last_feedback.pr_url is a Gitea PR that's still open (cross-check via API)

B-1 is the simplest and matches the build phase's pattern. B-2/B-3 add safety for production multi-project setups. For this homelab, B-1 is the right shape.

Minimum bar to call it done

  • One PR that changes src/damascus/phases.py:311-322 to:
    • Try to re-create the worktree at the row's branch field (using git_ops.ensure_worktree or equivalent)
    • If re-create fails, return tests_failed with the recreate error in last_feedback
    • If re-create succeeds, run the existing test_cmd validation path
  • The existing E2E test tests/eviewer.py::test_reviewer_03_validate_layer_runs_test_cmd goes GREEN
  • A new source-grep contract test in tests/contract/test_contracts_match_source.py that asserts the reviewer code does NOT contain the early-return pass branches (or contains a comment naming the gap)
  • No regression in tests/e2e/test_reviewer_01_no_pr_defensive_guard (the no_pr defensive guard must still fire before the worktree check)
  • No regression in tests/e2e/test_reviewer_04_subjective_comments_do_not_block_merge

Out of scope for this issue

  • The no_pr defensive guard (PR #1 baseline, working)
  • The loop-breaker on attempts >= budget_cycles (PR #3, working)
  • The spec-refiner ambiguity-routing drift (PR #12, awaiting Kay)
  • The §7 metrics/analyzer (issue #6, awaiting BMAD picks)

Cross-references

# Reviewer validate layer passes through when worktree is missing — test_reviewer_03 RED ## The contract `wiki/concepts/reviewer-contract.md` §1 Layer 1 (validate) lists 6 mechanical checks as a hard gate: 1. `pr_url IS NOT NULL` 2. PR is open and mergeable 3. **`## Test Command` exits 0** in the worktree 4. Build succeeds 5. Lint clean 6. Branch is rebased onto current main The contract treats these as a hard block: any failure → `tests_failed` (or `rebase_conflict` for #6), row goes back to `build` with `attempts++`. ## The gap `src/damascus/phases.py:311-317` has two early-return `pass` branches: ```python test_cmd = feedback.get("test_cmd") if not test_cmd: return _verdict("pass", {"note": "no test_cmd recorded; PR exists, passing through"}) wt = _worktree_path(project, story_id) if not wt.exists(): return _verdict("pass", {"note": "worktree gone; PR exists, passing through"}) ``` When either is hit, the reviewer **passes through to merge** even though no actual test/build/lint was run. The "pass" verdict then routes to `merged` per `_next_phase_on_verdict`. The validate layer's hard gate is silently downgraded to advisory in the missing-worktree / no-test_cmd case. ## Why it matters This is the "Surviving bugs revealed by revived-dead tests" pattern from the operator skill. The test was previously dead (MySQL connection refused), the Postgres migration revived it, and its first failure surfaces a real Loop-C bug. A build that ships a no-worktree state (e.g. the build phase ran into a network error and left the row in `review` with no worktree) currently passes review and reaches `merged`, even though **no actual test was run, no actual lint was checked, no actual build was verified**. The whole point of the validate layer — defense in depth against build-phase lies — is bypassed. ## The E2E test that surfaces it `tests/e2e/test_reviewer.py::test_reviewer_03_validate_layer_runs_test_cmd` — currently **RED**: ``` E AssertionError: expected build/blocked, got review ``` The test inserts a row with `test_cmd="exit 1"`, sets phase='review' with a fake pr_url, runs a cycle, and asserts the row routes to `build` or `blocked`. The actual behavior: the worktree doesn't exist, the reviewer returns `pass` (worktree-gone early-return), the row is supposed to go `review → merged` but stays in `review` because the actual merge call against a fake PR URL fails. ## The three options (from the gap note) ### Option A — Fail-closed on missing worktree / test_cmd (5-line patch) Replace both early-return `pass` branches with `tests_failed`: ```python if not test_cmd: return _verdict("tests_failed", {"review_test": False, "reason": "no test_cmd recorded"}) if not wt.exists(): return _verdict("tests_failed", {"review_test": False, "reason": "worktree missing"}) ``` **Cost:** 5 minutes. **Risk:** rows that legitimately have a PR but no worktree (worktree pruned, branch deleted on remote but PR still open) now go to `build` instead of `merged`. **Recommendation:** ship Option B in the same PR; if Option B is too large for one PR, ship Option A as a follow-up fix and schedule Option B as a follow-up PR. ### Option B — Re-create worktree from row's `branch` field, then validate (recommended) The state-resume-protocol contract says `build()` already implements idempotent resume (worktree list, branch check, PR existence check, PR #9 merged 2026-06-24 05:15). Apply the same pattern to the reviewer's validate layer: before failing-closed, try to re-create the worktree at the row's `branch` field's tip. If re-create succeeds, run the test_cmd. If it fails, return `tests_failed` with the recreate error. **Cost:** 30-60 minutes. **Risk:** worktree-recreate path could fail in a way that masks the original test failure. **Mitigation:** log the recreate error prominently in `last_feedback.recreate_error`. ### Option C — Distinguish "validate passed" from "validate skipped" via typed verdict (largest change) Add a new typed verdict `validate_skipped` to the `VERDICTS` set. The reviewer returns `pass` only when validate actually ran and passed. The state machine then enforces "merged requires validate_ran=true." Requires a contract update, the typed-verdict table, and a re-audit of every place that consumes `pass`. **Cost:** 2-3 hours. **Risk:** big blast radius. ## Recommendation **Option B (re-create worktree from `branch` field, then validate).** Reuses the state-resume-protocol pattern from `build()`. The reviewer contract should follow the same shape as the build contract. ## What needs human/BMAD input The options require a design call on **how aggressive the worktree-recreate path should be**: - B-1: Re-create worktree at row's `branch` tip unconditionally - B-2: Re-create only if `branch` is on a known set of projects (whitelist) - B-3: Re-create only if the row's `last_feedback.pr_url` is a Gitea PR that's still open (cross-check via API) B-1 is the simplest and matches the build phase's pattern. B-2/B-3 add safety for production multi-project setups. **For this homelab, B-1 is the right shape.** ## Minimum bar to call it done - One PR that changes `src/damascus/phases.py:311-322` to: - Try to re-create the worktree at the row's `branch` field (using `git_ops.ensure_worktree` or equivalent) - If re-create fails, return `tests_failed` with the recreate error in `last_feedback` - If re-create succeeds, run the existing test_cmd validation path - The existing E2E test `tests/eviewer.py::test_reviewer_03_validate_layer_runs_test_cmd` goes GREEN - A new source-grep contract test in `tests/contract/test_contracts_match_source.py` that asserts the reviewer code does NOT contain the early-return `pass` branches (or contains a comment naming the gap) - No regression in `tests/e2e/test_reviewer_01_no_pr_defensive_guard` (the `no_pr` defensive guard must still fire before the worktree check) - No regression in `tests/e2e/test_reviewer_04_subjective_comments_do_not_block_merge` ## Out of scope for this issue - The `no_pr` defensive guard (PR #1 baseline, working) - The loop-breaker on `attempts >= budget_cycles` (PR #3, working) - The spec-refiner ambiguity-routing drift (PR #12, awaiting Kay) - The §7 metrics/analyzer (issue #6, awaiting BMAD picks) ## Cross-references - Gap note: `wiki/queries/damascus-orchestrator/reviewer-validates-failing-test-cmd-still-merges-2026-06-24.md` - Contract: `wiki/concepts/reviewer-contract.md` - State-resume pattern: `wiki/concepts/state-resume-protocol.md`, PR #9 - Skill reference: "Surviving bugs revealed by revived-dead tests" in operator skill
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: kaykayyali/damascus-orchestrator#13