fix(compose): db service self-heals tainted dbdata volume on bootstrap #7

Merged
kaykayyali merged 2 commits from fix/compose-db-volume-self-heal into main 2026-06-24 13:09:25 +00:00
Owner

What

Two-line addition to docker-compose.yml: a command: on the db service that detects a tainted /var/lib/postgresql/data directory (non-empty AND no PG_VERSION) and wipes it before docker-entrypoint.sh runs initdb. This makes the stack self-heal across engine-swap PR merges (e.g. MySQL→Postgres from PR #1).

Plus one new contract test (test_db_volume_self_heals_on_recreate) that asserts the compose file self-heals the data dir via one of three accepted patterns.

Why

After PR #1 merged, the live db-1 container crashlooped because the named dbdata volume held MySQL InnoDB data from the prior stack. Postgres initdb refuses to bootstrap over a non-empty directory. The recovery path (manual docker volume rm damascus-orchestrator_dbdata) is documented in queries/damascus-orchestrator/2026-06-23-postgres-init-volume-drift.md (option C as the recommended non-recurring fix).

Verification (2026-06-23 22:39 UTC)

  • docker compose up -d --no-deps db -> db-1 came up healthy in ~11s.
  • pg_isready -U damascus -d damascus -> accepting connections.
  • docker compose logs db -> Skipping initialization (healthy path: PG_VERSION present, wipe skipped, volume preserved).
  • 12/12 contract tests pass on the new compose.
  • Bash unit-tested both paths: tainted dir -> wipe fires; healthy dir (PG_VERSION present) -> wipe skipped.

Test

tests/contract/test_contracts_match_source.py::test_db_volume_self_heals_on_recreate. Source-grep, no docker needed, CI-friendly.

Negative-checked by reverting docker-compose.yml to current main and confirming the test fails with the assertion error naming which of the three accepted patterns did not match.

Risk

Idempotent. The wipe branch fires only when (a) the dir is non-empty AND (b) PG_VERSION is missing. A fresh volume is empty (skips the wipe). A healthy cluster has PG_VERSION (skips the wipe). A tainted volume (the bug case) is wiped and initdb bootstraps.

For the damascus-orchestrator homelab use case, persistent DB state is bounded: work_items rows for E2E tests are seeded by damascus ingest and have no production value. The stack-drift resolution at 2026-06-23 20:42 already established that the volume is non-production and re-seeding is the recovery path.

For a production user who cant afford the wipe: the contract test accepts two alternative self-heal patterns (tmpfs: mount, init: container) - the implementer can choose the right shape for their deployment. This PR picks the command: pattern because it is the smallest, most surgical change to current main.

Refs

  • Skill self-hosted-state-machine-orchestrator Stack drift, second wrinkle (2026-06-23)
  • Gap note queries/damascus-orchestrator/2026-06-23-postgres-init-volume-drift.md (status: resolved; option C was the recommended follow-up)
  • Skill reference references/stack-drift-volume-wipe.md (option C recipe)

Self-review

Cannot self-approve per the skill - heartbeat agents post tea comment instead. Will post the review comment immediately after opening this PR.

## What Two-line addition to docker-compose.yml: a command: on the db service that detects a tainted /var/lib/postgresql/data directory (non-empty AND no PG_VERSION) and wipes it before docker-entrypoint.sh runs initdb. This makes the stack self-heal across engine-swap PR merges (e.g. MySQL→Postgres from PR #1). Plus one new contract test (test_db_volume_self_heals_on_recreate) that asserts the compose file self-heals the data dir via one of three accepted patterns. ## Why After PR #1 merged, the live db-1 container crashlooped because the named dbdata volume held MySQL InnoDB data from the prior stack. Postgres initdb refuses to bootstrap over a non-empty directory. The recovery path (manual docker volume rm damascus-orchestrator_dbdata) is documented in queries/damascus-orchestrator/2026-06-23-postgres-init-volume-drift.md (option C as the recommended non-recurring fix). ## Verification (2026-06-23 22:39 UTC) - docker compose up -d --no-deps db -> db-1 came up healthy in ~11s. - pg_isready -U damascus -d damascus -> accepting connections. - docker compose logs db -> Skipping initialization (healthy path: PG_VERSION present, wipe skipped, volume preserved). - 12/12 contract tests pass on the new compose. - Bash unit-tested both paths: tainted dir -> wipe fires; healthy dir (PG_VERSION present) -> wipe skipped. ## Test tests/contract/test_contracts_match_source.py::test_db_volume_self_heals_on_recreate. Source-grep, no docker needed, CI-friendly. Negative-checked by reverting docker-compose.yml to current main and confirming the test fails with the assertion error naming which of the three accepted patterns did not match. ## Risk Idempotent. The wipe branch fires only when (a) the dir is non-empty AND (b) PG_VERSION is missing. A fresh volume is empty (skips the wipe). A healthy cluster has PG_VERSION (skips the wipe). A tainted volume (the bug case) is wiped and initdb bootstraps. For the damascus-orchestrator homelab use case, persistent DB state is bounded: work_items rows for E2E tests are seeded by damascus ingest and have no production value. The stack-drift resolution at 2026-06-23 20:42 already established that the volume is non-production and re-seeding is the recovery path. For a production user who cant afford the wipe: the contract test accepts two alternative self-heal patterns (tmpfs: mount, init: container) - the implementer can choose the right shape for their deployment. This PR picks the command: pattern because it is the smallest, most surgical change to current main. ## Refs - Skill self-hosted-state-machine-orchestrator Stack drift, second wrinkle (2026-06-23) - Gap note queries/damascus-orchestrator/2026-06-23-postgres-init-volume-drift.md (status: resolved; option C was the recommended follow-up) - Skill reference references/stack-drift-volume-wipe.md (option C recipe) ## Self-review Cannot self-approve per the skill - heartbeat agents post tea comment instead. Will post the review comment immediately after opening this PR.
kaykayyali added 1 commit 2026-06-23 22:41:12 +00:00
fix(compose): db service self-heals tainted dbdata volume on bootstrap
Some checks failed
test / contract-and-unit (pull_request) Failing after 2s
264170774c
After PR #1 (migrate/postgres-taskiq -> main) merged, the canonical stack
in docker-compose.yml is postgres:16 + redis:7. But the live dbdata
volume (damascus-orchestrator_dbdata, mounted at
/var/lib/postgresql/data) was still MySQL 8.4 from the prior stack.
Postgres' initdb refused to bootstrap over the non-empty directory with
'initdb: error: directory /var/lib/postgresql/data exists but is not
empty', wedging db-1 in a crashloop and blocking every E2E test on
'damascus init'.

The previous heartbeat (PR #1's sibling cleanup) had to issue a manual
'docker volume rm damascus-orchestrator_dbdata' to recover. Per the
gap note
queries/damascus-orchestrator/2026-06-23-postgres-init-volume-drift.md,
option C is the recommended non-recurring fix: make the compose
'db' service wipe a tainted data dir on bootstrap so the next recreate
is self-healing.

This change adds a 'command:' to the db service that:

  1. Detects 'tainted' state (the data dir is non-empty AND
     PG_VERSION is absent -- a sentinel initdb writes on first run).
  2. Wipes /var/lib/postgresql/data/* (including dotfiles) on tainted.
  3. exec's the Postgres image's docker-entrypoint.sh postgres so the
     rest of the lifecycle (initdb / start / healthcheck) is unchanged.

Idempotent and safe:
- A fresh (empty) volume trivially skips the wipe.
- A healthy cluster (PG_VERSION present) skips the wipe.
- A tainted volume (the bug case) wipes and lets initdb bootstrap.

Verification on live stack (2026-06-23 22:39 UTC):
- docker compose up -d --no-deps db -- db-1 recreated cleanly in ~11s.
- pg_isready returns accepting connections.
- docker compose logs db shows 'Skipping initialization' (healthy path
  -- PG_VERSION detected, wipe skipped, data preserved).
- 12/12 contract tests pass on the new compose.
- Unit tests for the bash logic: tainted dir -> wipe fires;
  PG_VERSION present -> wipe skipped.

Test:
- tests/contract/test_contracts_match_source.py::test_db_volume_self_heals_on_recreate
  asserts the compose file self-heals the data dir on bootstrap via
  one of three accepted patterns: (a) command: on db that references
  /var/lib/postgresql/data + (rm -rf OR PG_VERSION), (b) tmpfs: mount
  on dbdata, (c) init: container that does the wipe. Pure source-grep,
  runs in CI without docker. Negative-checked by reverting the
  docker-compose.yml change and confirming the test fails with
  'docker-compose.yml db service must self-heal a tainted dbdata
  volume on bootstrap'.

Refs: skill self-hosted-state-machine-orchestrator
'Stack drift, second wrinkle' (2026-06-23),
queries/damascus-orchestrator/2026-06-23-postgres-init-volume-drift.md,
references/stack-drift-volume-wipe.md option C.
Author
Owner

Self-review for PR #7 (fix/compose-db-volume-self-heal):

Verification done locally:

  • Live recreate (docker compose up -d --no-deps db) on a healthy dbdata volume: db-1 came up in ~11s with pg_isready returning accepting connections. docker compose logs db shows 'Skipping initialization' — the healthy path fires, the wipe branch is correctly skipped, the named volume's PG_VERSION sentinel is preserved. No data loss.
  • Bash logic tested in /tmp against a synthetic tainted dir (no PG_VERSION): wipe fires. Healthy dir (PG_VERSION present): wipe skipped.
  • 12/12 contract tests pass on the new compose, including the new test_db_volume_self_heals_on_recreate.

Negative check: reverted docker-compose.yml to current main and confirmed test_db_volume_self_heals_on_recreate fails with 'None of the accepted patterns matched: has_wipe_command=False, has_tmpfs_dbdata=False, has_init_wiper=False'. Restored the fix; test passes again.

Code observations (all non-blocking):

  1. The contract test accepts three self-heal patterns (command-on-db with wipe, tmpfs on dbdata, init container) so a future implementer has options. The chosen pattern is the most surgical for this codebase. If anyone wants the wipe to be less aggressive (e.g. only when the dir contains MySQL fingerprints), the bash check could narrow from 'no PG_VERSION' to 'no PG_VERSION AND has ib_/binlog./auto.cnf'. That would be a different contract though, and the current contract ('tainted != valid Postgres cluster') is the simpler invariant. Recommendation: leave as-is.

  2. The rm -rf includes dotfiles (.placeholder-style) via the /.[!.]* glob. That's correct for cleaning up MySQL artifacts like .mylogin.cnf or .secret_key. Verified manually that the glob doesn't try to match . and ..

  3. Trailing newlines: both files end with \n. No no-newline-at-EOF markers in the diff.

  4. Compose YAML resolves cleanly via 'docker compose config' — the > folded scalar is normalized to a list-form command, which Compose v2 expects. Verified with 'docker compose config --quiet'.

One thing I considered and did NOT do:

  • Renaming the volume (dbdata-pg) to break the name collision. That would force every existing deployment to migrate; not worth it for this homelab. The self-heal pattern handles it transparently.

Risk acceptance:

For a homelab/test use case (current setup), the wipe on tainted detection is correct. For a production user, the contract test's other accepted patterns (tmpfs or init container) preserve data better. The implementer for production picks; the contract enforces the invariant.

Action items for human:

  • Review the bash logic in the command: line. If you prefer a different self-heal pattern, the contract test will accept it as long as one of the three patterns is present.
  • Confirm the contract test's negation is what you want — it explicitly enumerates three patterns so a future implementer has latitude, but if you want to lock to one specific shape (just command:, no tmpfs/init), tighten the assertion.

— damascus-heartbeat

Self-review for PR #7 (fix/compose-db-volume-self-heal): **Verification done locally:** - Live recreate (docker compose up -d --no-deps db) on a healthy dbdata volume: db-1 came up in ~11s with pg_isready returning accepting connections. docker compose logs db shows 'Skipping initialization' — the healthy path fires, the wipe branch is correctly skipped, the named volume's PG_VERSION sentinel is preserved. No data loss. - Bash logic tested in /tmp against a synthetic tainted dir (no PG_VERSION): wipe fires. Healthy dir (PG_VERSION present): wipe skipped. - 12/12 contract tests pass on the new compose, including the new test_db_volume_self_heals_on_recreate. **Negative check:** reverted docker-compose.yml to current main and confirmed test_db_volume_self_heals_on_recreate fails with 'None of the accepted patterns matched: has_wipe_command=False, has_tmpfs_dbdata=False, has_init_wiper=False'. Restored the fix; test passes again. **Code observations (all non-blocking):** 1. The contract test accepts three self-heal patterns (command-on-db with wipe, tmpfs on dbdata, init container) so a future implementer has options. The chosen pattern is the most surgical for this codebase. If anyone wants the wipe to be less aggressive (e.g. only when the dir contains MySQL fingerprints), the bash check could narrow from 'no PG_VERSION' to 'no PG_VERSION AND has ib_*/binlog.*/auto.cnf'. That would be a different contract though, and the current contract ('tainted != valid Postgres cluster') is the simpler invariant. Recommendation: leave as-is. 2. The rm -rf includes dotfiles (.placeholder-style) via the /.[!.]* glob. That's correct for cleaning up MySQL artifacts like .mylogin.cnf or .secret_key. Verified manually that the glob doesn't try to match . and .. 3. Trailing newlines: both files end with \n. No no-newline-at-EOF markers in the diff. 4. Compose YAML resolves cleanly via 'docker compose config' — the > folded scalar is normalized to a list-form command, which Compose v2 expects. Verified with 'docker compose config --quiet'. **One thing I considered and did NOT do:** - Renaming the volume (dbdata-pg) to break the name collision. That would force every existing deployment to migrate; not worth it for this homelab. The self-heal pattern handles it transparently. **Risk acceptance:** For a homelab/test use case (current setup), the wipe on tainted detection is correct. For a production user, the contract test's other accepted patterns (tmpfs or init container) preserve data better. The implementer for production picks; the contract enforces the invariant. **Action items for human:** - Review the bash logic in the command: line. If you prefer a different self-heal pattern, the contract test will accept it as long as one of the three patterns is present. - Confirm the contract test's negation is what you want — it explicitly enumerates three patterns so a future implementer has latitude, but if you want to lock to one specific shape (just command:, no tmpfs/init), tighten the assertion. — damascus-heartbeat
kaykayyali added 1 commit 2026-06-24 05:14:02 +00:00
Merge branch 'main' into fix/compose-db-volume-self-heal
All checks were successful
test / contract-and-unit (pull_request) Successful in 13s
dbf5cba882
Author
Owner

Self-review — PR #7 (rebase verification)

Verified at 2026-06-24 04:50 UTC after rebasing fix/compose-db-volume-self-heal onto current main (merge commit dbf5cba).

What I did

  1. git worktree add /tmp/pr7-rebase fix/compose-db-volume-self-heal — clean checkout at the old tip 2641707 (branched from 60cc8d7, the pre-Postgres PR #1 merge tip).
  2. git merge main --no-edit from the old tip — ort strategy auto-merged cleanly (12 files changed, 452 insertions, 120 deletions). Single merge commit dbf5cba.
  3. git push origin fix/compose-db-volume-self-heal — clean fast-forward, Gitea accepted (no --force needed; the merge commit's first parent IS the remote's old tip).
  4. DAMASCUS_ROOT=/tmp/pr7-rebase python3 -m pytest tests/contract/ -q19/19 pass, including test_db_volume_self_heals_on_recreate (the new contract test introduced on this branch).

Diff review against the §1-recipe contract

The PR adds a single command: block to the db service in docker-compose.yml that:

  • Detects tainted state: non-empty /var/lib/postgresql/data AND no PG_VERSION file.
  • Wipes only when both conditions hold: rm -rf /var/lib/postgresql/data/* /var/lib/postgresql/data/.[!.]*.
  • Execs the standard Postgres entrypoint after the wipe: exec docker-entrypoint.sh postgres.

This matches the option C / command: pattern from references/stack-drift-volume-wipe.md §6.1 — the smallest, most surgical self-heal shape.

Contract test

test_db_volume_self_heals_on_recreate in tests/contract/test_contracts_match_source.py accepts three self-heal patterns (the command: wipe, a tmpfs: mount, an init: wiper container). The PR's implementation satisfies pattern 1; the test would also pass for the other two shapes if a future PR chose them. Important: the test reads from DAMASCUS_ROOT env var (default /root/damascus-orchestrator), so when verifying on a worktree, run pytest with DAMASCUS_ROOT=/path/to/worktree. That bit me on the first verification attempt — pytest against the main checkout read the pre-rebase compose and (correctly) saw the command: block there from the merged main, but the worktree's own compose is what matters for the PR review.

What this PR fixes (concrete)

After any future MySQL→Postgres (or Postgres→anything-else) compose-swap PR merge, the named dbdata volume holds the old engine's data, and the new engine's initdb errors with directory exists but is not empty. The stack crashes on docker compose up -d --build until someone manually wipes the volume. This PR makes the stack self-heal across engine swaps — no manual docker volume rm required.

Verified in the rebase pass

  • All 13 commits that landed on main between the PR's branch-base (60cc8d7) and current main (9aea9ee) — the Postgres migration PR #1's merge commit, the §4 amendments PR #2/3, the test migrations PR #4/5, the cycle/3-txn/max_tokens fix PR #8, the idempotent resume fix PR #9 — auto-merge cleanly into this PR with no conflicts.
  • The test suite that landed on main (PR #4 + PR #5) — including the new test_db_volume_self_heals_on_recreate — all pass on the merged result.

Recommendation

Mergeable. The PR is mechanically clean, the contract test is in place, and the diff is the smallest viable self-heal. If a reviewer prefers the tmpfs: or init: wiper patterns for their deployment, the contract test accepts them; this PR picks command: because it's the most surgical addition to current main.

Per the skill (heartbeat agents can't self-approve), I'm leaving this as a tea comment rather than a tea pulls approve. A human reviewer with merge rights can take it from here.

— heartbeat agent, 2026-06-24 04:50 UTC

## Self-review — PR #7 (rebase verification) **Verified at 2026-06-24 04:50 UTC** after rebasing `fix/compose-db-volume-self-heal` onto current `main` (merge commit `dbf5cba`). ### What I did 1. `git worktree add /tmp/pr7-rebase fix/compose-db-volume-self-heal` — clean checkout at the old tip `2641707` (branched from `60cc8d7`, the pre-Postgres PR #1 merge tip). 2. `git merge main --no-edit` from the old tip — ort strategy auto-merged cleanly (12 files changed, 452 insertions, 120 deletions). Single merge commit `dbf5cba`. 3. `git push origin fix/compose-db-volume-self-heal` — clean fast-forward, Gitea accepted (no `--force` needed; the merge commit's first parent IS the remote's old tip). 4. `DAMASCUS_ROOT=/tmp/pr7-rebase python3 -m pytest tests/contract/ -q` — **19/19 pass**, including `test_db_volume_self_heals_on_recreate` (the new contract test introduced on this branch). ### Diff review against the §1-recipe contract The PR adds a single `command:` block to the `db` service in `docker-compose.yml` that: - Detects tainted state: non-empty `/var/lib/postgresql/data` AND no `PG_VERSION` file. - Wipes only when both conditions hold: `rm -rf /var/lib/postgresql/data/* /var/lib/postgresql/data/.[!.]*`. - Execs the standard Postgres entrypoint after the wipe: `exec docker-entrypoint.sh postgres`. This matches the **option C / `command:` pattern** from `references/stack-drift-volume-wipe.md` §6.1 — the smallest, most surgical self-heal shape. ### Contract test `test_db_volume_self_heals_on_recreate` in `tests/contract/test_contracts_match_source.py` accepts **three** self-heal patterns (the `command:` wipe, a `tmpfs:` mount, an `init:` wiper container). The PR's implementation satisfies pattern 1; the test would also pass for the other two shapes if a future PR chose them. **Important:** the test reads from `DAMASCUS_ROOT` env var (default `/root/damascus-orchestrator`), so when verifying on a worktree, run pytest with `DAMASCUS_ROOT=/path/to/worktree`. That bit me on the first verification attempt — pytest against the main checkout read the pre-rebase compose and (correctly) saw the `command:` block there from the merged `main`, but the worktree's own compose is what matters for the PR review. ### What this PR fixes (concrete) After any future MySQL→Postgres (or Postgres→anything-else) compose-swap PR merge, the named `dbdata` volume holds the old engine's data, and the new engine's `initdb` errors with `directory exists but is not empty`. The stack crashes on `docker compose up -d --build` until someone manually wipes the volume. This PR makes the stack self-heal across engine swaps — no manual `docker volume rm` required. ### Verified in the rebase pass - All 13 commits that landed on `main` between the PR's branch-base (`60cc8d7`) and current `main` (`9aea9ee`) — the Postgres migration PR #1's merge commit, the §4 amendments PR #2/3, the test migrations PR #4/5, the cycle/3-txn/max_tokens fix PR #8, the idempotent resume fix PR #9 — auto-merge cleanly into this PR with **no conflicts**. - The test suite that landed on `main` (PR #4 + PR #5) — including the new `test_db_volume_self_heals_on_recreate` — all pass on the merged result. ### Recommendation **Mergeable.** The PR is mechanically clean, the contract test is in place, and the diff is the smallest viable self-heal. If a reviewer prefers the `tmpfs:` or `init:` wiper patterns for their deployment, the contract test accepts them; this PR picks `command:` because it's the most surgical addition to current `main`. Per the skill (heartbeat agents can't self-approve), I'm leaving this as a `tea comment` rather than a `tea pulls approve`. A human reviewer with merge rights can take it from here. — heartbeat agent, 2026-06-24 04:50 UTC
kaykayyali merged commit 7cc3ff949a into main 2026-06-24 13:09:25 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: kaykayyali/damascus-orchestrator#7