fix(compose): db service self-heals tainted dbdata volume on bootstrap #7
Reference in New Issue
Block a user
Delete Branch "fix/compose-db-volume-self-heal"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Two-line addition to docker-compose.yml: a command: on the db service that detects a tainted /var/lib/postgresql/data directory (non-empty AND no PG_VERSION) and wipes it before docker-entrypoint.sh runs initdb. This makes the stack self-heal across engine-swap PR merges (e.g. MySQL→Postgres from PR #1).
Plus one new contract test (test_db_volume_self_heals_on_recreate) that asserts the compose file self-heals the data dir via one of three accepted patterns.
Why
After PR #1 merged, the live db-1 container crashlooped because the named dbdata volume held MySQL InnoDB data from the prior stack. Postgres initdb refuses to bootstrap over a non-empty directory. The recovery path (manual docker volume rm damascus-orchestrator_dbdata) is documented in queries/damascus-orchestrator/2026-06-23-postgres-init-volume-drift.md (option C as the recommended non-recurring fix).
Verification (2026-06-23 22:39 UTC)
Test
tests/contract/test_contracts_match_source.py::test_db_volume_self_heals_on_recreate. Source-grep, no docker needed, CI-friendly.
Negative-checked by reverting docker-compose.yml to current main and confirming the test fails with the assertion error naming which of the three accepted patterns did not match.
Risk
Idempotent. The wipe branch fires only when (a) the dir is non-empty AND (b) PG_VERSION is missing. A fresh volume is empty (skips the wipe). A healthy cluster has PG_VERSION (skips the wipe). A tainted volume (the bug case) is wiped and initdb bootstraps.
For the damascus-orchestrator homelab use case, persistent DB state is bounded: work_items rows for E2E tests are seeded by damascus ingest and have no production value. The stack-drift resolution at 2026-06-23 20:42 already established that the volume is non-production and re-seeding is the recovery path.
For a production user who cant afford the wipe: the contract test accepts two alternative self-heal patterns (tmpfs: mount, init: container) - the implementer can choose the right shape for their deployment. This PR picks the command: pattern because it is the smallest, most surgical change to current main.
Refs
Self-review
Cannot self-approve per the skill - heartbeat agents post tea comment instead. Will post the review comment immediately after opening this PR.
Self-review for PR #7 (fix/compose-db-volume-self-heal):
Verification done locally:
Negative check: reverted docker-compose.yml to current main and confirmed test_db_volume_self_heals_on_recreate fails with 'None of the accepted patterns matched: has_wipe_command=False, has_tmpfs_dbdata=False, has_init_wiper=False'. Restored the fix; test passes again.
Code observations (all non-blocking):
The contract test accepts three self-heal patterns (command-on-db with wipe, tmpfs on dbdata, init container) so a future implementer has options. The chosen pattern is the most surgical for this codebase. If anyone wants the wipe to be less aggressive (e.g. only when the dir contains MySQL fingerprints), the bash check could narrow from 'no PG_VERSION' to 'no PG_VERSION AND has ib_/binlog./auto.cnf'. That would be a different contract though, and the current contract ('tainted != valid Postgres cluster') is the simpler invariant. Recommendation: leave as-is.
The rm -rf includes dotfiles (.placeholder-style) via the /.[!.]* glob. That's correct for cleaning up MySQL artifacts like .mylogin.cnf or .secret_key. Verified manually that the glob doesn't try to match . and ..
Trailing newlines: both files end with \n. No no-newline-at-EOF markers in the diff.
Compose YAML resolves cleanly via 'docker compose config' — the > folded scalar is normalized to a list-form command, which Compose v2 expects. Verified with 'docker compose config --quiet'.
One thing I considered and did NOT do:
Risk acceptance:
For a homelab/test use case (current setup), the wipe on tainted detection is correct. For a production user, the contract test's other accepted patterns (tmpfs or init container) preserve data better. The implementer for production picks; the contract enforces the invariant.
Action items for human:
— damascus-heartbeat
Self-review — PR #7 (rebase verification)
Verified at 2026-06-24 04:50 UTC after rebasing
fix/compose-db-volume-self-healonto currentmain(merge commitdbf5cba).What I did
git worktree add /tmp/pr7-rebase fix/compose-db-volume-self-heal— clean checkout at the old tip2641707(branched from60cc8d7, the pre-Postgres PR #1 merge tip).git merge main --no-editfrom the old tip — ort strategy auto-merged cleanly (12 files changed, 452 insertions, 120 deletions). Single merge commitdbf5cba.git push origin fix/compose-db-volume-self-heal— clean fast-forward, Gitea accepted (no--forceneeded; the merge commit's first parent IS the remote's old tip).DAMASCUS_ROOT=/tmp/pr7-rebase python3 -m pytest tests/contract/ -q— 19/19 pass, includingtest_db_volume_self_heals_on_recreate(the new contract test introduced on this branch).Diff review against the §1-recipe contract
The PR adds a single
command:block to thedbservice indocker-compose.ymlthat:/var/lib/postgresql/dataAND noPG_VERSIONfile.rm -rf /var/lib/postgresql/data/* /var/lib/postgresql/data/.[!.]*.exec docker-entrypoint.sh postgres.This matches the option C /
command:pattern fromreferences/stack-drift-volume-wipe.md§6.1 — the smallest, most surgical self-heal shape.Contract test
test_db_volume_self_heals_on_recreateintests/contract/test_contracts_match_source.pyaccepts three self-heal patterns (thecommand:wipe, atmpfs:mount, aninit:wiper container). The PR's implementation satisfies pattern 1; the test would also pass for the other two shapes if a future PR chose them. Important: the test reads fromDAMASCUS_ROOTenv var (default/root/damascus-orchestrator), so when verifying on a worktree, run pytest withDAMASCUS_ROOT=/path/to/worktree. That bit me on the first verification attempt — pytest against the main checkout read the pre-rebase compose and (correctly) saw thecommand:block there from the mergedmain, but the worktree's own compose is what matters for the PR review.What this PR fixes (concrete)
After any future MySQL→Postgres (or Postgres→anything-else) compose-swap PR merge, the named
dbdatavolume holds the old engine's data, and the new engine'sinitdberrors withdirectory exists but is not empty. The stack crashes ondocker compose up -d --builduntil someone manually wipes the volume. This PR makes the stack self-heal across engine swaps — no manualdocker volume rmrequired.Verified in the rebase pass
mainbetween the PR's branch-base (60cc8d7) and currentmain(9aea9ee) — the Postgres migration PR #1's merge commit, the §4 amendments PR #2/3, the test migrations PR #4/5, the cycle/3-txn/max_tokens fix PR #8, the idempotent resume fix PR #9 — auto-merge cleanly into this PR with no conflicts.main(PR #4 + PR #5) — including the newtest_db_volume_self_heals_on_recreate— all pass on the merged result.Recommendation
Mergeable. The PR is mechanically clean, the contract test is in place, and the diff is the smallest viable self-heal. If a reviewer prefers the
tmpfs:orinit:wiper patterns for their deployment, the contract test accepts them; this PR pickscommand:because it's the most surgical addition to currentmain.Per the skill (heartbeat agents can't self-approve), I'm leaving this as a
tea commentrather than atea pulls approve. A human reviewer with merge rights can take it from here.— heartbeat agent, 2026-06-24 04:50 UTC