Migrate to Postgres + Taskiq (conform to orchestration plan) #1
Reference in New Issue
Block a user
Delete Branch "migrate/postgres-taskiq"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Brings the orchestrator to the design plan: MySQL → Postgres 16 and cron → Taskiq (the Python BullMQ-equivalent over a Redis broker). Postgres
FOR UPDATE SKIP LOCKEDstays the atomic claim; the per-tick "claim one item, run one phase" model is unchanged.damascus cycle(CLI /bin/run-cycle.sh) remains the deterministic one-shot operator path.Approved fixes folded in
claim_for_merge— removed the call to the non-existentstate.claim_for_merge; claim order is nowreview → build → spec(merge happens insidereviewonpass).attempts >= budget_cyclesparks the row asblocked, opens ahuman_issue, and emitswork.blocked(design §5/§16).spec_wrong— added tophases.VERDICTS;refine_specemits it when the spec is missing required sections (Goal/Acceptance Criteria/TDD Plan/Test Command). Routes tospec(re-run refiner), notawaiting_human— distinct fromspec_ambiguous.Driver / schema
dict_rowcursor,Jsonb()for JSONB,%sparams).schema.sqlrewritten to PG16: guardedCREATE TYPEenums, JSONB, TIMESTAMPTZ, BIGSERIAL, aBEFORE UPDATEtrigger replacing MySQL'sON UPDATE CURRENT_TIMESTAMP,v_active_claimsview.cli initguard-creates the DB and applies the whole schema in oneexecute()(incl.DO $$blocks).Queue
src/damascus/tasks.py:ListQueueBroker+TaskiqSchedulerwith a syncrun_cycletask (→cycle.tick()) on a* * * * *cron label. DockerfileCMDruns the worker; compose addsredis:7+ anorchestrator-schedulerservice.Bugs found & fixed during verification
cycle.py/cli.py statushardcoded/data/status/active.json→ nowsettings.data_dir / "status".DEFAULT_SOCKET_TIMEOUT=5killed idle Taskiq workers (indefiniteBRPOP+ uncaughtTimeoutError). Broker now setssocket_timeout=None.orchestrator-schedulerpointed atdamascus.tasks:broker→ fixed todamascus.tasks:scheduler.tasks.pydocstring referenced non-existent--concurrency→ corrected to--max-threadpool-threads.Verification (all green)
postgres:16;damascus initend-to-end.damascus cyclesmoke:spec→buildon pass; forced-fail at budget →blocked+human_issue+work.blocked;spec_wrong→spec;spec_ambiguous→awaiting_human;answer→spec.run_cycle.kiq()→ worker runscycle.tick()→ row advancesspec→build.damascus cyclecalls): seededspecrow wentspec→build (pass) → build→build (tests_failed, retry) → build→blocked (tests_failed) + work.blocked + open human_issue. Proves the queue replaces cron and the loop-breaker via the queue.Out of scope (deliberately)
assessadvisory, sprint reconciler, wiki snapshot-pinning, outbox drainer/overseer, metrics analyzer, fair-share scheduler) — design Phase-gated.tests/e2e/*) stays docker-stack-dependent; CI runs contract + unit only.docker-compose.yml:57— it's a personal admin token. Recommend rotating + moving to a secret.What this branch established
SELECT ... FOR UPDATE SKIP LOCKEDinstate.py. Keep SKIP LOCKED; it's the whole concurrency story.src/damascus/tasks.pyis the wiring. The per-tick model is unchanged —run_cyclejust callscycle.tick(). Do not redesign dispatch into per-phase queues; the plan keeps one claim + one phase per tick.How to run it
damascus cycle/bin/run-cycle.sh= deterministic one-shot (operators + E2E). Bypasses the queue.--max-threadpool-threads(sync tasks run in a threadpool). There is no--concurrencyflag in taskiq 0.12.x.Gotchas you will hit if you forget
socket_timeout=5.ListQueueBrokerdoes an indefiniteBRPOP; a 5s read timeout raisesTimeoutError, which taskiq'slisten()does NOT catch (it's a sibling ofConnectionError, not a subclass) → worker dies + restarts in a loop while idle. The broker is constructed withsocket_timeout=None— keep that. Don't "helpfully" add a socket_timeout.taskiq schedulertakes theTaskiqSchedulerinstance, not the broker. Path isdamascus.tasks:scheduler(not:broker). The composeorchestrator-schedulerservice is set correctly; don't revert it.dict→ JSONB. Wrap dict/list values bound to JSONB columns withpsycopg.types.json.Jsonb(...). Seestate.upsert_story/set_phase(last_feedback) /emit_event/clievent inserts. A bare dict → "cannot adapt type 'dict'".settings.data_dir / "status" / "active.json"(configurable). Don't re-hardcode/data— it breaks anywhere/dataisn't writable.attemptsis post-increment. The claim increments it, so in_next_phase_on_verdictcompareitem["attempts"] >= item["budget_cycles"]directly (no off-by-one).passis exempt from the breaker.Verdict routing (
cycle._next_phase_on_verdict)pass: review→merged, build→review, spec→buildtests_failed/rebase_conflict/no_pr→ build (retry)spec_ambiguous→ awaiting_human (opens ahuman_issueinrefine_spec)spec_wrong→ spec (re-run refiner; no human issue — it's an internally broken spec, not an ambiguity)attempts >= budget→ blocked +human_issue+work.blockedeventStill open / next steps (not done in this PR)
docker-compose.yml:57(it's a personal admin token in VCS) → move to a secret.assessadvisory + lint/build gate,sprint-status.yamlreconciler, wiki snapshot-pinning + merge-gate fact writing, outbox drainer + overseer, metrics analyzer, global spend caps, scope-disjoint/fair-share dispatch.tests/e2e/*) still needs the docker stack; it's not in CI. If you add it to CI, stop the worker service in E2E setup before exec-ing one-shot ticks (worker would race the seeded rows).postgres:16+redis:7containers on host alt-ports (5433/6380) due to WSL2 port conflicts; the real compose uses the docker network.Tests
pytest tests/contract/ tests/unit/— 19 passing. Requires live Postgres (DAMASCUS_PG_*+DAMASCUS_ROOT+DAMASCUS_SCHEMA_PATH); rundamascus initfirst.📋 Added the design docs to this branch (commit
cc7f442), side by side indocs/:docs/multi-project-orchestration-plan_1.md— the original design plan (was only in Downloads; now in-repo).docs/multi-project-orchestration-plan_amendments.md— the six architecture amendments, reviewed and revised against the post-migration codebase.§4 amendment compliance —
attemptsis not reset when resuming fromawaiting_human.The amendments doc (line ~190, §4 "Which verdicts consume budget") is explicit:
This
SET phase='spec'clause returns the row tospecfor re-refinement but does not resetattempts. Concrete failure mode:awaiting_humanafterattempts=2(aspec_ambiguousopen question, no autonomous failure).spec→attemptsbecomes 3. The first autonomous failure (e.g.spec_wrong) puts the row overbudget_cycles(default 3) and parks it asblocked— even though the human just spent time giving the answer.Two ways to fix; the second is the one the amendment text implies:
attemptsto 0 in this UPDATE when transitioningawaiting_human → spec. The autonomous budget starts fresh because the human just removed the blocker.attemptsby 1 (or the count ofspec_ambiguousevents the row has logged inevents_outbox) so the resume is cheaper than a fresh claim but not a full reset. Subtler; needs the spec_ambiguous-event count.Either is fine; (a) is simpler and matches the plain reading of "the budget resumes counting only on autonomous retries after the human answers." Please add a one-line test in
tests/contract/that:spec_ambiguoussoattempts=2andphase='awaiting_human'.damascus answer.attemptsis now 0 (or 1, if you go with option (b)).claim_for_specthentests_faileddoes not park the row.Without this fix, the §4 amendment as worded is not actually enforced by the code, even though the schema default change in PR #2 makes the budget cap tighter and exposes the bug faster. Worth landing in PR #1 (or a follow-up) before §7 / §2 / §3 work piles more load on the budget loop-breaker.
(Side note: the diff also drops the trailing newline on
src/damascus/state.pyandsrc/damascus/tasks.py—\ No newline at end of filemarkers. Not a correctness issue; some linters andgit diffUIs complain. Worth a one-character fix on rebase.)