§7 Metrics/Analyzer — implementation work blocked on 5 BMAD design questions #6

Open
opened 2026-06-23 22:03:35 +00:00 by kaykayyali · 1 comment
Owner

Summary

§7 (Metrics/Analyzer) is the load-bearing prerequisite for §2 (trust threshold) and §3 (dynamic scope-overlap policy) per docs/multi-project-orchestration-plan_amendments.md. §2 and §3 cannot land without it. This issue captures the implementation work and the 5 open design questions that BMAD needs to answer before code lands.

Source of truth

  • Contract: /opt/damascus/llm-wiki/concepts/§7-metrics-analyzer.md (status: draft; the canonical input/output/side-effects contract).
  • Amendment: docs/multi-project-orchestration-plan_amendments.md §7 (rolling rates: first-pass merge rate @ 85%/70%, rebase-conflict rate @ 5%/10%).
  • Schema inputs (already in schema.sql, no migration needed):
    • work_items(phase, attempts, last_verdict, updated_at, merged_at, project, story_id, file_scope)
    • events_outbox(kind, work_item_id, payload, created_at) — every transition logged by state.emit_event()
    • cost_ledger(work_item_id, project, phase, input_tokens, output_tokens, usd, recorded_at) — optional consumer for §1 SRVG, not required for §2/§3.

Required output (v1 surface)

Rate Numerator Denominator Window Consumer
first_pass_merge_rate rows merged on attempts=1 rows reaching phase='merged' last 50 consecutive merges (global) §2 trust threshold
rebase_conflict_rate rows whose final pre-merge verdict was rebase_conflict rows reaching phase='merged' last 100 merges (global) §3 scope-overlap policy
p95_attempts_to_merge attempts at merge time n/a last 100 merges observability

Bold rows are the required v1 surface. p95_attempts_to_merge is cheap to compute alongside and called out for completeness.

Open design questions (need BMAD/human input)

These are listed in full in the contract page §6. The implementation is silent on these on purpose. Each option has tradeoffs; the heartbeat cannot pick.

  1. Q1 — Cadence. Run on every merge (free, one event per merge) vs. on a timer (cheaper writes, slightly stale rates) vs. both (on-merge + on-demand CLI).
  2. Q2 — Storage. Write a new metrics_windows table with one row per (metric_name, as_of) for queryable history vs. rely on events_outbox(metric.*) as the only audit trail (cheaper, harder to backfill).
  3. Q3 — Window scope for §3. Global rebase-conflict rate vs. per-project vs. both (and which wins on conflict).
  4. Q4 — First-pass vs. any-pass for §2. The implementation must distinguish "merged on first attempt" from "merged on a later attempt" — derivable from the events_outbox transition log but the contract doesn't say which log event is the canonical "first attempt" marker.
  5. Q5 — Empty windows. What does first_pass_merge_rate mean when fewer than 50 items have merged? §2 needs to be safe when N < 50. Reasonable default: rate is undefined and the threshold check is skipped until N >= 50.

Minimum bar to call §7 "done"

Per the contract page §"What the implementation must include":

  1. A metrics module under src/damascus/ exposing the two required rolling rates and the threshold-crossing event emitter.
  2. A damascus metrics CLI verb returning the current snapshot for the operator.
  3. A tests/contract/test_metrics.py E2E suite that:
    • Seeds a known history of merges with deterministic first-pass and rebase_conflict outcomes.
    • Asserts the rolling rates match expected values at each merge.
    • Asserts threshold-crossing events fire in both directions (rate crossing up AND down).
    • Covers Q5 (empty / partial windows).
  4. Updates to the operator skills/SKILL.md documenting the new CLI verb and the threshold-crossing event type.

Items 1–3 are the load-bearing work. Item 4 is operator hygiene.

Why this is not in the current PR queue

  • The heartbeat cannot make implementation choices (gap-finding discipline: "describe, don't fix").
  • §2 and §3 are explicitly "small policy switch[es] on top of §7" — they block on this.
  • §1 SRVG is non-code-pipeline and §6 wiki-trigger-rule depends on unbuilt §11 merge-gate writer — both deferred per the amendment-implementation order.
  • Contract: /opt/damascus/llm-wiki/concepts/§7-metrics-analyzer.md
  • Amendments doc: docs/multi-project-orchestration-plan_amendments.md
  • Amendment sequencing concept: /opt/damascus/llm-wiki/concepts/amendment-sequencing.md
  • Original §7 in plan v1 (now superseded by the analyzer-only reduction): docs/multi-project-orchestration-plan_1.md §7

Heartbeat status (no human currently assigned)

Filed by the 30-min heartbeat agent in YOLO mode (Kay is away). This issue is descriptive — no implementation decision has been made. The first human (or BMAD run) to pick this up should resolve the 5 open questions before code lands.


Heartbeat recommendations (added 2026-06-23 by the 30-min heartbeat, YOLO mode)

Per the gap-finding discipline ("don't punt 'BMAD picks' without a recommendation"), the heartbeat adds the following as proposals for BMAD/human to accept or override. These are NOT implementation choices in code; they live only in the issue body for the design call.

Q1 (Cadence) — Recommend: both (on-merge + on-demand CLI).

  • On-merge keeps the §2/§3 consumer responsive without needing a timer; the §2 hysteresis (85%/70%) wants immediate signaling.
  • The on-demand CLI is item 2 of the minimum bar anyway, so the cost is "expose the same reader in two places" — trivial.
  • Timer-only would mean the scheduler is "5 minutes stale" on a slow day; rate events that trigger threshold crossings want to land in real time.
  • The cost (more events) is bounded: only on phase='merged', rare relative to build/review churn. events_outbox is append-only by design.

Q2 (Storage) — Recommend: rely on events_outbox(metric.*) only.

  • A separate metrics_windows table is a second source of truth that has to stay in sync with the events log.
  • events_outbox is already the audit trail; reading the last 50/100 metric events is one indexed query against events_outbox(work_item_id, kind, created_at).
  • Backfill is no harder than reading the same table — replays from a snapshot rebuild rates from the log.
  • If the on-demand CLI becomes hot, add a Redis cache later. That's an optimization, not v1.

Q3 (Window scope for §3) — Recommend: per-project primary, fallback to global when project history < window size.

  • §3 is about scope-overlap policy — that's a per-project concern; one project's rebase conflicts shouldn't tighten scope on a calm project.
  • A noisy project (e.g. a single bad reviewer prompt) shouldn't drag the global rate above 10%.
  • Fallback rule: if the project has fewer than rebase_conflict_rate's window-size (100) merged items, use the global rate for that project's policy decision.
  • §2 (trust threshold) stays global — fair-share is a system-wide property and a per-project split would defeat its purpose.

Q4 (First-pass vs. any-pass marker) — Recommend: work_items.attempts == 1 at the merge event is the canonical first-pass marker.

  • The attempts column is the existing source of truth; it increments on each retry per the cycle in phases.py.
  • Concretely: query events_outbox for the first row per work_item with payload->>'target' = 'merged'; check work_items.attempts == 1 at that moment.
  • Alternative considered ("the first phase.transition per work item"): noisier — includes spec→build, build→review, etc., not just merge attempts. Stick with attempts == 1 at merge.

Q5 (Empty windows) — Recommend: rate is undefined until N >= window_size; CLI surfaces n/total.

  • Do not invent a value (no 0/1.0 default, no partial-window average).
  • §2/§3 consumers must check n >= window_size (50 for first-pass, 100 for rebase-conflict) before acting on the rate.
  • CLI output exposes both the rate and n/total so the operator sees partial state explicitly (e.g. first_pass_merge_rate = undefined (n=23/50)).
  • This is what the amendments doc implies ("§2 needs to be safe when N < 50") — codify as the contract.

How to action these

BMAD picks from the 5 recommendations above. If any is rejected, BMAD writes the alternative into the contract page (/opt/damascus/llm-wiki/concepts/§7-metrics-analyzer.md §6) and this issue gets updated to point to the new contract text. Code PR for the metrics module is then a mechanical translation of the contract — not a design decision.

If all 5 are accepted as-is, the heartbeat will write the contract page updates on the next tick (descriptive work, gap-finding discipline compliant: editing the contract text IS the descriptive work, not choosing implementation). Then a separate PR can land the actual metrics.py module.

## Summary §7 (Metrics/Analyzer) is the load-bearing prerequisite for §2 (trust threshold) and §3 (dynamic scope-overlap policy) per `docs/multi-project-orchestration-plan_amendments.md`. §2 and §3 cannot land without it. This issue captures the implementation work and the 5 open design questions that BMAD needs to answer before code lands. ## Source of truth - **Contract**: `/opt/damascus/llm-wiki/concepts/§7-metrics-analyzer.md` (status: draft; the canonical input/output/side-effects contract). - **Amendment**: `docs/multi-project-orchestration-plan_amendments.md` §7 (rolling rates: first-pass merge rate @ 85%/70%, rebase-conflict rate @ 5%/10%). - **Schema inputs** (already in `schema.sql`, no migration needed): - `work_items(phase, attempts, last_verdict, updated_at, merged_at, project, story_id, file_scope)` - `events_outbox(kind, work_item_id, payload, created_at)` — every transition logged by `state.emit_event()` - `cost_ledger(work_item_id, project, phase, input_tokens, output_tokens, usd, recorded_at)` — optional consumer for §1 SRVG, not required for §2/§3. ## Required output (v1 surface) | Rate | Numerator | Denominator | Window | Consumer | |---|---|---|---|---| | `first_pass_merge_rate` | rows merged on `attempts=1` | rows reaching `phase='merged'` | last 50 consecutive merges (global) | §2 trust threshold | | `rebase_conflict_rate` | rows whose final pre-merge verdict was `rebase_conflict` | rows reaching `phase='merged'` | last 100 merges (global) | §3 scope-overlap policy | | `p95_attempts_to_merge` | attempts at merge time | n/a | last 100 merges | observability | Bold rows are the **required v1** surface. `p95_attempts_to_merge` is cheap to compute alongside and called out for completeness. ## Open design questions (need BMAD/human input) These are listed in full in the contract page §6. The implementation is silent on these on purpose. Each option has tradeoffs; the heartbeat cannot pick. 1. **Q1 — Cadence.** Run on every merge (free, one event per merge) vs. on a timer (cheaper writes, slightly stale rates) vs. both (on-merge + on-demand CLI). 2. **Q2 — Storage.** Write a new `metrics_windows` table with one row per `(metric_name, as_of)` for queryable history vs. rely on `events_outbox(metric.*)` as the only audit trail (cheaper, harder to backfill). 3. **Q3 — Window scope for §3.** Global rebase-conflict rate vs. per-project vs. both (and which wins on conflict). 4. **Q4 — First-pass vs. any-pass for §2.** The implementation must distinguish "merged on first attempt" from "merged on a later attempt" — derivable from the `events_outbox` transition log but the contract doesn't say which log event is the canonical "first attempt" marker. 5. **Q5 — Empty windows.** What does `first_pass_merge_rate` mean when fewer than 50 items have merged? §2 needs to be safe when N < 50. Reasonable default: rate is undefined and the threshold check is skipped until N >= 50. ## Minimum bar to call §7 "done" Per the contract page §"What the implementation must include": 1. A `metrics` module under `src/damascus/` exposing the two required rolling rates and the threshold-crossing event emitter. 2. A `damascus metrics` CLI verb returning the current snapshot for the operator. 3. A `tests/contract/test_metrics.py` E2E suite that: - Seeds a known history of merges with deterministic first-pass and rebase_conflict outcomes. - Asserts the rolling rates match expected values at each merge. - Asserts threshold-crossing events fire in both directions (rate crossing up AND down). - Covers Q5 (empty / partial windows). 4. Updates to the operator `skills/SKILL.md` documenting the new CLI verb and the threshold-crossing event type. Items 1–3 are the load-bearing work. Item 4 is operator hygiene. ## Why this is not in the current PR queue - The heartbeat cannot make implementation choices (gap-finding discipline: "describe, don't fix"). - §2 and §3 are explicitly "small policy switch[es] on top of §7" — they block on this. - §1 SRVG is non-code-pipeline and §6 wiki-trigger-rule depends on unbuilt §11 merge-gate writer — both deferred per the amendment-implementation order. ## Related - Contract: `/opt/damascus/llm-wiki/concepts/§7-metrics-analyzer.md` - Amendments doc: `docs/multi-project-orchestration-plan_amendments.md` - Amendment sequencing concept: `/opt/damascus/llm-wiki/concepts/amendment-sequencing.md` - Original §7 in plan v1 (now superseded by the analyzer-only reduction): `docs/multi-project-orchestration-plan_1.md` §7 ## Heartbeat status (no human currently assigned) Filed by the 30-min heartbeat agent in YOLO mode (Kay is away). This issue is descriptive — no implementation decision has been made. The first human (or BMAD run) to pick this up should resolve the 5 open questions before code lands. --- ## Heartbeat recommendations (added 2026-06-23 by the 30-min heartbeat, YOLO mode) Per the gap-finding discipline ("don't punt 'BMAD picks' without a recommendation"), the heartbeat adds the following as proposals for BMAD/human to accept or override. These are NOT implementation choices in code; they live only in the issue body for the design call. **Q1 (Cadence) — Recommend: both (on-merge + on-demand CLI).** - On-merge keeps the §2/§3 consumer responsive without needing a timer; the §2 hysteresis (85%/70%) wants immediate signaling. - The on-demand CLI is item 2 of the minimum bar anyway, so the cost is "expose the same reader in two places" — trivial. - Timer-only would mean the scheduler is "5 minutes stale" on a slow day; rate events that trigger threshold crossings want to land in real time. - The cost (more events) is bounded: only on `phase='merged'`, rare relative to build/review churn. `events_outbox` is append-only by design. **Q2 (Storage) — Recommend: rely on `events_outbox(metric.*)` only.** - A separate `metrics_windows` table is a second source of truth that has to stay in sync with the events log. - `events_outbox` is already the audit trail; reading the last 50/100 metric events is one indexed query against `events_outbox(work_item_id, kind, created_at)`. - Backfill is no harder than reading the same table — replays from a snapshot rebuild rates from the log. - If the on-demand CLI becomes hot, add a Redis cache later. That's an optimization, not v1. **Q3 (Window scope for §3) — Recommend: per-project primary, fallback to global when project history < window size.** - §3 is about scope-overlap policy — that's a per-project concern; one project's rebase conflicts shouldn't tighten scope on a calm project. - A noisy project (e.g. a single bad reviewer prompt) shouldn't drag the global rate above 10%. - Fallback rule: if the project has fewer than `rebase_conflict_rate`'s window-size (100) merged items, use the global rate for that project's policy decision. - §2 (trust threshold) stays global — fair-share is a system-wide property and a per-project split would defeat its purpose. **Q4 (First-pass vs. any-pass marker) — Recommend: `work_items.attempts == 1` at the merge event is the canonical first-pass marker.** - The `attempts` column is the existing source of truth; it increments on each retry per the cycle in `phases.py`. - Concretely: query `events_outbox` for the first row per work_item with `payload->>'target' = 'merged'`; check `work_items.attempts == 1` at that moment. - Alternative considered ("the first `phase.transition` per work item"): noisier — includes spec→build, build→review, etc., not just merge attempts. Stick with `attempts == 1 at merge`. **Q5 (Empty windows) — Recommend: rate is undefined until N >= window_size; CLI surfaces `n/total`.** - Do not invent a value (no 0/1.0 default, no partial-window average). - §2/§3 consumers must check `n >= window_size` (50 for first-pass, 100 for rebase-conflict) before acting on the rate. - CLI output exposes both the rate and `n/total` so the operator sees partial state explicitly (e.g. `first_pass_merge_rate = undefined (n=23/50)`). - This is what the amendments doc implies ("§2 needs to be safe when N < 50") — codify as the contract. ## How to action these BMAD picks from the 5 recommendations above. If any is rejected, BMAD writes the alternative into the contract page (`/opt/damascus/llm-wiki/concepts/§7-metrics-analyzer.md` §6) and this issue gets updated to point to the new contract text. Code PR for the metrics module is then a mechanical translation of the contract — not a design decision. If all 5 are accepted as-is, the heartbeat will write the contract page updates on the next tick (descriptive work, gap-finding discipline compliant: editing the contract text IS the descriptive work, not choosing implementation). Then a separate PR can land the actual `metrics.py` module.
Author
Owner

§7 contract page landed (heartbeat, 2026-06-24)

Per the issue body's last paragraph, I wrote the §7 contract page at wiki/concepts/§7-metrics-analyzer.md with my 5 recommendations mirrored as "Heartbeat-proposed answers (pending BMAD/human approval)."

Page is on kaykayyali/damascus-wiki main (commit e475104, pushed). It documents:

  • §1 background + load-bearing dependency on §2/§3 amendments
  • §2 input contract (events_outbox + work_items, no schema migration)
  • §3 output contract: the 3 required v1 metrics (first_pass_merge_rate over 50, rebase_conflict_rate over 100, p95_attempts_to_merge) + metric.threshold_crossed event shape
  • §4 side effects (idempotent replay; no row mutation; observational only)
  • §5 how §2 trust threshold + §3 scope-overlap policy consume the output
  • §6 the 5 open design questions with options + tradeoffs
  • §7 heartbeat-proposed answers with explicit accept/override workflow
  • §8 minimum bar (metrics module + damascus metrics CLI + contract tests + SKILL.md update)

No code change, no PR. The contract is now in the place the implementation will translate from — BMAD/human accepts/overrides in §6/§7, then the implementation PR is a mechanical translation, not a design call.

One unit of progress this tick: the contract page exists. Next tick: pick up PR #7 (fix(compose): db service self-heals tainted dbdata volume on bootstrap) which is 14 commits behind current main and needs a rebase (likely via the git merge main recipe from the skill, conflict zone is one contract-test file at most).

— damascus-heartbeat

## §7 contract page landed (heartbeat, 2026-06-24) Per the issue body's last paragraph, I wrote the §7 contract page at `wiki/concepts/§7-metrics-analyzer.md` with my 5 recommendations mirrored as "Heartbeat-proposed answers (pending BMAD/human approval)." Page is on `kaykayyali/damascus-wiki` main (commit `e475104`, pushed). It documents: - **§1** background + load-bearing dependency on §2/§3 amendments - **§2** input contract (`events_outbox` + `work_items`, no schema migration) - **§3** output contract: the 3 required v1 metrics (`first_pass_merge_rate` over 50, `rebase_conflict_rate` over 100, `p95_attempts_to_merge`) + `metric.threshold_crossed` event shape - **§4** side effects (idempotent replay; no row mutation; observational only) - **§5** how §2 trust threshold + §3 scope-overlap policy consume the output - **§6** the 5 open design questions with options + tradeoffs - **§7** heartbeat-proposed answers with explicit accept/override workflow - **§8** minimum bar (metrics module + `damascus metrics` CLI + contract tests + SKILL.md update) No code change, no PR. The contract is now in the place the implementation will translate from — BMAD/human accepts/overrides in §6/§7, then the implementation PR is a mechanical translation, not a design call. **One unit of progress this tick:** the contract page exists. **Next tick:** pick up PR #7 (`fix(compose): db service self-heals tainted dbdata volume on bootstrap`) which is 14 commits behind current main and needs a rebase (likely via the `git merge main` recipe from the skill, conflict zone is one contract-test file at most). — damascus-heartbeat
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: kaykayyali/damascus-orchestrator#6