Files
zalbot/docs/deployment-guide.md
Kaysser Kayyali e2c92e854f
Some checks failed
tests / Unit tests (Node 22) (push) Failing after 2m13s
Add unit tests for LLM clients, persona loader, and XP/Foundry rewards
Expands the unit test suite from 320 to 380 tests (+60) and adds a
Gitea Actions CI workflow. Closes all six follow-up recommendations
from the test-architecture validation report.

New tests (tests/unit/):
  - ollamaClient.test.ts          — Ollama SDK wrapper, options passthrough
  - litellmClient.test.ts         — OpenAI SDK wrapper, model fallback
  - personaLoader.test.ts         — Zod validation + cache invalidation
  - foundryReward.test.ts         — Tool plugin: lookup, errors, partial grants
  - xpAwarder.test.ts             — Bulk XP awards + per-player skip reasons
  - redisErrorPath.test.ts        — Singleton error handler does not crash
  - messageRouterRunLLMTurn.test.ts — 18 cases for the runtime heart:
    narrative-only path, tool dispatch, filter correction, retry loop
    guard, missed-skill-check heuristic, typing indicator interval,
    LLM error fallback, archive on resolve.

Coverage (line %):
  - harness/litellmClient.ts      0 → 100
  - harness/ollamaClient.ts       0 → 100
  - harness/tools/foundryReward.ts 0 → 100
  - session/xpAwarder.ts          0 → 100
  - persona/loader.ts             0 → 100
  - db/redis.ts                   0 → 100
  - bot/handlers/messageRouter.ts 0 → 39.86 (runLLMTurn now covered)

Tooling:
  - package.json: + test:coverage, test:watch scripts
  - devDep: @vitest/coverage-v8@^3.1.0
  - tests/README.md: conventions, anti-patterns, template map
  - .gitignore: exclude coverage/
  - .gitea/workflows/test.yml: Node 22, npm cache, tsc --noEmit gate

Documentation (from earlier /bmad-document-project run, now committed):
  - docs/index.md
  - docs/project-overview.md
  - docs/architecture.md
  - docs/deployment-guide.md
  - docs/api-contracts.md
  - docs/data-models.md
  - docs/source-tree-analysis.md
  - docs/component-inventory.md
  - docs/development-guide.md
  - _bmad-output/test-artifacts/automate-validation-report.md

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-19 05:59:13 +00:00

7.5 KiB

Deployment Guide

Deploying the Mardonar Encounter Engine. Generated 2026-06-19.

Architecture

The bot is a single long-running Node.js process. It connects to:

  • Discord over WebSocket (discord.js v14)
  • Redis for session and player/character registries
  • GraphMCP (HTTP JSON-RPC) for NPC memory, lore search, and encounter log writes
  • LiteLLM (preferred) or Ollama for LLM inference
  • VTT relay (optional) for Foundry VTT integration

The Dockerfile is multi-stage Node 22 alpine. There is currently no production docker-compose.yml — only the dev one (docker-compose.dev.yml). Production deploys use the Dockerfile directly with whatever orchestrator is in use.

Build

npm ci --ignore-scripts
npm run build          # tsc → dist/

The build is reproducible from a clean node_modules. The Dockerfile's builder stage does exactly this.

Container image

Dockerfile:

  • Builder (node:22-alpine): npm ci --ignore-scripts, copy src + tsconfig.json, run npm run build
  • Runtime (node:22-alpine): npm ci --omit=dev --ignore-scripts, copy dist/, specs/, lore/, persona.yaml
  • CMD: ["node", "dist/bot/index.js"]

To build locally:

docker build -t mardonar-bot:latest .

The data/ directory is not copied into the image — it must be mounted as a volume in production so tally and summaries persist across restarts.

Local dev (Docker Compose)

docker-compose.dev.yml is the only compose file in the repo. It declares the mardonar-internal Docker network as external: true — it expects the GraphMCP-Example stack (Redis + MCP server) to be running first.

docker compose -f docker-compose.dev.yml up -d
docker compose -f docker-compose.dev.yml logs -f bot

Two services:

  • deploy-commands — one-shot container that runs node dist/scripts/deploy-commands.js. restart: "no".
  • bot — long-running container. restart: unless-stopped. Mounts ./data:/app/data so tally and summaries persist. depends_on: deploy-commands: service_completed_successfully ensures commands are registered before the bot starts serving traffic.

Production deployment

There is no production compose file. Pick one:

Option A: Plain Docker

docker build -t mardonar-bot:latest .
docker run -d \
  --name mardonar-bot \
  --restart unless-stopped \
  --env-file .env \
  -v /var/lib/mardonar/data:/app/data \
  --network mardonar-internal \
  mardonar-bot:latest

Register commands once before the bot serves traffic (either via the deploy-commands service or by running the same image with a different command):

docker run --rm \
  --env-file .env \
  --network mardonar-internal \
  mardonar-bot:latest \
  node dist/scripts/deploy-commands.js

Option B: systemd (Linux host)

# /etc/systemd/system/mardonar-bot.service
[Unit]
Description=Mardonar Encounter Engine
After=network.target redis-server.service

[Service]
Type=simple
User=mardonar
WorkingDirectory=/opt/mardonar
EnvironmentFile=/opt/mardonar/.env
ExecStart=/usr/bin/node /opt/mardonar/dist/bot/index.js
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now mardonar-bot
sudo journalctl -u mardonar-bot -f

Environment

All runtime configuration is via environment variables, validated by Zod (src/config.ts). The full list is in development-guide.md.

Production essentials:

DISCORD_TOKEN=...
DISCORD_CLIENT_ID=...
DISCORD_GUILD_ID=...           # instant command registration

# Network isolation: only respond in specific channels
DISCORD_ALLOWED_CHANNELS=123456789012345678,987654321098765432
# User restriction: only allow specific users to run /encounter
DISCORD_ALLOWED_USERS=111111111111111111

# LiteLLM (preferred)
LITELLM_BASE_URL=http://your-litellm-host:4000
LITELLM_API_KEY=...
LITELLM_MODEL=ollama-cloud

# Ollama fallback
OLLAMA_BASE_URL=http://your-ollama-host:11434
OLLAMA_MODEL=gemma4-it:e2b

# GraphMCP (must be reachable)
GRAPHMCP_URL=http://mcp-server:9000
GRAPHMCP_SCORE_THRESHOLD=0.68
GRAPHMCP_INGEST_STREAM=raw.messages

# Persisted state
DATA_DIR=/app/data              # or wherever you mount the volume

# Logging
LOG_LEVEL=info

Security note: DISCORD_ALLOWED_CHANNELS is empty by default, which means the bot will respond in no channels. This is secure-by-default but easy to misconfigure. Set it explicitly.

Persistent state

Two kinds of state to back up:

  1. data/tally.json — per-spec run counts. Useful for analytics, not load-bearing.
  2. data/summaries/ — one .txt per resolved encounter. Permanent record.

Session state lives in Redis with a 12h TTL. If Redis is wiped, in-flight sessions are lost but Discord threads themselves remain — the bot will simply not find a session for that thread on next message. No data corruption risk.

Health checks

The bot does not currently expose an HTTP health endpoint. Suggested liveness probe patterns:

  • Discord WebSocket liveness — the bot logs [bot] Logged in as <tag> on ready. Scrape stdout for this.
  • Redis — already externally monitored. The bot logs [redis] connection error on failure.
  • GraphMCP — first call after startup will fail loudly if unreachable.
  • Custom probe — call /encounter status in a known thread and check the response (the bot only responds in DISCORD_ALLOWED_CHANNELS).

A simple docker healthcheck using Discord WebSocket isn't trivially scriptable. If you need an HTTP probe, add a small Express server in a future iteration that responds 200 while the Discord client is ready and Redis is connected.

Logging

The bot uses pino. In dev, pino-pretty formats to a human-readable stream. In prod, pino emits structured JSON to stdout — pipe to your log shipper (Loki, CloudWatch, etc.).

Useful fields to index:

  • level, time, msg
  • threadId, encounterId (for encounter-specific queries)
  • latencyMs (for LLM and tool latency)
  • error (for failure analysis)

Operational runbook

Restart the bot

docker restart mardonar-bot
# or: systemctl restart mardonar-bot

Rotate the Discord token

  1. Generate a new token in the Discord developer portal
  2. Update the env var (or secret store)
  3. Restart the bot
  4. Old token is invalidated immediately

Re-register slash commands

After changing any src/bot/commands/*.ts:

docker run --rm --env-file .env --network mardonar-internal mardonar-bot:latest \
  node dist/scripts/deploy-commands.js

Or in dev: npm run deploy-commands

Reset a stuck session

A bot restart clears all in-memory state (including reaction managers and burst counters). Redis session state persists. If a session is genuinely stuck (e.g. a tool dispatched but the response was lost), use /encounter end in-thread to force-resolve.

Drain Redis (nuclear option)

docker exec -it <redis-container> redis-cli FLUSHDB

Open deployment gaps

These are real but not blockers:

  • No production compose file — only docker-compose.dev.yml. Production deploy is ad-hoc.
  • No CI/CD — no .github/workflows/. Build and deploy are manual.
  • No health endpoint — no HTTP probe target.
  • No metrics export — pino logs are the only observability surface.
  • docker-compose.dev.yml references an external Docker network (mardonar-internal) — fine for the dev stack it's designed for, but a fresh deployment needs to either join the same network or remove the reference.