5.1 KiB
Dealix Launch Drills — Execution Plan
Purpose: concrete commands to execute T5, T6, T7, T8 drills from LAUNCH_GATES.md.
All drills are designed to be runnable by Sami on the prod server with minimal agent hand-holding. Each has a clear pass/fail signal.
Preflight (once, on prod server)
# Ensure scripts are executable and up to date
cd /opt/dealix
git pull origin main
chmod +x scripts/ops/*.sh
# Verify current state
git rev-parse --short HEAD # should be latest main
cat /opt/dealix/.last_good_sha # should be ce0027e or newer good commit
systemctl is-active dealix-api # should be: active
curl -s http://127.0.0.1:8001/health/deep | jq . # should show postgres + redis + llm green
T5 — DLQ fault-injection (full closure)
Status: 🟡 primitives live, E2E untested
Target: prove a malformed webhook lands in WEBHOOKS_DLQ
# From any machine with an API key (recommended: the prod server itself)
export API_KEY="$(grep '^API_KEYS=' /opt/dealix/.env | cut -d= -f2- | tr -d '"' | cut -d, -f1)"
export API_BASE="http://127.0.0.1:8001"
bash /opt/dealix/scripts/ops/dlq_fault_injection.sh
Pass criteria:
- Baseline
webhooksdepth isN - After injection, depth is
N+1 /admin/dlq/webhooks/peekreturns the injected event
Cleanup:
curl -X POST -H "X-API-Key: $API_KEY" "$API_BASE/api/v1/admin/dlq/webhooks/drain"
T6 — k6 load test against prod
Status: 🔴 not run
Target: 100 VU, p95 <2s, failure <2% (thresholds baked into k6_smoke.js)
Install k6 (once)
sudo gpg --no-default-keyring \
--keyring /usr/share/keyrings/k6-archive-keyring.gpg \
--keyserver hkp://keyserver.ubuntu.com:80 \
--recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
| sudo tee /etc/apt/sources.list.d/k6.list
sudo apt-get update && sudo apt-get install -y k6
Run
sudo bash /opt/dealix/scripts/ops/run_k6_prod.sh
Pass criteria: k6 exits 0 (both thresholds green). Summary JSON saved to /var/log/dealix_k6/.
If it fails:
- Check
http_req_durationbreakdown — identify which endpoint breached p95 - Likely culprits:
/health/deep(Postgres slow),/api/v1/pricing/plans(pure Python, should be fast) - Re-run at 50 VU to confirm it's load-related not environmental
T7 — Rollback drill
Status: 🔴 not drilled Target: prove rollback works end-to-end in <5 minutes
Step A: dry run (safe, always)
sudo bash /opt/dealix/scripts/ops/rollback_drill.sh --dry-run
Review the log. Confirm .last_good_sha points to a real earlier commit (not HEAD).
Step B: real drill (only on staging OR a maintenance window)
# Only run on prod during a planned maintenance window with no active traffic
sudo CONFIRM=YES bash /opt/dealix/scripts/ops/rollback_drill.sh --real
Pass criteria:
- Script exits 0
- Elapsed time <300s (5 minutes)
/health/deepreturns 200 after rollback- Log file at
/var/log/dealix_rollback_drill.<ts>.log
After drill: roll forward immediately by:
cd /opt/dealix
git fetch origin && git reset --hard origin/main
.venv/bin/pip install -q -r requirements.txt
systemctl restart dealix-api
T8 — Backup restore drill (staging required)
Status: 🔴 not drilled — follow RUNBOOK Scenario 5 Target: restore production DB dump into staging DB, verify row counts
Prerequisites (blocked on staging infra)
- Staging server with Postgres matching prod version
- Recent
pg_dumpfrom prod (should run daily via cron; verify withls -lt /opt/dealix/backups/) - SSH access to staging
Procedure (summary — full detail in RUNBOOK.md)
# On staging server, as root
LATEST_DUMP=$(ls -1t /opt/dealix/backups/*.sql.gz | head -1)
# Pre-check row counts on staging (if it has prior data)
psql -U dealix -d dealix -c "SELECT COUNT(*) FROM leads;"
# Restore
gunzip -c "$LATEST_DUMP" | psql -U dealix -d dealix_restore_test
# Post-check row counts — must be non-zero and match production
psql -U dealix -d dealix_restore_test -c "SELECT COUNT(*) FROM leads;"
psql -U dealix -d dealix_restore_test -c "SELECT COUNT(*) FROM deals;"
Pass criteria: row counts for critical tables (leads, deals, events) are non-zero and within 1% of production counts at dump time.
Current blocker: no staging environment provisioned. Options:
- Spin up a second Hetzner VPS as staging (cost: ~€5/mo) — recommended
- Use a local Docker Postgres + local restore — acceptable for drill but not ongoing
- Skip until G5 (first paid deal) — risky, contradicts launch gates philosophy
Drill cadence (post-launch)
| Drill | Frequency | Owner |
|---|---|---|
| T5 DLQ fault-injection | Monthly | Sami or on-call |
| T6 k6 | Weekly for first month, then monthly | Sami |
| T7 rollback dry-run | Monthly | Sami |
| T7 rollback real | Quarterly (or after major deploy) | Sami |
| T8 backup restore | Monthly | Sami |
All drills should be logged in docs/incidents/YYYY-MM-DD-drill-<name>.md with the log file path and outcome.