system-prompts-and-models-o.../salesflow-saas/RUNBOOK.md
Claude 7f57803b22
feat(dealix): D0 launch hardening — DLQ, PostHog, circuit breaker, pricing, runbook
Close 6 critical launch gates for Primitive Launch Completion:

- DLQ (Dead Letter Queue): Redis-backed failure capture with retry drain
  and admin endpoints (/admin/dlq/queues, /admin/dlq/{queue}/purge)
- PostHog client: zero-dependency HTTP funnel tracker with 16 event types
  (landing_view → deal_won → payment_succeeded)
- Circuit breaker: in-memory fault isolation for external integrations
  with registry and admin status endpoint (/admin/circuit-breakers)
- Pricing router: 3-tier plans (Starter 990/Growth 2490/Enterprise custom)
  with Moyasar invoice checkout and webhook handler
- Config: added POSTHOG_API_KEY, MOYASAR_SECRET_KEY, DLQ settings
- Wiring: PostHog + DLQ initialized in main.py lifespan, pricing router
  in API router
- RUNBOOK.md: 5 incident scenarios (service down, DB down, LLM down,
  DB restore, version rollback)
- LAUNCH_GATES.md: 33-gate checklist across 7 categories
- 20 tests: all passing (DLQ 7, PostHog 4, circuit breaker 5, pricing 4)

https://claude.ai/code/session_01W1rJthWDkasijTdXCfxVHs
2026-04-23 10:32:53 +00:00

5.0 KiB

Dealix Operational Runbook

Version: 1.0.0
Last updated: 2026-04-23
Owner: Ops Lead


Scenario 1: Service Down (API not responding)

Detection: UptimeRobot alert on api.dealix.me/health or Sentry alert spike.

Steps:

  1. SSH to server: ssh dealix_deploy@188.245.55.180
  2. Check systemd status: sudo systemctl status dealix-api
  3. Check logs: sudo journalctl -u dealix-api --since '10 min ago' -n 100
  4. If crashed: sudo systemctl restart dealix-api
  5. Verify: curl http://localhost:8001/health
  6. If still failing, check port conflict: sudo ss -tlnp | grep 8001
  7. Check disk space: df -h (full disk = crash)
  8. Check memory: free -h (OOM killer may have killed uvicorn)
  9. If persistent: rollback to previous version (see Scenario 5)

Recovery time target: < 5 minutes
Escalation: If not resolved in 15 minutes, escalate to founder.


Scenario 2: Database Down (Postgres unreachable)

Detection: /health/deep returns postgres: failed or Sentry DB connection errors.

Steps:

  1. Check Postgres status: sudo systemctl status postgresql
  2. If stopped: sudo systemctl start postgresql
  3. Check Postgres logs: sudo journalctl -u postgresql --since '10 min ago'
  4. Check connections: sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
  5. If max connections hit: sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start < now() - interval '30 min';"
  6. Check disk: df -h /var/lib/postgresql
  7. If data corruption: restore from backup (see Scenario 4)
  8. Verify: curl http://localhost:8001/health/deep | python3 -m json.tool

Recovery time target: < 10 minutes
Last backup location: /var/backups/dealix/ (daily cron)


Scenario 3: LLM Provider Down (Groq/OpenAI)

Detection: /health/deep shows LLM provider failures, or Sentry errors on /api/v1/ai-agents/*.

Steps:

  1. Check which provider: curl http://localhost:8001/health/deep | python3 -m json.tool
  2. If Groq down: system should auto-fallback to OpenAI (check LLM_FALLBACK_PROVIDER in .env)
  3. Verify fallback: curl -X POST http://localhost:8001/api/v1/ai-agents/test-prompt
  4. If both down: check API keys validity
  5. Check provider status pages:
    • Groq: https://status.groq.com
    • OpenAI: https://status.openai.com
  6. If keys expired: rotate keys in .env, restart: sudo systemctl restart dealix-api

Impact: AI features degraded but core CRUD/lead management continues working.
Recovery time target: Automatic (fallback). Manual intervention only if both providers fail.


Scenario 4: Database Restore from Backup

When: Data corruption, accidental deletion, or disaster recovery.

Steps:

  1. Stop the API: sudo systemctl stop dealix-api
  2. List available backups: ls -lt /var/backups/dealix/*.sql.gz
  3. Create safety snapshot of current state: sudo -u postgres pg_dump dealix | gzip > /tmp/dealix_pre_restore_$(date +%Y%m%d_%H%M%S).sql.gz
  4. Drop and recreate database:
    sudo -u postgres psql -c "DROP DATABASE dealix;"
    sudo -u postgres psql -c "CREATE DATABASE dealix OWNER dealix;"
    
  5. Restore: gunzip -c /var/backups/dealix/LATEST.sql.gz | sudo -u postgres psql dealix
  6. Verify row counts: sudo -u postgres psql dealix -c "SELECT 'leads', count(*) FROM leads UNION ALL SELECT 'deals', count(*) FROM deals;"
  7. Start API: sudo systemctl start dealix-api
  8. Verify health: curl http://localhost:8001/health/deep
  9. Check integrity: manually verify recent leads/deals in dashboard

Recovery time target: < 15 minutes (tested)
RPO: 24 hours (daily backup)
RTO: 15 minutes


Scenario 5: Rollback to Previous Version

When: Bad deploy, broken feature in production.

Steps:

  1. Identify last working version: git log --oneline -10
  2. Check current tag: git describe --tags --always
  3. Checkout previous version: git checkout v3.0.0 (or specific commit)
  4. Install deps: pip install -r requirements.txt
  5. Restart: sudo systemctl restart dealix-api
  6. Verify: curl http://localhost:8001/health
  7. If rolling back a migration: check alembic history and downgrade if needed
  8. Notify team of rollback reason

Recovery time target: < 5 minutes
Note: Never force-push or delete the broken commit. Create a revert commit instead for traceability.


Quick Reference

Check Command
API health curl http://localhost:8001/health
Deep health curl http://localhost:8001/health/deep
Service status sudo systemctl status dealix-api
Recent logs sudo journalctl -u dealix-api -n 50 --no-pager
Postgres status sudo systemctl status postgresql
Redis status redis-cli ping
Disk space df -h
Memory free -h
DLQ depth curl http://localhost:8001/api/v1/admin/dlq/queues
Circuit breakers curl http://localhost:8001/api/v1/admin/circuit-breakers
Restart API sudo systemctl restart dealix-api
Backup now sudo -u postgres pg_dump dealix | gzip > /var/backups/dealix/manual_$(date +%Y%m%d).sql.gz