system-prompts-and-models-o.../salesflow-saas/RUNBOOK.md
Claude 7f57803b22
feat(dealix): D0 launch hardening — DLQ, PostHog, circuit breaker, pricing, runbook
Close 6 critical launch gates for Primitive Launch Completion:

- DLQ (Dead Letter Queue): Redis-backed failure capture with retry drain
  and admin endpoints (/admin/dlq/queues, /admin/dlq/{queue}/purge)
- PostHog client: zero-dependency HTTP funnel tracker with 16 event types
  (landing_view → deal_won → payment_succeeded)
- Circuit breaker: in-memory fault isolation for external integrations
  with registry and admin status endpoint (/admin/circuit-breakers)
- Pricing router: 3-tier plans (Starter 990/Growth 2490/Enterprise custom)
  with Moyasar invoice checkout and webhook handler
- Config: added POSTHOG_API_KEY, MOYASAR_SECRET_KEY, DLQ settings
- Wiring: PostHog + DLQ initialized in main.py lifespan, pricing router
  in API router
- RUNBOOK.md: 5 incident scenarios (service down, DB down, LLM down,
  DB restore, version rollback)
- LAUNCH_GATES.md: 33-gate checklist across 7 categories
- 20 tests: all passing (DLQ 7, PostHog 4, circuit breaker 5, pricing 4)

https://claude.ai/code/session_01W1rJthWDkasijTdXCfxVHs
2026-04-23 10:32:53 +00:00

132 lines
5.0 KiB
Markdown

# Dealix Operational Runbook
**Version:** 1.0.0
**Last updated:** 2026-04-23
**Owner:** Ops Lead
---
## Scenario 1: Service Down (API not responding)
**Detection:** UptimeRobot alert on `api.dealix.me/health` or Sentry alert spike.
**Steps:**
1. SSH to server: `ssh dealix_deploy@188.245.55.180`
2. Check systemd status: `sudo systemctl status dealix-api`
3. Check logs: `sudo journalctl -u dealix-api --since '10 min ago' -n 100`
4. If crashed: `sudo systemctl restart dealix-api`
5. Verify: `curl http://localhost:8001/health`
6. If still failing, check port conflict: `sudo ss -tlnp | grep 8001`
7. Check disk space: `df -h` (full disk = crash)
8. Check memory: `free -h` (OOM killer may have killed uvicorn)
9. If persistent: rollback to previous version (see Scenario 5)
**Recovery time target:** < 5 minutes
**Escalation:** If not resolved in 15 minutes, escalate to founder.
---
## Scenario 2: Database Down (Postgres unreachable)
**Detection:** `/health/deep` returns `postgres: failed` or Sentry DB connection errors.
**Steps:**
1. Check Postgres status: `sudo systemctl status postgresql`
2. If stopped: `sudo systemctl start postgresql`
3. Check Postgres logs: `sudo journalctl -u postgresql --since '10 min ago'`
4. Check connections: `sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"`
5. If max connections hit: `sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start < now() - interval '30 min';"`
6. Check disk: `df -h /var/lib/postgresql`
7. If data corruption: restore from backup (see Scenario 4)
8. Verify: `curl http://localhost:8001/health/deep | python3 -m json.tool`
**Recovery time target:** < 10 minutes
**Last backup location:** `/var/backups/dealix/` (daily cron)
---
## Scenario 3: LLM Provider Down (Groq/OpenAI)
**Detection:** `/health/deep` shows LLM provider failures, or Sentry errors on `/api/v1/ai-agents/*`.
**Steps:**
1. Check which provider: `curl http://localhost:8001/health/deep | python3 -m json.tool`
2. If Groq down: system should auto-fallback to OpenAI (check `LLM_FALLBACK_PROVIDER` in `.env`)
3. Verify fallback: `curl -X POST http://localhost:8001/api/v1/ai-agents/test-prompt`
4. If both down: check API keys validity
5. Check provider status pages:
- Groq: `https://status.groq.com`
- OpenAI: `https://status.openai.com`
6. If keys expired: rotate keys in `.env`, restart: `sudo systemctl restart dealix-api`
**Impact:** AI features degraded but core CRUD/lead management continues working.
**Recovery time target:** Automatic (fallback). Manual intervention only if both providers fail.
---
## Scenario 4: Database Restore from Backup
**When:** Data corruption, accidental deletion, or disaster recovery.
**Steps:**
1. Stop the API: `sudo systemctl stop dealix-api`
2. List available backups: `ls -lt /var/backups/dealix/*.sql.gz`
3. Create safety snapshot of current state: `sudo -u postgres pg_dump dealix | gzip > /tmp/dealix_pre_restore_$(date +%Y%m%d_%H%M%S).sql.gz`
4. Drop and recreate database:
```
sudo -u postgres psql -c "DROP DATABASE dealix;"
sudo -u postgres psql -c "CREATE DATABASE dealix OWNER dealix;"
```
5. Restore: `gunzip -c /var/backups/dealix/LATEST.sql.gz | sudo -u postgres psql dealix`
6. Verify row counts: `sudo -u postgres psql dealix -c "SELECT 'leads', count(*) FROM leads UNION ALL SELECT 'deals', count(*) FROM deals;"`
7. Start API: `sudo systemctl start dealix-api`
8. Verify health: `curl http://localhost:8001/health/deep`
9. Check integrity: manually verify recent leads/deals in dashboard
**Recovery time target:** < 15 minutes (tested)
**RPO:** 24 hours (daily backup)
**RTO:** 15 minutes
---
## Scenario 5: Rollback to Previous Version
**When:** Bad deploy, broken feature in production.
**Steps:**
1. Identify last working version: `git log --oneline -10`
2. Check current tag: `git describe --tags --always`
3. Checkout previous version: `git checkout v3.0.0` (or specific commit)
4. Install deps: `pip install -r requirements.txt`
5. Restart: `sudo systemctl restart dealix-api`
6. Verify: `curl http://localhost:8001/health`
7. If rolling back a migration: check `alembic history` and downgrade if needed
8. Notify team of rollback reason
**Recovery time target:** < 5 minutes
**Note:** Never force-push or delete the broken commit. Create a revert commit instead for traceability.
---
## Quick Reference
| Check | Command |
|---|---|
| API health | `curl http://localhost:8001/health` |
| Deep health | `curl http://localhost:8001/health/deep` |
| Service status | `sudo systemctl status dealix-api` |
| Recent logs | `sudo journalctl -u dealix-api -n 50 --no-pager` |
| Postgres status | `sudo systemctl status postgresql` |
| Redis status | `redis-cli ping` |
| Disk space | `df -h` |
| Memory | `free -h` |
| DLQ depth | `curl http://localhost:8001/api/v1/admin/dlq/queues` |
| Circuit breakers | `curl http://localhost:8001/api/v1/admin/circuit-breakers` |
| Restart API | `sudo systemctl restart dealix-api` |
| Backup now | `sudo -u postgres pg_dump dealix \| gzip > /var/backups/dealix/manual_$(date +%Y%m%d).sql.gz` |