Close 6 critical launch gates for Primitive Launch Completion:
- DLQ (Dead Letter Queue): Redis-backed failure capture with retry drain
and admin endpoints (/admin/dlq/queues, /admin/dlq/{queue}/purge)
- PostHog client: zero-dependency HTTP funnel tracker with 16 event types
(landing_view → deal_won → payment_succeeded)
- Circuit breaker: in-memory fault isolation for external integrations
with registry and admin status endpoint (/admin/circuit-breakers)
- Pricing router: 3-tier plans (Starter 990/Growth 2490/Enterprise custom)
with Moyasar invoice checkout and webhook handler
- Config: added POSTHOG_API_KEY, MOYASAR_SECRET_KEY, DLQ settings
- Wiring: PostHog + DLQ initialized in main.py lifespan, pricing router
in API router
- RUNBOOK.md: 5 incident scenarios (service down, DB down, LLM down,
DB restore, version rollback)
- LAUNCH_GATES.md: 33-gate checklist across 7 categories
- 20 tests: all passing (DLQ 7, PostHog 4, circuit breaker 5, pricing 4)
https://claude.ai/code/session_01W1rJthWDkasijTdXCfxVHs
5.0 KiB
Dealix Operational Runbook
Version: 1.0.0
Last updated: 2026-04-23
Owner: Ops Lead
Scenario 1: Service Down (API not responding)
Detection: UptimeRobot alert on api.dealix.me/health or Sentry alert spike.
Steps:
- SSH to server:
ssh dealix_deploy@188.245.55.180 - Check systemd status:
sudo systemctl status dealix-api - Check logs:
sudo journalctl -u dealix-api --since '10 min ago' -n 100 - If crashed:
sudo systemctl restart dealix-api - Verify:
curl http://localhost:8001/health - If still failing, check port conflict:
sudo ss -tlnp | grep 8001 - Check disk space:
df -h(full disk = crash) - Check memory:
free -h(OOM killer may have killed uvicorn) - If persistent: rollback to previous version (see Scenario 5)
Recovery time target: < 5 minutes
Escalation: If not resolved in 15 minutes, escalate to founder.
Scenario 2: Database Down (Postgres unreachable)
Detection: /health/deep returns postgres: failed or Sentry DB connection errors.
Steps:
- Check Postgres status:
sudo systemctl status postgresql - If stopped:
sudo systemctl start postgresql - Check Postgres logs:
sudo journalctl -u postgresql --since '10 min ago' - Check connections:
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;" - If max connections hit:
sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start < now() - interval '30 min';" - Check disk:
df -h /var/lib/postgresql - If data corruption: restore from backup (see Scenario 4)
- Verify:
curl http://localhost:8001/health/deep | python3 -m json.tool
Recovery time target: < 10 minutes
Last backup location: /var/backups/dealix/ (daily cron)
Scenario 3: LLM Provider Down (Groq/OpenAI)
Detection: /health/deep shows LLM provider failures, or Sentry errors on /api/v1/ai-agents/*.
Steps:
- Check which provider:
curl http://localhost:8001/health/deep | python3 -m json.tool - If Groq down: system should auto-fallback to OpenAI (check
LLM_FALLBACK_PROVIDERin.env) - Verify fallback:
curl -X POST http://localhost:8001/api/v1/ai-agents/test-prompt - If both down: check API keys validity
- Check provider status pages:
- Groq:
https://status.groq.com - OpenAI:
https://status.openai.com
- Groq:
- If keys expired: rotate keys in
.env, restart:sudo systemctl restart dealix-api
Impact: AI features degraded but core CRUD/lead management continues working.
Recovery time target: Automatic (fallback). Manual intervention only if both providers fail.
Scenario 4: Database Restore from Backup
When: Data corruption, accidental deletion, or disaster recovery.
Steps:
- Stop the API:
sudo systemctl stop dealix-api - List available backups:
ls -lt /var/backups/dealix/*.sql.gz - Create safety snapshot of current state:
sudo -u postgres pg_dump dealix | gzip > /tmp/dealix_pre_restore_$(date +%Y%m%d_%H%M%S).sql.gz - Drop and recreate database:
sudo -u postgres psql -c "DROP DATABASE dealix;" sudo -u postgres psql -c "CREATE DATABASE dealix OWNER dealix;" - Restore:
gunzip -c /var/backups/dealix/LATEST.sql.gz | sudo -u postgres psql dealix - Verify row counts:
sudo -u postgres psql dealix -c "SELECT 'leads', count(*) FROM leads UNION ALL SELECT 'deals', count(*) FROM deals;" - Start API:
sudo systemctl start dealix-api - Verify health:
curl http://localhost:8001/health/deep - Check integrity: manually verify recent leads/deals in dashboard
Recovery time target: < 15 minutes (tested)
RPO: 24 hours (daily backup)
RTO: 15 minutes
Scenario 5: Rollback to Previous Version
When: Bad deploy, broken feature in production.
Steps:
- Identify last working version:
git log --oneline -10 - Check current tag:
git describe --tags --always - Checkout previous version:
git checkout v3.0.0(or specific commit) - Install deps:
pip install -r requirements.txt - Restart:
sudo systemctl restart dealix-api - Verify:
curl http://localhost:8001/health - If rolling back a migration: check
alembic historyand downgrade if needed - Notify team of rollback reason
Recovery time target: < 5 minutes
Note: Never force-push or delete the broken commit. Create a revert commit instead for traceability.
Quick Reference
| Check | Command |
|---|---|
| API health | curl http://localhost:8001/health |
| Deep health | curl http://localhost:8001/health/deep |
| Service status | sudo systemctl status dealix-api |
| Recent logs | sudo journalctl -u dealix-api -n 50 --no-pager |
| Postgres status | sudo systemctl status postgresql |
| Redis status | redis-cli ping |
| Disk space | df -h |
| Memory | free -h |
| DLQ depth | curl http://localhost:8001/api/v1/admin/dlq/queues |
| Circuit breakers | curl http://localhost:8001/api/v1/admin/circuit-breakers |
| Restart API | sudo systemctl restart dealix-api |
| Backup now | sudo -u postgres pg_dump dealix | gzip > /var/backups/dealix/manual_$(date +%Y%m%d).sql.gz |