mancitrus/system-prompts-and-models-of-ai-tools

mirror of https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools.git synced 2026-06-17 23:09:35 +00:00

feat(dealix): D0 launch hardening — DLQ, PostHog, circuit breaker, pricing, runbook

Close 6 critical launch gates for Primitive Launch Completion:

- DLQ (Dead Letter Queue): Redis-backed failure capture with retry drain
  and admin endpoints (/admin/dlq/queues, /admin/dlq/{queue}/purge)
- PostHog client: zero-dependency HTTP funnel tracker with 16 event types
  (landing_view → deal_won → payment_succeeded)
- Circuit breaker: in-memory fault isolation for external integrations
  with registry and admin status endpoint (/admin/circuit-breakers)
- Pricing router: 3-tier plans (Starter 990/Growth 2490/Enterprise custom)
  with Moyasar invoice checkout and webhook handler
- Config: added POSTHOG_API_KEY, MOYASAR_SECRET_KEY, DLQ settings
- Wiring: PostHog + DLQ initialized in main.py lifespan, pricing router
  in API router
- RUNBOOK.md: 5 incident scenarios (service down, DB down, LLM down,
  DB restore, version rollback)
- LAUNCH_GATES.md: 33-gate checklist across 7 categories
- 20 tests: all passing (DLQ 7, PostHog 4, circuit breaker 5, pricing 4)

https://claude.ai/code/session_01W1rJthWDkasijTdXCfxVHs

2026-04-23 10:32:53 +00:00

5.0 KiB

Raw Blame History

Dealix Operational Runbook

Version: 1.0.0
Last updated: 2026-04-23
Owner: Ops Lead

Scenario 1: Service Down (API not responding)

Detection: UptimeRobot alert on api.dealix.me/health or Sentry alert spike.

Steps:

SSH to server: ssh dealix_deploy@188.245.55.180
Check systemd status: sudo systemctl status dealix-api
Check logs: sudo journalctl -u dealix-api --since '10 min ago' -n 100
If crashed: sudo systemctl restart dealix-api
Verify: curl http://localhost:8001/health
If still failing, check port conflict: sudo ss -tlnp | grep 8001
Check disk space: df -h (full disk = crash)
Check memory: free -h (OOM killer may have killed uvicorn)
If persistent: rollback to previous version (see Scenario 5)

Recovery time target: < 5 minutes
Escalation: If not resolved in 15 minutes, escalate to founder.

Scenario 2: Database Down (Postgres unreachable)

Detection: /health/deep returns postgres: failed or Sentry DB connection errors.

Steps:

Check Postgres status: sudo systemctl status postgresql
If stopped: sudo systemctl start postgresql
Check Postgres logs: sudo journalctl -u postgresql --since '10 min ago'
Check connections: sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
If max connections hit: sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state='idle' AND query_start < now() - interval '30 min';"
Check disk: df -h /var/lib/postgresql
If data corruption: restore from backup (see Scenario 4)
Verify: curl http://localhost:8001/health/deep | python3 -m json.tool

Recovery time target: < 10 minutes
Last backup location: /var/backups/dealix/ (daily cron)

Scenario 3: LLM Provider Down (Groq/OpenAI)

Detection: /health/deep shows LLM provider failures, or Sentry errors on /api/v1/ai-agents/*.

Steps:

Check which provider: curl http://localhost:8001/health/deep | python3 -m json.tool
If Groq down: system should auto-fallback to OpenAI (check LLM_FALLBACK_PROVIDER in .env)
Verify fallback: curl -X POST http://localhost:8001/api/v1/ai-agents/test-prompt
If both down: check API keys validity
Check provider status pages:
- Groq: https://status.groq.com
- OpenAI: https://status.openai.com
If keys expired: rotate keys in .env, restart: sudo systemctl restart dealix-api

Impact: AI features degraded but core CRUD/lead management continues working.
Recovery time target: Automatic (fallback). Manual intervention only if both providers fail.

Scenario 4: Database Restore from Backup

When: Data corruption, accidental deletion, or disaster recovery.

Steps:

Stop the API: sudo systemctl stop dealix-api
List available backups: ls -lt /var/backups/dealix/*.sql.gz
Create safety snapshot of current state: sudo -u postgres pg_dump dealix | gzip > /tmp/dealix_pre_restore_$(date +%Y%m%d_%H%M%S).sql.gz

Drop and recreate database:

sudo -u postgres psql -c "DROP DATABASE dealix;"
sudo -u postgres psql -c "CREATE DATABASE dealix OWNER dealix;"

Restore: gunzip -c /var/backups/dealix/LATEST.sql.gz | sudo -u postgres psql dealix
Verify row counts: sudo -u postgres psql dealix -c "SELECT 'leads', count(*) FROM leads UNION ALL SELECT 'deals', count(*) FROM deals;"
Start API: sudo systemctl start dealix-api
Verify health: curl http://localhost:8001/health/deep
Check integrity: manually verify recent leads/deals in dashboard

Recovery time target: < 15 minutes (tested)
RPO: 24 hours (daily backup)
RTO: 15 minutes

Scenario 5: Rollback to Previous Version

When: Bad deploy, broken feature in production.

Steps:

Identify last working version: git log --oneline -10
Check current tag: git describe --tags --always
Checkout previous version: git checkout v3.0.0 (or specific commit)
Install deps: pip install -r requirements.txt
Restart: sudo systemctl restart dealix-api
Verify: curl http://localhost:8001/health
If rolling back a migration: check alembic history and downgrade if needed
Notify team of rollback reason

Recovery time target: < 5 minutes
Note: Never force-push or delete the broken commit. Create a revert commit instead for traceability.

Quick Reference

Check	Command
API health	`curl http://localhost:8001/health`
Deep health	`curl http://localhost:8001/health/deep`
Service status	`sudo systemctl status dealix-api`
Recent logs	`sudo journalctl -u dealix-api -n 50 --no-pager`
Postgres status	`sudo systemctl status postgresql`
Redis status	`redis-cli ping`
Disk space	`df -h`
Memory	`free -h`
DLQ depth	`curl http://localhost:8001/api/v1/admin/dlq/queues`
Circuit breakers	`curl http://localhost:8001/api/v1/admin/circuit-breakers`
Restart API	`sudo systemctl restart dealix-api`
Backup now	`sudo -u postgres pg_dump dealix \| gzip > /var/backups/dealix/manual_$(date +%Y%m%d).sql.gz`

5.0 KiB Raw Blame History

Dealix Operational Runbook

Scenario 1: Service Down (API not responding)

Scenario 2: Database Down (Postgres unreachable)

Scenario 3: LLM Provider Down (Groq/OpenAI)

Scenario 4: Database Restore from Backup

Scenario 5: Rollback to Previous Version

Quick Reference

5.0 KiB

Raw Blame History