11 KiB
Dealix Production Runbook
Version: 1.0 (v3.0.0 Primitive Launch)
Owner: Sami Assiri
Last updated: 2026-04-23
Environment: Production — api.dealix.me, 188.245.55.180
0. Contact & Access
- Server:
ssh -o StrictHostKeyChecking=no -i ~/.ssh/dealix_deploy root@188.245.55.180 - App dir:
/opt/dealix - systemd unit:
dealix-api.service - Database: Postgres
postgresql://dealix@127.0.0.1:5432/dealix - Cache/queue: Redis
127.0.0.1:6379/0 - Nginx:
api.dealix.me→127.0.0.1:8001,dealix.me→/var/www/dealix/landing - GitHub: https://github.com/VoXc2/dealix (main protected,
ghCLI withapi_credentials=["github"]) - Sentry: DSN in
/opt/dealix/.env→SENTRY_DSN - UptimeRobot: monitors
https://api.dealix.me/health - PostHog: EU region
https://eu.i.posthog.com
First rule
Never edit code or .env directly on the server. All changes flow through GitHub PR → deploy script. Untracked production drift already cost 20 commits once — not again.
Scenario 1 — Routine Deploy (merge to main → prod)
When: PR approved and merged to main.
-
Verify CI green on main
gh run list --repo VoXc2/dealix --branch main --limit 3 --json name,status,conclusionAll three latest must be
completed / success. If not — abort. -
SSH to server
ssh -o StrictHostKeyChecking=no -i ~/.ssh/dealix_deploy root@188.245.55.180 -
Snapshot current state (rollback safety)
cd /opt/dealix git rev-parse HEAD > /opt/dealix/.last_good_sha cp .env .env.bak.$(date -u +%Y%m%dT%H%M%SZ) -
Pull + install + migrate
git fetch origin git checkout main git pull --ff-only origin main /opt/dealix/.venv/bin/pip install -r requirements.txt /opt/dealix/.venv/bin/alembic upgrade head -
Restart service
systemctl restart dealix-api systemctl status dealix-api --no-pager -
Health verification (all must pass)
curl -sf https://api.dealix.me/health curl -sf https://api.dealix.me/health/deep | jq . curl -sf https://api.dealix.me/api/v1/pricing/plans | jq '.plans | length'/health/deepmust showpostgres,redis,llm_providersall green. -
Trigger Sentry + PostHog probes
curl -sf -H "X-API-Key: $ADMIN_KEY" https://api.dealix.me/api/v1/admin/sentry-checkVerify in Sentry + PostHog dashboards within 60s.
DoD: Health green, Sentry ping received, no error spike in Sentry for 10 min.
Scenario 2 — Rollback (bad deploy, 5-min target)
Trigger: error rate spike in Sentry, /health/deep red, 5xx surge, or user complaint post-deploy.
-
Announce in channel (if you have one — otherwise note in GitHub issue):
Rolling back prod to last good SHA due to .
-
SSH in, revert to last known-good SHA
ssh -i ~/.ssh/dealix_deploy root@188.245.55.180 cd /opt/dealix LAST_GOOD=$(cat /opt/dealix/.last_good_sha) git checkout "$LAST_GOOD" /opt/dealix/.venv/bin/pip install -r requirements.txt -
Roll back migrations only if the bad deploy added new ones
# Check what the bad deploy added: /opt/dealix/.venv/bin/alembic history | head # Downgrade one step ONLY if necessary: /opt/dealix/.venv/bin/alembic downgrade -1Rule: never downgrade more than 1 step without Sami's explicit approval.
-
Restart + verify
systemctl restart dealix-api curl -sf https://api.dealix.me/health/deep | jq . -
Re-open main for fix via PR (no direct commits to main;
mainis protected).
DoD: health green on old SHA within 5 minutes, Sentry error rate back to baseline within 10 min. Target: <5 min from decision to rollback complete.
Scenario 3 — Database Down / Unreachable
Signals: /health/deep reports postgres: error, 500s on leads endpoints, Sentry OperationalError.
-
Triage — is it us or the DB?
ssh -i ~/.ssh/dealix_deploy root@188.245.55.180 systemctl status postgresql --no-pager sudo -u postgres psql -c "SELECT 1;" -
If systemd says failed:
journalctl -u postgresql -n 200 --no-pager systemctl restart postgresql sleep 3 sudo -u postgres psql -c "SELECT 1;" -
If disk full (most common real cause):
df -h / # Clear postgres WAL archives / old backups first: du -sh /var/lib/postgresql/* /var/backups/postgres/* 2>/dev/null | sort -h # DO NOT delete active WAL. Rotate old backups only. -
If connection pool exhausted (API is up, DB healthy, but API can't connect):
sudo -u postgres psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;" systemctl restart dealix-api # drops API's stale connections -
If DB corrupt / cannot start → restore from backup (Scenario 5).
-
Post-incident:
- Write a 5-line postmortem in
docs/incidents/YYYY-MM-DD.md. - If webhooks arrived during outage, drain
WEBHOOKS_DLQ:curl -s -H "X-API-Key: $ADMIN_KEY" -X POST \ 'https://api.dealix.me/api/v1/admin/dlq/webhooks/drain?limit=100'
- Write a 5-line postmortem in
DoD: /health/deep postgres green; no lost webhooks (DLQ drained or re-queued).
Scenario 4 — LLM Provider Down (Anthropic / OpenAI / Google)
Signals: Sentry shows provider timeouts, /health/deep llm_providers yellow/red, ConnectorFacade circuit breaker open, workflow failures in PostHog.
-
Confirm it's the provider, not us:
- Check https://status.anthropic.com / https://status.openai.com / https://status.cloud.google.com.
curl https://api.anthropic.com/v1/messages -H "x-api-key: $ANTHROPIC_API_KEY" ...
-
The circuit breaker should already be doing its job — requests to the failing provider return fast with
CircuitOpenError, DLQ absorbs the failures. -
Verify breaker state:
curl -s -H "X-API-Key: $ADMIN_KEY" https://api.dealix.me/api/v1/admin/dlq/stats | jq .If
OUTBOUND_DLQorENRICHMENT_DLQis growing fast, breaker is protecting us — no action on our side. -
Temporary failover (if one provider is the primary and down for >30 min): Edit
/opt/dealix/.env→LLM_PROVIDER_PRIORITY="openai,google,anthropic"(reorder).systemctl restart dealix-apiRule: this is the ONLY allowed in-place
.envedit. Commit the change back to.env.example(without secret) next business day. -
When provider recovers:
- Breaker auto-half-opens after 60s, then closes on first success.
- Drain the relevant DLQ to replay queued work:
curl -s -H "X-API-Key: $ADMIN_KEY" -X POST \ 'https://api.dealix.me/api/v1/admin/dlq/outbound/drain?limit=50'
DoD: /health/deep llm_providers green; DLQ depth returning to zero; no workflow failures in last 10 min.
Scenario 5 — Backup Restoration (Data Loss / Corruption)
Trigger: DB corrupt, accidental mass delete, ransomware, or monthly drill (required).
Preflight
- Identify the target backup:
ls -lht /var/backups/postgres/*.sql.gz | head -5 - Never restore into production DB. Restore into a staging clone first, validate, then swap.
Drill / Restore procedure (on staging — monthly required)
- Create isolated DB:
sudo -u postgres createdb dealix_restore_test - Restore:
BACKUP=/var/backups/postgres/dealix-YYYY-MM-DD.sql.gz gunzip -c "$BACKUP" | sudo -u postgres psql dealix_restore_test - Validate row counts against prod (sanity):
sudo -u postgres psql dealix_restore_test -c "SELECT 'leads' t, count(*) FROM leads UNION ALL SELECT 'users', count(*) FROM users;" - Validate latest lead timestamp is within acceptable RPO (≤24h):
sudo -u postgres psql dealix_restore_test -c "SELECT max(created_at) FROM leads;" - Teardown:
sudo -u postgres dropdb dealix_restore_test
Real incident restore (production data loss)
- Stop API to freeze writes:
systemctl stop dealix-api - Rename current DB (do NOT drop — evidence):
sudo -u postgres psql -c "ALTER DATABASE dealix RENAME TO dealix_corrupt_$(date +%Y%m%d);" sudo -u postgres createdb dealix - Restore latest good backup:
gunzip -c "$BACKUP" | sudo -u postgres psql dealix - Restart API + verify:
systemctl start dealix-api curl -sf https://api.dealix.me/health/deep | jq . - Postmortem mandatory — how data was lost, why backup gap existed, what changed.
DoD (drill): restore completes in ≤15 min, row counts ±5% of prod, max timestamp within RPO. DoD (incident): API back up, no data newer than last backup lost (document gap).
Scenario 6 — Security Incident (suspected breach)
Signals: unexplained admin API calls, fail2ban banning authorized IPs, unexpected outbound traffic, Sentry PermissionError spike, unknown webhook signatures failing.
-
Contain first, investigate second:
ssh -i ~/.ssh/dealix_deploy root@188.245.55.180 # Rotate ALL secrets immediately cd /opt/dealix && bash scripts/rotate_secrets.sh systemctl restart dealix-api -
Lock down UFW to known IPs only (temporary):
# Save current rules first: ufw status numbered > /tmp/ufw.before.$(date +%s) # Restrict SSH to your IP: ufw delete allow 22/tcp || true ufw allow from <YOUR_PUBLIC_IP> to any port 22 -
Check auth logs:
journalctl -u ssh -n 500 --no-pager | grep -iE 'accepted|failed' fail2ban-client status sshd -
Preserve evidence:
tar -czf /root/incident-$(date +%s).tgz /var/log/nginx /var/log/auth.log /var/log/dealix* -
Notify Sami, document in
docs/incidents/, file GitHub security advisory if user data touched.
DoD: all secrets rotated, UFW locked, attacker IPs banned, incident doc drafted within 1h.
Appendix A — Health Check Cheat Sheet
| Signal | Command | Expected |
|---|---|---|
| Liveness | curl -sf https://api.dealix.me/health |
200 OK |
| Deep health | curl -sf https://api.dealix.me/health/deep |
postgres+redis+llm_providers green |
| CI status | gh run list --repo VoXc2/dealix --limit 3 |
success |
| DLQ depth | GET /api/v1/admin/dlq/stats |
0 across all queues |
| Pending approvals | GET /api/v1/admin/approvals/pending |
<10 |
| Service status | systemctl status dealix-api |
active (running) |
| fail2ban | fail2ban-client status sshd |
jail active |
| Nginx | systemctl status nginx |
active (running) |
Appendix B — Do-Not-Touch List
mainbranch: protected, no direct push/opt/dealix/.env.pre-v3.0.0.bak: emergency reference, never deleteserver-backup-20260423-084442branch: historical evidence, do not force-delete- Postgres
dealix_corrupt_*dbs: keep for 7 days post-incident before dropping
Appendix C — Runbook Review
Review every 4 weeks. If any command in this runbook failed during a real incident, update it immediately after that incident closes.