system-prompts-and-models-o.../dealix/docs/archive/runbook_lowercase_DEPRECATED.md
2026-05-01 14:03:52 +03:00

11 KiB

Dealix Production Runbook

Version: 1.0 (v3.0.0 Primitive Launch) Owner: Sami Assiri Last updated: 2026-04-23 Environment: Production — api.dealix.me, 188.245.55.180


0. Contact & Access

  • Server: ssh -o StrictHostKeyChecking=no -i ~/.ssh/dealix_deploy root@188.245.55.180
  • App dir: /opt/dealix
  • systemd unit: dealix-api.service
  • Database: Postgres postgresql://dealix@127.0.0.1:5432/dealix
  • Cache/queue: Redis 127.0.0.1:6379/0
  • Nginx: api.dealix.me127.0.0.1:8001, dealix.me/var/www/dealix/landing
  • GitHub: https://github.com/VoXc2/dealix (main protected, gh CLI with api_credentials=["github"])
  • Sentry: DSN in /opt/dealix/.envSENTRY_DSN
  • UptimeRobot: monitors https://api.dealix.me/health
  • PostHog: EU region https://eu.i.posthog.com

First rule

Never edit code or .env directly on the server. All changes flow through GitHub PR → deploy script. Untracked production drift already cost 20 commits once — not again.


Scenario 1 — Routine Deploy (merge to main → prod)

When: PR approved and merged to main.

  1. Verify CI green on main

    gh run list --repo VoXc2/dealix --branch main --limit 3 --json name,status,conclusion
    

    All three latest must be completed / success. If not — abort.

  2. SSH to server

    ssh -o StrictHostKeyChecking=no -i ~/.ssh/dealix_deploy root@188.245.55.180
    
  3. Snapshot current state (rollback safety)

    cd /opt/dealix
    git rev-parse HEAD > /opt/dealix/.last_good_sha
    cp .env .env.bak.$(date -u +%Y%m%dT%H%M%SZ)
    
  4. Pull + install + migrate

    git fetch origin
    git checkout main
    git pull --ff-only origin main
    /opt/dealix/.venv/bin/pip install -r requirements.txt
    /opt/dealix/.venv/bin/alembic upgrade head
    
  5. Restart service

    systemctl restart dealix-api
    systemctl status dealix-api --no-pager
    
  6. Health verification (all must pass)

    curl -sf https://api.dealix.me/health
    curl -sf https://api.dealix.me/health/deep | jq .
    curl -sf https://api.dealix.me/api/v1/pricing/plans | jq '.plans | length'
    

    /health/deep must show postgres, redis, llm_providers all green.

  7. Trigger Sentry + PostHog probes

    curl -sf -H "X-API-Key: $ADMIN_KEY" https://api.dealix.me/api/v1/admin/sentry-check
    

    Verify in Sentry + PostHog dashboards within 60s.

DoD: Health green, Sentry ping received, no error spike in Sentry for 10 min.


Scenario 2 — Rollback (bad deploy, 5-min target)

Trigger: error rate spike in Sentry, /health/deep red, 5xx surge, or user complaint post-deploy.

  1. Announce in channel (if you have one — otherwise note in GitHub issue):

    Rolling back prod to last good SHA due to .

  2. SSH in, revert to last known-good SHA

    ssh -i ~/.ssh/dealix_deploy root@188.245.55.180
    cd /opt/dealix
    LAST_GOOD=$(cat /opt/dealix/.last_good_sha)
    git checkout "$LAST_GOOD"
    /opt/dealix/.venv/bin/pip install -r requirements.txt
    
  3. Roll back migrations only if the bad deploy added new ones

    # Check what the bad deploy added:
    /opt/dealix/.venv/bin/alembic history | head
    # Downgrade one step ONLY if necessary:
    /opt/dealix/.venv/bin/alembic downgrade -1
    

    Rule: never downgrade more than 1 step without Sami's explicit approval.

  4. Restart + verify

    systemctl restart dealix-api
    curl -sf https://api.dealix.me/health/deep | jq .
    
  5. Re-open main for fix via PR (no direct commits to main; main is protected).

DoD: health green on old SHA within 5 minutes, Sentry error rate back to baseline within 10 min. Target: <5 min from decision to rollback complete.


Scenario 3 — Database Down / Unreachable

Signals: /health/deep reports postgres: error, 500s on leads endpoints, Sentry OperationalError.

  1. Triage — is it us or the DB?

    ssh -i ~/.ssh/dealix_deploy root@188.245.55.180
    systemctl status postgresql --no-pager
    sudo -u postgres psql -c "SELECT 1;"
    
  2. If systemd says failed:

    journalctl -u postgresql -n 200 --no-pager
    systemctl restart postgresql
    sleep 3
    sudo -u postgres psql -c "SELECT 1;"
    
  3. If disk full (most common real cause):

    df -h /
    # Clear postgres WAL archives / old backups first:
    du -sh /var/lib/postgresql/* /var/backups/postgres/* 2>/dev/null | sort -h
    # DO NOT delete active WAL. Rotate old backups only.
    
  4. If connection pool exhausted (API is up, DB healthy, but API can't connect):

    sudo -u postgres psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
    systemctl restart dealix-api  # drops API's stale connections
    
  5. If DB corrupt / cannot startrestore from backup (Scenario 5).

  6. Post-incident:

    • Write a 5-line postmortem in docs/incidents/YYYY-MM-DD.md.
    • If webhooks arrived during outage, drain WEBHOOKS_DLQ:
      curl -s -H "X-API-Key: $ADMIN_KEY" -X POST \
        'https://api.dealix.me/api/v1/admin/dlq/webhooks/drain?limit=100'
      

DoD: /health/deep postgres green; no lost webhooks (DLQ drained or re-queued).


Scenario 4 — LLM Provider Down (Anthropic / OpenAI / Google)

Signals: Sentry shows provider timeouts, /health/deep llm_providers yellow/red, ConnectorFacade circuit breaker open, workflow failures in PostHog.

  1. Confirm it's the provider, not us:

  2. The circuit breaker should already be doing its job — requests to the failing provider return fast with CircuitOpenError, DLQ absorbs the failures.

  3. Verify breaker state:

    curl -s -H "X-API-Key: $ADMIN_KEY" https://api.dealix.me/api/v1/admin/dlq/stats | jq .
    

    If OUTBOUND_DLQ or ENRICHMENT_DLQ is growing fast, breaker is protecting us — no action on our side.

  4. Temporary failover (if one provider is the primary and down for >30 min): Edit /opt/dealix/.envLLM_PROVIDER_PRIORITY="openai,google,anthropic" (reorder).

    systemctl restart dealix-api
    

    Rule: this is the ONLY allowed in-place .env edit. Commit the change back to .env.example (without secret) next business day.

  5. When provider recovers:

    • Breaker auto-half-opens after 60s, then closes on first success.
    • Drain the relevant DLQ to replay queued work:
      curl -s -H "X-API-Key: $ADMIN_KEY" -X POST \
        'https://api.dealix.me/api/v1/admin/dlq/outbound/drain?limit=50'
      

DoD: /health/deep llm_providers green; DLQ depth returning to zero; no workflow failures in last 10 min.


Scenario 5 — Backup Restoration (Data Loss / Corruption)

Trigger: DB corrupt, accidental mass delete, ransomware, or monthly drill (required).

Preflight

  1. Identify the target backup:
    ls -lht /var/backups/postgres/*.sql.gz | head -5
    
  2. Never restore into production DB. Restore into a staging clone first, validate, then swap.

Drill / Restore procedure (on staging — monthly required)

  1. Create isolated DB:
    sudo -u postgres createdb dealix_restore_test
    
  2. Restore:
    BACKUP=/var/backups/postgres/dealix-YYYY-MM-DD.sql.gz
    gunzip -c "$BACKUP" | sudo -u postgres psql dealix_restore_test
    
  3. Validate row counts against prod (sanity):
    sudo -u postgres psql dealix_restore_test -c "SELECT 'leads' t, count(*) FROM leads UNION ALL SELECT 'users', count(*) FROM users;"
    
  4. Validate latest lead timestamp is within acceptable RPO (≤24h):
    sudo -u postgres psql dealix_restore_test -c "SELECT max(created_at) FROM leads;"
    
  5. Teardown:
    sudo -u postgres dropdb dealix_restore_test
    

Real incident restore (production data loss)

  1. Stop API to freeze writes:
    systemctl stop dealix-api
    
  2. Rename current DB (do NOT drop — evidence):
    sudo -u postgres psql -c "ALTER DATABASE dealix RENAME TO dealix_corrupt_$(date +%Y%m%d);"
    sudo -u postgres createdb dealix
    
  3. Restore latest good backup:
    gunzip -c "$BACKUP" | sudo -u postgres psql dealix
    
  4. Restart API + verify:
    systemctl start dealix-api
    curl -sf https://api.dealix.me/health/deep | jq .
    
  5. Postmortem mandatory — how data was lost, why backup gap existed, what changed.

DoD (drill): restore completes in ≤15 min, row counts ±5% of prod, max timestamp within RPO. DoD (incident): API back up, no data newer than last backup lost (document gap).


Scenario 6 — Security Incident (suspected breach)

Signals: unexplained admin API calls, fail2ban banning authorized IPs, unexpected outbound traffic, Sentry PermissionError spike, unknown webhook signatures failing.

  1. Contain first, investigate second:

    ssh -i ~/.ssh/dealix_deploy root@188.245.55.180
    # Rotate ALL secrets immediately
    cd /opt/dealix && bash scripts/rotate_secrets.sh
    systemctl restart dealix-api
    
  2. Lock down UFW to known IPs only (temporary):

    # Save current rules first:
    ufw status numbered > /tmp/ufw.before.$(date +%s)
    # Restrict SSH to your IP:
    ufw delete allow 22/tcp || true
    ufw allow from <YOUR_PUBLIC_IP> to any port 22
    
  3. Check auth logs:

    journalctl -u ssh -n 500 --no-pager | grep -iE 'accepted|failed'
    fail2ban-client status sshd
    
  4. Preserve evidence:

    tar -czf /root/incident-$(date +%s).tgz /var/log/nginx /var/log/auth.log /var/log/dealix*
    
  5. Notify Sami, document in docs/incidents/, file GitHub security advisory if user data touched.

DoD: all secrets rotated, UFW locked, attacker IPs banned, incident doc drafted within 1h.


Appendix A — Health Check Cheat Sheet

Signal Command Expected
Liveness curl -sf https://api.dealix.me/health 200 OK
Deep health curl -sf https://api.dealix.me/health/deep postgres+redis+llm_providers green
CI status gh run list --repo VoXc2/dealix --limit 3 success
DLQ depth GET /api/v1/admin/dlq/stats 0 across all queues
Pending approvals GET /api/v1/admin/approvals/pending <10
Service status systemctl status dealix-api active (running)
fail2ban fail2ban-client status sshd jail active
Nginx systemctl status nginx active (running)

Appendix B — Do-Not-Touch List

  • main branch: protected, no direct push
  • /opt/dealix/.env.pre-v3.0.0.bak: emergency reference, never delete
  • server-backup-20260423-084442 branch: historical evidence, do not force-delete
  • Postgres dealix_corrupt_* dbs: keep for 7 days post-incident before dropping

Appendix C — Runbook Review

Review every 4 weeks. If any command in this runbook failed during a real incident, update it immediately after that incident closes.