system-prompts-and-models-o.../dealix/docs/ops/INCIDENT_RUNBOOK.md
2026-05-01 14:03:52 +03:00

3.4 KiB

🚨 Dealix — Incident Runbook

Use when production breaks. Goal: Restore service + communicate + learn.


Severity Levels

Level Definition Response Time Communication
SEV-1 Full outage, no customers can use service < 15 min Immediate Slack + email customers
SEV-2 Major feature broken, most customers affected < 1 hour Slack alert, status page update
SEV-3 Minor bug, one customer affected < 4 hours Individual customer comms
SEV-4 Cosmetic, not user-blocking < 24 hours Ticket only

SEV-1 Response (Full Outage)

Within 5 minutes

  1. Confirm: Open /healthz in browser. If 5xx or timeout → SEV-1.
  2. Check Railway dashboard → service status
  3. Check UptimeRobot → when did it start?

Within 15 minutes

  1. Diagnose:
    • Last deploy in Railway?
    • Recent PR merged?
    • DB connection?
    • Moyasar API outage?
  2. Mitigate:
    • Roll back last deploy if caused by recent change
    • Restart service in Railway
    • Check env vars

Communicate

Post to customers (if any active):

نواجه مشكلة فنية مؤقتة في النظام. الفريق يعمل على حلها.
سنحدثكم خلال 30 دقيقة.
— فريق Dealix

After Resolution (within 48h)

Write post-mortem:

  1. Timeline
  2. Root cause
  3. What worked
  4. What didn't
  5. Action items to prevent recurrence

Common Issues + Fixes

Issue: /api/v1/* returns 404

Likely cause: Deploy failed or wrong Start Command. Fix:

  1. Railway → Deployments → check latest deploy status
  2. If failed: check logs, fix, redeploy
  3. If succeeded but still 404: Settings → Start Command = /app/start.sh

Issue: Moyasar webhook returns 401

Likely cause: Secret mismatch. Fix:

  1. Railway → Variables → MOYASAR_WEBHOOK_SECRET
  2. Moyasar Dashboard → Webhooks → same secret
  3. Must be identical string

Issue: Database connection refused

Likely cause: DATABASE_URL wrong or Postgres add-on down. Fix:

  1. Railway → PostgreSQL service → check status
  2. Copy connection string
  3. Update env var
  4. Redeploy

Issue: High error rate in Sentry

Likely cause: New deploy or traffic spike. Fix:

  1. Check last deploy diff
  2. If unrelated: scale Railway resources
  3. If related: roll back

Rollback Procedure

Railway Rollback (2 minutes)

  1. Railway → Deployments
  2. Find previous successful deployment
  3. Click ... → Redeploy
  4. Wait for Active status
  5. Verify /healthz = 200

Git Revert (if code caused)

git checkout main
git revert <bad-commit-sha>
git push origin main
# CI runs, deploy triggered automatically

Who to Contact

Issue Contact
Backend down Sami (founder, on-call 24/7)
Payment processing Moyasar support
Domain DNS Domain registrar
Hosting Railway support

Monitoring Setup Check

Run monthly:

  • Sentry alerts still firing? (trigger test error)
  • UptimeRobot still polling? (check dashboard)
  • Slack channel #dealix-alerts active?
  • Emergency phone numbers current?

Learning from Incidents

Every SEV-1 or SEV-2 requires:

  1. Post-mortem within 48 hours
  2. File in docs/ops/postmortems/YYYY-MM-DD-summary.md
  3. Review in weekly team sync (even solo)
  4. Update this runbook if new pattern