mirror of
https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools.git
synced 2026-06-17 23:09:35 +00:00
3.4 KiB
3.4 KiB
🚨 Dealix — Incident Runbook
Use when production breaks. Goal: Restore service + communicate + learn.
Severity Levels
| Level | Definition | Response Time | Communication |
|---|---|---|---|
| SEV-1 | Full outage, no customers can use service | < 15 min | Immediate Slack + email customers |
| SEV-2 | Major feature broken, most customers affected | < 1 hour | Slack alert, status page update |
| SEV-3 | Minor bug, one customer affected | < 4 hours | Individual customer comms |
| SEV-4 | Cosmetic, not user-blocking | < 24 hours | Ticket only |
SEV-1 Response (Full Outage)
Within 5 minutes
- Confirm: Open
/healthzin browser. If 5xx or timeout → SEV-1. - Check Railway dashboard → service status
- Check UptimeRobot → when did it start?
Within 15 minutes
- Diagnose:
- Last deploy in Railway?
- Recent PR merged?
- DB connection?
- Moyasar API outage?
- Mitigate:
- Roll back last deploy if caused by recent change
- Restart service in Railway
- Check env vars
Communicate
Post to customers (if any active):
نواجه مشكلة فنية مؤقتة في النظام. الفريق يعمل على حلها.
سنحدثكم خلال 30 دقيقة.
— فريق Dealix
After Resolution (within 48h)
Write post-mortem:
- Timeline
- Root cause
- What worked
- What didn't
- Action items to prevent recurrence
Common Issues + Fixes
Issue: /api/v1/* returns 404
Likely cause: Deploy failed or wrong Start Command. Fix:
- Railway → Deployments → check latest deploy status
- If failed: check logs, fix, redeploy
- If succeeded but still 404: Settings → Start Command =
/app/start.sh
Issue: Moyasar webhook returns 401
Likely cause: Secret mismatch. Fix:
- Railway → Variables →
MOYASAR_WEBHOOK_SECRET - Moyasar Dashboard → Webhooks → same secret
- Must be identical string
Issue: Database connection refused
Likely cause: DATABASE_URL wrong or Postgres add-on down. Fix:
- Railway → PostgreSQL service → check status
- Copy connection string
- Update env var
- Redeploy
Issue: High error rate in Sentry
Likely cause: New deploy or traffic spike. Fix:
- Check last deploy diff
- If unrelated: scale Railway resources
- If related: roll back
Rollback Procedure
Railway Rollback (2 minutes)
- Railway → Deployments
- Find previous successful deployment
- Click
...→ Redeploy - Wait for
Activestatus - Verify
/healthz= 200
Git Revert (if code caused)
git checkout main
git revert <bad-commit-sha>
git push origin main
# CI runs, deploy triggered automatically
Who to Contact
| Issue | Contact |
|---|---|
| Backend down | Sami (founder, on-call 24/7) |
| Payment processing | Moyasar support |
| Domain DNS | Domain registrar |
| Hosting | Railway support |
Monitoring Setup Check
Run monthly:
- Sentry alerts still firing? (trigger test error)
- UptimeRobot still polling? (check dashboard)
- Slack channel
#dealix-alertsactive? - Emergency phone numbers current?
Learning from Incidents
Every SEV-1 or SEV-2 requires:
- Post-mortem within 48 hours
- File in
docs/ops/postmortems/YYYY-MM-DD-summary.md - Review in weekly team sync (even solo)
- Update this runbook if new pattern