mirror of
https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools.git
synced 2026-06-18 15:29:36 +00:00
135 lines
3.4 KiB
Markdown
135 lines
3.4 KiB
Markdown
# 🚨 Dealix — Incident Runbook
|
|
|
|
**Use when production breaks.**
|
|
**Goal:** Restore service + communicate + learn.
|
|
|
|
---
|
|
|
|
## Severity Levels
|
|
|
|
| Level | Definition | Response Time | Communication |
|
|
|-------|-----------|---------------|---------------|
|
|
| **SEV-1** | Full outage, no customers can use service | < 15 min | Immediate Slack + email customers |
|
|
| **SEV-2** | Major feature broken, most customers affected | < 1 hour | Slack alert, status page update |
|
|
| **SEV-3** | Minor bug, one customer affected | < 4 hours | Individual customer comms |
|
|
| **SEV-4** | Cosmetic, not user-blocking | < 24 hours | Ticket only |
|
|
|
|
---
|
|
|
|
## SEV-1 Response (Full Outage)
|
|
|
|
### Within 5 minutes
|
|
1. Confirm: Open `/healthz` in browser. If 5xx or timeout → SEV-1.
|
|
2. Check Railway dashboard → service status
|
|
3. Check UptimeRobot → when did it start?
|
|
|
|
### Within 15 minutes
|
|
1. **Diagnose:**
|
|
- Last deploy in Railway?
|
|
- Recent PR merged?
|
|
- DB connection?
|
|
- Moyasar API outage?
|
|
2. **Mitigate:**
|
|
- Roll back last deploy if caused by recent change
|
|
- Restart service in Railway
|
|
- Check env vars
|
|
|
|
### Communicate
|
|
Post to customers (if any active):
|
|
```
|
|
نواجه مشكلة فنية مؤقتة في النظام. الفريق يعمل على حلها.
|
|
سنحدثكم خلال 30 دقيقة.
|
|
— فريق Dealix
|
|
```
|
|
|
|
### After Resolution (within 48h)
|
|
Write post-mortem:
|
|
1. Timeline
|
|
2. Root cause
|
|
3. What worked
|
|
4. What didn't
|
|
5. Action items to prevent recurrence
|
|
|
|
---
|
|
|
|
## Common Issues + Fixes
|
|
|
|
### Issue: `/api/v1/*` returns 404
|
|
**Likely cause:** Deploy failed or wrong Start Command.
|
|
**Fix:**
|
|
1. Railway → Deployments → check latest deploy status
|
|
2. If failed: check logs, fix, redeploy
|
|
3. If succeeded but still 404: Settings → Start Command = `/app/start.sh`
|
|
|
|
### Issue: Moyasar webhook returns 401
|
|
**Likely cause:** Secret mismatch.
|
|
**Fix:**
|
|
1. Railway → Variables → `MOYASAR_WEBHOOK_SECRET`
|
|
2. Moyasar Dashboard → Webhooks → same secret
|
|
3. Must be identical string
|
|
|
|
### Issue: Database connection refused
|
|
**Likely cause:** DATABASE_URL wrong or Postgres add-on down.
|
|
**Fix:**
|
|
1. Railway → PostgreSQL service → check status
|
|
2. Copy connection string
|
|
3. Update env var
|
|
4. Redeploy
|
|
|
|
### Issue: High error rate in Sentry
|
|
**Likely cause:** New deploy or traffic spike.
|
|
**Fix:**
|
|
1. Check last deploy diff
|
|
2. If unrelated: scale Railway resources
|
|
3. If related: roll back
|
|
|
|
---
|
|
|
|
## Rollback Procedure
|
|
|
|
### Railway Rollback (2 minutes)
|
|
1. Railway → Deployments
|
|
2. Find previous successful deployment
|
|
3. Click `...` → Redeploy
|
|
4. Wait for `Active` status
|
|
5. Verify `/healthz` = 200
|
|
|
|
### Git Revert (if code caused)
|
|
```bash
|
|
git checkout main
|
|
git revert <bad-commit-sha>
|
|
git push origin main
|
|
# CI runs, deploy triggered automatically
|
|
```
|
|
|
|
---
|
|
|
|
## Who to Contact
|
|
|
|
| Issue | Contact |
|
|
|-------|---------|
|
|
| Backend down | Sami (founder, on-call 24/7) |
|
|
| Payment processing | Moyasar support |
|
|
| Domain DNS | Domain registrar |
|
|
| Hosting | Railway support |
|
|
|
|
---
|
|
|
|
## Monitoring Setup Check
|
|
|
|
Run monthly:
|
|
- [ ] Sentry alerts still firing? (trigger test error)
|
|
- [ ] UptimeRobot still polling? (check dashboard)
|
|
- [ ] Slack channel `#dealix-alerts` active?
|
|
- [ ] Emergency phone numbers current?
|
|
|
|
---
|
|
|
|
## Learning from Incidents
|
|
|
|
Every SEV-1 or SEV-2 requires:
|
|
1. Post-mortem within 48 hours
|
|
2. File in `docs/ops/postmortems/YYYY-MM-DD-summary.md`
|
|
3. Review in weekly team sync (even solo)
|
|
4. Update this runbook if new pattern
|