6.3 KiB
Incident & Rollback Runbook
What to do when things are on fire. Kept short on purpose — reach for this at 3am, not in a quiet conference room.
0. Severity ladder
| Level | Definition | Response |
|---|---|---|
| P0 | Production down, data loss risk, active security incident | All-hands; incident commander within 15 min |
| P1 | Major feature broken, elevated error rate (>10%), single-customer blocker | On-call owns; update every 30 min |
| P2 | Degraded performance, minor feature broken | Business-hours triage |
| P3 | Cosmetic, non-blocking | Backlog |
1. First 10 minutes (any P0/P1)
- Declare. Post in
#incidents: severity, symptoms, impact, one incident commander. - Stop the bleeding. If a recent deploy is suspect → roll back (see §6).
- Freeze. No new deploys to the affected env until green.
- Preserve. Snapshot logs, metrics, traces, DB state before touching anything.
- Communicate. Customer-facing status update if a customer is affected.
2. Common P0/P1 scenarios
2.1 LLM provider outage
Symptoms: high 5xx from core.llm.router, specific provider in error spike.
Response:
- Check the provider's status page.
- Verify the router is falling back (
router.usage_summary()["<provider>"]["fallbacks_triggered"]). - If fallback chain is not triggering, force-route by setting env var override and restarting the app.
- If multiple providers are down, pause the affected pipeline (set a feature flag).
Prevention: keep fallback chains healthy; monitor trust_policy_decisions_total dropping.
2.2 Postgres unavailable
Symptoms: asyncpg.PostgresError, API returning 5xx from /api/v1/leads.
Response:
- Check DB container / managed service status.
- Check connection pool saturation (
db.session._engine().pool.status()). - If the DB is healthy, restart the app (pool may be stale after a failover).
- If the DB is down, surface a 503 from the app and queue inbound webhooks for replay.
2.3 HubSpot 429 (rate limit)
Symptoms: crm_sync_failed warnings spiking; deals not landing in HubSpot.
Response:
- The CRMAgent already retries with exponential backoff. Confirm retries are firing.
- If sustained, reduce concurrent sync rate via app config.
- File a HubSpot rate-limit-raise request if regular.
2.4 WhatsApp webhook signature failures
Symptoms: whatsapp_invalid_signature warnings; inbound leads not processed.
Response:
- Verify
WHATSAPP_APP_SECRETmatches the Meta app dashboard. - Check for clock skew on the server.
- If signature verification is misconfigured but source is trusted, temporarily disable the check (config flag) and re-enable after fix.
2.5 Suspected secret leak
Symptoms: gitleaks alert, unusual API activity, provider notifying you.
Response:
- Immediately rotate the affected key in the provider dashboard.
- Update
.env/ secrets manager with the new key. - Redeploy.
gitleaks detect --source . --log-level debugagainst the full history.- If the leak was in a pushed commit: force-push a history rewrite IF the repo is private and team coordination allows. If public, the assumption is leaked — rotation is the only remedy.
- File a SECURITY.md report.
2.6 Approval Center stuck
Symptoms: trust_approval_lag_seconds climbing; approvals not flowing.
Response:
- Check notifier health (email/Slack/WhatsApp delivery).
POST /api/v1/trust/approvals/check-timeouts(or runApprovalCenter.check_timeouts()).- If the queue has grown large, temporarily raise the TTL and process backlog manually.
2.7 Tool verification contradictions spiking
Symptoms: trust_tool_contradictions_total{tool=<name>} rising.
Response:
- Inspect
ToolVerificationLedger.contradictions(). - If the intended action format changed (e.g. prompt drift), revert the prompt or fix the schema.
- If a tool is actually misbehaving, disable the agent that uses it (feature flag).
3. Data incidents
Breach response (personal data)
Per PDPL:
- Contain: revoke access, rotate keys.
- Assess scope: which entities, which data classes.
- Notify SDAIA within 72 hours (PDPL requirement for qualifying breaches).
- Notify affected data subjects if required by risk assessment.
- Document: root cause, timeline, remediation.
See compliance_saudi.yaml for the full PDPL workflow and DPO contact.
4. Incident roles
- Incident Commander (IC) — drives the response; doesn't debug.
- Ops Lead — mitigates, deploys, rolls back.
- Comms Lead — customer status, internal updates.
- Scribe — timeline notes in the incident channel.
For small incidents, one person can hold multiple roles.
5. Post-incident
Within 3 business days:
- Blameless post-mortem document in
docs/incidents/YYYY-MM-DD-<slug>.md. - Timeline, root cause, contributing factors, what worked, what didn't, action items.
- Review in next architecture meeting; close out action items.
6. Rollback procedures
6.1 Application rollback
# Find previous tag
gh release list
# Pull and restart
docker compose pull ghcr.io/ORG/ai-company-saudi:v<prev>
docker compose up -d
# Verify
curl -fv https://api.ai-company.sa/health
6.2 DB migration rollback
# Downgrade one revision
alembic downgrade -1
# Or to a specific revision
alembic downgrade <revision_id>
Rollback window: 24h free of blame. After 24h, prefer fix-forward unless severity demands.
6.3 Feature flag rollback
If the problematic change is behind a flag, disable the flag first; no deploy needed.
7. Pre-incident hygiene (preventive)
- Healthchecks on every environment (
/health) monitored every 30s - Error rate alerts at >1% for 5 minutes
- Latency p95 alerts at >10s for 5 minutes
- LLM fallback rate alerts at >20% for 10 minutes
- Weekly restore-test on the DB backups
- Monthly game day: simulate an LLM outage + a DB failover
8. Who to page
| Condition | Who |
|---|---|
| App down / 5xx storm | On-call platform |
| DB down | On-call platform + DBA (if staffed) |
| Security incident | Security lead + DPO |
| Customer-facing issue | On-call platform + Customer Success |
| LLM cost spike | On-call platform + CTO |
| PDPL breach candidate | DPO + Legal + Security |
Concrete names, phones, escalation chain: dealix/masters/oncall.md (per deployment — NOT committed publicly).