mancitrus/system-prompts-and-models-of-ai-tools

mirror of https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools.git synced 2026-06-17 23:09:35 +00:00

Sami Assiri f79c69ff25 ci(dealix): root GitHub workflows, ai-company track, full Dealix API tree

Made-with: Cursor

2026-05-01 14:03:52 +03:00

6.3 KiB

Raw Blame History

Incident & Rollback Runbook

What to do when things are on fire. Kept short on purpose — reach for this at 3am, not in a quiet conference room.

0. Severity ladder

Level	Definition	Response
P0	Production down, data loss risk, active security incident	All-hands; incident commander within 15 min
P1	Major feature broken, elevated error rate (>10%), single-customer blocker	On-call owns; update every 30 min
P2	Degraded performance, minor feature broken	Business-hours triage
P3	Cosmetic, non-blocking	Backlog

1. First 10 minutes (any P0/P1)

Declare. Post in #incidents: severity, symptoms, impact, one incident commander.
Stop the bleeding. If a recent deploy is suspect → roll back (see §6).
Freeze. No new deploys to the affected env until green.
Preserve. Snapshot logs, metrics, traces, DB state before touching anything.
Communicate. Customer-facing status update if a customer is affected.

2. Common P0/P1 scenarios

2.1 LLM provider outage

Symptoms: high 5xx from core.llm.router, specific provider in error spike.

Response:

Check the provider's status page.
Verify the router is falling back (router.usage_summary()["<provider>"]["fallbacks_triggered"]).
If fallback chain is not triggering, force-route by setting env var override and restarting the app.
If multiple providers are down, pause the affected pipeline (set a feature flag).

Prevention: keep fallback chains healthy; monitor trust_policy_decisions_total dropping.

2.2 Postgres unavailable

Symptoms: asyncpg.PostgresError, API returning 5xx from /api/v1/leads.

Response:

Check DB container / managed service status.
Check connection pool saturation (db.session._engine().pool.status()).
If the DB is healthy, restart the app (pool may be stale after a failover).
If the DB is down, surface a 503 from the app and queue inbound webhooks for replay.

2.3 HubSpot 429 (rate limit)

Symptoms: crm_sync_failed warnings spiking; deals not landing in HubSpot.

Response:

The CRMAgent already retries with exponential backoff. Confirm retries are firing.
If sustained, reduce concurrent sync rate via app config.
File a HubSpot rate-limit-raise request if regular.

2.4 WhatsApp webhook signature failures

Symptoms: whatsapp_invalid_signature warnings; inbound leads not processed.

Response:

Verify WHATSAPP_APP_SECRET matches the Meta app dashboard.
Check for clock skew on the server.
If signature verification is misconfigured but source is trusted, temporarily disable the check (config flag) and re-enable after fix.

2.5 Suspected secret leak

Symptoms: gitleaks alert, unusual API activity, provider notifying you.

Response:

Immediately rotate the affected key in the provider dashboard.
Update .env / secrets manager with the new key.
Redeploy.
gitleaks detect --source . --log-level debug against the full history.
If the leak was in a pushed commit: force-push a history rewrite IF the repo is private and team coordination allows. If public, the assumption is leaked — rotation is the only remedy.
File a SECURITY.md report.

2.6 Approval Center stuck

Symptoms: trust_approval_lag_seconds climbing; approvals not flowing.

Response:

Check notifier health (email/Slack/WhatsApp delivery).
POST /api/v1/trust/approvals/check-timeouts (or run ApprovalCenter.check_timeouts()).
If the queue has grown large, temporarily raise the TTL and process backlog manually.

2.7 Tool verification contradictions spiking

Symptoms: trust_tool_contradictions_total{tool=<name>} rising.

Response:

Inspect ToolVerificationLedger.contradictions().
If the intended action format changed (e.g. prompt drift), revert the prompt or fix the schema.
If a tool is actually misbehaving, disable the agent that uses it (feature flag).

3. Data incidents

Breach response (personal data)

Per PDPL:

Contain: revoke access, rotate keys.
Assess scope: which entities, which data classes.
Notify SDAIA within 72 hours (PDPL requirement for qualifying breaches).
Notify affected data subjects if required by risk assessment.
Document: root cause, timeline, remediation.

See compliance_saudi.yaml for the full PDPL workflow and DPO contact.

4. Incident roles

Incident Commander (IC) — drives the response; doesn't debug.
Ops Lead — mitigates, deploys, rolls back.
Comms Lead — customer status, internal updates.
Scribe — timeline notes in the incident channel.

For small incidents, one person can hold multiple roles.

5. Post-incident

Within 3 business days:

Blameless post-mortem document in docs/incidents/YYYY-MM-DD-<slug>.md.
Timeline, root cause, contributing factors, what worked, what didn't, action items.
Review in next architecture meeting; close out action items.

6. Rollback procedures

6.1 Application rollback

# Find previous tag
gh release list

# Pull and restart
docker compose pull ghcr.io/ORG/ai-company-saudi:v<prev>
docker compose up -d

# Verify
curl -fv https://api.ai-company.sa/health

6.2 DB migration rollback

# Downgrade one revision
alembic downgrade -1

# Or to a specific revision
alembic downgrade <revision_id>

Rollback window: 24h free of blame. After 24h, prefer fix-forward unless severity demands.

6.3 Feature flag rollback

If the problematic change is behind a flag, disable the flag first; no deploy needed.

7. Pre-incident hygiene (preventive)

Healthchecks on every environment (/health) monitored every 30s
Error rate alerts at >1% for 5 minutes
Latency p95 alerts at >10s for 5 minutes
LLM fallback rate alerts at >20% for 10 minutes
Weekly restore-test on the DB backups
Monthly game day: simulate an LLM outage + a DB failover

8. Who to page

Condition	Who
App down / 5xx storm	On-call platform
DB down	On-call platform + DBA (if staffed)
Security incident	Security lead + DPO
Customer-facing issue	On-call platform + Customer Success
LLM cost spike	On-call platform + CTO
PDPL breach candidate	DPO + Legal + Security

Concrete names, phones, escalation chain: dealix/masters/oncall.md (per deployment — NOT committed publicly).

6.3 KiB Raw Blame History

Incident & Rollback Runbook

0. Severity ladder

1. First 10 minutes (any P0/P1)

2. Common P0/P1 scenarios

2.1 LLM provider outage

2.2 Postgres unavailable

2.3 HubSpot 429 (rate limit)

2.4 WhatsApp webhook signature failures

2.5 Suspected secret leak

2.6 Approval Center stuck

2.7 Tool verification contradictions spiking

3. Data incidents

Breach response (personal data)

4. Incident roles

5. Post-incident

6. Rollback procedures

6.1 Application rollback

6.2 DB migration rollback

6.3 Feature flag rollback

7. Pre-incident hygiene (preventive)

8. Who to page

6.3 KiB

Raw Blame History