system-prompts-and-models-o.../salesflow-saas/SLO.md
Claude 4d385f0482
feat(dealix): k6 smoke test, SLO definition, fault-injection tests, env update
Close 3 more launch gates:
- T5: k6 smoke test script (scripts/k6_smoke_test.js) with p95<500ms
  and <1% error rate thresholds, tests health/pricing/DLQ/approvals
- O5: SLO.md with latency targets per endpoint category, recovery
  objectives (RPO 24h, RTO 15min), and escalation matrix
- DLQ fault-injection tests (6/6 passing): webhook crash → DLQ,
  retry-then-succeed, exhausted retries → dead, circuit breaker
  open/recover, multi-queue isolation

Also:
- .env.example updated with POSTHOG_*, MOYASAR_SECRET_KEY,
  MOYASAR_WEBHOOK_SECRET, DLQ_*, CALENDLY_* settings
- LAUNCH_GATES.md updated: 13/33 gates closed, 5 blocked on
  founder API keys (PostHog/Moyasar/HubSpot/Calendly/UptimeRobot)

https://claude.ai/code/session_01W1rJthWDkasijTdXCfxVHs
2026-04-23 10:46:57 +00:00

87 lines
2.6 KiB
Markdown

# Dealix Service Level Objectives (SLO)
**Version:** 1.0.0
**Effective:** 2026-04-23
**Review:** Monthly, or after any incident
---
## API Availability
| SLI | Target | Measurement | Alert Threshold |
|-----|--------|-------------|-----------------|
| Uptime (monthly) | 99.5% | UptimeRobot on `/api/v1/health` | < 99% triggers incident |
| Health endpoint response | < 200ms p95 | k6 smoke test | > 500ms p95 |
## API Latency
| Endpoint Category | p50 Target | p95 Target | p99 Target |
|-------------------|------------|------------|------------|
| Health / public reads | < 50ms | < 200ms | < 500ms |
| Pricing / plans | < 100ms | < 300ms | < 1000ms |
| Lead CRUD | < 200ms | < 500ms | < 2000ms |
| AI agent calls | < 2000ms | < 5000ms | < 10000ms |
| Webhook processing | < 500ms | < 2000ms | < 5000ms |
## Error Rate
| Metric | Target | Alert |
|--------|--------|-------|
| HTTP 5xx rate | < 0.5% of requests | > 1% for 5 min |
| Webhook failure rate | < 2% | > 5% for 15 min |
| DLQ depth | < 10 entries | > 50 triggers alert |
## Recovery
| Metric | Target |
|--------|--------|
| RPO (Recovery Point Objective) | 24 hours (daily DB backup) |
| RTO (Recovery Time Objective) | 15 minutes (tested via drill) |
| Rollback time | < 5 minutes (git checkout + restart) |
| MTTR (Mean Time To Recovery) | < 30 minutes |
## Revenue Funnel
| Step | Freshness Target |
|------|-----------------|
| Lead capture PostHog event | < 5 seconds |
| Payment webhook PostHog event | < 10 seconds |
| DLQ entry first retry | < 60 seconds |
| Approval request notification | < 5 minutes |
## Monitoring
| System | Check Interval | Alert Channel |
|--------|---------------|---------------|
| UptimeRobot | 5 minutes | SMS + Email |
| Sentry | Real-time | Email |
| DLQ depth | On admin request | Dashboard |
| Circuit breakers | On admin request | Dashboard |
---
## How to Verify
```bash
# Health latency
curl -w "%{time_total}s\n" -o /dev/null -s https://api.dealix.me/api/v1/health
# k6 smoke test
k6 run --env API_BASE=https://api.dealix.me scripts/k6_smoke_test.js
# DLQ depth
curl -H "Authorization: Bearer $TOKEN" https://api.dealix.me/api/v1/admin/dlq/queues
# Circuit breaker states
curl -H "Authorization: Bearer $TOKEN" https://api.dealix.me/api/v1/admin/circuit-breakers
```
## Escalation
| Severity | Condition | Response |
|----------|-----------|----------|
| P1 - Critical | Service down > 5 min | Immediate (see RUNBOOK Scenario 1) |
| P2 - Major | Error rate > 5% for 15 min | Within 1 hour |
| P3 - Minor | Latency > SLO for 30 min | Within 4 hours |
| P4 - Low | DLQ depth > 20 | Next business day |