mirror of
https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools.git
synced 2026-06-18 15:29:36 +00:00
Close 3 more launch gates: - T5: k6 smoke test script (scripts/k6_smoke_test.js) with p95<500ms and <1% error rate thresholds, tests health/pricing/DLQ/approvals - O5: SLO.md with latency targets per endpoint category, recovery objectives (RPO 24h, RTO 15min), and escalation matrix - DLQ fault-injection tests (6/6 passing): webhook crash → DLQ, retry-then-succeed, exhausted retries → dead, circuit breaker open/recover, multi-queue isolation Also: - .env.example updated with POSTHOG_*, MOYASAR_SECRET_KEY, MOYASAR_WEBHOOK_SECRET, DLQ_*, CALENDLY_* settings - LAUNCH_GATES.md updated: 13/33 gates closed, 5 blocked on founder API keys (PostHog/Moyasar/HubSpot/Calendly/UptimeRobot) https://claude.ai/code/session_01W1rJthWDkasijTdXCfxVHs
87 lines
2.6 KiB
Markdown
87 lines
2.6 KiB
Markdown
# Dealix Service Level Objectives (SLO)
|
|
|
|
**Version:** 1.0.0
|
|
**Effective:** 2026-04-23
|
|
**Review:** Monthly, or after any incident
|
|
|
|
---
|
|
|
|
## API Availability
|
|
|
|
| SLI | Target | Measurement | Alert Threshold |
|
|
|-----|--------|-------------|-----------------|
|
|
| Uptime (monthly) | 99.5% | UptimeRobot on `/api/v1/health` | < 99% triggers incident |
|
|
| Health endpoint response | < 200ms p95 | k6 smoke test | > 500ms p95 |
|
|
|
|
## API Latency
|
|
|
|
| Endpoint Category | p50 Target | p95 Target | p99 Target |
|
|
|-------------------|------------|------------|------------|
|
|
| Health / public reads | < 50ms | < 200ms | < 500ms |
|
|
| Pricing / plans | < 100ms | < 300ms | < 1000ms |
|
|
| Lead CRUD | < 200ms | < 500ms | < 2000ms |
|
|
| AI agent calls | < 2000ms | < 5000ms | < 10000ms |
|
|
| Webhook processing | < 500ms | < 2000ms | < 5000ms |
|
|
|
|
## Error Rate
|
|
|
|
| Metric | Target | Alert |
|
|
|--------|--------|-------|
|
|
| HTTP 5xx rate | < 0.5% of requests | > 1% for 5 min |
|
|
| Webhook failure rate | < 2% | > 5% for 15 min |
|
|
| DLQ depth | < 10 entries | > 50 triggers alert |
|
|
|
|
## Recovery
|
|
|
|
| Metric | Target |
|
|
|--------|--------|
|
|
| RPO (Recovery Point Objective) | 24 hours (daily DB backup) |
|
|
| RTO (Recovery Time Objective) | 15 minutes (tested via drill) |
|
|
| Rollback time | < 5 minutes (git checkout + restart) |
|
|
| MTTR (Mean Time To Recovery) | < 30 minutes |
|
|
|
|
## Revenue Funnel
|
|
|
|
| Step | Freshness Target |
|
|
|------|-----------------|
|
|
| Lead capture → PostHog event | < 5 seconds |
|
|
| Payment webhook → PostHog event | < 10 seconds |
|
|
| DLQ entry → first retry | < 60 seconds |
|
|
| Approval request → notification | < 5 minutes |
|
|
|
|
## Monitoring
|
|
|
|
| System | Check Interval | Alert Channel |
|
|
|--------|---------------|---------------|
|
|
| UptimeRobot | 5 minutes | SMS + Email |
|
|
| Sentry | Real-time | Email |
|
|
| DLQ depth | On admin request | Dashboard |
|
|
| Circuit breakers | On admin request | Dashboard |
|
|
|
|
---
|
|
|
|
## How to Verify
|
|
|
|
```bash
|
|
# Health latency
|
|
curl -w "%{time_total}s\n" -o /dev/null -s https://api.dealix.me/api/v1/health
|
|
|
|
# k6 smoke test
|
|
k6 run --env API_BASE=https://api.dealix.me scripts/k6_smoke_test.js
|
|
|
|
# DLQ depth
|
|
curl -H "Authorization: Bearer $TOKEN" https://api.dealix.me/api/v1/admin/dlq/queues
|
|
|
|
# Circuit breaker states
|
|
curl -H "Authorization: Bearer $TOKEN" https://api.dealix.me/api/v1/admin/circuit-breakers
|
|
```
|
|
|
|
## Escalation
|
|
|
|
| Severity | Condition | Response |
|
|
|----------|-----------|----------|
|
|
| P1 - Critical | Service down > 5 min | Immediate (see RUNBOOK Scenario 1) |
|
|
| P2 - Major | Error rate > 5% for 15 min | Within 1 hour |
|
|
| P3 - Minor | Latency > SLO for 30 min | Within 4 hours |
|
|
| P4 - Low | DLQ depth > 20 | Next business day |
|