system-prompts-and-models-o.../dealix/docs/SLO.md

# Dealix Service Level Objectives (SLO) — v3.0.0

**Status:** Draft / Skeleton
**Owner:** Sami (sami.assiri11@gmail.com)
**Review cadence:** Monthly for first 3 months, then quarterly

---

## Philosophy

We are in **Primitive Launch** phase with <10 customers. SLOs should be:
- **Conservative** (high bar) on correctness and security
- **Lenient** on latency until we have real traffic baselines
- **Boring** — easy to explain, easy to measure, easy to page on

No 99.99% theater. If we can't measure it cheaply, it's not an SLO.

---

## Tier 1 — Paid-customer critical path

| SLI | SLO | Measurement | Why |
|---|---|---|---|
| `/health/deep` returns 200 | **99.5%** over 30 days | UptimeRobot poll every 60s | Core liveness — if this is red, nothing works |
| `/api/v1/webhooks/moyasar` p95 latency | **<2000ms** | Sentry transaction sampling | Payment webhook must not time out and retry incorrectly |
| `/api/v1/checkout` success rate | **>98%** | 2xx / total, excl. 4xx from bad input | Revenue path — 5xx here loses real money |
| Moyasar payment event → PostHog event | **<60s e2e** | timestamp delta | Funnel accuracy depends on this |

**Error budget (30d):** 3.6 hours of `/health/deep` downtime.

## Tier 2 — Operational

| SLI | SLO | Measurement |
|---|---|---|
| DLQ depth (`webhooks` queue) | **<10 entries** at any time | `/admin/dlq/stats` poll every 5m |
| DLQ age (oldest entry) | **<24h** | queue inspection; alert if older |
| Approvals pending | **<50 requests** | `/admin/approvals/stats` |
| LLM provider fallback rate | **<5%** of requests | `/admin/costs` breakdown |

## Tier 3 — Cost

| SLI | SLO | Measurement |
|---|---|---|
| Daily LLM spend | **<$10 USD/day** with alert | `/admin/costs` aggregated daily |
| Redis memory | **<500MB** | `redis-cli INFO memory used_memory_human` |
| Postgres connections | **<80** | `pg_stat_activity` count |

---

## Alerting policy

- **Page Sami immediately** if:
  - `/health/deep` returns non-200 for >5 consecutive minutes
  - DLQ `webhooks` depth >50
  - Moyasar webhook 5xx rate >5% over 10 minutes
- **Slack/email (non-urgent)** if:
  - Any Tier 1 SLO burns >25% of its 30d budget in a single day
  - Daily LLM cost >$15
  - Approvals pending >100

---

## Dashboards (to build)

Minimum viable dashboard (Grafana or Sentry Dashboards):

1. **Liveness row:** `/health`, `/health/deep`, process uptime
2. **Revenue row:** /checkout 2xx/5xx count (last 24h), pending approvals, Moyasar webhook rate
3. **Backlog row:** DLQ depth per queue, oldest entry age, approvals pending
4. **Cost row:** LLM spend per provider (last 24h), Redis memory, Postgres connections

**Current state:** 0/4 rows built. Sentry already collects performance data but no dashboard cut yet.

---

## What closes O5 gate

1. This document merged to `docs/SLO.md` ✅ (this PR)
2. At least Tier 1 dashboard built and linked in RUNBOOK.md → **blocked on UptimeRobot API key** for external health SLI
3. Alert routing configured → **blocked on UptimeRobot + Slack/email settings**

---

## Revisions

| Date | Change | Author |
|---|---|---|
| 2026-04-23 | Initial skeleton created as part of Primitive Launch D0 hardening | Agent (approved by Sami) |