Sami Assiri f79c69ff25 ci(dealix): root GitHub workflows, ai-company track, full Dealix API tree

Made-with: Cursor

2026-05-01 14:03:52 +03:00

Dealix Service Level Objectives (SLO) — v3.0.0

Status: Draft / Skeleton Owner: Sami (sami.assiri11@gmail.com) Review cadence: Monthly for first 3 months, then quarterly

Philosophy

We are in Primitive Launch phase with <10 customers. SLOs should be:

No 99.99% theater. If we can't measure it cheaply, it's not an SLO.

SLI	SLO	Measurement	Why
`/health/deep` returns 200	99.5% over 30 days	UptimeRobot poll every 60s	Core liveness — if this is red, nothing works
`/api/v1/webhooks/moyasar` p95 latency	<2000ms	Sentry transaction sampling	Payment webhook must not time out and retry incorrectly
`/api/v1/checkout` success rate	>98%	2xx / total, excl. 4xx from bad input	Revenue path — 5xx here loses real money
Moyasar payment event → PostHog event	<60s e2e	timestamp delta	Funnel accuracy depends on this

Error budget (30d): 3.6 hours of /health/deep downtime.

SLI	SLO	Measurement
DLQ depth (`webhooks` queue)	<10 entries at any time	`/admin/dlq/stats` poll every 5m
DLQ age (oldest entry)	<24h	queue inspection; alert if older
Approvals pending	<50 requests	`/admin/approvals/stats`
LLM provider fallback rate	<5% of requests	`/admin/costs` breakdown

SLI	SLO	Measurement
Daily LLM spend	<$10 USD/day with alert	`/admin/costs` aggregated daily
Redis memory	<500MB	`redis-cli INFO memory used_memory_human`
Postgres connections	<80	`pg_stat_activity` count

Page Sami immediately if:
- /health/deep returns non-200 for >5 consecutive minutes
- DLQ webhooks depth >50
- Moyasar webhook 5xx rate >5% over 10 minutes
Slack/email (non-urgent) if:
- Any Tier 1 SLO burns >25% of its 30d budget in a single day
- Daily LLM cost >$15
- Approvals pending >100

Minimum viable dashboard (Grafana or Sentry Dashboards):

Liveness row: /health, /health/deep, process uptime
Revenue row: /checkout 2xx/5xx count (last 24h), pending approvals, Moyasar webhook rate
Backlog row: DLQ depth per queue, oldest entry age, approvals pending
Cost row: LLM spend per provider (last 24h), Redis memory, Postgres connections

Current state: 0/4 rows built. Sentry already collects performance data but no dashboard cut yet.

This document merged to docs/SLO.md ✅ (this PR)
At least Tier 1 dashboard built and linked in RUNBOOK.md → blocked on UptimeRobot API key for external health SLI
Alert routing configured → blocked on UptimeRobot + Slack/email settings

Date	Change	Author
2026-04-23	Initial skeleton created as part of Primitive Launch D0 hardening	Agent (approved by Sami)