mancitrus/system-prompts-and-models-of-ai-tools

mirror of https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools.git synced 2026-06-18 07:19:35 +00:00

Sami Assiri f79c69ff25 ci(dealix): root GitHub workflows, ai-company track, full Dealix API tree

Made-with: Cursor

2026-05-01 14:03:52 +03:00

11 KiB

Raw Blame History

Dealix Production Runbook

Version: 1.0 (v3.0.0 Primitive Launch) Owner: Sami Assiri Last updated: 2026-04-23 Environment: Production — api.dealix.me, 188.245.55.180

0. Contact & Access

Server: ssh -o StrictHostKeyChecking=no -i ~/.ssh/dealix_deploy root@188.245.55.180
App dir: /opt/dealix
systemd unit: dealix-api.service
Database: Postgres postgresql://dealix@127.0.0.1:5432/dealix
Cache/queue: Redis 127.0.0.1:6379/0
Nginx: api.dealix.me → 127.0.0.1:8001, dealix.me → /var/www/dealix/landing
GitHub: https://github.com/VoXc2/dealix (main protected, gh CLI with api_credentials=["github"])
Sentry: DSN in /opt/dealix/.env → SENTRY_DSN
UptimeRobot: monitors https://api.dealix.me/health
PostHog: EU region https://eu.i.posthog.com

First rule

Never edit code or .env directly on the server. All changes flow through GitHub PR → deploy script. Untracked production drift already cost 20 commits once — not again.

Scenario 1 — Routine Deploy (merge to main → prod)

When: PR approved and merged to main.

Verify CI green on main

gh run list --repo VoXc2/dealix --branch main --limit 3 --json name,status,conclusion

All three latest must be completed / success. If not — abort.

SSH to server

ssh -o StrictHostKeyChecking=no -i ~/.ssh/dealix_deploy root@188.245.55.180

Snapshot current state (rollback safety)

cd /opt/dealix
git rev-parse HEAD > /opt/dealix/.last_good_sha
cp .env .env.bak.$(date -u +%Y%m%dT%H%M%SZ)

Pull + install + migrate

git fetch origin
git checkout main
git pull --ff-only origin main
/opt/dealix/.venv/bin/pip install -r requirements.txt
/opt/dealix/.venv/bin/alembic upgrade head

Restart service

systemctl restart dealix-api
systemctl status dealix-api --no-pager

Health verification (all must pass)

curl -sf https://api.dealix.me/health
curl -sf https://api.dealix.me/health/deep | jq .
curl -sf https://api.dealix.me/api/v1/pricing/plans | jq '.plans | length'

/health/deep must show postgres, redis, llm_providers all green.

Trigger Sentry + PostHog probes

curl -sf -H "X-API-Key: $ADMIN_KEY" https://api.dealix.me/api/v1/admin/sentry-check

Verify in Sentry + PostHog dashboards within 60s.

DoD: Health green, Sentry ping received, no error spike in Sentry for 10 min.

Scenario 2 — Rollback (bad deploy, 5-min target)

Trigger: error rate spike in Sentry, /health/deep red, 5xx surge, or user complaint post-deploy.

Announce in channel (if you have one — otherwise note in GitHub issue):

Rolling back prod to last good SHA due to .

SSH in, revert to last known-good SHA

ssh -i ~/.ssh/dealix_deploy root@188.245.55.180
cd /opt/dealix
LAST_GOOD=$(cat /opt/dealix/.last_good_sha)
git checkout "$LAST_GOOD"
/opt/dealix/.venv/bin/pip install -r requirements.txt

Roll back migrations only if the bad deploy added new ones

# Check what the bad deploy added:
/opt/dealix/.venv/bin/alembic history | head
# Downgrade one step ONLY if necessary:
/opt/dealix/.venv/bin/alembic downgrade -1

Rule: never downgrade more than 1 step without Sami's explicit approval.

Restart + verify

systemctl restart dealix-api
curl -sf https://api.dealix.me/health/deep | jq .

Re-open main for fix via PR (no direct commits to main; main is protected).

DoD: health green on old SHA within 5 minutes, Sentry error rate back to baseline within 10 min. Target: <5 min from decision to rollback complete.

Scenario 3 — Database Down / Unreachable

Signals: /health/deep reports postgres: error, 500s on leads endpoints, Sentry OperationalError.

Triage — is it us or the DB?

ssh -i ~/.ssh/dealix_deploy root@188.245.55.180
systemctl status postgresql --no-pager
sudo -u postgres psql -c "SELECT 1;"

If systemd says failed:

journalctl -u postgresql -n 200 --no-pager
systemctl restart postgresql
sleep 3
sudo -u postgres psql -c "SELECT 1;"

If disk full (most common real cause):

df -h /
# Clear postgres WAL archives / old backups first:
du -sh /var/lib/postgresql/* /var/backups/postgres/* 2>/dev/null | sort -h
# DO NOT delete active WAL. Rotate old backups only.

If connection pool exhausted (API is up, DB healthy, but API can't connect):

sudo -u postgres psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
systemctl restart dealix-api  # drops API's stale connections

If DB corrupt / cannot start → restore from backup (Scenario 5).
Post-incident:
- Write a 5-line postmortem in docs/incidents/YYYY-MM-DD.md.
- If webhooks arrived during outage, drain WEBHOOKS_DLQ:
```
curl -s -H "X-API-Key: $ADMIN_KEY" -X POST \
  'https://api.dealix.me/api/v1/admin/dlq/webhooks/drain?limit=100'
```

DoD: /health/deep postgres green; no lost webhooks (DLQ drained or re-queued).

Scenario 4 — LLM Provider Down (Anthropic / OpenAI / Google)

Signals: Sentry shows provider timeouts, /health/deep llm_providers yellow/red, ConnectorFacade circuit breaker open, workflow failures in PostHog.

Confirm it's the provider, not us:
- Check https://status.anthropic.com / https://status.openai.com / https://status.cloud.google.com.
- curl https://api.anthropic.com/v1/messages -H "x-api-key: $ANTHROPIC_API_KEY" ...
The circuit breaker should already be doing its job — requests to the failing provider return fast with CircuitOpenError, DLQ absorbs the failures.
Verify breaker state:
```
curl -s -H "X-API-Key: $ADMIN_KEY" https://api.dealix.me/api/v1/admin/dlq/stats | jq .
```
If OUTBOUND_DLQ or ENRICHMENT_DLQ is growing fast, breaker is protecting us — no action on our side.
Temporary failover (if one provider is the primary and down for >30 min): Edit /opt/dealix/.env → LLM_PROVIDER_PRIORITY="openai,google,anthropic" (reorder).
```
systemctl restart dealix-api
```
Rule: this is the ONLY allowed in-place .env edit. Commit the change back to .env.example (without secret) next business day.
When provider recovers:
- Breaker auto-half-opens after 60s, then closes on first success.
- Drain the relevant DLQ to replay queued work:
```
curl -s -H "X-API-Key: $ADMIN_KEY" -X POST \
  'https://api.dealix.me/api/v1/admin/dlq/outbound/drain?limit=50'
```

DoD: /health/deep llm_providers green; DLQ depth returning to zero; no workflow failures in last 10 min.

Scenario 5 — Backup Restoration (Data Loss / Corruption)

Trigger: DB corrupt, accidental mass delete, ransomware, or monthly drill (required).

Preflight

Identify the target backup:

ls -lht /var/backups/postgres/*.sql.gz | head -5

Never restore into production DB. Restore into a staging clone first, validate, then swap.

Drill / Restore procedure (on staging — monthly required)

Create isolated DB:

sudo -u postgres createdb dealix_restore_test

Restore:

BACKUP=/var/backups/postgres/dealix-YYYY-MM-DD.sql.gz
gunzip -c "$BACKUP" | sudo -u postgres psql dealix_restore_test

Validate row counts against prod (sanity):

sudo -u postgres psql dealix_restore_test -c "SELECT 'leads' t, count(*) FROM leads UNION ALL SELECT 'users', count(*) FROM users;"

Validate latest lead timestamp is within acceptable RPO (≤24h):

sudo -u postgres psql dealix_restore_test -c "SELECT max(created_at) FROM leads;"

Teardown:

sudo -u postgres dropdb dealix_restore_test

Real incident restore (production data loss)

Stop API to freeze writes:
```
systemctl stop dealix-api
```

Rename current DB (do NOT drop — evidence):

sudo -u postgres psql -c "ALTER DATABASE dealix RENAME TO dealix_corrupt_$(date +%Y%m%d);"
sudo -u postgres createdb dealix

Restore latest good backup:

gunzip -c "$BACKUP" | sudo -u postgres psql dealix

Restart API + verify:

systemctl start dealix-api
curl -sf https://api.dealix.me/health/deep | jq .

Postmortem mandatory — how data was lost, why backup gap existed, what changed.

DoD (drill): restore completes in ≤15 min, row counts ±5% of prod, max timestamp within RPO. DoD (incident): API back up, no data newer than last backup lost (document gap).

Scenario 6 — Security Incident (suspected breach)

Signals: unexplained admin API calls, fail2ban banning authorized IPs, unexpected outbound traffic, Sentry PermissionError spike, unknown webhook signatures failing.

Contain first, investigate second:

ssh -i ~/.ssh/dealix_deploy root@188.245.55.180
# Rotate ALL secrets immediately
cd /opt/dealix && bash scripts/rotate_secrets.sh
systemctl restart dealix-api

Lock down UFW to known IPs only (temporary):

# Save current rules first:
ufw status numbered > /tmp/ufw.before.$(date +%s)
# Restrict SSH to your IP:
ufw delete allow 22/tcp || true
ufw allow from <YOUR_PUBLIC_IP> to any port 22

Check auth logs:

journalctl -u ssh -n 500 --no-pager | grep -iE 'accepted|failed'
fail2ban-client status sshd

Preserve evidence:

tar -czf /root/incident-$(date +%s).tgz /var/log/nginx /var/log/auth.log /var/log/dealix*

Notify Sami, document in docs/incidents/, file GitHub security advisory if user data touched.

DoD: all secrets rotated, UFW locked, attacker IPs banned, incident doc drafted within 1h.

Appendix A — Health Check Cheat Sheet

Signal	Command	Expected
Liveness	`curl -sf https://api.dealix.me/health`	200 OK
Deep health	`curl -sf https://api.dealix.me/health/deep`	postgres+redis+llm_providers green
CI status	`gh run list --repo VoXc2/dealix --limit 3`	success
DLQ depth	`GET /api/v1/admin/dlq/stats`	0 across all queues
Pending approvals	`GET /api/v1/admin/approvals/pending`	<10
Service status	`systemctl status dealix-api`	active (running)
fail2ban	`fail2ban-client status sshd`	jail active
Nginx	`systemctl status nginx`	active (running)

Appendix B — Do-Not-Touch List

main branch: protected, no direct push
/opt/dealix/.env.pre-v3.0.0.bak: emergency reference, never delete
server-backup-20260423-084442 branch: historical evidence, do not force-delete
Postgres dealix_corrupt_* dbs: keep for 7 days post-incident before dropping

Appendix C — Runbook Review

Review every 4 weeks. If any command in this runbook failed during a real incident, update it immediately after that incident closes.

11 KiB Raw Blame History