system-prompts-and-models-o.../dealix/docs/AGENT_OBSERVABILITY_EVALS.md
Dealix Builder bcf545c22e feat(self-improving): Hermes-inspired Agent Platform — 6 layers + 30 endpoints + 76 tests + Private Beta launch
Security Curator (4 modules) — جدار الحماية الأول
- secret_redactor: 11 patterns (GitHub PAT, OpenAI/Anthropic/Supabase/WhatsApp/Moyasar/Sentry/Google/AWS/private keys); never returns raw secret
- patch_firewall: blocks .env / credentials.json / RSA keys; scans added lines for secret patterns
- trace_redactor: masks phones (+966...) and emails for PII safety
- tool_output_sanitizer: cleans tool outputs before they hit ledger/Proof Pack/UI/observability

Growth Curator (5 modules) — التحسين الذاتي
- message_curator: grades Arabic messages (0..100), detects 8 risky phrases, suggests Saudi-tone skeleton
- playbook_curator: scores playbooks by outcome (accept/reply/meeting/deal); winner/promising/needs_work/archive
- mission_curator: scores completed missions; ship_it_widely/iterate/rework_or_retire
- skill_inventory: deterministic 23-skill catalog across 5 layers
- curator_report: weekly Arabic summary "ماذا تعلمنا هذا الأسبوع"

Meeting Intelligence (5 modules) — ذكاء الاجتماعات
- transcript_parser: accepts Google Meet entries OR plain "Speaker: text" format
- meeting_brief: 6-section pre-meeting brief in Arabic (objective/questions/objections/offer/next-step)
- objection_extractor: 8 categories (price/timing/authority/trust/integration/competitor/results/complexity)
- followup_builder: email + WhatsApp drafts; live_send_allowed=False always
- deal_risk: 0..100 score from objections + missing next-step + decision-maker absence + days-since-touch

Model Router (5 modules) — موجّه النماذج
- provider_registry: 7 providers (Claude Sonnet/Haiku, GPT-4-class, GPT-4o-mini, Gemini Pro, Azure OAI KSA-region, Local Qwen Arabic-tuned)
- task_router: 10 task types × routing decisions with reasons_ar
- cost_policy: bulk → low; output > 1500 tokens → high
- fallback_policy: high-sensitivity workloads prefer KSA-region/self-hosted FIRST
- usage_dashboard: deterministic demo of all task routes

Connector Catalog (3 modules) — كتالوج التكاملات
- 14 connectors (WhatsApp Cloud, Gmail, Calendar, Google Meet, Moyasar, LinkedIn Lead Forms, Google Business Profile, X API, Instagram, Sheets, CRM, Website Forms, Composio, MCP Gateway)
- Each has launch_phase (1-4), risk_level, allowed_actions, blocked_actions, Arabic risk dossier
- WhatsApp blocks cold_send_without_consent; Moyasar blocks store_card_number; MCP requires allowlist

Agent Observability (5 modules) — مراقبة الوكلاء + التقييمات
- trace_events: SHA256-hashes user/company IDs; sanitizes payload/output before logging
- safety_eval: 7 rules (guarantee, scarcity_fake, medical_claim, financial, regulatory, personal_data, urgency); 0..100 → safe/needs_review/blocked
- saudi_tone_eval: positive markers (هلا, لاحظت, يناسبك) vs negative (تحية طيبة وبعد, synergy, leverage); arabic_ratio bonus
- eval_pack: 5 curated cases with expected verdicts
- cost_tracker: per workflow/provider/task_type aggregation

Routers (6 new) — 30 endpoints
- /api/v1/security-curator/{demo, redact, inspect-diff, sanitize-output}
- /api/v1/growth-curator/{skills/inventory, messages/grade, messages/improve, messages/duplicates, missions/next, report/weekly, report/demo}
- /api/v1/meeting-intelligence/{brief, brief/demo, transcript/summarize, followup/draft, deal-risk}
- /api/v1/model-router/{providers, tasks, route, cost-class, usage/demo}
- /api/v1/connector-catalog/{catalog, summary, status, risks, {key}}
- /api/v1/agent-observability/{trace/build, safety/eval, tone/eval, evals/run}

Tests (6 new files, 76 tests)
- test_security_curator: 16 tests (PAT detect, key redact, env diff block, payload scan, trace mask)
- test_growth_curator: 16 tests (Arabic grade, risky phrases, dup detect, playbook scoring, mission recommend, weekly report)
- test_meeting_intelligence: 13 tests (transcript parse, brief sections, objection extract, followup drafts, deal risk)
- test_dealix_model_router: 11 tests (every task → ≥1 provider, KSA-region for high sensitivity, cost class, primary override)
- test_agent_observability: 12 tests (trace hashing, safety verdicts, tone scoring, eval pack)
- test_connector_catalog: 11 tests (≥12 connectors, every has risk/blocked actions, WA cold-send blocked, Moyasar card-storage blocked)

Docs (8 new + 1 updated)
- AGENT_SECURITY_CURATOR.md (Arabic)
- GROWTH_CURATOR_STRATEGY.md (Arabic)
- MEETING_INTELLIGENCE.md (Arabic)
- MODEL_PROVIDER_ROUTER.md (Arabic)
- CONNECTOR_CATALOG.md (Arabic)
- AGENT_OBSERVABILITY_EVALS.md (Arabic)
- PRIVATE_BETA_LAUNCH_TODAY.md (Arabic) — go-checklist + offer + risks
- DEMO_SCRIPT_12_MINUTES.md (Arabic) — minute-by-minute demo flow
- FIRST_20_OUTREACH_MESSAGES.md (Arabic) — 7 personas + 3 follow-ups, all under safety/tone evals
- DEALIX_100_PERCENT_LAUNCH_PLAN.md — added §34 Self-Improving Agent Platform + §35 Private Beta Launch

Landing
- landing/private-beta.html — Arabic RTL, dark theme, pricing, 11 demo endpoints, safety banner

Test results
- 76/76 new tests pass
- Full suite: 663 passed, 2 skipped (missing API keys, unrelated)
- 0 existing tests broken

Safety
- All 6 layers honor approval-first, draft-only, no-live-send
- Hash user/company IDs before any trace
- No secrets in logs/embeddings/traces (3-layer defense: redactor + sanitizer + firewall)
- Saudi tone eval rejects "تحية طيبة وبعد" + "synergy" auto-corporate language
- Safety eval blocks "ضمان 100%" + medical claims + fake urgency
- Connector Catalog: WhatsApp blocks cold-send, Moyasar blocks card storage, MCP requires allowlist

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:30:18 +03:00

68 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Agent Observability + Evals — مراقبة الوكلاء + التقييمات
> Trace events معقّمة + safety eval + Saudi tone eval + cost tracker. كله deterministic، لا PII في الـtraces.
## 1. Trace Events
`build_trace_event(...)` يبني trace جاهز لـLangfuse/Sentry:
- `user_id` و`company_id` تُهاش (sha256[:16]) قبل التخزين.
- `payload` و`output` يمران عبر `sanitize_trace_event`.
- الحقول الآمنة (event_type, agent_name, status, latency_ms, cost_estimate, approval_status, tool, policy_result, risk_level, workflow_name, trace_id) تبقى كما هي.
## 2. Safety Eval
7 قواعد:
| الفئة | السببية بالعربي | الخطورة |
|------|-----------------|--------|
| guarantee | وعد بنتائج مضمونة | 50 |
| scarcity_fake | تكتيك ندرة مزيف | 25 |
| medical_claim | ادعاء طبي | 50 |
| financial_claim | عوائد مبالغ فيها | 35 |
| regulatory | ادعاء ترخيص | 35 |
| personal_data | تلميح بيع بيانات | 50 |
| urgency_manipulation | ضغط زمني مصطنع | 15 |
`score = max(0, 100 - sum_penalties)`. تيرز: ≥70 safe, ≥40 needs_review, <40 blocked.
## 3. Saudi Tone Eval
- إيجابيات: "هلا/أهلاً/مساء الخير، لاحظت/شفت، يناسبك/تحب، Pilot/بايلوت" +12 لكل واحدة.
- سلبيات: "السيد المحترم/تحية طيبة وبعد/ندعوكم لاكتشاف، leverage/synergy/best-in-class" -20 لكل واحدة.
- نسبة عربية 60%: +20؛ 30%: +10.
- طول > 80 كلمة: -10.
تيرز: ≥75 natural, ≥50 decent, <50 off.
## 4. Eval Pack
5 cases مختارة (`run_eval_pack()`):
- natural_warm_intro safe + natural
- fake_urgency blocked + off
- too_corporate safe + off
- medical_claim blocked + off (أو needs_review)
- decent_but_short safe + decent
النتيجة: `{total, passed, failed, pass_rate, results}`.
## 5. Cost Tracker
`CostTracker.record(workflow_name, provider_key, task_type, cost_estimate)` ثم `summary()` يُرجع `{runs, total, by_workflow, by_provider, by_task_type}`.
## 6. Endpoints
```
POST /api/v1/agent-observability/trace/build
POST /api/v1/agent-observability/safety/eval
POST /api/v1/agent-observability/tone/eval
GET /api/v1/agent-observability/evals/run
```
## 7. حدود
- لا tokens في الـtraces.
- لا secrets (يمر عبر `sanitize_trace_event`).
- لا raw PII (phones/emails مخفية).
- لا full customer lists.
- لا payment details.