system-prompts-and-models-o.../dealix/docs/ops/DATA_LAKE_PLAYBOOK.md
2026-05-01 14:03:52 +03:00

146 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Dealix Data Lake + Lead Graph Playbook
How to use Dealix as a data ingestion + enrichment + outreach-prep system, not a blast tool.
## Mental model
```
Data Lake (raw_lead_imports + raw_lead_rows)
↓ normalize
Lead Graph (accounts + contacts + signals)
↓ enrich (providers)
Scored Accounts (lead_scores)
↓ suppression check + channel policy
Outreach Queue (always approval_required for first 30 days)
```
Raw rows are kept forever. Outreach happens only after compliance gates pass.
## 4 data types Dealix accepts
| Type | Example source | `source_type` |
|---|---|---|
| Owned | Customer CRM, your own form submissions | `owned` |
| Public | Google Search, Google Maps, business directories | `public` / `google_maps` / `google_search` |
| Paid | Vetted vendor lists with documented allowed-use | `paid` |
| Partner | Co-marketing list with explicit consent | `partner` |
**Never accept:** WhatsApp number lists with no source, scraped LinkedIn profiles,
personal emails without opt-in.
## Required metadata per import
```json
{
"source_name": "vendor_x_saudi_real_estate_2026",
"source_type": "paid",
"allowed_use": "business_contact_research_only",
"consent_status": "legitimate_interest",
"risk_level": "high",
"rows": [...]
}
```
If the vendor can't tell you `source`, `allowed_use`, and `last_updated` — don't buy the list.
## Step-by-step ingestion
### 1. Audit the file BEFORE upload
```bash
python scripts/audit_lead_file.py vendor_file.csv
```
Reports acceptance rate, phone/email validity, dedup risk. If acceptance < 50%,
reject the file or ask the vendor to clean it.
### 2. Upload
```bash
python scripts/import_leads.py vendor_file.csv \
--source-name "vendor_x_2026_q2" \
--source-type paid \
--allowed-use "business_contact_research_only" \
--risk-level high \
--auto-pipeline
```
`--auto-pipeline` runs normalize dedupe enrich automatically.
### 3. Or call the API directly
```
POST /api/v1/data/import
POST /api/v1/data/import/{id}/normalize
POST /api/v1/data/import/{id}/dedupe
POST /api/v1/data/import/{id}/enrich body: {enrichment_level: "standard", max_accounts: 25}
GET /api/v1/data/import/{id}/report
```
### 4. Discover local Saudi sectors via Google Maps
```bash
python scripts/discover_local_to_csv.py dental_clinic riyadh --max 20
# wrote 20 rows → dental_clinic_riyadh.csv
python scripts/import_leads.py dental_clinic_riyadh.csv \
--source-name "maps_dental_clinic_riyadh" \
--source-type google_maps \
--auto-pipeline
```
### 5. Suppress opt-outs
```
POST /api/v1/data/suppression
body: {"email": "...", "reason": "opt_out_request_2026_04"}
```
### 6. Prepare outreach
```
POST /api/v1/outreach/prepare-from-data
body: {"priority": ["P0","P1"], "max_accounts": 25, "persist": true}
```
Returns `ready` / `needs_review` / `blocked` lists. Persisted rows go to
`outreach_queue` with `approval_required=True` Sami still approves manually.
### 7. Export a CSV for human send
```bash
python scripts/export_outreach_ready.py --priority P0,P1 --max 50 \
--out today_outreach_50.csv
```
## Compliance guardrails (already enforced)
- Suppression hits `blocked`, never queued.
- `opt_out=true` on contact `blocked`.
- `risk_level=high` `needs_review`, requires explicit approval.
- Missing `allowed_use` `needs_review`.
- All queue rows have `approval_required=True` for the first 30 days.
## Data quality scoring
Each account gets a `data_quality_score` 0..100 based on field completeness +
signal coverage negatives (no source, opt-out, high risk). See
`auto_client_acquisition/pipelines/scoring.py::compute_data_quality`.
`/api/v1/data/accounts?priority=P0` lets you pull the highest-DQ + highest-score
accounts ready for action.
## Google Maps cache policy
Per Google Maps Platform terms, we store `place_id` (allowed) and refresh details
on demand rather than caching everything forever. See
`auto_client_acquisition/connectors/google_maps.py::discover_local`.
## Don't do
- Auto-send from raw rows. Always normalize dedupe enrich score queue.
- Cold-blast WhatsApp. WhatsApp is inbound only in Dealix.
- Scrape LinkedIn. Use it for manual research only.
- Use a list with no `source` or `allowed_use`. Reject the data.
- Send unapproved messages in the first 30 days of any new customer.