Ingest
Load daily survey and facility files from multiple partner sites with schema checks at entry.
Raw extract to validated reporting model
This project documents how I would move survey and facility extracts from raw files into a validated reporting layer that can feed dashboards, briefs, and partner reporting without manual spreadsheet firefighting.
The practical goal is simple: stop every analysis cycle from starting with a new round of ad hoc cleaning. The pipeline standardizes raw extracts, applies validation rules, and publishes a reporting-ready table with audit flags preserved.
Load daily survey and facility files from multiple partner sites with schema checks at entry.
Run range checks, missingness rules, duplicate detection, and referential integrity against facility metadata.
Standardize names, dates, units, and indicator logic into a tidy reporting model.
Export clean tables and QA summaries for dashboards, analyst handoff, and stakeholder reporting.
| Rule | Purpose |
|---|---|
| Unique household or patient ID | Prevent duplicate counting |
| Allowed facility IDs only | Catch site mapping errors early |
| Date ordering check | Ensure visit dates do not precede enrollment |
| Income and expenditure bounds | Flag impossible or clearly mis-entered values |
| Required fields by encounter type | Make downstream indicators safe to compute |
The pipeline produces three layers:
raw: immutable copies of source extractsstaging: standardized field names and basic harmonizationreporting: one clean fact table joined to a facility reference table and QA summaryThis is the kind of work that makes dashboards trustworthy. It is also the bridge between research data management and modern data engineering: naming conventions, validation layers, reproducible jobs, and clear handoff contracts.