Why Data Quality Is Your Real Job
If your data is trash, your:
- Models are misleading
- Dashboards tell the wrong story
- Policy recommendations can hurt people
Data quality is not “extra” work—it’s core to being a serious analyst.
7 Checks to Automate on Every Dataset
- Missingness patterns by variable and group
- Uniqueness of IDs
- Range checks for numeric variables
- Category consistency for factors/coded responses
- Cross-field logic (e.g., age vs date of birth, pregnancy vs sex)
- Duplicates and near-duplicates
- Date/time sanity (ordering, impossible dates)
Automate these in:
- R (with tidyverse/janitor)
- Python (with pandas)
How to Turn This Into a Portfolio Project
- Take any public health or development dataset
- Build:
- A script that runs all 7 checks
- A short report or dashboard summarizing issues
- Include:
- A “recommended data cleaning plan”
- Examples of how findings explain weird results
Great analysts are paranoid about data quality—and employers love that.