The Truth Nobody Tells You
80% of data analysis is cleaning data.
Only 20% is the “fun” stuff - modeling, visualization, insights.
Yet most courses skip data cleaning or treat it as a footnote.
This is why beginners struggle with real-world data.
This post covers the data cleaning skills that make you invaluable.
The 7 Types of Messy Data (And How to Fix Each)
1. Missing Values
The Problem:
Name Age Salary Department
0 John 32 50000.0 Sales
1 Jane NaN 65000.0 Marketing
2 Bob 28 NaN Sales
3 Alice 45 75000.0 NaNSolutions:
Option A: Remove Rows (if < 5% missing)
# Remove rows with any missing values
df_clean = df.dropna()
# Remove rows with missing in specific columns
df_clean = df.dropna(subset=['Age', 'Salary'])Option B: Impute (fill with calculated value)
# Fill with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill with median (better for skewed data)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# Fill with mode (categorical)
df['Department'].fillna(df['Department'].mode()[0], inplace=True)
# Forward fill (time series)
df['Sales'].fillna(method='ffill', inplace=True)
# Custom logic
df['Age'].fillna(df.groupby('Department')['Age'].transform('mean'), inplace=True)Option C: Flag as Missing (create indicator column)
df['Age_missing'] = df['Age'].isnull().astype(int)
df['Age'].fillna(df['Age'].median(), inplace=True)FREE Resources: - Pandas Missing Data - How to Handle Missing Data (Article)
2. Duplicate Records
The Problem:
CustomerID Name Email Order
0 1001 John john@email.com 100
1 1001 John john@email.com 100
2 1002 Jane jane@email.com 150Solutions:
Find Duplicates:
# Check for duplicates
duplicates = df[df.duplicated()]
print(f"Found {len(duplicates)} duplicates")
# Check specific columns
duplicates = df[df.duplicated(subset=['CustomerID', 'Email'])]Remove Duplicates:
# Remove all duplicates (keep first occurrence)
df_clean = df.drop_duplicates()
# Keep last occurrence
df_clean = df.drop_duplicates(keep='last')
# Remove based on specific columns
df_clean = df.drop_duplicates(subset=['CustomerID'], keep='first')3. Inconsistent Text Data
The Problem:
Product Category
0 iPhone 13 Electronics
1 iphone 13 electronics
2 IPHONE 13 ELECTRONICS
3 iPhone 13 Electronics Solutions:
Standardize Case:
# Convert to lowercase
df['Product'] = df['Product'].str.lower()
df['Category'] = df['Category'].str.lower()
# Or title case
df['Name'] = df['Name'].str.title()
# Or uppercase
df['State'] = df['State'].str.upper()Remove Extra Spaces:
# Strip leading/trailing spaces
df['Product'] = df['Product'].str.strip()
# Remove extra spaces between words
df['Product'] = df['Product'].str.replace('\\s+', ' ', regex=True)Find and Replace:
# Simple replacement
df['Category'] = df['Category'].str.replace('electronics', 'Electronics')
# Multiple replacements
replacements = {
'elec': 'Electronics',
'cloth': 'Clothing',
'food': 'Food & Beverage'
}
df['Category'] = df['Category'].replace(replacements)
# Regex replacement
df['Phone'] = df['Phone'].str.replace(r'[^0-9]', '', regex=True)4. Wrong Data Types
The Problem:
df.dtypes
# Date object # Should be datetime
# Age object # Should be numeric
# Price object # Should be float
# Quantity float # Should be intSolutions:
Convert Types:
# String to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce') # Invalid → NaN
df['Price'] = df['Price'].astype(float)
# String to datetime
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
# Float to int (handle NaN first!)
df['Quantity'] = df['Quantity'].fillna(0).astype(int)
# String to categorical (saves memory)
df['Category'] = df['Category'].astype('category')Handle Parsing Errors:
# Identify problematic values
pd.to_numeric(df['Age'], errors='coerce').isna().sum()
# See which values failed
mask = pd.to_numeric(df['Age'], errors='coerce').isna()
print(df.loc[mask, 'Age'].unique())
# Clean before converting
df['Age'] = df['Age'].str.extract('(\\d+)')[0] # Extract digits
df['Age'] = pd.to_numeric(df['Age'])5. Outliers
The Problem:
# Salary column has some extreme values
print(df['Salary'].describe())
# mean: 75,000
# std: 50,000
# max: 9,999,999 # Obviously wrong!Detection Methods:
Method 1: IQR (Interquartile Range)
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Flag outliers
outliers = df[(df['Salary'] < lower_bound) | (df['Salary'] > upper_bound)]
# Remove outliers
df_clean = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]Method 2: Z-Score
from scipy import stats
# Calculate z-scores
z_scores = np.abs(stats.zscore(df['Salary']))
# Remove data points with |z| > 3
df_clean = df[z_scores < 3]Method 3: Domain Knowledge
# Age should be 18-100
df_clean = df[(df['Age'] >= 18) & (df['Age'] <= 100)]
# Salary should be 20K-500K for this role
df_clean = df[(df['Salary'] >= 20000) & (df['Salary'] <= 500000)]FREE Resources: - Detecting Outliers (Medium)
6. Inconsistent Formatting
The Problem:
# Dates in multiple formats
0 2024-01-15
1 01/15/2024
2 15-Jan-2024
3 Jan 15, 2024
# Phone numbers
0 (123) 456-7890
1 123-456-7890
2 1234567890
3 +1 123 456 7890Solutions:
Dates:
# Let pandas infer format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
# Or specify format explicitly
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d', errors='coerce')
# For mixed formats, process separately
mask = df['Date'].str.contains('/')
df.loc[mask, 'Date'] = pd.to_datetime(df.loc[mask, 'Date'], format='%m/%d/%Y')
df.loc[~mask, 'Date'] = pd.to_datetime(df.loc[~mask, 'Date'], format='%Y-%m-%d')Phone Numbers:
# Extract only digits
df['Phone'] = df['Phone'].str.replace(r'[^0-9]', '', regex=True)
# Standardize format
df['Phone'] = df['Phone'].str.replace(r'(\\d{3})(\\d{3})(\\d{4})', r'(\\1) \\2-\\3', regex=True)Currency:
# Remove currency symbols and commas
df['Price'] = df['Price'].str.replace('[$,]', '', regex=True).astype(float)7. Incorrect Values
The Problem:
# Logic errors
Age: -5, 150 # Negative or impossible ages
Date: 2025-13-45 # Invalid date
Gender: M, F, Male, Female, m, f # Inconsistent categoriesSolutions:
Range Validation:
# Flag impossible values
df.loc[df['Age'] < 0, 'Age'] = np.nan
df.loc[df['Age'] > 120, 'Age'] = np.nan
# Or remove
df = df[(df['Age'] >= 0) & (df['Age'] <= 120)]Categorical Standardization:
# Map variations to standard values
gender_map = {
'M': 'Male',
'm': 'Male',
'Male': 'Male',
'F': 'Female',
'f': 'Female',
'Female': 'Female'
}
df['Gender'] = df['Gender'].map(gender_map)
# Or use replace
df['Status'] = df['Status'].replace({
'active': 'Active',
'inactive': 'Inactive',
'pending': 'Pending'
})Constraint Validation:
# Ensure related fields are consistent
# e.g., EndDate should be after StartDate
invalid = df[df['EndDate'] < df['StartDate']]
print(f"{len(invalid)} records with EndDate before StartDate")
# Fix or flag
df.loc[df['EndDate'] < df['StartDate'], 'EndDate'] = np.nanThe Complete Data Cleaning Workflow
import pandas as pd
import numpy as np
from scipy import stats
def clean_data(df):
"""
Complete data cleaning pipeline
"""
print(f"Starting with {len(df)} rows")
# 1. INSPECT
print("\\n=== DATA OVERVIEW ===")
print(df.info())
print(df.describe())
print(df.head())
# 2. REMOVE DUPLICATES
print("\\n=== REMOVING DUPLICATES ===")
before = len(df)
df = df.drop_duplicates()
print(f"Removed {before - len(df)} duplicates")
# 3. FIX DATA TYPES
print("\\n=== FIXING DATA TYPES ===")
# Date columns
date_cols = ['order_date', 'ship_date']
for col in date_cols:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors='coerce')
# Numeric columns
numeric_cols = ['age', 'salary', 'price']
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
# 4. HANDLE MISSING VALUES
print("\\n=== HANDLING MISSING VALUES ===")
missing = df.isnull().sum()
missing_pct = 100 * missing / len(df)
missing_df = pd.DataFrame({
'Missing': missing,
'Percent': missing_pct
})
print(missing_df[missing_df['Missing'] > 0])
# Remove columns with >50% missing
cols_to_drop = missing_df[missing_df['Percent'] > 50].index
df = df.drop(columns=cols_to_drop)
print(f"Dropped columns: {list(cols_to_drop)}")
# Impute remaining
for col in df.select_dtypes(include=[np.number]).columns:
if df[col].isnull().sum() > 0:
df[col].fillna(df[col].median(), inplace=True)
for col in df.select_dtypes(include=['object']).columns:
if df[col].isnull().sum() > 0:
df[col].fillna(df[col].mode()[0], inplace=True)
# 5. STANDARDIZE TEXT
print("\\n=== STANDARDIZING TEXT ===")
text_cols = df.select_dtypes(include=['object']).columns
for col in text_cols:
df[col] = df[col].str.strip()
df[col] = df[col].str.replace('\\s+', ' ', regex=True)
# 6. REMOVE OUTLIERS
print("\\n=== REMOVING OUTLIERS ===")
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower) | (df[col] > upper)]
if len(outliers) > 0:
print(f"{col}: {len(outliers)} outliers removed")
df = df[(df[col] >= lower) & (df[col] <= upper)]
# 7. VALIDATE
print("\\n=== VALIDATION ===")
print(f"Final dataset: {len(df)} rows, {len(df.columns)} columns")
print(f"Missing values: {df.isnull().sum().sum()}")
return df
# Usage
df_clean = clean_data(df)FREE Tools for Data Cleaning
Python Libraries:
- pandas - Core data manipulation
- numpy - Numerical operations
- pandas-profiling - Automated EDA
- missingno - Visualize missing data
- ftfy - Fix Unicode text
- dateparser - Parse dates
R Libraries:
GUI Tools (No Coding):
- OpenRefine - Powerful, free, local
- Trifacta Wrangler - Free tier
- Excel Power Query - Built into Excel
- DataWrangler (Tableau Prep) - Free trial
Practice Datasets (Messy on Purpose)
- Data Cleaning Practice (Kaggle)
- Messy Data (GitHub)
- Real-World Messy Data (Data.gov) - Download any dataset
- Quandl Financial Data - Often needs cleaning
Data Cleaning Interview Questions
Be ready to answer:
- “How do you handle missing data?”
- “How do you detect outliers?”
- “Walk me through your data cleaning process”
- “How do you deal with duplicate records?”
- “What would you do with a column that has 60% missing values?”
- “How do you validate data quality?”
Data Quality Checklist
Before ANY analysis, verify:
Take Action Today (1 Hour)
- Download a messy dataset from Kaggle
- Run
df.info()anddf.describe() - Find missing values with
df.isnull().sum() - Remove duplicates
- Impute or remove missing values
- Document what you did and why
Post your cleaned dataset on Kaggle or GitHub!
Related Posts: - Master SQL in 30 Days - Python vs R for Data Analytics - Build a Portfolio That Gets You Hired
Tags: #DataCleaning #Pandas #Python #DataQuality #Tutorial #DataScience