Statistics for Data Analysts: The Only Concepts You Actually Need (No PhD Required) – Nic

The Truth About Statistics in Data Analytics

Here’s what nobody tells you: You don’t need to be a statistics expert to be a great data analyst.

I’ve worked with PhDs who couldn’t explain insights to stakeholders, and self-taught analysts who drove millions in business value.

The difference? Knowing which 15% of statistics to learn deeply, and when to apply them.

The 15 Statistical Concepts That Matter

Tier 1: Descriptive Statistics (Use Daily)

1. Mean, Median, Mode (Central Tendency)

What They Are: - Mean: Average (add all, divide by count) - Median: Middle value when sorted - Mode: Most frequent value

When to Use Which:

Data Type	Best Measure	Why
Salaries	Median	Outliers (CEOs) skew mean
Test scores	Mean	Normal distribution
Shoe sizes	Mode	Discrete choices
House prices	Median	High-value outliers

Real Example:

import pandas as pd
import numpy as np

salaries = [50000, 52000, 55000, 58000, 500000]  # CEO ruins the mean

mean_salary = np.mean(salaries)    # $143,000 (misleading!)
median_salary = np.median(salaries) # $55,000 (realistic)

Key Insight: If mean >> median, you have outliers or right-skewed data.

2. Standard Deviation & Variance (Spread)

What They Measure: How spread out your data is

Formula (Don’t Memorize, Understand): - Variance: Average squared distance from mean - Standard Deviation: Square root of variance

Practical Interpretation: - Low StdDev: Data clustered (consistent) - High StdDev: Data spread out (variable)

Real Example:

# Two sales teams with same average
team_a_sales = [100, 105, 98, 102, 95]  # Consistent
team_b_sales = [50, 150, 80, 120, 100]  # Variable

print(f"Team A StdDev: {np.std(team_a_sales):.2f}")  # 3.56
print(f"Team B StdDev: {np.std(team_b_sales):.2f}")  # 32.25

# Team A is more predictable!

FREE Resources: - Khan Academy: Standard Deviation - StatQuest: SD Explained

3. Percentiles & Quartiles (Distribution Position)

What They Are: - Percentile: % of data below a value - Quartiles: 25th, 50th (median), 75th percentiles - IQR: Interquartile Range (Q3 - Q1)

Why They Matter: - Identify outliers (> Q3 + 1.5×IQR or < Q1 - 1.5×IQR) - Understand distribution - Set thresholds

Real Example:

sales = df['order_amount']

q1 = sales.quantile(0.25)    # 25th percentile
q2 = sales.quantile(0.50)    # Median
q3 = sales.quantile(0.75)    # 75th percentile
iqr = q3 - q1

# Flag potential outliers
outliers = sales[(sales > q3 + 1.5*iqr) | (sales < q1 - 1.5*iqr)]

Use Case: “Our top 10% customers (90th percentile) spend over $500”

Tier 2: Probability & Distributions (Use Weekly)

4. Normal Distribution (The Bell Curve)

Key Properties: - Mean = Median = Mode - 68% within 1 StdDev - 95% within 2 StdDevs - 99.7% within 3 StdDevs

When Data is Normal: - Heights, weights - Test scores - Measurement errors - Many natural phenomena

Why It Matters: Many statistical tests assume normality.

Check for Normality:

from scipy import stats
import matplotlib.pyplot as plt

# Visual check
plt.hist(data, bins=30)
plt.show()

# Statistical test
statistic, p_value = stats.shapiro(data)
if p_value > 0.05:
    print("Data appears normal")

FREE Resources: - Khan Academy: Normal Distribution - StatQuest: Normal Distribution

5. Correlation (Relationships Between Variables)

What It Measures: Strength and direction of linear relationship (-1 to +1)

Correlation Coefficients: - +1: Perfect positive (both increase together) - 0: No linear relationship - -1: Perfect negative (one increases, other decreases)

CRITICAL: Correlation ≠ Causation

Real Examples:

import seaborn as sns

# Calculate correlation matrix
corr_matrix = df[['age', 'income', 'spending']].corr()

# Visualize
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

Common Correlations: - Ice cream sales vs. drownings: High correlation (both caused by summer) - Education vs. income: Positive correlation (potentially causal) - Exercise vs. weight: Negative correlation (likely causal)

FREE Resources: - Khan Academy: Correlation - Spurious Correlations - Fun examples

6. Statistical Significance & P-Values

What They Mean: - P-value: Probability of observing results if null hypothesis is true - p < 0.05: Commonly used threshold for “significant”

Translation: - p = 0.03: 3% chance result is due to random chance (likely real effect) - p = 0.47: 47% chance result is random (probably no real effect)

IMPORTANT: - p < 0.05 doesn’t mean “important” or “large effect” - With huge samples, tiny effects become “significant” - Always report effect size too

Real Example:

from scipy import stats

# A/B test: control vs variant
control_conversions = [0, 1, 1, 0, 1, 0, ...]    # 10,000 users
variant_conversions = [1, 1, 0, 1, 1, 1, ...]    # 10,000 users

# Chi-square test
chi2, p_value = stats.chi2_contingency([[control_sum, variant_sum], 
                                        [control_total, variant_total]])

if p_value < 0.05:
    print(f"Variant is significantly different (p={p_value:.3f})")

FREE Resources: - StatQuest: P-Values - Seeing Theory: Frequentist Inference

Tier 3: Hypothesis Testing (Use Monthly)

7. T-Tests (Comparing Two Groups)

When to Use: - Compare means of two groups - Example: “Is average order value different between mobile vs desktop?”

Types: - One-sample: Compare group mean to a known value - Independent samples: Compare two different groups - Paired samples: Before-after comparisons

Real Example:

from scipy import stats

# Compare average order value: mobile vs desktop
mobile_orders = df[df['device'] == 'mobile']['order_value']
desktop_orders = df[df['device'] == 'desktop']['order_value']

# Perform t-test
t_stat, p_value = stats.ttest_ind(mobile_orders, desktop_orders)

print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.3f}")

if p_value < 0.05:
    print("Significant difference between mobile and desktop orders")

8. Chi-Square Test (Categorical Relationships)

When to Use: - Test relationship between two categorical variables - Example: “Is there a relationship between gender and product preference?”

Real Example:

from scipy import stats
import pandas as pd

# Contingency table
data = pd.crosstab(df['gender'], df['product_category'])

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(data)

if p_value < 0.05:
    print("Gender and product preference are related")

9. ANOVA (Comparing 3+ Groups)

When to Use: - Compare means across multiple groups - Example: “Is customer satisfaction different across regions (North, South, East, West)?”

Real Example:

from scipy import stats

north = df[df['region'] == 'North']['satisfaction']
south = df[df['region'] == 'South']['satisfaction']
east = df[df['region'] == 'East']['satisfaction']
west = df[df['region'] == 'West']['satisfaction']

# One-way ANOVA
f_stat, p_value = stats.f_oneway(north, south, east, west)

if p_value < 0.05:
    print("Satisfaction differs significantly across regions")
    # Follow up with post-hoc tests to see which pairs differ

Tier 4: Regression & Forecasting (Advanced)

10. Linear Regression (Predict Numeric Outcomes)

What It Does: - Predicts continuous variable from one or more predictors - Finds “line of best fit”

Equation: y = β₀ + β₁x₁ + β₂x₂ + … + ε

Real Example:

from sklearn.linear_model import LinearRegression
import numpy as np

# Predict sales from advertising spend
X = df[['tv_ads', 'radio_ads', 'digital_ads']]
y = df['sales']

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Interpret coefficients
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: ${coef:.2f} sales per $1 spent")

# Make predictions
predictions = model.predict(X_test)

# Evaluate
from sklearn.metrics import r2_score
print(f"R² Score: {r2_score(y_test, predictions):.3f}")

Key Metrics: - R²: % of variance explained (0 to 1, higher is better) - Coefficients: Effect of each predictor - Residuals: Difference between actual and predicted

11. Logistic Regression (Predict Binary Outcomes)

When to Use: - Predict yes/no, true/false, 0/1 - Examples: Will customer churn? Will lead convert?

Real Example:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Predict customer churn
X = df[['tenure', 'monthly_charges', 'total_charges']]
y = df['churn']  # 0 or 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, predictions):.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, predictions))

Tier 5: Confidence & Sampling

12. Confidence Intervals (Estimate Ranges)

What They Are: Range where true population parameter likely falls (usually 95%)

Translation: “We’re 95% confident the true average order value is between $45-$55”

Real Example:

from scipy import stats

# Calculate 95% confidence interval for mean
data = df['order_value']
confidence = 0.95

mean = np.mean(data)
stderr = stats.sem(data)
interval = stderr * stats.t.ppf((1 + confidence) / 2, len(data) - 1)

print(f"Mean: ${mean:.2f}")
print(f"95% CI: ${mean - interval:.2f} to ${mean + interval:.2f}")

13. Sample Size (How Much Data Do You Need?)

Rules of Thumb: - Surveys: 385+ for 95% confidence, 5% margin of error - A/B tests: Depends on expected effect size (use calculators) - Machine learning: 10x rows per feature minimum

Sample Size Calculators (FREE): - Evan Miller A/B Test Calculator - SurveyMonkey Sample Size Calculator

14. Sampling Methods

Types: - Simple Random: Everyone has equal chance - Stratified: Sample proportionally from subgroups - Cluster: Sample entire groups - Convenience: Whoever is available (⚠️ biased)

When Each Matters: - Random: Most surveys - Stratified: Ensure representation (e.g., demographics) - Cluster: Geographic studies - Convenience: Avoid for serious analysis

15. Type I & Type II Errors

Definitions: - Type I Error (False Positive): Finding effect that doesn’t exist (α = 0.05) - Type II Error (False Negative): Missing real effect (β, power = 1 - β)

Real-World Examples: - Type I: Saying new feature increased signups when it didn’t - Type II: Missing that new feature DID increase signups

Controlling Errors: - Reduce Type I: Lower α (p < 0.01 instead of 0.05) - Reduce Type II: Increase sample size, increase power

FREE Statistics Learning Resources

Interactive Learning:

Khan Academy: Statistics & Probability - Complete course
Seeing Theory - Beautiful visualizations
StatQuest YouTube - Best video explanations
Brilliant.org Statistics - Interactive problems

Books (Free Online):

OpenIntro Statistics - Comprehensive textbook
Think Stats - Python-based
Statistics by Jim - Clear blog explanations

Practice:

Statistics Workbench
Kaggle Learn: Intro to Machine Learning
Brilliant.org Quizzes

The 30-Day Statistics Bootcamp

Week 1: Descriptive Stats

Day 1-2: Mean, median, mode, standard deviation
Day 3-4: Percentiles, quartiles, outliers
Day 5-7: Distributions, normal distribution

Week 2: Probability & Correlation

Day 8-10: Probability basics, conditional probability
Day 11-12: Correlation vs causation
Day 13-14: Practice problems

Week 3: Hypothesis Testing

Day 15-17: P-values, significance, confidence intervals
Day 18-19: T-tests, chi-square tests
Day 20-21: ANOVA, multiple testing

Week 4: Regression & Projects

Day 22-24: Linear regression
Day 25-26: Logistic regression
Day 27-30: Apply to real projects

Statistics Interview Questions

Be ready to answer:

“Explain p-value to a non-technical person”
“When would you use median instead of mean?”
“What’s the difference between correlation and causation?”
“How do you detect outliers?”
“Explain Type I and Type II errors”
“How would you determine if an A/B test result is significant?”
“What assumptions does linear regression make?”

Common Statistics Mistakes to Avoid

❌ Using mean for skewed data
✅ Use median for income, house prices, etc.

❌ Assuming correlation means causation
✅ Always consider confounding variables

❌ P-hacking (testing until p < 0.05)
✅ Decide hypothesis before testing

❌ Ignoring sample size
✅ Large samples make tiny effects “significant”

❌ Forgetting assumptions (normality, independence)
✅ Check assumptions before running tests

When to Get Help from a Statistician

You probably need expert help if: - Clinical trial or medical study - Multiple hypothesis testing - Complex survey design - Causal inference (not just correlation) - Bayesian analysis - Time series forecasting

Don’t be afraid to admit limitations!

Take Action Today

Your homework (2 hours):

Download a dataset from Kaggle
Calculate descriptive statistics (mean, median, std dev)
Create histograms and boxplots
Identify outliers
Calculate correlation matrix
Document your findings

Share your analysis on LinkedIn/Twitter!

Tags: #Statistics #DataAnalytics #Math #Tutorial #Beginners #HypothesisTesting

--- title: "Statistics for Data Analysts: The Only Concepts You Actually Need (No PhD Required)" subtitle: "Stop Feeling Intimidated - These 15 Concepts Cover 90% of Your Job" author: "Nichodemus Amollo" date: "2025-10-19" categories: [Statistics, Math, Tutorial, Beginners] --- ## The Truth About Statistics in Data Analytics Here's what nobody tells you: **You don't need to be a statistics expert to be a great data analyst.** I've worked with PhDs who couldn't explain insights to stakeholders, and self-taught analysts who drove millions in business value. **The difference?** Knowing which 15% of statistics to learn deeply, and when to apply them. --- ## The 15 Statistical Concepts That Matter ### **Tier 1: Descriptive Statistics (Use Daily)** #### 1. **Mean, Median, Mode** (Central Tendency) **What They Are:** - **Mean:** Average (add all, divide by count) - **Median:** Middle value when sorted - **Mode:** Most frequent value **When to Use Which:** | Data Type | Best Measure | Why | |-----------|-------------|-----| | Salaries | Median | Outliers (CEOs) skew mean | | Test scores | Mean | Normal distribution | | Shoe sizes | Mode | Discrete choices | | House prices | Median | High-value outliers | **Real Example:** ```python import pandas as pd import numpy as np salaries = [50000, 52000, 55000, 58000, 500000] # CEO ruins the mean mean_salary = np.mean(salaries) # $143,000 (misleading!) median_salary = np.median(salaries) # $55,000 (realistic) ``` **Key Insight:** If mean >> median, you have outliers or right-skewed data. --- #### 2. **Standard Deviation & Variance** (Spread) **What They Measure:** How spread out your data is **Formula (Don't Memorize, Understand):** - Variance: Average squared distance from mean - Standard Deviation: Square root of variance **Practical Interpretation:** - **Low StdDev:** Data clustered (consistent) - **High StdDev:** Data spread out (variable) **Real Example:** ```python # Two sales teams with same average team_a_sales = [100, 105, 98, 102, 95] # Consistent team_b_sales = [50, 150, 80, 120, 100] # Variable print(f"Team A StdDev: {np.std(team_a_sales):.2f}") # 3.56 print(f"Team B StdDev: {np.std(team_b_sales):.2f}") # 32.25 # Team A is more predictable! ``` **FREE Resources:** - [Khan Academy: Standard Deviation](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/variance-standard-deviation-population/v/statistics-standard-deviation) - [StatQuest: SD Explained](https://www.youtube.com/watch?v=MRqtXL2WX2M) --- #### 3. **Percentiles & Quartiles** (Distribution Position) **What They Are:** - **Percentile:** % of data below a value - **Quartiles:** 25th, 50th (median), 75th percentiles - **IQR:** Interquartile Range (Q3 - Q1) **Why They Matter:** - Identify outliers (> Q3 + 1.5×IQR or < Q1 - 1.5×IQR) - Understand distribution - Set thresholds **Real Example:** ```python sales = df['order_amount'] q1 = sales.quantile(0.25) # 25th percentile q2 = sales.quantile(0.50) # Median q3 = sales.quantile(0.75) # 75th percentile iqr = q3 - q1 # Flag potential outliers outliers = sales[(sales > q3 + 1.5*iqr) | (sales < q1 - 1.5*iqr)] ``` **Use Case:** "Our top 10% customers (90th percentile) spend over $500" --- ### **Tier 2: Probability & Distributions (Use Weekly)** #### 4. **Normal Distribution** (The Bell Curve) **Key Properties:** - Mean = Median = Mode - 68% within 1 StdDev - 95% within 2 StdDevs - 99.7% within 3 StdDevs **When Data is Normal:** - Heights, weights - Test scores - Measurement errors - Many natural phenomena **Why It Matters:** Many statistical tests assume normality. **Check for Normality:** ```python from scipy import stats import matplotlib.pyplot as plt # Visual check plt.hist(data, bins=30) plt.show() # Statistical test statistic, p_value = stats.shapiro(data) if p_value > 0.05: print("Data appears normal") ``` **FREE Resources:** - [Khan Academy: Normal Distribution](https://www.khanacademy.org/math/statistics-probability/modeling-distributions-of-data/normal-distributions-library/v/introduction-to-the-normal-distribution) - [StatQuest: Normal Distribution](https://www.youtube.com/watch?v=rzFX5NWojp0) --- #### 5. **Correlation** (Relationships Between Variables) **What It Measures:** Strength and direction of linear relationship (-1 to +1) **Correlation Coefficients:** - **+1:** Perfect positive (both increase together) - **0:** No linear relationship - **-1:** Perfect negative (one increases, other decreases) **CRITICAL: Correlation ≠ Causation** **Real Examples:** ```python import seaborn as sns # Calculate correlation matrix corr_matrix = df[['age', 'income', 'spending']].corr() # Visualize sns.heatmap(corr_matrix, annot=True, cmap='coolwarm') ``` **Common Correlations:** - Ice cream sales vs. drownings: High correlation (both caused by summer) - Education vs. income: Positive correlation (potentially causal) - Exercise vs. weight: Negative correlation (likely causal) **FREE Resources:** - [Khan Academy: Correlation](https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/introduction-to-correlation/v/correlation-and-causality) - [Spurious Correlations](https://www.tylervigen.com/spurious-correlations) - Fun examples --- #### 6. **Statistical Significance & P-Values** **What They Mean:** - **P-value:** Probability of observing results if null hypothesis is true - **p < 0.05:** Commonly used threshold for "significant" **Translation:** - **p = 0.03:** 3% chance result is due to random chance (likely real effect) - **p = 0.47:** 47% chance result is random (probably no real effect) **IMPORTANT:** - p < 0.05 doesn't mean "important" or "large effect" - With huge samples, tiny effects become "significant" - Always report effect size too **Real Example:** ```python from scipy import stats # A/B test: control vs variant control_conversions = [0, 1, 1, 0, 1, 0, ...] # 10,000 users variant_conversions = [1, 1, 0, 1, 1, 1, ...] # 10,000 users # Chi-square test chi2, p_value = stats.chi2_contingency([[control_sum, variant_sum], [control_total, variant_total]]) if p_value < 0.05: print(f"Variant is significantly different (p={p_value:.3f})") ``` **FREE Resources:** - [StatQuest: P-Values](https://www.youtube.com/watch?v=5Z9OIYA8He8) - [Seeing Theory: Frequentist Inference](https://seeing-theory.brown.edu/frequentist-inference/index.html) --- ### **Tier 3: Hypothesis Testing (Use Monthly)** #### 7. **T-Tests** (Comparing Two Groups) **When to Use:** - Compare means of two groups - Example: "Is average order value different between mobile vs desktop?" **Types:** - **One-sample:** Compare group mean to a known value - **Independent samples:** Compare two different groups - **Paired samples:** Before-after comparisons **Real Example:** ```python from scipy import stats # Compare average order value: mobile vs desktop mobile_orders = df[df['device'] == 'mobile']['order_value'] desktop_orders = df[df['device'] == 'desktop']['order_value'] # Perform t-test t_stat, p_value = stats.ttest_ind(mobile_orders, desktop_orders) print(f"t-statistic: {t_stat:.3f}") print(f"p-value: {p_value:.3f}") if p_value < 0.05: print("Significant difference between mobile and desktop orders") ``` --- #### 8. **Chi-Square Test** (Categorical Relationships) **When to Use:** - Test relationship between two categorical variables - Example: "Is there a relationship between gender and product preference?" **Real Example:** ```python from scipy import stats import pandas as pd # Contingency table data = pd.crosstab(df['gender'], df['product_category']) # Chi-square test chi2, p_value, dof, expected = stats.chi2_contingency(data) if p_value < 0.05: print("Gender and product preference are related") ``` --- #### 9. **ANOVA** (Comparing 3+ Groups) **When to Use:** - Compare means across multiple groups - Example: "Is customer satisfaction different across regions (North, South, East, West)?" **Real Example:** ```python from scipy import stats north = df[df['region'] == 'North']['satisfaction'] south = df[df['region'] == 'South']['satisfaction'] east = df[df['region'] == 'East']['satisfaction'] west = df[df['region'] == 'West']['satisfaction'] # One-way ANOVA f_stat, p_value = stats.f_oneway(north, south, east, west) if p_value < 0.05: print("Satisfaction differs significantly across regions") # Follow up with post-hoc tests to see which pairs differ ``` --- ### **Tier 4: Regression & Forecasting (Advanced)** #### 10. **Linear Regression** (Predict Numeric Outcomes) **What It Does:** - Predicts continuous variable from one or more predictors - Finds "line of best fit" **Equation:** y = β₀ + β₁x₁ + β₂x₂ + ... + ε **Real Example:** ```python from sklearn.linear_model import LinearRegression import numpy as np # Predict sales from advertising spend X = df[['tv_ads', 'radio_ads', 'digital_ads']] y = df['sales'] # Split data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Train model model = LinearRegression() model.fit(X_train, y_train) # Interpret coefficients for feature, coef in zip(X.columns, model.coef_): print(f"{feature}: ${coef:.2f} sales per $1 spent") # Make predictions predictions = model.predict(X_test) # Evaluate from sklearn.metrics import r2_score print(f"R² Score: {r2_score(y_test, predictions):.3f}") ``` **Key Metrics:** - **R²:** % of variance explained (0 to 1, higher is better) - **Coefficients:** Effect of each predictor - **Residuals:** Difference between actual and predicted --- #### 11. **Logistic Regression** (Predict Binary Outcomes) **When to Use:** - Predict yes/no, true/false, 0/1 - Examples: Will customer churn? Will lead convert? **Real Example:** ```python from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix # Predict customer churn X = df[['tenure', 'monthly_charges', 'total_charges']] y = df['churn'] # 0 or 1 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LogisticRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, predictions):.3f}") print("Confusion Matrix:") print(confusion_matrix(y_test, predictions)) ``` --- ### **Tier 5: Confidence & Sampling** #### 12. **Confidence Intervals** (Estimate Ranges) **What They Are:** Range where true population parameter likely falls (usually 95%) **Translation:** "We're 95% confident the true average order value is between $45-$55" **Real Example:** ```python from scipy import stats # Calculate 95% confidence interval for mean data = df['order_value'] confidence = 0.95 mean = np.mean(data) stderr = stats.sem(data) interval = stderr * stats.t.ppf((1 + confidence) / 2, len(data) - 1) print(f"Mean: ${mean:.2f}") print(f"95% CI: ${mean - interval:.2f} to ${mean + interval:.2f}") ``` --- #### 13. **Sample Size** (How Much Data Do You Need?) **Rules of Thumb:** - **Surveys:** 385+ for 95% confidence, 5% margin of error - **A/B tests:** Depends on expected effect size (use calculators) - **Machine learning:** 10x rows per feature minimum **Sample Size Calculators (FREE):** - [Evan Miller A/B Test Calculator](https://www.evanmiller.org/ab-testing/sample-size.html) - [SurveyMonkey Sample Size Calculator](https://www.surveymonkey.com/mp/sample-size-calculator/) --- #### 14. **Sampling Methods** **Types:** - **Simple Random:** Everyone has equal chance - **Stratified:** Sample proportionally from subgroups - **Cluster:** Sample entire groups - **Convenience:** Whoever is available (⚠️ biased) **When Each Matters:** - **Random:** Most surveys - **Stratified:** Ensure representation (e.g., demographics) - **Cluster:** Geographic studies - **Convenience:** Avoid for serious analysis --- #### 15. **Type I & Type II Errors** **Definitions:** - **Type I Error (False Positive):** Finding effect that doesn't exist (α = 0.05) - **Type II Error (False Negative):** Missing real effect (β, power = 1 - β) **Real-World Examples:** - **Type I:** Saying new feature increased signups when it didn't - **Type II:** Missing that new feature DID increase signups **Controlling Errors:** - Reduce Type I: Lower α (p < 0.01 instead of 0.05) - Reduce Type II: Increase sample size, increase power --- ## FREE Statistics Learning Resources ### **Interactive Learning:** 1. **[Khan Academy: Statistics & Probability](https://www.khanacademy.org/math/statistics-probability)** - Complete course 2. **[Seeing Theory](https://seeing-theory.brown.edu/)** - Beautiful visualizations 3. **[StatQuest YouTube](https://www.youtube.com/c/joshstarmer)** - Best video explanations 4. **[Brilliant.org Statistics](https://brilliant.org/courses/statistics/)** - Interactive problems ### **Books (Free Online):** 1. **[OpenIntro Statistics](https://www.openintro.org/book/os/)** - Comprehensive textbook 2. **[Think Stats](https://greenteapress.com/thinkstats/)** - Python-based 3. **[Statistics by Jim](https://statisticsbyjim.com/)** - Clear blog explanations ### **Practice:** 1. **[Statistics Workbench](https://www.r-tutor.com/elementary-statistics)** 2. **[Kaggle Learn: Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** 3. **[Brilliant.org Quizzes](https://brilliant.org/)** --- ## The 30-Day Statistics Bootcamp ### **Week 1: Descriptive Stats** - Day 1-2: Mean, median, mode, standard deviation - Day 3-4: Percentiles, quartiles, outliers - Day 5-7: Distributions, normal distribution ### **Week 2: Probability & Correlation** - Day 8-10: Probability basics, conditional probability - Day 11-12: Correlation vs causation - Day 13-14: Practice problems ### **Week 3: Hypothesis Testing** - Day 15-17: P-values, significance, confidence intervals - Day 18-19: T-tests, chi-square tests - Day 20-21: ANOVA, multiple testing ### **Week 4: Regression & Projects** - Day 22-24: Linear regression - Day 25-26: Logistic regression - Day 27-30: Apply to real projects --- ## Statistics Interview Questions **Be ready to answer:** 1. "Explain p-value to a non-technical person" 2. "When would you use median instead of mean?" 3. "What's the difference between correlation and causation?" 4. "How do you detect outliers?" 5. "Explain Type I and Type II errors" 6. "How would you determine if an A/B test result is significant?" 7. "What assumptions does linear regression make?" --- ## Common Statistics Mistakes to Avoid ❌ **Using mean for skewed data** ✅ Use median for income, house prices, etc. ❌ **Assuming correlation means causation** ✅ Always consider confounding variables ❌ **P-hacking** (testing until p < 0.05) ✅ Decide hypothesis before testing ❌ **Ignoring sample size** ✅ Large samples make tiny effects "significant" ❌ **Forgetting assumptions** (normality, independence) ✅ Check assumptions before running tests --- ## When to Get Help from a Statistician **You probably need expert help if:** - Clinical trial or medical study - Multiple hypothesis testing - Complex survey design - Causal inference (not just correlation) - Bayesian analysis - Time series forecasting **Don't be afraid to admit limitations!** --- ## Take Action Today **Your homework (2 hours):** 1. Download a dataset from Kaggle 2. Calculate descriptive statistics (mean, median, std dev) 3. Create histograms and boxplots 4. Identify outliers 5. Calculate correlation matrix 6. Document your findings **Share your analysis on LinkedIn/Twitter!** --- **Related Posts:** - [Your Ultimate 100-Day Data Analytics Roadmap](../01-data-analytics-roadmap/) - [Python vs R for Data Analytics](../02-python-vs-r/) - [Build a Portfolio That Gets You Hired](../06-portfolio-that-gets-hired/) **Tags:** #Statistics #DataAnalytics #Math #Tutorial #Beginners #HypothesisTesting