Why R for Health Research?
R has become the gold standard for statistical analysis in health sciences. Here’s why:
✅ Free and open source - No licensing costs ✅ Powerful statistics - Built by statisticians, for statisticians ✅ Reproducible research - Document analysis with code ✅ Rich ecosystem - 19,000+ packages for every need ✅ Beautiful visualizations - Publication-ready graphics ✅ Active community - Help always available
Fun Fact: Over 60% of biostatistics papers now use R!
Setting Up Your R Environment
Step 1: Install R
Download from CRAN: - Windows: Click “Download R for Windows” - Mac: Click “Download R for macOS” - Linux: Use your package manager
Step 2: Install RStudio
Download RStudio Desktop (FREE)
Why RStudio? - Integrated console, editor, and plots - Project management - Git integration - Package management - Markdown support
Step 3: Customize Your Setup
# Install essential packages
install.packages(c(
"tidyverse", # Data manipulation & viz
"readxl", # Read Excel files
"janitor", # Clean data
"gtsummary", # Publication tables
"survival", # Survival analysis
"epiR", # Epidemiology tools
"broom", # Tidy model outputs
"here" # File paths
))R Basics for Health Researchers
Your First R Script
# This is a comment in R
# Basic arithmetic
2 + 2
10 - 3
5 * 4
20 / 4
# Variables
age <- 25
weight_kg <- 70
height_m <- 1.75
# Calculate BMI
bmi <- weight_kg / (height_m^2)
print(bmi)Try It Live in Your Browser
The cell below runs with webR directly in the browser, so readers can edit the values and rerun the code without leaving the post.
Understanding R Objects
# Vectors (most common)
ages <- c(23, 45, 67, 34, 56)
names <- c("Alice", "Bob", "Charlie", "Diana", "Eve")
# Calculate mean age
mean(ages)
# Data frames (like Excel tables)
patients <- data.frame(
id = 1:5,
name = names,
age = ages,
diabetes = c(FALSE, TRUE, TRUE, FALSE, TRUE)
)
# View the data
print(patients)Working with Health Data
Reading Data
library(tidyverse)
library(readxl)
# CSV files
patient_data <- read_csv("data/patients.csv")
# Excel files
survey_data <- read_excel("data/survey.xlsx", sheet = "Sheet1")
# SPSS files
library(haven)
clinic_data <- read_sav("data/clinic.sav")
# View first few rows
head(patient_data)
# Get structure
str(patient_data)
# Summary statistics
summary(patient_data)Data Cleaning with tidyverse
library(tidyverse)
library(janitor)
# Clean column names
clean_data <- patient_data %>%
clean_names()
# Select specific columns
selected <- clean_data %>%
select(patient_id, age, gender, diagnosis, treatment)
# Filter rows
adults <- clean_data %>%
filter(age >= 18)
# Create new variables
clean_data <- clean_data %>%
mutate(
age_group = case_when(
age < 18 ~ "Child",
age < 65 ~ "Adult",
TRUE ~ "Senior"
),
bmi_category = case_when(
bmi < 18.5 ~ "Underweight",
bmi < 25 ~ "Normal",
bmi < 30 ~ "Overweight",
TRUE ~ "Obese"
)
)
# Remove missing values
complete_data <- clean_data %>%
drop_na(age, gender, treatment)Data Manipulation
# Group by and summarize
summary_stats <- clean_data %>%
group_by(treatment) %>%
summarize(
n = n(),
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE)
)
# Pivot tables
cross_tab <- clean_data %>%
count(treatment, outcome) %>%
pivot_wider(names_from = outcome, values_from = n)
# Join datasets
merged_data <- left_join(
patient_data,
lab_results,
by = "patient_id"
)Statistical Analysis for Health Research
Descriptive Statistics
library(gtsummary)
# Create Table 1 (demographics)
clean_data %>%
select(age, gender, treatment, bmi, hypertension) %>%
tbl_summary(
by = treatment,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
label = list(
age ~ "Age (years)",
gender ~ "Gender",
bmi ~ "BMI (kg/m²)",
hypertension ~ "Hypertension"
)
) %>%
add_p() %>% # Add p-values
add_overall() %>% # Add overall column
bold_labels()T-tests and ANOVA
# Independent t-test
t.test(weight ~ gender, data = clean_data)
# Paired t-test
t.test(weight_before, weight_after, paired = TRUE)
# One-way ANOVA
anova_result <- aov(cholesterol ~ treatment, data = clean_data)
summary(anova_result)
# Post-hoc tests
TukeyHSD(anova_result)Chi-square Tests
# Create contingency table
table <- table(clean_data$treatment, clean_data$outcome)
# Chi-square test
chisq.test(table)
# Fisher's exact test (for small samples)
fisher.test(table)Linear Regression
# Simple linear regression
model1 <- lm(systolic_bp ~ age, data = clean_data)
summary(model1)
# Multiple linear regression
model2 <- lm(systolic_bp ~ age + bmi + gender + smoking,
data = clean_data)
summary(model2)
# Get tidy results
library(broom)
tidy(model2, conf.int = TRUE)
# Check assumptions
par(mfrow = c(2, 2))
plot(model2)Logistic Regression
# Binary outcome
logit_model <- glm(diabetes ~ age + bmi + family_history,
data = clean_data,
family = binomial)
summary(logit_model)
# Odds ratios with confidence intervals
exp(coef(logit_model))
exp(confint(logit_model))
# Or use broom
tidy(logit_model, exponentiate = TRUE, conf.int = TRUE)Survival Analysis
library(survival)
library(survminer)
# Create survival object
surv_obj <- Surv(time = clean_data$follow_up_months,
event = clean_data$died)
# Kaplan-Meier curves
km_fit <- survfit(surv_obj ~ treatment, data = clean_data)
# Plot
ggsurvplot(km_fit,
data = clean_data,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE,
xlab = "Time (months)",
ylab = "Survival Probability")
# Cox proportional hazards
cox_model <- coxph(surv_obj ~ age + gender + treatment,
data = clean_data)
summary(cox_model)Data Visualization
Basic Plots
library(ggplot2)
# Histogram
ggplot(clean_data, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
labs(title = "Age Distribution",
x = "Age (years)",
y = "Count") +
theme_minimal()
# Box plot
ggplot(clean_data, aes(x = treatment, y = cholesterol, fill = treatment)) +
geom_boxplot() +
labs(title = "Cholesterol Levels by Treatment",
x = "Treatment Group",
y = "Cholesterol (mg/dL)") +
theme_minimal()
# Scatter plot with trend line
ggplot(clean_data, aes(x = age, y = systolic_bp)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
labs(title = "Blood Pressure vs Age",
x = "Age (years)",
y = "Systolic BP (mmHg)") +
theme_minimal()Publication-Ready Graphics
# Bar chart with error bars
summary_data <- clean_data %>%
group_by(treatment) %>%
summarize(
mean_bp = mean(systolic_bp, na.rm = TRUE),
se_bp = sd(systolic_bp, na.rm = TRUE) / sqrt(n())
)
ggplot(summary_data, aes(x = treatment, y = mean_bp, fill = treatment)) +
geom_col() +
geom_errorbar(aes(ymin = mean_bp - se_bp, ymax = mean_bp + se_bp),
width = 0.2) +
labs(title = "Mean Systolic Blood Pressure by Treatment",
subtitle = "Error bars represent standard error",
x = "Treatment",
y = "Systolic BP (mmHg)") +
theme_classic() +
theme(legend.position = "none")
# Save the plot
ggsave("figures/bp_by_treatment.png", width = 8, height = 6, dpi = 300)Forest Plots
library(forestplot)
# Prepare data
results <- data.frame(
study = c("Study 1", "Study 2", "Study 3", "Pooled"),
OR = c(1.5, 1.8, 1.3, 1.5),
lower = c(1.2, 1.4, 1.0, 1.3),
upper = c(1.9, 2.3, 1.7, 1.8)
)
# Create forest plot
forestplot(
labeltext = results$study,
mean = results$OR,
lower = results$lower,
upper = results$upper,
xlab = "Odds Ratio",
title = "Meta-analysis of Treatment Effect"
)Epidemiology with R
Calculate Disease Measures
library(epiR)
# 2x2 table
table_data <- matrix(c(
50, 200, # Exposed: diseased, not diseased
25, 225 # Not exposed: diseased, not diseased
), nrow = 2, byrow = TRUE)
# Calculate measures
epi_measures <- epi.2by2(table_data, method = "cohort.count")
# View results
print(epi_measures)
# Extract specific measures
# Relative Risk
epi_measures$massoc.detail$RR.strata.wald
# Attributable Risk
epi_measures$massoc.detail$AR.strata.waldSample Size Calculation
# Sample size for comparing two proportions
library(pwr)
pwr.2p.test(
h = ES.h(p1 = 0.30, p2 = 0.45), # Effect size
sig.level = 0.05, # Alpha
power = 0.80, # Power
alternative = "two.sided"
)
# Sample size for t-test
pwr.t.test(
d = 0.5, # Effect size (Cohen's d)
sig.level = 0.05,
power = 0.80,
type = "two.sample"
)Reproducible Reports with R Markdown
Create an R Markdown Document
---
title: "Clinical Trial Analysis Report"
author: "Your Name"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
---
## Introduction
This report analyzes data from the clinical trial...
## Data Import
```{r}
library(tidyverse)
data <- read_csv("data/trial_data.csv")
```
## Descriptive Statistics
```{r}
data %>%
group_by(treatment) %>%
summarize(
n = n(),
mean_age = mean(age),
sd_age = sd(age)
)
```
## Results
The mean age was `r round(mean(data$age), 1)` years.
```{r}
ggplot(data, aes(x = treatment, y = outcome)) +
geom_boxplot()
```
## Conclusion
Our analysis shows...Common Health Research Workflows
Clinical Trial Analysis
# Load data
trial_data <- read_csv("data/trial.csv")
# Clean and prepare
trial_clean <- trial_data %>%
filter(eligible == TRUE) %>%
mutate(
treatment_group = factor(treatment_group,
levels = c("Placebo", "Drug A", "Drug B")),
response = factor(response, levels = c("No", "Yes"))
)
# Baseline characteristics
baseline_table <- trial_clean %>%
select(treatment_group, age, gender, baseline_severity) %>%
tbl_summary(by = treatment_group) %>%
add_p()
# Primary outcome analysis
primary_model <- glm(response ~ treatment_group + age + gender,
data = trial_clean,
family = binomial)
# Create results table
tbl_regression(primary_model, exponentiate = TRUE)Cohort Study Analysis
# Load cohort data
cohort <- read_csv("data/cohort.csv")
# Calculate person-time
cohort <- cohort %>%
mutate(
person_years = follow_up_days / 365.25,
incidence_rate = cases / person_years * 1000
)
# Incidence rates by exposure
incidence_table <- cohort %>%
group_by(exposure) %>%
summarize(
cases = sum(cases),
person_years = sum(person_years),
rate_per_1000 = cases / person_years * 1000,
ci_lower = (qchisq(0.025, 2 * cases) / 2) / person_years * 1000,
ci_upper = (qchisq(0.975, 2 * (cases + 1)) / 2) / person_years * 1000
)Survey Data Analysis
library(survey)
# Create survey design object
survey_design <- svydesign(
ids = ~cluster_id,
strata = ~strata,
weights = ~sampling_weight,
data = survey_data
)
# Weighted means
svymean(~age, survey_design)
# Weighted proportions
svytable(~diabetes + gender, survey_design)
# Weighted regression
svy_model <- svyglm(diabetes ~ age + bmi + gender,
design = survey_design,
family = binomial)
summary(svy_model)Tips for Success
1. Use Projects 📁
Always work in RStudio Projects:
# Create new project: File > New Project > New Directory
# Benefits:
# - Organized file structure
# - Portable paths
# - Version control integration3. Use the Pipe %>% 🔗
# Without pipe (hard to read)
summarize(group_by(filter(data, age > 18), treatment), mean_age = mean(age))
# With pipe (easy to read)
data %>%
filter(age > 18) %>%
group_by(treatment) %>%
summarize(mean_age = mean(age))4. Handle Missing Data ❓
# Check for missing
sum(is.na(data$age))
colSums(is.na(data))
# Visualize missing patterns
library(naniar)
vis_miss(data)
# Handle missing in analysis
mean(data$age, na.rm = TRUE) # Remove NA5. Save Your Work 💾
# Save cleaned data
write_csv(clean_data, "data/clean/patients_clean.csv")
# Save R objects
saveRDS(model, "outputs/final_model.rds")
# Load R objects
model <- readRDS("outputs/final_model.rds")Common Mistakes to Avoid
❌ Using attach() - Use data %>% instead ❌ Not setting working directory - Use RStudio Projects ❌ Overwriting original data - Always create new objects ❌ Not checking assumptions - Use diagnostic plots ❌ Ignoring warnings - They’re there for a reason!
Learning Resources
Free Online Courses
- R for Data Science - Free online book
- Coursera: R Programming - Johns Hopkins
- DataCamp: Introduction to R - Free chapter
- Swirl - Learn R in R
Health-Specific Resources
Communities
- RStudio Community
- Stack Overflow R Tag
- R-Ladies - Global organization promoting gender diversity
- #rstats on Twitter
Your 30-Day R Learning Plan
Week 1: Basics
- Day 1-2: Install R and RStudio, learn basic syntax
- Day 3-4: Data types and structures
- Day 5-7: Import and explore data
Week 2: Data Manipulation
- Day 8-10: Learn dplyr verbs (select, filter, mutate, etc.)
- Day 11-12: Grouping and summarizing
- Day 13-14: Joining datasets
Week 3: Visualization & Statistics
- Day 15-17: ggplot2 basics
- Day 18-20: Basic statistical tests
- Day 21: Practice project
Week 4: Advanced Topics
- Day 22-24: Linear and logistic regression
- Day 25-26: R Markdown reports
- Day 27-28: Your own health data project
- Day 29-30: Share your work!
Conclusion
R is a powerful tool that will transform how you analyze health data. Start small, practice daily, and don’t be afraid to make mistakes—that’s how you learn!
Remember: - Use RStudio Projects - Comment your code - Save your work frequently - Ask for help when stuck - Share your knowledge
Your journey to R mastery starts today! 🚀
Related Posts: - Why Reproducible Research Matters in Public Health - Data Visualization Best Practices for Health Dashboards - Statistics for Data Analysts
Tags: #RProgramming #HealthResearch #Statistics #DataScience #Tutorial
Questions about R for health research? Drop them in the comments!
2. Comment Your Code 💬