Chapter 32 Statistical Best Practices
What You’ll Learn:
- P-value interpretation
- Multiple testing corrections
- Effect sizes and power
- Reproducible analyses
- Common pitfalls
Key Errors Covered: 15+ conceptual errors
Difficulty: ⭐⭐⭐ Advanced
32.2 P-values
💡 Key Insight: Understanding P-values
# P-value is NOT:
# - Probability hypothesis is true
# - Probability of replication
# - Effect size
# P-value IS:
# - P(data | H0 is true)
# - Probability of seeing data this extreme if null is true
# Example
t_result <- t.test(mpg ~ am, data = mtcars)
cat("P-value:", round(t_result$p.value, 4), "\n")
cat("\nInterpretation:\n")
cat("If there were truly no difference in mpg between transmissions,\n")
cat("we would see a difference this large or larger in",
round(t_result$p.value * 100, 2), "% of studies.\n")
# Common thresholds
if (t_result$p.value < 0.001) cat("\nVery strong evidence against H0\n")
else if (t_result$p.value < 0.01) cat("\nStrong evidence against H0\n")
else if (t_result$p.value < 0.05) cat("\nModerate evidence against H0\n")
else cat("\nWeak evidence against H0\n")
#> Error in parse(text = input): <text>:21:1: unexpected 'else'
#> 20: if (t_result$p.value < 0.001) cat("\nVery strong evidence against H0\n")
#> 21: else
#> ^32.3 Multiple Testing
💡 Key Insight: Adjusting for Multiple Tests
# Problem: More tests = more false positives
set.seed(123)
n_tests <- 20
p_values <- replicate(n_tests, {
x <- rnorm(30)
y <- rnorm(30)
t.test(x, y)$p.value
})
cat("Significant at α = 0.05:", sum(p_values < 0.05), "out of", n_tests, "\n")
#> Significant at α = 0.05: 1 out of 20
cat("Expected false positives:", n_tests * 0.05, "\n")
#> Expected false positives: 1
# Solutions:
# 1. Bonferroni correction (conservative)
p_bonferroni <- p.adjust(p_values, method = "bonferroni")
cat("\nAfter Bonferroni:", sum(p_bonferroni < 0.05), "significant\n")
#>
#> After Bonferroni: 0 significant
# 2. FDR control (less conservative)
p_fdr <- p.adjust(p_values, method = "fdr")
cat("After FDR:", sum(p_fdr < 0.05), "significant\n")
#> After FDR: 0 significant
# 3. Holm method (sequential)
p_holm <- p.adjust(p_values, method = "holm")
cat("After Holm:", sum(p_holm < 0.05), "significant\n")
#> After Holm: 0 significant32.4 Effect Sizes
🎯 Best Practice: Always Report Effect Sizes
# t-test with effect size
auto <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]
t_result <- t.test(auto, manual)
# Cohen's d
cohens_d <- function(x, y) {
n1 <- length(x)
n2 <- length(y)
s_pooled <- sqrt(((n1 - 1) * var(x) + (n2 - 1) * var(y)) / (n1 + n2 - 2))
(mean(x) - mean(y)) / s_pooled
}
d <- cohens_d(auto, manual)
# Complete reporting
cat("=== Complete Statistical Report ===\n\n")
#> === Complete Statistical Report ===
cat("Descriptives:\n")
#> Descriptives:
cat(" Automatic: M =", round(mean(auto), 2), ", SD =", round(sd(auto), 2), ", n =", length(auto), "\n")
#> Automatic: M = 17.15 , SD = 3.83 , n = 19
cat(" Manual: M =", round(mean(manual), 2), ", SD =", round(sd(manual), 2), ", n =", length(manual), "\n\n")
#> Manual: M = 24.39 , SD = 6.17 , n = 13
cat("Inferential:\n")
#> Inferential:
cat(" t(", round(t_result$parameter, 1), ") = ", round(t_result$statistic, 2), "\n", sep = "")
#> t(18.3) = -3.77
cat(" p =", format.pval(t_result$p.value, digits = 3), "\n")
#> p = 0.00137
cat(" 95% CI: [", round(t_result$conf.int[1], 2), ", ", round(t_result$conf.int[2], 2), "]\n", sep = "")
#> 95% CI: [-11.28, -3.21]
cat(" Cohen's d =", round(d, 2), "\n")
#> Cohen's d = -1.4832.5 Power Analysis
🎯 Best Practice: Consider Statistical Power
# A priori power analysis (before study)
power_result <- power.t.test(
delta = 5, # Expected difference
sd = 6, # Expected SD
sig.level = 0.05,
power = 0.80
)
cat("Required sample size per group:", ceiling(power_result$n), "\n")
#> Required sample size per group: 24
# Post-hoc power analysis (after study)
power_observed <- power.t.test(
n = length(auto),
delta = abs(mean(auto) - mean(manual)),
sd = sd(c(auto, manual)),
sig.level = 0.05
)
cat("Observed power:", round(power_observed$power, 2), "\n")
#> Observed power: 0.95
# Sensitivity analysis
cat("\nDetectable effect sizes with 80% power:\n")
#>
#> Detectable effect sizes with 80% power:
for (n in c(10, 20, 30, 50)) {
result <- power.t.test(n = n, sd = 6, sig.level = 0.05, power = 0.80)
cat(" n =", n, ": d =", round(result$delta / 6, 2), "\n")
}
#> n = 10 : d = 1.32
#> n = 20 : d = 0.91
#> n = 30 : d = 0.74
#> n = 50 : d = 0.5732.6 Reproducible Analysis
🎯 Best Practice: Make Analyses Reproducible
# Set seed for reproducibility
set.seed(42)
# Document R version and packages
cat("R version:", R.version.string, "\n")
#> R version: R version 4.5.0 (2025-04-11)
cat("Packages:\n")
#> Packages:
cat(" dplyr:", as.character(packageVersion("dplyr")), "\n")
#> dplyr: 1.1.4
# Save session info
# sessionInfo()
# Complete analysis function
reproducible_t_test <- function(data, formula, seed = NULL) {
if (!is.null(seed)) set.seed(seed)
# Record context
context <- list(
date = Sys.time(),
r_version = R.version.string,
data_rows = nrow(data),
seed = seed
)
# Perform test
result <- t.test(formula, data = data)
# Return everything
list(
result = result,
context = context,
data_summary = summary(data)
)
}
# Use it
analysis <- reproducible_t_test(mtcars, mpg ~ am, seed = 123)
analysis$result
#>
#> Welch Two Sample t-test
#>
#> data: mpg by am
#> t = -3.7671, df = 18.332, p-value = 0.001374
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#> -11.280194 -3.209684
#> sample estimates:
#> mean in group 0 mean in group 1
#> 17.14737 24.3923132.7 Common Pitfalls
⚠️ Avoid These Common Mistakes
# Pitfall 1: P-hacking (testing until significant)
# BAD: Keep testing different variables until p < 0.05
# GOOD: Pre-specify hypotheses
# Pitfall 2: HARKing (Hypothesizing After Results Known)
# BAD: Claim you predicted a finding you discovered
# GOOD: Be transparent about exploratory vs confirmatory
# Pitfall 3: Not checking assumptions
model <- lm(mpg ~ hp, data = mtcars)
# BAD: Trust results without checking
# GOOD: Check diagnostics
plot(model)
# Pitfall 4: Confusing significance with importance
# p < 0.05 doesn't mean effect is large or meaningful
# Always report effect sizes
# Pitfall 5: Ignoring multiple comparisons
# BAD: Do 20 t-tests, report any p < 0.05
# GOOD: Adjust for multiple testing
# Pitfall 6: Treating p = 0.051 differently than p = 0.049
# BAD: Arbitrary cutoffs
# GOOD: Report actual p-values and confidence intervals



32.8 Reporting Results
🎯 Best Practice: Complete Reporting Template
report_t_test <- function(x, y, test_name = "t-test") {
# Perform test
result <- t.test(x, y)
# Calculate effect size
n1 <- length(x)
n2 <- length(y)
s_pooled <- sqrt(((n1 - 1) * var(x) + (n2 - 1) * var(y)) / (n1 + n2 - 2))
d <- (mean(x) - mean(y)) / s_pooled
# Format report
cat("======================\n")
cat(test_name, "\n")
cat("======================\n\n")
cat("Sample sizes:\n")
cat(" Group 1: n =", n1, "\n")
cat(" Group 2: n =", n2, "\n\n")
cat("Descriptive statistics:\n")
cat(" Group 1: M =", round(mean(x), 2), ", SD =", round(sd(x), 2), "\n")
cat(" Group 2: M =", round(mean(y), 2), ", SD =", round(sd(y), 2), "\n")
cat(" Difference: M =", round(mean(x) - mean(y), 2), "\n\n")
cat("Inferential statistics:\n")
cat(" t(", round(result$parameter, 1), ") = ", round(result$statistic, 2), "\n", sep = "")
cat(" p-value:", format.pval(result$p.value, digits = 3), "\n")
cat(" 95% CI: [", round(result$conf.int[1], 2), ", ",
round(result$conf.int[2], 2), "]\n", sep = "")
cat(" Cohen's d =", round(d, 2))
if (abs(d) < 0.2) cat(" (negligible)")
else if (abs(d) < 0.5) cat(" (small)")
else if (abs(d) < 0.8) cat(" (medium)")
else cat(" (large)")
cat("\n\nInterpretation:\n")
if (result$p.value < 0.05) {
cat(" Statistically significant difference (p < 0.05)\n")
} else {
cat(" No statistically significant difference (p >= 0.05)\n")
}
cat(" Effect size is", ifelse(abs(d) < 0.5, "small", "moderate to large"), "\n")
invisible(list(result = result, effect_size = d))
}
# Use it
auto <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]
report_t_test(auto, manual, "Transmission Type Comparison")
#> ======================
#> Transmission Type Comparison
#> ======================
#>
#> Sample sizes:
#> Group 1: n = 19
#> Group 2: n = 13
#>
#> Descriptive statistics:
#> Group 1: M = 17.15 , SD = 3.83
#> Group 2: M = 24.39 , SD = 6.17
#> Difference: M = -7.24
#>
#> Inferential statistics:
#> t(18.3) = -3.77
#> p-value: 0.00137
#> 95% CI: [-11.28, -3.21]
#> Cohen's d = -1.48 (large)
#>
#> Interpretation:
#> Statistically significant difference (p < 0.05)
#> Effect size is moderate to large32.9 Checklist
🎯 Pre-Analysis Checklist
Before running any statistical test:
After running test:
# Example checklist function
analysis_checklist <- function(data, dv, iv) {
cat("=== Analysis Checklist ===\n\n")
# Sample size
n <- nrow(data)
cat("✓ Sample size:", n, "\n")
if (n < 30) cat(" ⚠ Small sample size\n")
# Missing data
n_missing <- sum(!complete.cases(data))
cat("✓ Missing data:", n_missing, "cases\n")
if (n_missing > 0) cat(" ⚠ Consider handling missing data\n")
# Assumptions
cat("✓ Check assumptions:\n")
cat(" - Independence: Consider study design\n")
cat(" - Normality: Use shapiro.test() or Q-Q plot\n")
cat(" - Equal variances: Use var.test()\n\n")
cat("✓ Analysis ready!\n")
}
analysis_checklist(mtcars, "mpg", "am")
#> === Analysis Checklist ===
#>
#> ✓ Sample size: 32
#> ✓ Missing data: 0 cases
#> ✓ Check assumptions:
#> - Independence: Consider study design
#> - Normality: Use shapiro.test() or Q-Q plot
#> - Equal variances: Use var.test()
#>
#> ✓ Analysis ready!32.10 Summary
Key Takeaways:
- P-values measure evidence - Not truth or importance
- Adjust for multiple tests - Use Bonferroni, FDR, or Holm
- Always report effect sizes - Cohen’s d, R², etc.
- Consider power - Before and after study
- Be reproducible - Set seeds, document versions
- Check assumptions - Don’t blindly trust results
- Report completely - All statistics, not just p-values
Reporting Template:
# Complete report includes:
1. Sample sizes
2. Descriptive statistics (M, SD)
3. Test statistic and df
4. P-value (exact, not just p < 0.05)
5. Confidence interval
6. Effect size
7. Interpretation in contextBest Practices:
# ✅ Good
Pre-specify hypotheses
Check all assumptions
Report effect sizes
Adjust for multiple testing
Use confidence intervals
Be transparent about exploratory analyses
Set random seeds
Document everything
# ❌ Avoid
P-hacking
HARKing
Treating p = 0.05 as magic threshold
Only reporting significant results
Ignoring effect sizes
Not checking assumptions
Cherry-picking results