Awesome-Agent-Skills-for-Empirical-Research statistical-software-qa
Quality assurance and testing protocols for statistical software
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
manifest:
skills/26-Data-Wise-scholar/skills/implementation/statistical-software-qa/skill.mdsource content
Statistical Software QA
Quality assurance patterns and testing strategies for statistical R packages
Use this skill when working on: R package testing, numerical accuracy validation, reference implementation comparison, edge case identification, statistical correctness verification, or software quality assurance for methodology packages.
Testing Philosophy
Statistical Software is Different
Unlike typical software where "correct output" is clear, statistical software must:
- Produce statistically correct results - Not just bug-free
- Handle edge cases gracefully - Small samples, boundary conditions
- Maintain numerical precision - Floating-point considerations
- Match known results - Verification against published examples
- Degrade gracefully - Informative errors, not crashes
Testing Pyramid for Statistical Packages
▲ /│\ / │ \ / │ \ /Manual│\ <- Human review of outputs / Tests │ \ /─────────────\ / Integration \ <- Cross-function workflows / Tests \ /───────────────────\ / Statistical Tests \ <- Correctness verification / \ /─────────────────────────\ / Reference Tests \ <- Match known results / \ /───────────────────────────────\ / Unit Tests \ <- Individual functions /___________________________________\
Unit Testing Patterns
Pattern 1: Known Value Tests
Test against analytically derivable results:
test_that("indirect effect equals a*b for simple mediation", { # Create data where we know true values set.seed(42) n <- 10000 x <- rnorm(n) m <- 0.5 * x + rnorm(n, sd = 0.1) # a = 0.5 y <- 0.3 * m + rnorm(n, sd = 0.1) # b = 0.3 result <- mediate(y ~ x + m, mediator = "m", data = data.frame(x, m, y)) # True indirect = 0.5 * 0.3 = 0.15 expect_equal(result$indirect, 0.15, tolerance = 0.02) })
Pattern 2: Boundary Condition Tests
test_that("handles minimum sample size", { # Minimum viable sample small_data <- data.frame( x = c(0, 0, 1, 1), m = c(0, 1, 1, 2), y = c(1, 2, 2, 3) ) # Should work without error expect_no_error(mediate(y ~ x + m, mediator = "m", data = small_data)) # Should warn about low power expect_warning( mediate(y ~ x + m, mediator = "m", data = small_data), "sample size" ) })
Pattern 3: Equivalence Tests
test_that("bootstrap CI contains delta method CI asymptotically", { set.seed(123) data <- simulate_mediation(n = 5000, a = 0.3, b = 0.4) boot_result <- mediate(data, method = "bootstrap", R = 2000) delta_result <- mediate(data, method = "delta") # CIs should be similar for large n expect_equal(boot_result$ci, delta_result$ci, tolerance = 0.05) })
Reference Implementation Testing
Strategy: Compare Against Published Results
test_that("matches Imai et al. (2010) JOBS II example", # Load reference data from mediation package data("jobs", package = "mediation") # Our implementation our_result <- our_mediate( outcome = job_seek ~ treat + econ_hard + sex + age, mediator = job_disc ~ treat + econ_hard + sex + age, data = jobs ) # Published results (from paper Table 2) expected_acme <- 0.015 expected_acme_ci <- c(-0.004, 0.035) expect_equal(our_result$acme, expected_acme, tolerance = 0.005) expect_equal(our_result$acme_ci, expected_acme_ci, tolerance = 0.01) })
Cross-Package Validation
test_that("matches lavaan for SEM-based mediation", { data <- simulate_mediation(n = 1000) # Our implementation our_result <- our_mediate(data) # lavaan implementation library(lavaan) model <- ' m ~ a*x y ~ b*m + c*x indirect := a*b ' lavaan_fit <- sem(model, data = data) lavaan_indirect <- parameterEstimates(lavaan_fit)[ parameterEstimates(lavaan_fit)$label == "indirect", "est" ] expect_equal(our_result$indirect, lavaan_indirect, tolerance = 0.01) })
Statistical Correctness Tests
Coverage Probability Tests
Verify confidence intervals achieve nominal coverage:
test_that("95% CI achieves nominal coverage", { set.seed(42) n_sims <- 1000 true_indirect <- 0.15 coverage <- 0 for (i in 1:n_sims) { data <- simulate_mediation(n = 200, a = 0.5, b = 0.3) result <- mediate(data, conf.level = 0.95) if (result$ci[1] <= true_indirect && true_indirect <= result$ci[2]) { coverage <- coverage + 1 } } coverage_rate <- coverage / n_sims # Coverage should be between 93% and 97% (accounting for MC error) expect_gte(coverage_rate, 0.93) expect_lte(coverage_rate, 0.97) })
Bias Tests
test_that("estimator is approximately unbiased", { set.seed(123) n_sims <- 500 true_indirect <- 0.2 estimates <- numeric(n_sims) for (i in 1:n_sims) { data <- simulate_mediation(n = 500, a = 0.5, b = 0.4) estimates[i] <- mediate(data)$indirect } # Mean should be close to true value bias <- mean(estimates) - true_indirect expect_lt(abs(bias), 0.02) # Less than 2% bias })
Type I Error Tests
test_that("maintains nominal Type I error under null", { set.seed(456) n_sims <- 1000 rejections <- 0 for (i in 1:n_sims) { # Null: no indirect effect (a = 0) data <- simulate_mediation(n = 200, a = 0, b = 0.5) result <- mediate(data, conf.level = 0.95) # Reject if CI excludes 0 if (result$ci[1] > 0 || result$ci[2] < 0) { rejections <- rejections + 1 } } type1_rate <- rejections / n_sims # Should be close to 5% expect_lt(type1_rate, 0.07) # Allow some MC error })
Numerical Precision Testing
Stability Under Scaling
test_that("results invariant to variable scaling", { data <- simulate_mediation(n = 500) # Original scale result1 <- mediate(data) # Scale variables by 1000 data_scaled <- data data_scaled$y <- data$y * 1000 data_scaled$m <- data$m * 1000 result2 <- mediate(data_scaled) # Standardized effects should match expect_equal( result1$indirect / (sd(data$y)), result2$indirect / (sd(data_scaled$y)), tolerance = 1e-10 ) })
Extreme Values
test_that("handles extreme correlations", { set.seed(789) # Nearly collinear x <- rnorm(100) m <- x + rnorm(100, sd = 0.001) # r ≈ 0.9999 y <- m + rnorm(100, sd = 0.1) data <- data.frame(x, m, y) # Should warn about collinearity expect_warning(mediate(data), "collinear|singular") }) test_that("handles near-zero variance", { data <- data.frame( x = c(rep(0, 99), 1), # Almost constant m = rnorm(100), y = rnorm(100) ) # Should handle gracefully expect_error( mediate(data), "variance|constant" ) })
Edge Case Identification
Edge Case Checklist
- Sample size: n = 3, 4, 10, 30, 100, 1000, 10000
- Effect sizes: 0, very small (0.01), medium (0.3), large (0.8), 1.0
- Correlations: 0, near-zero, high (>0.9), perfect (1.0)
- Missing data: None, 1%, 10%, 50%
- Outliers: None, 1 extreme, 5%
- Variable types: Continuous, binary, categorical
- Model fit: Perfect, good, poor
Systematic Edge Case Generation
#' Generate Edge Case Test Suite #' #' @return List of edge case datasets generate_edge_cases <- function() { list( # Minimal sample minimal = data.frame( x = c(0, 1, 0, 1), m = c(0, 0.5, 0.5, 1), y = c(0, 0.3, 0.3, 0.6) ), # Zero effect null_effect = { set.seed(1) n <- 100 data.frame( x = rnorm(n), m = rnorm(n), # No relationship with x y = rnorm(n) # No relationship with m ) }, # Perfect mediation perfect = { x <- c(0, 0, 1, 1, 2, 2) m <- x # Perfect a path y <- m # Perfect b path data.frame(x, m, y) }, # High collinearity collinear = { set.seed(2) x <- rnorm(100) m <- x + rnorm(100, sd = 0.01) y <- m + rnorm(100) data.frame(x, m, y) }, # Outliers with_outliers = { set.seed(3) n <- 100 x <- c(rnorm(n-2), 10, -10) m <- 0.5 * x + c(rnorm(n-2), 20, -20) y <- 0.3 * m + rnorm(n) data.frame(x, m, y) }, # Missing data with_missing = { set.seed(4) n <- 100 x <- rnorm(n) m <- 0.5 * x + rnorm(n) y <- 0.3 * m + rnorm(n) # Introduce 10% missing m[sample(n, 10)] <- NA y[sample(n, 10)] <- NA data.frame(x, m, y) } ) } # Run all edge cases test_that("handles all edge cases", { edge_cases <- generate_edge_cases() for (name in names(edge_cases)) { expect_no_error( tryCatch( mediate(edge_cases[[name]]), warning = function(w) NULL # Warnings OK ), info = paste("Edge case:", name) ) } })
Performance Testing
Benchmarking
test_that("meets performance requirements", { skip_on_cran() # Skip during CRAN checks data <- simulate_mediation(n = 1000) # Point estimate should be fast time_point <- system.time({ mediate(data, method = "delta") })["elapsed"] expect_lt(time_point, 0.1) # < 100ms # Bootstrap should be reasonable time_boot <- system.time({ mediate(data, method = "bootstrap", R = 1000) })["elapsed"] expect_lt(time_boot, 10) # < 10s for 1000 bootstraps })
Memory Testing
test_that("memory usage is reasonable", { skip_on_cran() # Large dataset data <- simulate_mediation(n = 100000) mem_before <- pryr::mem_used() result <- mediate(data) mem_after <- pryr::mem_used() mem_increase <- as.numeric(mem_after - mem_before) / 1e6 # MB expect_lt(mem_increase, 100) # Less than 100MB increase })
Test Organization
Recommended Test Structure
tests/ ├── testthat/ │ ├── test-mediate.R # Main function tests │ ├── test-mediate-bootstrap.R # Bootstrap-specific │ ├── test-mediate-delta.R # Delta method-specific │ ├── test-numerical.R # Numerical precision │ ├── test-edge-cases.R # Edge cases │ ├── test-reference.R # Reference implementation │ ├── test-coverage.R # Statistical coverage │ └── helper-simulate.R # Test helpers ├── testthat.R └── reference-results/ # Saved reference results ├── jobs-example.rds └── known-values.rds
Test Tagging
test_that("coverage probability (slow)", { skip_on_cran() skip_if_not(Sys.getenv("RUN_SLOW_TESTS") == "true") # ... slow coverage test ... })
Continuous Integration
GitHub Actions Workflow
# .github/workflows/R-CMD-check.yaml name: R-CMD-check on: [push, pull_request] jobs: R-CMD-check: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: r-lib/actions/setup-r@v2 - uses: r-lib/actions/setup-r-dependencies@v2 - uses: r-lib/actions/check-r-package@v2 coverage: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: r-lib/actions/setup-r@v2 - uses: r-lib/actions/setup-r-dependencies@v2 - name: Test coverage run: covr::codecov() shell: Rscript {0} slow-tests: runs-on: ubuntu-latest if: github.event_name == 'push' && github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v3 - uses: r-lib/actions/setup-r@v2 - uses: r-lib/actions/setup-r-dependencies@v2 - name: Run slow tests run: | Sys.setenv(RUN_SLOW_TESTS = "true") testthat::test_local() shell: Rscript {0}
QA Checklist
Pre-Release Checklist
- All unit tests pass
- Reference tests match published results
- Coverage > 80%
- No R CMD check warnings
- Edge cases handled gracefully
- Performance benchmarks met
- Documentation examples run
- Vignette builds without error
- NEWS.md updated
- Version number incremented
Statistical Validation Template
## Statistical Validation Report **Package**: [name] v[version] **Date**: [date] **Validator**: [name] ### Coverage Probability (95% CI) | Scenario | N | True Effect | Observed Coverage | Pass | |----------|---|-------------|-------------------|------| | Small sample | 50 | 0.15 | 94.2% | Yes | | Medium sample | 200 | 0.15 | 95.1% | Yes | | Large sample | 1000 | 0.15 | 94.8% | Yes | | Null effect | 200 | 0 | 95.3% | Yes | ### Bias Assessment | Scenario | True | Mean Estimate | Bias | Pass | |----------|------|---------------|------|------| | a=0.5, b=0.3 | 0.15 | 0.151 | 0.001 | Yes | | a=0.1, b=0.1 | 0.01 | 0.012 | 0.002 | Yes | ### Reference Comparison | Reference | Our Result | Difference | Pass | |-----------|------------|------------|------| | Imai et al. Table 2 | Match | <0.001 | Yes | | lavaan output | Match | <0.001 | Yes |
References
R Package Testing
- Wickham, H. (2011). testthat: Get started with testing. The R Journal
- Wickham, H., & Bryan, J. (2023). R Packages (Testing chapter)
Statistical Software Validation
- FDA Guidance for Industry: Statistical Software Validation
- Altman, M., et al. (2004). Numerical Issues in Statistical Computing
Quality Assurance
- McCullough, B. D. (1999). Assessing the reliability of statistical software
- Keeling, K. B., & Pavur, R. J. (2007). Statistical accuracy of spreadsheet software
Version: 1.0.0 Created: 2025-12-09 Domain: Quality assurance for statistical R packages Target: MediationVerse ecosystem and similar methodology packages