SciAgent-Skills statistical-analysis

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/biostatistics/statistical-analysis" ~/.claude/skills/jaechang-hits-sciagent-skills-statistical-analysis && rm -rf "$T"

manifest: skills/biostatistics/statistical-analysis/SKILL.md

source content

Statistical Analysis

Overview

Statistical analysis is the systematic process of selecting appropriate tests, verifying assumptions, quantifying effect magnitudes, and reporting results. This knowhow guides test selection, assumption diagnostics, and APA-style reporting for frequentist and Bayesian analyses in academic research.

Key Concepts

Frequentist vs Bayesian Framework

Aspect	Frequentist	Bayesian
Core output	p-value, confidence interval	Posterior distribution, credible interval
Interpretation	"How likely is this data if H0 is true?"	"How likely is H1 given the data?"
Null support	Cannot support H0 (only fail to reject)	Can quantify evidence for H0 via Bayes Factor
Prior info	Not used	Incorporated via prior distributions
Sample size	Requires adequate power	Works with any sample size
Best for	Standard analyses, large samples	Small samples, prior info, complex models

Statistical vs Practical Significance

A statistically significant result (p < .05) may be trivially small in practice. Always report:

Effect size: Magnitude of the effect (Cohen's d, eta-squared, r, R-squared)
Confidence interval: Precision of the estimate
Context: Clinical/practical relevance in the domain

Common Effect Sizes

Test	Effect Size	Small	Medium	Large
t-test	Cohen's d	0.20	0.50	0.80
t-test (small n)	Hedges' g	0.20	0.50	0.80
ANOVA	eta-squared partial	0.01	0.06	0.14
ANOVA	omega-squared	0.01	0.06	0.14
Correlation	r	0.10	0.30	0.50
Regression	R-squared	0.02	0.13	0.26
Regression	f-squared	0.02	0.15	0.35
Chi-square	Cramer's V	0.07	0.21	0.35
Chi-square 2x2	phi coefficient	0.10	0.30	0.50

Cohen's benchmarks are guidelines, not rigid thresholds -- domain context always matters.

Assumptions Overview

Most parametric tests require:

Independence: Observations are independent of each other
Normality: Data (or residuals) are approximately normally distributed
Homogeneity of variance: Groups have similar variances (for group comparisons)
Linearity: Relationship between variables is linear (for regression)

When assumptions are violated:

Normality violated, n > 30: Proceed -- parametric tests are robust with large samples
Normality violated, n < 30: Use non-parametric alternative
Variance heterogeneity: Use Welch's correction (t-test) or Welch's ANOVA
Linearity violated: Add polynomial terms, transform variables, or use GAMs

Test-Specific Assumption Workflows

T-test assumptions: (1) Check normality per group with Shapiro-Wilk + Q-Q plots. (2) Check homogeneity with Levene's test. (3) If normality violated: Mann-Whitney U (independent) or Wilcoxon signed-rank (paired). If variance heterogeneity: use Welch's t-test.

ANOVA assumptions: (1) Normality per group. (2) Homogeneity via Levene's test. (3) For repeated measures: check sphericity (Mauchly's test); if violated, apply Greenhouse-Geisser (epsilon < 0.75) or Huynh-Feldt (epsilon > 0.75) correction. (4) If normality violated: Kruskal-Wallis (independent) or Friedman (repeated).

Linear regression assumptions: (1) Linearity via residuals-vs-fitted plot. (2) Independence via Durbin-Watson test (1.5-2.5 acceptable). (3) Homoscedasticity via Breusch-Pagan test + scale-location plot. (4) Normality of residuals via Q-Q plot + Shapiro-Wilk. (5) Multicollinearity via VIF (>10 = severe, >5 = moderate).

Logistic regression assumptions: (1) Independence. (2) Linearity of log-odds with continuous predictors (Box-Tidwell test). (3) No perfect multicollinearity (VIF). (4) Adequate sample size (10-20 events per predictor minimum).

Specialized Test Categories

Beyond the main decision flowchart, several specialized test families address specific data types:

Survival / time-to-event analysis:

Log-rank test: Compares survival curves between groups (non-parametric)
Cox proportional hazards: Models time-to-event with covariates; assumes proportional hazards
Parametric survival models: Weibull, exponential, log-normal for known distributional forms
Use when outcome is time until an event (death, relapse, failure) with possible censoring

Count outcome models:

Poisson regression: For count data where mean approximately equals variance
Negative binomial regression: For overdispersed counts (variance > mean)
Zero-inflated models: For excess zeros beyond what Poisson/NB predicts
Use when outcome is a count (number of events, incidents, occurrences)

Agreement and reliability:

Cohen's kappa: Inter-rater agreement for categorical ratings (2 raters)
Fleiss' kappa / Krippendorff's alpha: Agreement for >2 raters
Intraclass correlation coefficient (ICC): Continuous ratings reliability
Cronbach's alpha: Internal consistency of multi-item scales
Bland-Altman analysis: Agreement between two measurement methods (continuous)
Use when assessing measurement reliability or inter-rater consistency

Categorical data extensions:

McNemar's test: Paired binary outcomes (2x2)
Cochran's Q test: Paired binary outcomes (3+ conditions)
Cochran-Armitage trend test: Ordered categories in contingency tables

Decision Framework

Test Selection Flowchart

What is your research question?
|
+-- Comparing GROUPS on a continuous outcome?
|   |
|   +-- How many groups?
|   |   +-- 2 groups
|   |   |   +-- Independent -> Independent t-test (or Mann-Whitney U)
|   |   |   +-- Paired/repeated -> Paired t-test (or Wilcoxon signed-rank)
|   |   +-- 3+ groups
|   |      +-- Independent -> One-way ANOVA (or Kruskal-Wallis)
|   |      +-- Repeated -> Repeated-measures ANOVA (or Friedman)
|   |
|   +-- Multiple factors? -> Factorial ANOVA / Mixed ANOVA
|   +-- With covariates? -> ANCOVA
|
+-- Testing a RELATIONSHIP between variables?
|   |
|   +-- Both continuous?
|   |   +-- Normal -> Pearson correlation
|   |   +-- Non-normal or ordinal -> Spearman correlation
|   |
|   +-- Predicting continuous outcome?
|   |   +-- 1 predictor -> Simple linear regression
|   |   +-- Multiple predictors -> Multiple linear regression
|   |
|   +-- Predicting categorical outcome?
|   |   +-- Binary -> Logistic regression
|   |   +-- Ordinal -> Ordinal logistic regression
|   |
|   +-- Predicting count outcome?
|   |   +-- Equidispersed -> Poisson regression
|   |   +-- Overdispersed -> Negative binomial regression
|   |   +-- Excess zeros -> Zero-inflated Poisson/NB
|   |
|   +-- Time-to-event outcome?
|       +-- Compare survival curves -> Log-rank test
|       +-- With covariates -> Cox proportional hazards
|
+-- Testing ASSOCIATION between categorical variables?
|   +-- Expected cell count >= 5 -> Chi-square test
|   +-- Expected cell count < 5 -> Fisher's exact test
|   +-- Ordered categories -> Cochran-Armitage trend test
|   +-- Paired categories -> McNemar's test
|
+-- Assessing AGREEMENT / RELIABILITY?
    +-- Categorical, 2 raters -> Cohen's kappa
    +-- Categorical, >2 raters -> Fleiss' kappa
    +-- Continuous ratings -> ICC
    +-- Two measurement methods -> Bland-Altman analysis
    +-- Internal consistency -> Cronbach's alpha

Quick Reference Table

Research Question	Data Type	Normal?	Test	Non-parametric Alternative
2 independent groups	Continuous	Yes	Independent t-test	Mann-Whitney U
2 paired groups	Continuous	Yes	Paired t-test	Wilcoxon signed-rank
3+ independent groups	Continuous	Yes	One-way ANOVA	Kruskal-Wallis
3+ repeated groups	Continuous	Yes	Repeated-measures ANOVA	Friedman test
2 variables	Continuous	Yes	Pearson r	Spearman rho
Predict continuous	Mixed	--	Linear regression	--
Predict binary	Mixed	--	Logistic regression	--
Predict counts	Count	--	Poisson / Negative binomial	--
Time-to-event	Survival	--	Cox PH / Log-rank	--
2 categorical	Categorical	--	Chi-square / Fisher's exact	--
Rater agreement	Categorical	--	Cohen's kappa / Fleiss' kappa	--
Method agreement	Continuous	--	Bland-Altman / ICC	--

Best Practices

Pre-register analyses when possible to distinguish confirmatory from exploratory findings. Specify primary outcome, tests, and correction methods before data collection
Always check assumptions before interpreting results. Run normality tests (Shapiro-Wilk), homogeneity tests (Levene's), and residual diagnostics. Document results even when assumptions are met
Report effect sizes with confidence intervals for every test. p-values alone are insufficient -- effect sizes convey practical importance
Report all planned analyses including non-significant findings. Selective reporting inflates false positive rates
Use appropriate multiple comparison corrections. Bonferroni (conservative), Holm (step-down, less conservative), or FDR/Benjamini-Hochberg (for many tests). Choose based on the number of comparisons and acceptable error rate
Visualize data before and after analysis. Box plots for group comparisons, scatter plots for correlations, residual plots for regression diagnostics
Conduct sensitivity analyses to assess robustness: re-run with outliers removed, different transformations, or alternative tests
Anti-pattern -- p-hacking: Testing multiple outcomes, subgroups, or model specifications until p < .05 inflates false positives. Pre-register to avoid
Anti-pattern -- HARKing (Hypothesizing After Results are Known): Presenting exploratory findings as confirmatory undermines scientific integrity
Anti-pattern -- misinterpreting non-significance: Failure to reject H0 does not mean H0 is true. Use Bayesian methods or equivalence testing to support null

Common Pitfalls

Misinterpreting p-values as probability of the hypothesis being true. p-values measure P(data | H0), not P(H0 | data). How to avoid: Use precise language: "If the null hypothesis were true, the probability of observing data this extreme is p = ..."
Confusing statistical significance with practical importance. A large sample can make trivially small effects significant. How to avoid: Always report and interpret effect sizes alongside p-values
Running post-hoc power analysis after a non-significant result. Post-hoc power is a mathematical function of the p-value and adds no new information. How to avoid: Use sensitivity analysis instead -- determine what effect size the study could detect at 80% power
Ignoring assumption violations and proceeding with parametric tests. How to avoid: Run assumption checks systematically. Use Welch's corrections, non-parametric alternatives, or transformations when violated
Multiple comparisons without correction. Running 20 tests at alpha = .05 gives ~64% chance of at least one false positive. How to avoid: Apply Bonferroni, Holm, or FDR correction. Report both corrected and uncorrected p-values
Treating ordinal data as continuous. Likert scales are ordinal -- means and standard deviations assume equal intervals. How to avoid: Use non-parametric tests (Mann-Whitney, Kruskal-Wallis) or ordinal regression
Ignoring missing data patterns. Listwise deletion assumes MCAR, which is rarely true. How to avoid: Assess missingness mechanism (MCAR, MAR, MNAR). Use multiple imputation for MAR data
Confusing correlation with causation. Observational studies cannot establish causal relationships regardless of effect size. How to avoid: Use causal language only for experimental designs with random assignment
Not reporting non-significant results. Publication bias and file-drawer effect distort the literature. How to avoid: Report all pre-registered analyses. Consider registered reports
Using one-tailed tests to "improve" significance. One-tailed tests should be pre-specified based on strong directional hypotheses. How to avoid: Default to two-tailed. Only use one-tailed when justified a priori

Workflow

Standard Analysis Pipeline

Define research question and hypotheses
- State H0 and H1 explicitly
- Specify primary outcome and covariates
Select statistical test (use Decision Framework above)
- Match test to data type, design, and assumptions
- Plan multiple comparison corrections if needed
Conduct a priori power analysis
- Specify target effect size (from literature or clinical relevance)
- Set alpha = .05, power = .80 (minimum), determine required n
- Libraries:
```
statsmodels.stats.power
```
  ,
```
pingouin
```
Inspect and clean data
- Check for missing data patterns (MCAR/MAR/MNAR)
- Identify outliers (IQR method: Q1 - 1.5IQR / Q3 + 1.5IQR, or z-scores > 3)
- Verify variable types and coding
- The original
```
assumption_checks.py
```
  script provides automated normality, homogeneity, and outlier detection with visualization
Check assumptions (see Test-Specific Assumption Workflows above)
- Normality: Shapiro-Wilk test + Q-Q plots (visual primary for n > 50)
- Homogeneity: Levene's test + box plots
- Linearity: Residual plots (for regression)
- Sphericity: Mauchly's test (for repeated measures)
- Document results and remedial actions
Run primary analysis
- Execute planned test with appropriate library (scipy.stats, pingouin, statsmodels)
- Calculate effect size and confidence interval
- For Bayesian analyses: specify priors, run MCMC, check convergence (see
```
references/bayesian_statistics.md
```
  )
Conduct post-hoc and secondary analyses
- Post-hoc pairwise comparisons (Tukey HSD, Bonferroni)
- Sensitivity analyses (remove outliers, alternative methods)
- Exploratory analyses (clearly labeled)
Report results in APA format
- Descriptive statistics (M, SD, n per group)
- Test statistic, degrees of freedom, exact p-value
- Effect size with confidence interval
- See
```
references/reporting_standards.md
```
  for templates

Bundled Resources

```
references/effect_sizes_and_power.md
```
-- Detailed guide to calculating, interpreting, and reporting effect sizes (Cohen's d, Hedges' g, Glass's delta, eta-squared, omega-squared, partial eta-squared, phi coefficient, standardized beta, f-squared, Cramer's V, odds ratio); a priori, sensitivity, and correlation power analysis with code examples. Condensed from 582-line original.
```
references/bayesian_statistics.md
```
-- Comprehensive Bayesian analysis guide: Bayes' theorem, prior specification, ROPE (Region of Practical Equivalence), prior sensitivity analysis, Bayesian t-test/ANOVA/correlation/regression, hierarchical models, model comparison (WAIC/LOO), convergence diagnostics. Condensed from 662-line original.
```
references/reporting_standards.md
```
-- APA-style reporting templates for t-tests, ANOVA, regression, correlation, chi-square, non-parametric, and Bayesian analyses; pre-registration guidance; methods section templates (participants, design, measures); null results reporting; reporting checklist. Condensed from 470-line original.

Fully-Consolidated Files (no separate reference file)

```
test_selection_guide.md
```
(130 lines original) -- Fully consolidated into Decision Framework (flowchart + Quick Reference Table) and Specialized Test Categories subsection in Key Concepts. Combined coverage: flowchart (~35 lines) + Quick Reference Table (~15 lines) + Specialized Test Categories (~35 lines) = ~85 lines covering all original capabilities. Original content on sample size considerations, multiple comparisons, and missing data was consolidated into Best Practices and Common Pitfalls. Omitted: study design considerations (RCTs, observational, clustered data) -- general guidance covered by statsmodels-statistical-modeling skill.
```
assumptions_and_diagnostics.md
```
(370 lines original) -- Fully consolidated into Key Concepts (Assumptions Overview + Test-Specific Assumption Workflows) and Workflow Steps 4-5. Combined coverage: Assumptions Overview (~12 lines) + Test-Specific Assumption Workflows (~20 lines) + Workflow Steps 4-5 (~16 lines) = ~48 lines. The original contained detailed code blocks for each assumption check; since this is Knowhow (not Skill), code is referenced rather than reproduced. Key diagnostic thresholds preserved (VIF > 10, Durbin-Watson 1.5-2.5, variance ratio < 2-3). Omitted: extensive Python code blocks for individual checks (normality, homogeneity, linearity, logistic regression diagnostics) -- available in scipy.stats and pingouin documentation. Sample size rules of thumb covered in Workflow Step 3.

Script Disposition

assumption_checks.py
(540 lines) -- Contains 6 functions:
```
check_normality()
```
,
```
check_normality_per_group()
```
,
```
check_homogeneity_of_variance()
```
,
```
check_linearity()
```
,
```
detect_outliers()
```
,
```
comprehensive_assumption_check()
```
. As Knowhow entry, script functions are referenced in Workflow Step 4 rather than reproduced inline. Key capabilities (Shapiro-Wilk, Levene's, IQR/z-score outlier detection, Q-Q plots) are described in Assumptions Overview and Test-Specific Assumption Workflows. Users needing automated checking should use scipy.stats and pingouin directly following the patterns described.

Intentional Omissions

Time series methods (ARIMA, ACF/PACF) -- specialized topic beyond core statistical testing scope
Mixed-effects models / GEE -- covered by statsmodels-statistical-modeling skill
Bootstrap and permutation tests -- mentioned in passing; detailed implementation deferred to computational statistics resources

Related Skills

statsmodels-statistical-modeling -- Implementing OLS, GLM, Logit, time-series models programmatically
pymc-bayesian-modeling -- Full Bayesian modeling with MCMC sampling
scikit-learn-machine-learning -- Predictive modeling, cross-validation, classification
matplotlib-scientific-plotting -- Creating publication-quality statistical figures
hypothesis-generation -- Structured hypothesis formulation before statistical testing