Awesome-Agent-Skills-for-Empirical-Research academic-paper-verify

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/27-dariia-m-my_claude_skills/paper_verification" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-academic-paper-ve && rm -rf "$T"

manifest: skills/27-dariia-m-my_claude_skills/paper_verification/SKILL.md

source content

Academic Paper Verification

A systematic skill for verifying the integrity and replicability of an academic research paper. This covers everything from individual coefficient checks to full end-to-end replication.

Overview

Verification proceeds in six phases. Each phase produces structured output. Do not skip phases - earlier phases feed into later ones.

Phase 1: Discovery       -> inventory of all project files, scripts, outputs, paper
Phase 2: Table Audit     -> cross-check every number in every table
Phase 3: Inline Claims   -> verify quantitative claims in paper body text
Phase 4: Code Review     -> audit R scripts for correctness, modeling decisions, data pipeline
Phase 5: Manifest Build  -> create verification_manifest.json linking claims to code
Phase 6: Replication     -> write and run tests/verify_replication.R, fix failures

Before You Start

Identify the project root directory. Look for
```
.Rproj
```
files,
```
README
```
, or ask the user.
Read
```
references/phase-details.md
```
for the full procedure for each phase.
Read
```
references/common-pitfalls.md
```
for known failure modes to watch for.

Phase 1: Discovery

Scan the entire project and build an inventory. You need to know what you're working with before you can verify anything.

Find and catalog:

All
```
.R
```
and
```
.Rmd
```
scripts (note execution order if a master script exists)
All output files:
```
.csv
```
,
```
.rds
```
,
```
.tex
```
,
```
.txt
```
,
```
.log
```
in results/, output/, tables/, etc.
The LaTeX paper file(s):
```
.tex
```
in the root or paper/ or draft/ directory
Any data files:
```
.csv
```
,
```
.dta
```
,
```
.rds
```
,
```
.xlsx
```
in data/ or similar
Any configuration or parameter files

Produce: A file inventory printed to the console, organized by type, with notes on what each script appears to do (based on filename and a quick scan of its first ~30 lines).

Key questions to answer in this phase:

Is there a master script that runs everything in order?
Where do intermediate outputs land?
Which scripts produce which tables/figures?
Are there any scripts that appear unused or orphaned?

Phase 2: Table Audit

This is the most critical phase. Read

references/phase-details.md

Section 2 for the full procedure.

For every table in the paper:

Locate the table in the LaTeX source. Extract every number: coefficients, standard errors, t-statistics, p-values, confidence intervals, sample sizes (N), R-squared, F-statistics, means, medians, percentages - everything.
Locate the corresponding R output file that produced this table. This might be a
```
.tex
```
file generated by
```
stargazer
```
,
```
modelsummary
```
,
```
xtable
```
,
```
kableExtra
```
,
```
huxtable
```
, or similar. It could also be a
```
.csv
```
,
```
.rds
```
, or text log.
Cross-check every single number. Compare to the R output with appropriate tolerance:
- Coefficients and standard errors: match to the number of decimal places shown
- Sample sizes: must match exactly
- R-squared and similar: match to displayed precision
- Percentages: verify the arithmetic (numerator/denominator)
Check for rounding consistency - if a coefficient is 0.0347 in the R output and 0.035 in the paper, that is acceptable rounding. If it is 0.038, that is a discrepancy.
Verify that column headers, variable names, and panel labels in the paper match the specification in the code.
Check that the number of observations (N) is consistent across all tables that use the same sample. If Table 1 reports N=4,521 and Table 3 uses the same sample but reports N=4,519, that needs explanation.

Produce: A table-by-table verification report. For each table:

Table number and title
Source R script and output file
Number of values checked
List of any discrepancies with exact locations (paper line number, output file line number)
PASS/FAIL status

Phase 3: Inline Claims Audit

Read the paper body text (not just tables) and find every quantitative claim. These include:

"We find a 3.2 percentage point increase..."
"The effect is significant at the 5% level..."
"Our sample includes 12,450 observations..."
"Column 3 of Table 2 shows that..."
"The coefficient on X is negative and significant..."
Footnotes with numbers or statistical claims
Abstract claims about magnitudes and significance

For each claim, trace it back to a specific table cell, figure, or R output. Flag any claim that cannot be traced or that contradicts the evidence.

Produce: A claims checklist with claim text, source location in paper, evidence source, and VERIFIED/UNVERIFIED/DISCREPANCY status.

Phase 4: Code Review

Read every R script in the project, in execution order. This is not just a syntax check - you are auditing the analytical pipeline. Read

references/phase-details.md

Section 4 and

references/common-pitfalls.md

for what to look for.

Data Pipeline Verification:

At every
```
merge
```
,
```
join
```
,
```
filter
```
,
```
subset
```
, or
```
mutate
```
step, check: (a) How many observations before vs. after the transformation? (b) Do all column names needed downstream still exist? (c) Are key summary statistics (mean, min, max, N) reasonable after the step?
Flag any joins that could silently drop or duplicate observations
Flag any filters that might be too aggressive or too permissive
Check for proper handling of missing values (NA) - are they dropped, imputed, or ignored?
Verify that panel/time-series data is properly balanced or that imbalance is handled

Modeling Decisions:

Are the regression specifications consistent with what the paper describes? (e.g., if the paper says "we control for year fixed effects", is that in the code?)
Are standard errors clustered as described? (robust, clustered at the right level, etc.)
Are instrumental variables correctly specified? (first stage, exclusion restriction checks)
Is the sample restriction for each regression clearly defined and consistent with the paper?
Are interaction terms, polynomials, or transformations correctly implemented?
Do subsample analyses actually use the right subsamples?

Robustness and Red Flags:

Are there hardcoded values that should be computed? (e.g.,
```
filter(year > 2005)
```
when the paper says "post-treatment period" without defining the cutoff)
Are there commented-out lines that suggest alternative specifications were tried?
Is there any evidence of p-hacking patterns (many specifications tried, only one reported)?
Are random seeds set for any stochastic procedures?
Are there warnings or errors being suppressed?

Produce: A script-by-script review with:

Script name and purpose
Data pipeline issues (with line numbers)
Modeling decision flags (with line numbers)
Red flags (with line numbers)
Overall assessment: CLEAN / MINOR ISSUES / MAJOR ISSUES

Phase 5: Build Verification Manifest

Create

verification_manifest.json

that maps every quantitative claim in the paper to the code that produces it.

Structure:

{
  "paper_file": "paper/main.tex",
  "generated_at": "2026-02-08T12:00:00Z",
  "claims": [
    {
      "id": "T1_R2_C3",
      "type": "coefficient",
      "paper_location": {"file": "paper/main.tex", "line": 234, "context": "Table 1, Row 2, Col 3"},
      "paper_value": "0.035",
      "source_script": "code/02_main_regression.R",
      "source_line": 87,
      "output_file": "results/table1.tex",
      "output_location": {"line": 15, "context": "second coefficient in column 3"},
      "expected_value": "0.0347",
      "tolerance": 0.001,
      "status": "PASS",
      "notes": "Acceptable rounding from 0.0347 to 0.035"
    },
    {
      "id": "BODY_P12_S3",
      "type": "inline_claim",
      "paper_location": {"file": "paper/main.tex", "line": 412, "context": "paragraph 12, sentence 3"},
      "paper_value": "3.2 percentage points",
      "source_script": "code/02_main_regression.R",
      "source_line": 87,
      "output_file": "results/table1.tex",
      "output_location": {"line": 15},
      "expected_value": "0.032",
      "tolerance": 0.001,
      "status": "PASS",
      "notes": "Coefficient 0.0323 reported as 3.2pp"
    }
  ],
  "summary": {
    "total_claims": 142,
    "passed": 139,
    "failed": 2,
    "unverified": 1
  }
}

Every coefficient, standard error, sample size, p-value, summary statistic, and verbal claim should appear in this manifest. Be exhaustive.

Phase 6: Replication Test Suite

Write

tests/verify_replication.R

that programmatically reruns the analysis and checks results against the manifest.

Read

references/replication-script-template.md

for the template and structure.

The test script must:

Source or rerun each analysis script in the correct order
Extract the relevant outputs (coefficients, SEs, N, R-squared, etc.)
Compare against the values in
```
verification_manifest.json
```
Use appropriate tolerance for floating-point comparisons
Report PASS/FAIL for each claim with clear diagnostics on failure
Handle dependencies gracefully (if a data file is missing, report it, do not crash)

After writing the test script:

Run it
For any failures, diagnose the root cause
If the failure is due to a code bug (not a paper-code mismatch), fix the upstream script and document what you fixed
Rerun until all tests pass or all remaining failures are genuine paper-code discrepancies
Produce a final summary

Produce:

```
tests/verify_replication.R
```
- the test script
```
tests/replication_results.json
```
- structured test results
```
tests/replication_summary.md
```
- human-readable summary of what passed, what failed, what was fixed, and what remains unresolved

Output Format

At the end of the full verification, produce a consolidated report. Use this structure:

# Paper Verification Report

## Executive Summary
- Total quantitative claims checked: X
- Passed: Y
- Failed: Z
- Unverified: W
- Code issues found: N (M major, K minor)

## Table-by-Table Results
[from Phase 2]

## Inline Claims Results
[from Phase 3]

## Code Review Findings
[from Phase 4]

## Replication Test Results
[from Phase 6]

## Recommendations
[prioritized list of issues to address]

Important Notes

Never silently skip a number. If you cannot verify a value, mark it UNVERIFIED with an explanation.
When in doubt, flag it. False positives are better than missed discrepancies.
Pay special attention to N (sample sizes) - these are the most common source of inconsistencies across tables and text.
If the project uses R packages that produce formatted output (stargazer, modelsummary, etc.), check the raw model objects too, not just the formatted output.
If you encounter Stata
```
.do
```
files or Python scripts mixed in, verify those too using the same principles.
The user may want you to run this on a subset (e.g., "just check Table 3"). Adapt accordingly but note what was not checked.