Awesome-Agent-Skills-for-Empirical-Research academic-paper-verify
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/27-dariia-m-my_claude_skills/paper_verification" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-academic-paper-ve && rm -rf "$T"
skills/27-dariia-m-my_claude_skills/paper_verification/SKILL.mdAcademic Paper Verification
A systematic skill for verifying the integrity and replicability of an academic research paper. This covers everything from individual coefficient checks to full end-to-end replication.
Overview
Verification proceeds in six phases. Each phase produces structured output. Do not skip phases - earlier phases feed into later ones.
Phase 1: Discovery -> inventory of all project files, scripts, outputs, paper Phase 2: Table Audit -> cross-check every number in every table Phase 3: Inline Claims -> verify quantitative claims in paper body text Phase 4: Code Review -> audit R scripts for correctness, modeling decisions, data pipeline Phase 5: Manifest Build -> create verification_manifest.json linking claims to code Phase 6: Replication -> write and run tests/verify_replication.R, fix failures
Before You Start
- Identify the project root directory. Look for
files,.Rproj
, or ask the user.README - Read
for the full procedure for each phase.references/phase-details.md - Read
for known failure modes to watch for.references/common-pitfalls.md
Phase 1: Discovery
Scan the entire project and build an inventory. You need to know what you're working with before you can verify anything.
Find and catalog:
- All
and.R
scripts (note execution order if a master script exists).Rmd - All output files:
,.csv
,.rds
,.tex
,.txt
in results/, output/, tables/, etc..log - The LaTeX paper file(s):
in the root or paper/ or draft/ directory.tex - Any data files:
,.csv
,.dta
,.rds
in data/ or similar.xlsx - Any configuration or parameter files
Produce: A file inventory printed to the console, organized by type, with notes on what each script appears to do (based on filename and a quick scan of its first ~30 lines).
Key questions to answer in this phase:
- Is there a master script that runs everything in order?
- Where do intermediate outputs land?
- Which scripts produce which tables/figures?
- Are there any scripts that appear unused or orphaned?
Phase 2: Table Audit
This is the most critical phase. Read
references/phase-details.md Section 2 for the full
procedure.
For every table in the paper:
-
Locate the table in the LaTeX source. Extract every number: coefficients, standard errors, t-statistics, p-values, confidence intervals, sample sizes (N), R-squared, F-statistics, means, medians, percentages - everything.
-
Locate the corresponding R output file that produced this table. This might be a
file generated by.tex
,stargazer
,modelsummary
,xtable
,kableExtra
, or similar. It could also be ahuxtable
,.csv
, or text log..rds -
Cross-check every single number. Compare to the R output with appropriate tolerance:
- Coefficients and standard errors: match to the number of decimal places shown
- Sample sizes: must match exactly
- R-squared and similar: match to displayed precision
- Percentages: verify the arithmetic (numerator/denominator)
-
Check for rounding consistency - if a coefficient is 0.0347 in the R output and 0.035 in the paper, that is acceptable rounding. If it is 0.038, that is a discrepancy.
-
Verify that column headers, variable names, and panel labels in the paper match the specification in the code.
-
Check that the number of observations (N) is consistent across all tables that use the same sample. If Table 1 reports N=4,521 and Table 3 uses the same sample but reports N=4,519, that needs explanation.
Produce: A table-by-table verification report. For each table:
- Table number and title
- Source R script and output file
- Number of values checked
- List of any discrepancies with exact locations (paper line number, output file line number)
- PASS/FAIL status
Phase 3: Inline Claims Audit
Read the paper body text (not just tables) and find every quantitative claim. These include:
- "We find a 3.2 percentage point increase..."
- "The effect is significant at the 5% level..."
- "Our sample includes 12,450 observations..."
- "Column 3 of Table 2 shows that..."
- "The coefficient on X is negative and significant..."
- Footnotes with numbers or statistical claims
- Abstract claims about magnitudes and significance
For each claim, trace it back to a specific table cell, figure, or R output. Flag any claim that cannot be traced or that contradicts the evidence.
Produce: A claims checklist with claim text, source location in paper, evidence source, and VERIFIED/UNVERIFIED/DISCREPANCY status.
Phase 4: Code Review
Read every R script in the project, in execution order. This is not just a syntax check - you are auditing the analytical pipeline. Read
references/phase-details.md Section 4 and
references/common-pitfalls.md for what to look for.
Data Pipeline Verification:
- At every
,merge
,join
,filter
, orsubset
step, check: (a) How many observations before vs. after the transformation? (b) Do all column names needed downstream still exist? (c) Are key summary statistics (mean, min, max, N) reasonable after the step?mutate - Flag any joins that could silently drop or duplicate observations
- Flag any filters that might be too aggressive or too permissive
- Check for proper handling of missing values (NA) - are they dropped, imputed, or ignored?
- Verify that panel/time-series data is properly balanced or that imbalance is handled
Modeling Decisions:
- Are the regression specifications consistent with what the paper describes? (e.g., if the paper says "we control for year fixed effects", is that in the code?)
- Are standard errors clustered as described? (robust, clustered at the right level, etc.)
- Are instrumental variables correctly specified? (first stage, exclusion restriction checks)
- Is the sample restriction for each regression clearly defined and consistent with the paper?
- Are interaction terms, polynomials, or transformations correctly implemented?
- Do subsample analyses actually use the right subsamples?
Robustness and Red Flags:
- Are there hardcoded values that should be computed? (e.g.,
when the paper says "post-treatment period" without defining the cutoff)filter(year > 2005) - Are there commented-out lines that suggest alternative specifications were tried?
- Is there any evidence of p-hacking patterns (many specifications tried, only one reported)?
- Are random seeds set for any stochastic procedures?
- Are there warnings or errors being suppressed?
Produce: A script-by-script review with:
- Script name and purpose
- Data pipeline issues (with line numbers)
- Modeling decision flags (with line numbers)
- Red flags (with line numbers)
- Overall assessment: CLEAN / MINOR ISSUES / MAJOR ISSUES
Phase 5: Build Verification Manifest
Create
verification_manifest.json that maps every quantitative claim in the paper to
the code that produces it.
Structure:
{ "paper_file": "paper/main.tex", "generated_at": "2026-02-08T12:00:00Z", "claims": [ { "id": "T1_R2_C3", "type": "coefficient", "paper_location": {"file": "paper/main.tex", "line": 234, "context": "Table 1, Row 2, Col 3"}, "paper_value": "0.035", "source_script": "code/02_main_regression.R", "source_line": 87, "output_file": "results/table1.tex", "output_location": {"line": 15, "context": "second coefficient in column 3"}, "expected_value": "0.0347", "tolerance": 0.001, "status": "PASS", "notes": "Acceptable rounding from 0.0347 to 0.035" }, { "id": "BODY_P12_S3", "type": "inline_claim", "paper_location": {"file": "paper/main.tex", "line": 412, "context": "paragraph 12, sentence 3"}, "paper_value": "3.2 percentage points", "source_script": "code/02_main_regression.R", "source_line": 87, "output_file": "results/table1.tex", "output_location": {"line": 15}, "expected_value": "0.032", "tolerance": 0.001, "status": "PASS", "notes": "Coefficient 0.0323 reported as 3.2pp" } ], "summary": { "total_claims": 142, "passed": 139, "failed": 2, "unverified": 1 } }
Every coefficient, standard error, sample size, p-value, summary statistic, and verbal claim should appear in this manifest. Be exhaustive.
Phase 6: Replication Test Suite
Write
tests/verify_replication.R that programmatically reruns the analysis and checks
results against the manifest.
Read
references/replication-script-template.md for the template and structure.
The test script must:
- Source or rerun each analysis script in the correct order
- Extract the relevant outputs (coefficients, SEs, N, R-squared, etc.)
- Compare against the values in
verification_manifest.json - Use appropriate tolerance for floating-point comparisons
- Report PASS/FAIL for each claim with clear diagnostics on failure
- Handle dependencies gracefully (if a data file is missing, report it, do not crash)
After writing the test script:
- Run it
- For any failures, diagnose the root cause
- If the failure is due to a code bug (not a paper-code mismatch), fix the upstream script and document what you fixed
- Rerun until all tests pass or all remaining failures are genuine paper-code discrepancies
- Produce a final summary
Produce:
- the test scripttests/verify_replication.R
- structured test resultstests/replication_results.json
- human-readable summary of what passed, what failed, what was fixed, and what remains unresolvedtests/replication_summary.md
Output Format
At the end of the full verification, produce a consolidated report. Use this structure:
# Paper Verification Report ## Executive Summary - Total quantitative claims checked: X - Passed: Y - Failed: Z - Unverified: W - Code issues found: N (M major, K minor) ## Table-by-Table Results [from Phase 2] ## Inline Claims Results [from Phase 3] ## Code Review Findings [from Phase 4] ## Replication Test Results [from Phase 6] ## Recommendations [prioritized list of issues to address]
Important Notes
- Never silently skip a number. If you cannot verify a value, mark it UNVERIFIED with an explanation.
- When in doubt, flag it. False positives are better than missed discrepancies.
- Pay special attention to N (sample sizes) - these are the most common source of inconsistencies across tables and text.
- If the project uses R packages that produce formatted output (stargazer, modelsummary, etc.), check the raw model objects too, not just the formatted output.
- If you encounter Stata
files or Python scripts mixed in, verify those too using the same principles..do - The user may want you to run this on a subset (e.g., "just check Table 3"). Adapt accordingly but note what was not checked.