Medical-research-skills literature-statistics
Generate statistics for publication-year and journal distributions from local references or PDFs; use when you need standardized Year/Journal tables and a summary without any network access.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Other/literature-statistics" ~/.claude/skills/aipoch-medical-research-skills-literature-statistics && rm -rf "$T"
manifest:
scientific-skills/Other/literature-statistics/SKILL.mdsource content
When to Use
- You have a batch of references and need a publication year distribution table (counts and percentages).
- You need a journal distribution table (Top N optional) for a literature review or report appendix.
- Your input is pasted citations (BibTeX/RIS/EndNote/plain text/mixed) and you want quick aggregation.
- Your input is local reference files (
) and you want consistent, standardized output..bib/.ris/.txt/.csv - You have a local PDF folder and want to extract year/journal signals (best-effort) and summarize them.
Key Features
- Supports multiple input types: pasted text, local reference files, and local PDF directories (via script).
- Extracts Year and Journal using format-specific parsing rules (BibTeX/RIS/plain text/PDF).
- Produces two standardized tables:
- Year distribution:
year, count, percent - Journal distribution:
journal title, count, percent
- Year distribution:
- Provides a summary including totals and unknown-field counts (unknown year / unknown journal).
- Conservative extraction: does not guess when metadata is unclear; ambiguous items are counted as
.unknown - Local-only operation: no network calls, no external APIs, no credential usage.
Dependencies
- Python 3.9+
- Python packages (pinned by your project file):
pip install -r scripts/requirements.txt
Example Usage
1) Process a local PDF directory
python scripts/process_pdfs.py --input-dir "./pdfs" --output "./literature_stats.md"
2) Process a local reference file (example pattern)
If your repository provides a CLI entry or script for reference files, run it similarly to the PDF script. For example:
python scripts/process_references.py --input "./refs/library.bib" --output "./literature_stats.md"
3) Expected output format (Markdown)
## Summary - Total processed: 120 - Unknown year: 7 - Unknown journal: 15 ## Year Distribution | Year | Count | Percent | |------|-------|---------| | 2023 | 18 | 15.0% | | 2022 | 22 | 18.3% | | ... | ... | ... | ## Journal Distribution | Journal | Count | Percent | |---------|-------|---------| | Journal of X | 9 | 7.5% | | ... | ... | ... |
For additional examples, see:
references/examples.md.
Implementation Details
Processing Pipeline
- Detect input type: pasted text / file path / PDF directory.
- Read content from pasted text or local files.
- Split into individual citations using format cues:
- BibTeX entries
- RIS records
- blank-line separation for plain text/mixed inputs
- Extract
andyear
using the parsing rules below.journal - Normalize journal names using the normalization rules below.
- Aggregate counts and compute percentages.
- Output:
- Table 1: Year distribution
- Table 2: Journal distribution
- Summary: totals + unknown counts
- For PDF directories, use:
python scripts/process_pdfs.py --input-dir "<pdf_dir>" --output "<output_md>"
Parsing Rules
BibTeX
- Year:
fieldyear - Journal:
fieldjournal
RIS
- Year:
orPY
(use the first 4-digit year)Y1 - Journal: first non-empty value among
/JO
/JFT2
Plain Text / Mixed Citations
- Year: first 4-digit year in the range 1900-2099 found near the end of the citation
- Journal: infer only when patterns are unambiguous (e.g.,
orJournal Name. 2022;
); otherwise set toJournal Name, 2022unknown
PDF Directory (Script-Based)
- Year: prefer PDF metadata; otherwise use the first 4-digit year found on the first page
- Journal: prefer PDF metadata; otherwise scan first-page lines containing keywords such as:
,Journal
,Proceedings
If unclear, set toTransactions
.unknown
Journal Normalization Rules
- Trim leading/trailing whitespace.
- Collapse multiple spaces into a single space.
- Remove trailing periods and commas.
- If casing is inconsistent, convert to Title Case; otherwise keep original casing.
- Do not expand abbreviations or infer aliases.
Failure Handling and Safety Constraints
- Do not guess missing/unclear year or journal values.
- Count ambiguous entries as
and report the totals in the summary.unknown - No network access; no external APIs; no credentials.
- Do not read files outside the user-provided paths.
Sorting and Reporting Requirements
- Tables are sorted by:
descendingcount- then by
ascending (year or journal title)name
- Always report:
- total processed count
- unknown year count
- unknown journal count
When Not to Use
- Do not use this skill when the required source data, identifiers, files, or credentials are missing.
- Do not use this skill when the user asks for fabricated results, unsupported claims, or out-of-scope conclusions.
- Do not use this skill when a simpler direct answer is more appropriate than the documented workflow.
Required Inputs
- A clearly specified task goal aligned with the documented scope.
- All required files, identifiers, parameters, or environment variables before execution.
- Any domain constraints, formatting requirements, and expected output destination if applicable.
Recommended Workflow
- Validate the request against the skill boundary and confirm all required inputs are present.
- Select the documented execution path and prefer the simplest supported command or procedure.
- Produce the expected output using the documented file format, schema, or narrative structure.
- Run a final validation pass for completeness, consistency, and safety before returning the result.
Output Contract
- Return a structured deliverable that is directly usable without reformatting.
- If a file is produced, prefer a deterministic output name such as
unless the skill documentation defines a better convention.literature_statistics_result.md - Include a short validation summary describing what was checked, what assumptions were made, and any remaining limitations.
Validation and Safety Rules
- Validate required inputs before execution and stop early when mandatory fields or files are missing.
- Do not fabricate measurements, references, findings, or conclusions that are not supported by the provided source material.
- Emit a clear warning when credentials, privacy constraints, safety boundaries, or unsupported requests affect the result.
- Keep the output safe, reproducible, and within the documented scope at all times.
Failure Handling
- If validation fails, explain the exact missing field, file, or parameter and show the minimum fix required.
- If an external dependency or script fails, surface the command path, likely cause, and the next recovery step.
- If partial output is returned, label it clearly and identify which checks could not be completed.
Quick Validation
Run this minimal verification path before full execution when possible:
python scripts/process_pdfs.py --help
Expected output format:
Result file: literature_statistics_result.md Validation summary: PASS/FAIL with brief notes Assumptions: explicit list if any