Awesome-omni-skill scholarag

Build PRISMA 2020-compliant systematic literature review systems with RAG-powered analysis in VS Code. Use when researcher needs automated paper retrieval (Semantic Scholar, OpenAlex, arXiv), AI-assisted PRISMA screening (50% or 90% threshold), vector database creation (ChromaDB), or research conversation interface. Supports knowledge_repository (comprehensive, 15K+ papers, teaching/exploration) and systematic_review (publication-quality, 50-300 papers, meta-analysis) modes. Conversation-first workflow with 7 stages.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data-ai/scholarag-hosungyou" ~/.claude/skills/diegosouzapw-awesome-omni-skill-scholarag-fa79d7 && rm -rf "$T"
manifest: skills/data-ai/scholarag-hosungyou/SKILL.md
source content

ScholaRAG: Systematic Review Automation Skill

For: Claude Code (AI assistant in VS Code) Purpose: Guide researchers through PRISMA 2020 systematic literature review + RAG-powered analysis


Quick Start (5 minutes)

For Researchers

  1. Initialize project:
    python scholarag_cli.py init
  2. Paste Stage 1 prompt: Copy from https://www.scholarag.com/guide/01-introduction
  3. Answer Claude's questions → Config created automatically
  4. Proceed through 7 stages conversationally

For AI Assistants (Claude Code)

When researcher provides a ScholaRAG prompt:

  1. Check for HTML metadata block (
    <!-- METADATA ... -->
    at top of prompt)
  2. Identify current stage (1-7) from metadata
    stage
    field
  3. Follow conversation pattern (from metadata
    conversation_flow
    )
  4. Validate completion (using metadata
    validation_rules
    )
  5. Auto-execute commands (when
    auto_execute: true
    )
  6. Update
    .claude/context.json
    (track progress)
  7. Show next stage prompt (from metadata
    next_stage.prompt_file
    )

Researcher should NEVER touch terminal. You execute all scripts automatically.


7-Stage Workflow Overview

StageNameRead This FileDurationAuto-Execute
1Research Setupskills/claude_only/stage1_research_setup.md15-20 min
scholarag init
2Query Strategyskills/claude_only/stage2_query_strategy.md15-25 min❌ Manual review
3PRISMA Configskills/claude_only/stage3_prisma_config.md20-30 min❌ Manual review
4RAG Designskills/claude_only/stage4_rag_design.md10-15 min❌ Manual review
5Executionskills/claude_only/stage5_execution.md2-4 hours✅ Run all 5 scripts
6Research Conversationskills/claude_only/stage6_research_conversation.mdOngoing❌ Interactive
7Documentationskills/claude_only/stage7_documentation.md30-60 min✅ Generate PRISMA

Progressive Disclosure: Load stage file only when researcher enters that stage. Don't preload all 7 stages (token waste).


Critical Branching Points

🔴 project_type (Stage 1 Decision)

Two modes available:

ModeThresholdOutputBest For
knowledge_repository
50% (lenient)15K-20K papersTeaching, AI assistant, exploration
systematic_review
90% (strict)50-300 papersMeta-analysis, publication

Quick decision:

  • Publishing systematic review? →
    systematic_review
  • Comprehensive domain coverage? →
    knowledge_repository

Detailed decision tree: skills/reference/project_type_decision_tree.md

When to read decision tree:

  • Researcher asks: "Which project_type should I choose?"
  • Researcher says: "I'm unsure about my research goals"
  • Stage 1 initialization (proactively offer decision helper)

🔴 Stage 6 Scenarios (7 Research Modes)

Stage 6 branches into 7 specialized conversation scenarios:

  1. overview (Context Scanning): High-level themes, methods, findings
  2. hypothesis (Hypothesis Validation): Evidence for/against with effect sizes
  3. statistics (Statistical Extraction): RCT data table (tools, Cohen's d, samples)
  4. methods (Methodology Comparison): RCT vs quasi vs mixed methods
  5. contradictions (Contradiction Detection): Conflicting results + analysis
  6. policy (Policy Translation): Actionable recommendations for stakeholders
  7. grant (Future Research Design): Follow-up study design + hypotheses

Details: skills/claude_only/stage6_research_conversation.md

When to read: Stage 6 entry (researcher asks "What can I query?")


Error Recovery

When errors occur: skills/reference/error_recovery.md

Quick fixes (common issues):

ErrorQuick FixDetailed Guide
Too many papers (>30K)Refine query in Stage 2, re-run fetcherror_recovery.md §2.1
API key missingAdd
ANTHROPIC_API_KEY
to
.env
error_recovery.md §3.1
Low PDF success (<30%)Filter for
open_access
in Stage 1
error_recovery.md §4.1
All papers excluded (0 papers)Lower threshold or broaden queryerror_recovery.md §3.2

Reference Materials (Load Only When Needed)

Progressive disclosure: Don't preload these. Read only when researcher asks specific questions.

TopicFileWhen to Read
API endpointsskills/reference/api_reference.mdResearcher asks about Semantic Scholar, OpenAlex, arXiv
Config schemaskills/reference/config_schema.mdResearcher asks "What fields are in config.yaml?"
PRISMA checklistskills/reference/prisma_guidelines.mdResearcher asks about PRISMA 2020 compliance
Troubleshootingskills/reference/troubleshooting.mdResearcher reports errors not in Quick Fixes

Architecture Overview

File dependencies: https://www.scholarag.com/codebook/architecture

Key principle: Scripts read from

config.yaml
(single source of truth), never hardcode values.

Critical scripts (read

project_type
from config):

  • 03_screen_papers.py
    : Sets threshold (50% or 90%)
  • 07_generate_prisma.py
    : Changes diagram title ("Knowledge Repository" vs "Systematic Review")

For Codex Users

If researcher is using OpenAI Codex instead of Claude Code:

See AGENTS.md for bash-based task workflows.

Codex workflow differs:

  • Task-oriented (not conversation-oriented)
  • Bash commands (not validation rules)
  • Exit codes (not metadata parsing)

Universal reference files (Claude + Codex both use):

  • skills/reference/project_type_decision_tree.md
  • skills/reference/api_reference.md
  • skills/reference/config_schema.md

Token Optimization Notes

This file: ~400 lines (loaded once per conversation)

Stage-specific files: ~300-500 lines each (loaded on-demand)

Total per conversation: ~700 lines (this file + current stage file)

Previous approach: ~2,000 lines (all context upfront)

Token reduction: 65%

How it works:

  1. Researcher starts Stage 1 → You load this file +
    stage1_research_setup.md
  2. Researcher moves to Stage 2 → You load
    stage2_query_strategy.md
    (Stage 1 file unloaded)
  3. Reference files loaded only when researcher asks (e.g., "How does Semantic Scholar API work?")

Metadata Block Format

All prompts in

prompts/*.md
contain HTML comment metadata at top:

<!-- METADATA
stage: 1
stage_name: "Research Domain Setup"
expected_duration: "15-20 minutes"
conversation_mode: "interactive"
expected_turns: "6-10"
outputs:
  required:
    - project_name: "Descriptive name"
    - research_question: "Clear, answerable question"
    - project_type: "knowledge_repository OR systematic_review"
validation_rules:
  project_type:
    required: true
    allowed_values: ["knowledge_repository", "systematic_review"]
cli_commands:
  - command: "python scholarag_cli.py init ..."
    auto_execute: true
next_stage:
  stage: 2
  prompt_file: "prompts/02_query_strategy.md"
-->

How to use:

  1. Parse YAML inside HTML comment (lines between
    <!-- METADATA
    and
    -->
    )
  2. Extract fields:
    stage
    ,
    expected_turns
    ,
    validation_rules
    ,
    cli_commands
    ,
    next_stage
  3. Follow conversation pattern: Ask questions matching
    expected_turns
    count
  4. Validate: Check user inputs against
    validation_rules
  5. Execute: Run
    cli_commands
    when conversation complete
  6. Transition: Show prompt from
    next_stage.prompt_file

Divergence Handling

Common researcher confusions (from metadata

divergence_handling
):

Divergence 1: "Can you help me download PDFs?" (in Stage 1)

Response: "PDF downloading happens in Stage 4 (after screening in Stage 3). Right now in Stage 1, let's first define your research scope and choose project_type. We'll design queries in Stage 2, configure PRISMA in Stage 3, then download PDFs in Stage 4."

Divergence 2: "I want to skip systematic review" (in Stage 1)

Response: "If you don't need publication-quality systematic review, choose

project_type: knowledge_repository
in the next question. This mode uses lenient filtering (50% threshold) for comprehensive domain coverage (15K-20K papers). It's perfect for teaching materials, AI assistants, or exploratory research."

Divergence 3: "What's the difference between the two modes?" (in Stage 1)

Response: "Let me explain:

knowledge_repository:

  • 50% threshold (lenient, removes only spam)
  • 15,000-20,000 papers output
  • For: Teaching, exploration, AI assistant

systematic_review:

  • 90% threshold (strict, PRISMA 2020)
  • 50-300 papers output
  • For: Meta-analysis, publication

See full decision tree: skills/reference/project_type_decision_tree.md"


Conversation Flow Example (Stage 1)

Typical pattern (6-10 turns):

  1. Turn 1: Researcher provides research topic

    • You ask: "Is this for exploratory domain mapping or publication-quality systematic review?"
  2. Turn 2-3: Researcher answers scope questions

    • You suggest:
      project_type
      based on answers, explain threshold implications
    • Example: "Based on your goal of meta-analysis, I recommend
      systematic_review
      mode with 90% screening threshold."
  3. Turn 4-5: Researcher confirms project_type choice

    • You suggest: Year range, publication types, expected databases
    • Example: "For language learning studies, I recommend 2015-2025 (10 years) focusing on Semantic Scholar and ERIC."
  4. Turn 6-8: Researcher provides final details (domain, year range)

    • You summarize: All decisions, ask for confirmation
    • Example: "Here's what I'll create: [summary]. Ready to initialize?"
  5. Turn 9-10: Researcher confirms initialization

    • You execute:
      scholarag_cli.py init
      , create
      config.yaml
      , show next steps
    • Example: "✅ Project initialized! Next, let's design your search query in Stage 2."

Completion Checklist (Stage-Specific)

Stage 1 example (from metadata

completion_checklist
):

  • project_name
    is descriptive and unique (≥10 chars)
  • research_question
    is specific and answerable (≥20 chars)
  • project_type
    chosen with understanding of implications (50% vs 90%)
  • year_range
    is realistic for scope (≤25 years, not before 2000)
  • config.yaml
    created successfully (file exists, valid YAML)

When all checked → Auto-execute

scholarag_cli.py init
→ Show Stage 2 prompt


Example Commands You Will Execute

Stage 1: Initialize

python scholarag_cli.py init \
  --name "AI-Chatbots-Language-Learning" \
  --question "How do AI chatbots improve speaking proficiency in EFL learners?" \
  --domain education

Stage 5: Run Pipeline (All 5 Scripts)

# Fetch papers
python scripts/01_fetch_papers.py --project projects/YYYY-MM-DD_ProjectName

# Deduplicate
python scripts/02_deduplicate.py --project projects/YYYY-MM-DD_ProjectName

# Screen with AI
python scripts/03_screen_papers.py --project projects/YYYY-MM-DD_ProjectName

# Download PDFs
python scripts/04_download_pdfs.py --project projects/YYYY-MM-DD_ProjectName

# Build RAG
python scripts/05_build_rag.py --project projects/YYYY-MM-DD_ProjectName

Stage 7: Generate PRISMA

python scripts/07_generate_prisma.py --project projects/YYYY-MM-DD_ProjectName

Integration with .claude/context.json

You should update this file after each stage:

{
  "current_stage": {
    "stage": 2,
    "name": "Query Strategy",
    "status": "in_progress",
    "started_at": "2025-10-24T10:30:00Z"
  },
  "completed_stages": [
    {
      "stage": 1,
      "name": "Research Setup",
      "completed_at": "2025-10-24T10:25:00Z",
      "outputs": {
        "project_name": "AI-Chatbots-Language-Learning",
        "research_question": "How do AI chatbots improve speaking proficiency?",
        "project_type": "systematic_review"
      }
    }
  ],
  "project": {
    "name": "AI-Chatbots-Language-Learning",
    "created": "2025-10-24",
    "research_question": "How do AI chatbots improve speaking proficiency in EFL learners?",
    "project_type": "systematic_review"
  }
}

Purpose: Track progress, enable

scholarag status
command to show current stage.


FAQ for AI Assistants

Q: Should I always read stage files in order (1→2→3...)?

A: No! Read only the file for the current stage researcher is in. Use progressive disclosure.

Q: What if researcher jumps to Stage 5 without completing Stages 1-4?

A: Check

.claude/context.json
for completed stages. If missing prerequisites, politely redirect: "Stage 5 requires config.yaml from Stage 1, search_query from Stage 2, and PRISMA criteria from Stage 3. Let's complete those first."

Q: When should I read
skills/reference/
files?

A: Only when researcher explicitly asks. Examples:

  • "How does Semantic Scholar API work?" → Read
    api_reference.md
  • "What are all the config.yaml fields?" → Read
    config_schema.md
  • "Why should I choose systematic_review?" → Read
    project_type_decision_tree.md

Q: What if I don't understand metadata in prompts/*.md?

A: All metadata fields are documented in

skills/claude_only/metadata_spec.md
. Read that file if you encounter unknown fields.


Additional Resources

Detailed implementation guide: See CLAUDE.md for:

  • 🎓 User profile (researchers with limited coding experience)
  • How Claude Code should behave (DO/DON'T guidelines)
  • Auto-execution patterns (echo pipes, CLI arguments)
  • Full CLI reference and troubleshooting

For Codex/Cursor users: See AGENTS.md for task-based bash workflows


Last Updated: 2025-10-24 (v2.0 - Agent Skills Integration) Companion files: CLAUDE.md (detailed guide), AGENTS.md (Codex workflows) Compatible with: Claude Code v1.0+, Anthropic API Token Budget: ~380 lines (this file) + ~300-500 lines (stage file) = ~700-900 lines per conversation