Claude-skill-registry insight-pilot
Literature research automation - search papers, code, and blogs, deduplicate, download PDFs, analyze and generate research reports. Supports incremental updates.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/insight-pilot" ~/.claude/skills/majiayu000-claude-skill-registry-insight-pilot && rm -rf "$T"
skills/data/insight-pilot/SKILL.mdInsight-Pilot Skill
A workflow automation skill for literature research. Searches papers, GitHub repos/code/issues, PubMed, Dev.to, and blogs, deduplicates results, downloads PDFs, analyzes content, and generates incremental research reports.
Setup
Run the bootstrap script (automatically checks environment, creates and installs if missing):
bash .codex/skills/insight-pilot/scripts/bootstrap_env.sh
The script automatically detects if
~/.insight-pilot-venv exists and if packages are installed, only installing when necessary. See --help for advanced options.
Usage
Before running commands, activate the environment:
source ~/.insight-pilot-venv/bin/activate
Then use the CLI:
insight-pilot <command> [options]
CLI Commands
| Command | Purpose | Required Args | Key Optional Args |
|---|---|---|---|
| Create research project | , | |
| Search, merge and dedup | , , | , , |
| Download PDFs + convert to Markdown | | - |
| Analyze papers with LLM | | , |
| Generate index.md | | |
| Check project state | | - |
| Manage blog/RSS sources | | , , |
JSON Output Mode
Add
--json flag for structured output (recommended for agents):
insight-pilot status --json --project ./research/myproject
Blog/RSS Sources Configuration
Create
sources.yaml in your project root:
blogs: - name: "Cursor Blog" type: "ghost" url: "https://cursor.sh/blog" api_key: "auto" - name: "Example WP Blog" type: "wordpress" url: "https://blog.example.com" - name: "OpenAI Blog" type: "rss" url: "https://openai.com/blog/rss.xml" category: "ai"
Manage sources via:
insight-pilot sources --project ./research/webagent
Environment variables:
(GitHub API higher rate limit)GITHUB_TOKEN
(required by NCBI)PUBMED_EMAIL
(OpenAlex polite usage)OPENALEX_MAILTO
(override sources.yaml path)INSIGHT_PILOT_SOURCES
New Sources Examples
# GitHub repositories + code + issues insight-pilot search --project $PROJECT --source github --query "agent framework" --limit 30 # PubMed (requires PUBMED_EMAIL) insight-pilot search --project $PROJECT --source pubmed --query "clinical agents" --limit 20 # Dev.to articles insight-pilot search --project $PROJECT --source devto --query "ai agents" --limit 20 # Blogs (Ghost/WordPress/RSS from sources.yaml) insight-pilot search --project $PROJECT --source blog --query "agents" --limit 20
Workflow (Agent + CLI Collaboration)
This is the complete workflow for Agent + CLI collaboration.
Execution Principles:
- Run CLI commands in sequence as prescribed, no line-by-line confirmation needed.
- Agent intervention is ONLY required in Phase 2 for manual review (checking
and settingitems.json
/status
).exclude_reason
Phase 1: Search and Initial Filtering
Execute the following commands directly, no confirmation needed:
PROJECT=./research/webagent # Step 1: Initialize project insight-pilot init --topic "WebAgent Research" --keywords "web agent,browser agent" --output $PROJECT # Step 2: Search multiple sources (auto merge & dedup) insight-pilot search --project $PROJECT --source arxiv openalex github pubmed devto blog --query "web agent" --limit 50
Phase 2: Agent Review (Manual Check)
After deduplication, the Agent needs to review the paper list and remove content unrelated to the research topic.
# Check current status insight-pilot status --json --project $PROJECT
Agent Actions:
- Read
$PROJECT/.insight/items.json - Check
andtitle
for each paperabstract - Mark unrelated papers: set
tostatus
and add"excluded"exclude_reason - Save the updated
items.json
{ "id": "i0023", "title": "Unrelated Paper Title", "status": "excluded", "exclude_reason": "Not related to web agents, focuses on chemical agents" }
Phase 3: Download PDFs
Execute directly, no confirmation needed:
# Step 3: Download PDFs (converts to Markdown automatically) insight-pilot download --project $PROJECT
Download Results:
- Success:
, PDF saved todownload_status: "success"papers/ - Failed:
, recorded indownload_status: "failed"$PROJECT/.insight/download_failed.json
Failure list format:
[ { "id": "i0015", "title": "Paper Title", "url": "https://...", "error": "Connection timeout", "failed_at": "2026-01-17T10:30:00Z" } ]
Note: Advanced download (proxy/browser automation for failed items) is not yet implemented.
Phase 4: Analyze Papers
Precondition: Must complete Phase 3 Download PDFs first (
download command automatically converts PDFs to Markdown).
MUST try LLM analysis first. If LLM is configured, run directly:
# Step 4: LLM Analysis (prefers converted Markdown, falls back to PDF text extraction) insight-pilot analyze --project $PROJECT
Content Source Priority:
- Markdown (from
auto-conversion via pymupdf4llm)download - PDF Extraction (PyMuPDF)
LLM Configuration: Create
.codex/skills/insight-pilot/llm.yaml:
provider: openai # openai / anthropic / ollama model: gpt-4o-mini api_key: sk-xxx # or set env var OPENAI_API_KEY
When LLM is not configured: Manual Analysis Required
If no LLM is configured, the Agent needs to analyze manually:
- Read PDF files in
directorypapers/ - Extract key information for each paper
- Write analysis results to
$PROJECT/.insight/analysis/{id}.json
Analysis File Format (
$PROJECT/.insight/analysis/{id}.json):
{ "id": "i0001", "title": "Paper Title", "summary": "One sentence summary", "brief_analysis": "2-3 sentences brief analysis", "detailed_analysis": "300-500 words detailed analysis", "contributions": ["Contribution 1", "Contribution 2"], "methodology": "Methodology description", "key_findings": ["Finding 1", "Finding 2"], "limitations": ["Limitations"], "future_work": ["Future work 1"], "relevance_score": 8, "tags": ["webagent", "benchmark", "multimodal"], "analyzed_at": "2026-01-17T12:00:00Z" }
Phase 5: Generate Incremental Report
# Step 8: Generate/Update Index insight-pilot index --project $PROJECT
Reports are stored in
$PROJECT/index.md, showing only analyzed papers and linking to reports/{id}.md detailed reports.
Report Structure:
# WebAgent Research > **Generated**: 2026-01-18 10:30 > **Keywords**: web agent, browser agent > **Analyzed**: 5 papers --- ## 📚 Analyzed Papers ### [Paper Title](reports/i0001.md) **Authors**: Author A, Author B et al. | **Date**: 2026-01-15 | **Links**: arXiv/DOI | **Relevance**: 8/10 **Summary**: One sentence summary... > 2-3 sentences brief analysis... **Tags**: `webagent` `benchmark` `multimodal` --- ## ⚠️ Papers Not Available _The following papers could not be downloaded. Only abstracts are shown._ ### Paper Title **Authors**: ... | **Date**: ... | **Links**: ... > Abstract... --- ## 📊 Statistics | Metric | Value | |--------|-------| | Papers Analyzed | 5 | | Download Failed | 1 | | Total Processed | 6 |
Incremental Update Workflow
For daily/weekly updates:
# 1. Search new papers (use --since for date limit, auto merge & dedup) insight-pilot search --project $PROJECT --source arxiv openalex --query "web agent" --since 2026-01-17 --limit 20 # 2. [Agent] Review newly added papers # 3. Download PDFs for new papers insight-pilot download --project $PROJECT # 4. [Agent] Analyze new papers, update reports # 5. Regenerate index insight-pilot index --project $PROJECT
Project Structure
research/myproject/ ├── .insight/ │ ├── config.yaml # 项目配置 │ ├── state.json # 工作流状态 │ ├── items.json # 论文元数据(含 status, exclude_reason) │ ├── raw_arxiv.json # 原始搜索结果 │ ├── raw_openalex.json │ ├── download_failed.json # 下载失败列表(供高级下载重试) │ ├── analysis/ # 论文分析结果 │ │ ├── i0001.json │ │ ├── i0002.json │ │ └── ... │ └── markdown/ # PDF 转换结果(pymupdf4llm) │ ├── i0001/ │ │ ├── i0001.md # 转换后的 Markdown │ │ └── metadata.json │ └── ... ├── papers/ # 已下载的 PDF ├── reports/ # 历史报告存档 └── index.md # 当前研究报告(增量更新)
Data Schemas
Item (Paper)
{ "id": "i0001", "type": "paper", "title": "Paper Title", "authors": ["Author One", "Author Two"], "date": "2026-01-15", "abstract": "...", "status": "active|excluded|pending", "exclude_reason": null, "identifiers": { "doi": "10.1234/example", "arxiv_id": "2601.12345", "openalex_id": "W1234567890" }, "urls": { "abstract": "https://arxiv.org/abs/2601.12345", "pdf": "https://arxiv.org/pdf/2601.12345" }, "download_status": "success|pending|failed|unavailable", "local_path": "./papers/i0001.pdf", "citation_count": 42, "source": ["arxiv", "openalex"], "collected_at": "2026-01-17T10:00:00Z" }
Error Codes
| Code | Meaning | Retryable |
|---|---|---|
| Project directory doesn't exist | No |
| Required input files missing | No |
| items.json not found | No |
| Unknown data source | No |
| API request failed | Yes |
| API rate limit hit | Yes |
| PDF download failed | Yes |
| PDF to Markdown conversion failed | Yes |
| Required package not installed | No |
Agent Guidelines
Execution Principles:
- First run: Run bootstrap script to auto-setup environment
- CLI Commands (init, search, download, analyze, index): Run in sequence, no confirmation needed
- Agent intervention ONLY needed during Phase 2 (Review) and Manual Analysis (if no LLM)
Specific Guidelines:
- Environment Setup: Run
firstbash .codex/skills/insight-pilot/scripts/bootstrap_env.sh - Use
flag: Get structured output for parsing--json - Execute CLI directly: Do not ask for confirmation, follow workflow sequence
- Review: Modify
andstatus
inexclude_reasonitems.json - LLM Analysis First: Use
command if configured, otherwise manually createanalyzeanalysis/{id}.json - Incremental Updates: Only process new papers, keep existing analysis results