Autosearch autosearch:citation-index
Deduplicate URLs across all sources, assign stable citation numbers, and merge citations from multiple subagents / sections into one consistent reference list. Prevents "same URL cited as [3] in one paragraph and [17] in another" and "different URLs merged under [5]" bugs that come from per-section synthesis.
git clone https://github.com/0xmariowu/Autosearch
T=$(mktemp -d) && git clone --depth=1 https://github.com/0xmariowu/Autosearch "$T" && mkdir -p ~/.claude/skills && cp -r "$T/autosearch/skills/meta/citation-index" ~/.claude/skills/0xmariowu-autosearch-autosearch-citation-index && rm -rf "$T"
autosearch/skills/meta/citation-index/SKILL.mdCitation Index — Stable URL-to-Number Map
Borrowed from STORM's
StormArticle.url_to_unified_index + deepagents research_agent/prompts.py citation consolidation. Keeps citations stable across report sections and subagent outputs.
State
citation_index: entries: # URL (canonicalized) → entry "https://arxiv.org/abs/2401.12345": number: 1 title: "First paper title" first_seen_section: "introduction" used_in_sections: ["introduction", "background", "method"] source_channel: "search-arxiv" "https://github.com/foo/bar/issues/42": number: 2 title: "Issue: memory leak" first_seen_section: "implementation-notes" used_in_sections: ["implementation-notes"] source_channel: "search-github-issues" next_number: 3
URL Canonicalization Rules
Before indexing, normalize the URL:
- Strip tracking params:
,utm_*
,gclid
,fbclid
,ref=
.source= - Strip fragments (
) unless the URL is an anchor-addressed resource (e.g. docs page).#section-1 - Lowercase host but keep path case.
- Collapse multiple slashes (
→//
) except for the scheme./ - Platform-specific:
- YouTube:
form is canonical; stripwatch?v=ID
playlist params.list= - arXiv: strip version suffix for de-dup (
→v2
); keep title for reference."" - GitHub issues: canonicalize
togithubissues.com/X/Y/Z
.github.com/X/Y/issues/Z
- YouTube:
Different URL canonicalizations → different citation numbers. A URL that points to the same resource but failed canonicalization is a bug; log it and merge.
Write Path
When a subagent / section outputs evidence:
for ev in evidence: url = canonicalize(ev.url) if url not in index.entries: index.entries[url] = Entry( number=index.next_number, title=ev.title, first_seen_section=current_section, used_in_sections=[current_section], source_channel=ev.source_channel, ) index.next_number += 1 else: if current_section not in index.entries[url].used_in_sections: index.entries[url].used_in_sections.append(current_section)
Read Path
When rendering a section body, replace inline citation markers with the assigned number:
- Input from runtime AI:
... as shown in [ref: arxiv.org/abs/2401.12345] ... - After indexing:
... as shown in [1] ... - References section at end:
[1] Title. URL.
Merge Rule (Subagent Consolidation)
When merging N subagent outputs:
- Collect all evidence URLs across all subagents.
- Canonicalize and dedupe.
- Assign numbers in stable order (first-seen wins; use subagent order as tie-break).
- Rewrite each subagent's inline
tags to the final numbers.[ref: ...]
Failure Modes
- Malformed URL in evidence — log and skip; don't crash the whole index.
- Duplicate title with different URLs (e.g. "paper.pdf" published on two mirrors) — keep both; this is user judgment, not ours to merge without data.
When to Use
- Final report synthesis (always).
- Multi-subagent merge (always).
- Single-section simple answer — skip, overkill.
Cost
Cheap — mostly bookkeeping. No LLM calls unless resolving ambiguous cases (e.g. "are these two URLs the same resource?"). Runtime AI typically never needs to consult an LLM for citation indexing.
MCP Tool Usage
Full citation workflow using MCP tools:
# Create an index for this research session idx = citation_create() index_id = idx["index_id"] # Add URLs as you collect evidence (idempotent — same URL always gets same number) num1 = citation_add(index_id=index_id, url="https://arxiv.org/abs/2501.12345", title="RAG Survey 2026", source="arxiv")["citation_number"] num2 = citation_add(index_id=index_id, url="https://github.com/user/repo", title="rag-toolkit", source="github")["citation_number"] # Merge citations from a parallel delegate_subtask result # (if you ran delegate_subtask with a separate citation_create per subtask) citation_merge(target_id=index_id, source_id=other_index_id) # Export as Markdown reference list refs = citation_export(index_id=index_id)["markdown"] # refs = "[1] RAG Survey 2026 — arxiv (https://arxiv.org/abs/2501.12345)\n[2] rag-toolkit ..."
Use
[1], [2] inline citations in your report body, then append refs at the end.
Interactions
- Fed by → all channel skills +
output.delegate-subtask - Feeds →
(which produces thesynthesize-knowledge
report).[1] [2] - Feeds →
(which can spot "[5] is referenced but not in the index" bugs).evaluate-delivery
Quality Bar
- Evidence items have non-empty title and url.
- No crash on empty or malformed API response.
- Source channel field matches the channel name.