Cc-skills research-archival

Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification. TRIGGERS - scrape research, archive findings, save ChatGPT share, save Gemini research, research to issue.

install
source · Clone the upstream repo
git clone https://github.com/terrylica/cc-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/terrylica/cc-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/gh-tools/skills/research-archival" ~/.claude/skills/terrylica-cc-skills-research-archival && rm -rf "$T"
manifest: plugins/gh-tools/skills/research-archival/SKILL.md
source content

Research Archival

Scrape AI research conversations (ChatGPT, Gemini, Claude) and web pages, archive them as markdown files with YAML frontmatter, and create cross-referenced GitHub Issues — with mandatory identity verification at every step.

Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.

FIRST - TodoWrite Task Templates

MANDATORY: Select and load the appropriate template before any archival work.

Template A - Full Archival (scrape + save + issue)

1. Identity preflight — verify GH_ACCOUNT or resolve via curl /user
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
4. Survey labels — gh label list, reuse existing, max 3-6
5. Create GitHub Issue — use --body with heredoc or --body-file
6. Update frontmatter — add github_issue_url and github_issue_number
7. Post canonical backlink comment on Issue

Template B - Save Only (no issue)

1. Identity preflight (still required for consistency)
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter

Template C - Issue Only (file already exists)

1. Identity preflight
2. Read existing file frontmatter
3. Survey labels — gh label list, reuse existing, max 3-6
4. Create GitHub Issue — use --body with heredoc or --body-file
5. Update file frontmatter with issue cross-reference
6. Post canonical backlink comment on Issue

Identity Preflight (MANDATORY — Step 0)

MUST execute before any

gh
write command. Non-negotiable.

The

gh-repo-identity-guard.mjs
PreToolUse hook provides a safety net, but this skill performs its own check as defense-in-depth.

Resolution Order

  1. Fast-path
    GH_ACCOUNT
    env var (set by mise per-directory)
  2. Token filename — scan
    ~/.claude/.secrets/gh-token-*
    for single base match
  3. API call
    curl -sH "Authorization: token $GH_TOKEN" https://api.github.com/user

Verification

/usr/bin/env bash << 'IDENTITY_EOF'
# Resolve authenticated user
if [ -n "$GH_ACCOUNT" ]; then
  AUTH_USER="$GH_ACCOUNT"
  AUTH_SOURCE="GH_ACCOUNT"
else
  AUTH_USER=$(curl -sf --max-time 5 -H "Authorization: token $GH_TOKEN" \
    https://api.github.com/user 2>/dev/null | grep -o '"login":"[^"]*"' | cut -d'"' -f4)
  AUTH_SOURCE="API /user"
fi

# Resolve target repo owner
REPO_OWNER=$(git remote get-url origin 2>/dev/null | sed -n 's|.*github\.com[:/]\([^/]*\)/.*|\1|p')

echo "Authenticated as: $AUTH_USER (via $AUTH_SOURCE)"
echo "Target repo owner: $REPO_OWNER"

if [ "$AUTH_USER" != "$REPO_OWNER" ]; then
  echo ""
  echo "MISMATCH — do NOT proceed with gh write commands"
  echo "Fix: export GH_TOKEN=\$(cat ~/.claude/.secrets/gh-token-$REPO_OWNER)"
  exit 1
fi
echo "Identity verified — safe to proceed"
IDENTITY_EOF

BLOCK if mismatch — display diagnostic and do NOT continue to any

gh
write operation.


Scraping Workflow

Route scrape requests based on URL pattern. See url-routing.md for full details.

Decision Tree

URL contains chatgpt.com/share/
  → Jina Reader (https://r.jina.ai/{URL})
  → Use curl (not WebFetch — it summarizes instead of returning raw)

URL contains gemini.google.com/share/
  → Firecrawl (JS-heavy SPA)
  → Preflight: ping -c1 -W2 littleblack

URL contains claude.ai/artifacts/ or is a static web page
  → Jina Reader (https://r.jina.ai/{URL})
  → Use WebFetch or curl

Firecrawl Scrape (with Health Check + Auto-Revival)

CRITICAL: Firecrawl containers can show "Up" in

docker ps
while internal processes are dead (RAM/CPU overload crashes the worker inside the container). Always perform a deep health check before scraping.

/usr/bin/env bash << 'SCRAPE_EOF'
set -euo pipefail

# Step 1: Check Tailscale connectivity (littleblack primary, ZeroTier legacy at 172.25.236.1)
if ! ping -c1 -W2 littleblack >/dev/null 2>&1; then
  echo "ERROR: Firecrawl host unreachable. Check Tailscale: tailscale status"
  exit 1
fi

# Step 2: Deep health check — test actual API response, not just container status
# Port 3003 (wrapper) may accept TCP but return empty if Firecrawl API (3002) is dead inside
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
  -X POST http://localhost:3002/v1/scrape \
  -H "Content-Type: application/json" \
  -d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo "000")

if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then
  echo "WARNING: Firecrawl API unhealthy (HTTP $HTTP_CODE). Attempting revival..."

  # Step 2a: Check docker logs for WORKER STALLED (RAM/CPU overload)
  ssh littleblack 'docker logs firecrawl-api-1 --tail 20 2>&1 | grep -i "stalled\|error\|exit" || true'

  # Step 2b: Restart the critical containers
  ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1' 2>/dev/null
  echo "Containers restarted. Waiting 20s for API to initialize..."
  sleep 20

  # Step 2c: Verify recovery
  HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
    -X POST http://localhost:3002/v1/scrape \
    -H "Content-Type: application/json" \
    -d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo "000")

  if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then
    echo "ERROR: Firecrawl still unhealthy after restart (HTTP $HTTP_CODE)."
    echo "Manual intervention needed. Try: ssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'"
    echo "Falling back to Jina Reader: https://r.jina.ai/${URL}"
    exit 1
  fi
  echo "Firecrawl recovered successfully."
fi

# Step 3: Scrape via wrapper
CONTENT=$(curl -s --max-time 120 "http://littleblack:3003/scrape?url=${URL}&name=${SLUG}")

if [ -z "$CONTENT" ]; then
  echo "ERROR: Scrape returned empty. Try Jina fallback: https://r.jina.ai/${URL}"
  exit 1
fi

echo "$CONTENT"
SCRAPE_EOF

Known Failure Mode: Container "Up" But Processes Dead

Symptom:

docker ps
shows containers with status "Up 4 days" but
curl localhost:3002
returns connection reset.

Root cause: Firecrawl worker exhausts RAM/CPU (observed:

cpuUsage=0.998, memoryUsage=0.858
). Internal Node.js processes exit but Docker container stays alive because the entrypoint shell is still running.

Diagnosis:

ssh bigblack 'docker logs firecrawl-api-1 --tail 50 2>&1 | grep -E "STALLED|cpuUsage|exit"'
# Look for: WORKER STALLED {"cpuUsage":0.998,"memoryUsage":0.858}

Fix:

docker restart
(not
docker compose restart
— may require permissions to compose directory):

ssh bigblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1'
sleep 20  # Wait for API initialization
# Verify:
ssh bigblack 'curl -s -o /dev/null -w "%{http_code}" http://localhost:3002/v1/scrape'

File Saving

Naming Convention

YYYY-MM-DD-{slug}-{source_type}.md
  • slug
    — kebab-case summary (max 50 chars)
  • source_type
    — from enum:
    chatgpt
    ,
    gemini
    ,
    claude
    ,
    web

Default location:

docs/research/
in the current project.

YAML Frontmatter

See frontmatter-schema.md for the full field contract.

---
source_url: https://chatgpt.com/share/...
source_type: chatgpt-share
scraped_at: "2026-02-09T18:30:00Z"
model_name: gpt-4o
custom_gpt_name: Cosmo
claude_code_uuid: SESSION_UUID
github_issue_url: ""
github_issue_number: ""
---

Leave

github_issue_url
and
github_issue_number
empty — update after Issue creation.


GitHub Issue Creation

Label Survey

Survey existing labels first — reuse preferred, create only when concept is genuinely novel.

gh label list --repo owner/repo --limit 100

Policy: Max 3-6 labels per issue. Common labels:

research
,
ai-output
,
chatgpt
,
gemini
,
archival
.

Create Issue

Use

--body
with heredoc for inline composition, or
--body-file
for very large content.

/usr/bin/env bash << 'ISSUE_EOF'
# Write body to temp file
cat > "/tmp/issue-body-${SLUG}.md" << 'BODY_EOF'
## Summary

Brief description of the archived research content.

## Source

- **URL**: SOURCE_URL
- **Type**: source_type
- **Model**: model_name
- **Scraped**: scraped_at

## Key Findings

- Finding 1
- Finding 2

## Archived File

`docs/research/FILENAME.md`
BODY_EOF

# Create issue
gh issue create \
  --repo owner/repo \
  --title "Research: descriptive title here" \
  --body-file "/tmp/issue-body-${SLUG}.md" \
  --label "research,ai-output"

# Clean up
rm -f "/tmp/issue-body-${SLUG}.md"
ISSUE_EOF

Update Frontmatter

After issue creation, update the archived file's frontmatter with the issue URL and number.


Canonical Backlink Comment

Post a comment on the Issue linking back to the archived file:

**Archived**: `docs/research/YYYY-MM-DD-slug-source_type.md`
Scraped: 2026-02-09T18:30:00Z
Source: [chatgpt-share](https://chatgpt.com/share/...)
Session: SESSION_UUID

Post-Change Checklist

After modifying THIS skill:

  1. YAML frontmatter valid (no colons in description)
  2. Trigger keywords current in description
  3. All
    ./references/
    links resolve
  4. Identity preflight section remains FIRST in workflow
  5. Append changes to evolution-log.md
  6. Validate:
    uv run plugins/plugin-dev/scripts/skill-creator/quick_validate.py plugins/gh-tools/skills/research-archival
  7. Validate links:
    bun run plugins/plugin-dev/scripts/validate-links.ts plugins/gh-tools/skills/research-archival

Troubleshooting

IssueCauseFix
Wrong account postingGH_TOKEN mismatchCheck
mise env | grep GH_TOKEN
, verify
GH_ACCOUNT
Body exceeds 65536 charsGitHub API limitSplit across issue body + first comment
Firecrawl unreachableTailscale down
tailscale ping bigblack
, check
tailscale status
Firecrawl "Up" but deadContainer alive, processes crashed
docker restart firecrawl-api-1 firecrawl-playwright-service-1
, wait 20s
Firecrawl WORKER STALLEDRAM/CPU overload (>85% mem)Same as above; check
docker logs firecrawl-api-1 --tail 50
Scrape returns emptyJS-heavy page timeoutIncrease Firecrawl timeout, try Jina fallback
Jina returns login page shellGemini login wall (not rendered)Must use Firecrawl for
gemini.google.com/share/*
URLs
mise parse errorStale .mise.toml syntaxRun
mise doctor
, check
[hooks.enter]
syntax
Identity guard blocksNon-owner account
export GH_TOKEN=$(cat ~/.claude/.secrets/gh-token-OWNER)

References

Post-Execution Reflection

After this skill completes, check before closing:

  1. Did the command succeed? — If not, fix the instruction or error table that caused the failure.
  2. Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
  3. Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.

Only update if the issue is real and reproducible — not speculative.