Claude-skill-registry download-all-transcripts
Download transcripts for all data folders sequentially. Use for overnight batch processing or when you need to download pending transcripts across all channels and collections.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/download-all-transcripts" ~/.claude/skills/majiayu000-claude-skill-registry-download-all-transcripts && rm -rf "$T"
skills/data/download-all-transcripts/SKILL.mdDownload All Transcripts
Why? Manually downloading transcripts folder-by-folder is tedious and error-prone. This skill automates overnight batch processing across all channels and collections with built-in rate limiting and resumability.
Quick Start
# Run from repository root - handles everything automatically ./scripts/download_all_transcripts.sh
That's it. The script finds all folders with
videos.csv, downloads pending transcripts, and resumes safely if interrupted.
Workflow
1. Verify Prerequisites
Before running, ensure:
- You're in the repository root directory
- The
folder contains at least one subfolder with adata/
filevideos.csv - The
CLI is installed (comes with the project's Python package)transcript-download
# Check for valid data folders ls data/*/videos.csv
[!TIP] If no
files exist, first runvideos.csvorextract-videosto populate them.sync-all-channels
2. Execute Batch Download
./scripts/download_all_transcripts.sh
The script will:
- Find all folders in
containingdata/videos.csv - Process each folder sequentially
- Download transcripts to
<folder>/transcripts/ - Wait 60 seconds between videos to avoid YouTube rate limiting
- Update CSV with download status
[!CAUTION] This is a long-running operation. For a channel with 500 videos, expect 8+ hours. Run overnight or in a
/tmuxsession.screen
3. Monitor Progress
The script outputs real-time progress:
📝 YTScribe - Download All Transcripts ======================================= Started at: Thu Dec 26 09:00:00 PST 2024 Delay between videos: 60s Found 12 folders with videos.csv ──────────────────────────────────────── [1/12] Processing: lex-fridman CSV: /path/to/data/lex-fridman/videos.csv Output: /path/to/data/lex-fridman/transcripts
4. Handle Completion or Interruption
On successful completion:
✅ All transcripts downloaded! Finished at: Thu Dec 26 17:30:00 PST 2024 Summary of folders processed: - lex-fridman: 342 transcripts - huberman-lab: 156 transcripts ...
On interruption or IP block: Simply run the script again. It automatically skips videos where
transcript_downloaded=True in the CSV.
Output Structure
Transcripts are saved as markdown with YAML frontmatter:
data/huberman-lab/ ├── videos.csv └── transcripts/ ├── 2024-01-15-abc123.md ├── 2024-01-20-def456.md └── ...
Each transcript file contains:
--- video_id: abc123 title: "Sleep Optimization Toolkit" channel: Huberman Lab published_at: 2024-01-15 duration: PT2H15M30S --- [Transcript content here...]
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
message | YouTube detected automated requests | Switch VPN server, wait 1-2 hours, then resume |
| Empty or missing data folders | Run or first |
| Script exits immediately | No pending transcripts | Check CSVs - all may already be downloaded |
| CLI not installed | Run from repo root |
| Partial download (some videos skipped) | Videos without transcripts/captions | Check YouTube - video may have no captions available |
Common Mistakes
-
Running without checking disk space - Transcripts are small (~50KB each), but 10,000 videos = ~500MB. Verify space before overnight runs.
-
Interrupting during a download - Safe to Ctrl+C between videos. If you interrupt mid-download, that video's transcript may be incomplete. The CSV won't mark it as downloaded, so it will retry.
-
Running multiple instances - Don't run the script twice simultaneously. The 60s delay assumes single-threaded operation to respect rate limits.
-
Expecting instant results - The 60s delay is intentional. Faster rates trigger IP blocks. Plan for overnight runs.
Quality Checklist
Before considering batch download complete:
- All folders show transcript counts in summary output
- No
errors (or resolved by VPN switch)🛑 IP BLOCKED - Spot-check 2-3 random
files have valid content.md - CSV
column reflects actual downloadstranscript_downloaded
When to Use This vs. download-transcripts
| Scenario | Use |
|---|---|
| Download ALL pending transcripts across all channels | (this skill) |
| Download transcripts for a single specific folder | |
| Need fine-grained control over which videos | with filters |
Technical Details
- Rate limiting: 60 second delay between videos (configurable in script's
variable)DELAY - Exit codes: 0 = success, 1 = general error, 2 = IP blocked (special handling)
- Resumability: Based on
column in each CSVtranscript_downloaded - Dependencies: Requires
CLI from project's Python packagetranscript-download