Claude-skill-registry download-all-transcripts

Download transcripts for all data folders sequentially. Use for overnight batch processing or when you need to download pending transcripts across all channels and collections.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/download-all-transcripts" ~/.claude/skills/majiayu000-claude-skill-registry-download-all-transcripts && rm -rf "$T"

manifest: skills/data/download-all-transcripts/SKILL.md

Download All Transcripts

Why? Manually downloading transcripts folder-by-folder is tedious and error-prone. This skill automates overnight batch processing across all channels and collections with built-in rate limiting and resumability.

Quick Start

# Run from repository root - handles everything automatically
./scripts/download_all_transcripts.sh

That's it. The script finds all folders with

videos.csv

, downloads pending transcripts, and resumes safely if interrupted.

Workflow

1. Verify Prerequisites

Before running, ensure:

You're in the repository root directory
The
```
data/
```
folder contains at least one subfolder with a
```
videos.csv
```
file
The
```
transcript-download
```
CLI is installed (comes with the project's Python package)

# Check for valid data folders
ls data/*/videos.csv

[!TIP] If no
videos.csv
files exist, first run
extract-videos
or
sync-all-channels
to populate them.

2. Execute Batch Download

./scripts/download_all_transcripts.sh

The script will:

Find all folders in
```
data/
```
containing
```
videos.csv
```
Process each folder sequentially
Download transcripts to
```
<folder>/transcripts/
```
Wait 60 seconds between videos to avoid YouTube rate limiting
Update CSV with download status

[!CAUTION] This is a long-running operation. For a channel with 500 videos, expect 8+ hours. Run overnight or in a
tmux
/
screen
session.

3. Monitor Progress

The script outputs real-time progress:

📝 YTScribe - Download All Transcripts
=======================================
Started at: Thu Dec 26 09:00:00 PST 2024
Delay between videos: 60s

Found 12 folders with videos.csv

────────────────────────────────────────
[1/12] Processing: lex-fridman
  CSV: /path/to/data/lex-fridman/videos.csv
  Output: /path/to/data/lex-fridman/transcripts

4. Handle Completion or Interruption

On successful completion:

✅ All transcripts downloaded!
Finished at: Thu Dec 26 17:30:00 PST 2024

Summary of folders processed:
  - lex-fridman: 342 transcripts
  - huberman-lab: 156 transcripts
  ...

On interruption or IP block: Simply run the script again. It automatically skips videos where

transcript_downloaded=True

in the CSV.

Output Structure

Transcripts are saved as markdown with YAML frontmatter:

data/huberman-lab/
├── videos.csv
└── transcripts/
    ├── 2024-01-15-abc123.md
    ├── 2024-01-20-def456.md
    └── ...

Each transcript file contains:

---
video_id: abc123
title: "Sleep Optimization Toolkit"
channel: Huberman Lab
published_at: 2024-01-15
duration: PT2H15M30S
---

[Transcript content here...]

Troubleshooting

Problem	Cause	Solution
`🛑 IP BLOCKED` message	YouTube detected automated requests	Switch VPN server, wait 1-2 hours, then resume
`No videos.csv files found`	Empty or missing data folders	Run `extract-videos` or `sync-all-channels` first
Script exits immediately	No pending transcripts	Check CSVs - all may already be downloaded
`transcript-download: command not found`	CLI not installed	Run `pip install -e .` from repo root
Partial download (some videos skipped)	Videos without transcripts/captions	Check YouTube - video may have no captions available

Common Mistakes

Running without checking disk space - Transcripts are small (~50KB each), but 10,000 videos = ~500MB. Verify space before overnight runs.
Interrupting during a download - Safe to Ctrl+C between videos. If you interrupt mid-download, that video's transcript may be incomplete. The CSV won't mark it as downloaded, so it will retry.
Running multiple instances - Don't run the script twice simultaneously. The 60s delay assumes single-threaded operation to respect rate limits.
Expecting instant results - The 60s delay is intentional. Faster rates trigger IP blocks. Plan for overnight runs.

Quality Checklist

Before considering batch download complete:

All folders show transcript counts in summary output
No
```
🛑 IP BLOCKED
```
errors (or resolved by VPN switch)
Spot-check 2-3 random
```
.md
```
files have valid content
CSV
```
transcript_downloaded
```
column reflects actual downloads

When to Use This vs. download-transcripts

Scenario	Use
Download ALL pending transcripts across all channels	`download-all-transcripts` (this skill)
Download transcripts for a single specific folder	`download-transcripts --folder <name>`
Need fine-grained control over which videos	`download-transcripts` with filters

Technical Details

Rate limiting: 60 second delay between videos (configurable in script's
```
DELAY
```
variable)
Exit codes: 0 = success, 1 = general error, 2 = IP blocked (special handling)
Resumability: Based on
```
transcript_downloaded
```
column in each CSV
Dependencies: Requires
```
transcript-download
```
CLI from project's Python package