Asi yt-playlist-acset

Extract transcripts from YouTube playlists into DuckDB ACSet schema. Uses pytubefix + mlx-whisper on Apple Silicon. Supports auto-captions and local transcription fallback.

install
source · Clone the upstream repo
git clone https://github.com/plurigrid/asi
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/yt-playlist-acset" ~/.claude/skills/plurigrid-asi-yt-playlist-acset && rm -rf "$T"
manifest: skills/yt-playlist-acset/SKILL.md
source content

yt-playlist-acset

Extract knowledge from YouTube playlists into queryable DuckDB tables with GF(3) coloring.

Trit: +1 (PLUS - generative/extractive)
Color:

#A855F7
(purple - knowledge synthesis)
Bundle: knowledge-extraction

Overview

This skill extracts transcripts from YouTube playlists and stores them in a DuckDB database with ACSet-compatible schema. It leverages:

  • pytubefix: Reliable YouTube downloads (bypasses PO token requirements)
  • yt-dlp: Fast auto-caption extraction (VTT format)
  • mlx-whisper: Local transcription on Apple Silicon for videos without captions
  • DuckDB: Columnar storage with full-text search

Schema

CREATE TABLE transcripts (
  id VARCHAR PRIMARY KEY,      -- YouTube video ID
  title VARCHAR,               -- Video title
  duration INTEGER,            -- Duration in seconds
  transcript TEXT,             -- Full transcript text
  source VARCHAR,              -- 'vtt' or 'whisper'
  created_at TIMESTAMP DEFAULT NOW()
);

Quick Start

1. Extract Playlist to DuckDB

# Set playlist URL
PLAYLIST="https://www.youtube.com/playlist?list=PLjHvKZZWKqvgYDp7aGa-OSVi392j-Hmsa"

# Run extraction (babashka)
bb extract.clj "$PLAYLIST" ~/my_transcripts.duckdb

2. Query Transcripts

-- Find videos mentioning "category theory"
SELECT title, substr(transcript, 1, 200) 
FROM transcripts 
WHERE transcript ILIKE '%category theory%';

-- Full-text search
SELECT * FROM transcripts 
WHERE transcript ILIKE '%functor%' AND transcript ILIKE '%monad%';

-- Stats
SELECT COUNT(*) as videos, 
       ROUND(SUM(length(transcript))/1e6, 2) as MB 
FROM transcripts;

Dependencies

# Install via uv (recommended)
uv pip install pytubefix mlx-whisper yt-dlp duckdb

# Or pip
pip install pytubefix mlx-whisper yt-dlp duckdb

# Babashka (for orchestration)
brew install borkdude/brew/babashka

Architecture

┌─────────────────┐     ┌──────────────┐     ┌─────────────┐
│  YouTube        │────▶│  yt-dlp      │────▶│  VTT files  │
│  Playlist       │     │  (captions)  │     │  (294+)     │
└─────────────────┘     └──────────────┘     └──────┬──────┘
                                                    │
                        ┌──────────────┐            ▼
                        │  pytubefix   │     ┌─────────────┐
                        │  (audio)     │────▶│  DuckDB     │
                        └──────┬───────┘     │  ACSet      │
                               │             └─────────────┘
                               ▼                    ▲
                        ┌──────────────┐            │
                        │  mlx-whisper │────────────┘
                        │  (M-series)  │
                        └──────────────┘

Workflow

  1. Phase 1: Extract auto-captions via yt-dlp (fast, parallel)
  2. Phase 2: Download audio for videos without captions (pytubefix)
  3. Phase 3: Transcribe locally with mlx-whisper (Apple Silicon)
  4. Phase 4: Parse VTT/TXT and load into DuckDB

GF(3) Integration

Each transcript can be assigned a trit based on content analysis:

TritValueContent Type
MINUS-1Critique, analysis, verification
ERGODIC0Neutral, informational, balanced
PLUS+1Generative, creative, synthetic
-- Add trit column
ALTER TABLE transcripts ADD COLUMN trit INTEGER DEFAULT 0;

-- Classify based on keywords
UPDATE transcripts SET trit = 
  CASE 
    WHEN transcript ILIKE '%proof%' OR transcript ILIKE '%theorem%' THEN -1
    WHEN transcript ILIKE '%create%' OR transcript ILIKE '%build%' THEN +1
    ELSE 0
  END;

Related Skills

SkillRelationship
mlx-apple-silicon
Provides whisper transcription backend
video-downloader
General video download (this skill is playlist-focused)
duckdb-ies
Unified interactome analytics (consumes transcripts)
duckdb-timetravel
Temporal versioning for transcript updates
acsets-hatchery
ACSet schema patterns
sense
Diagrammatic video extraction (complementary)

Performance

MetricValue
Auto-caption extraction~5 videos/sec (parallel)
Audio download~1 video/10s (rate-limited)
Whisper transcription~1 hour video/5 min (M3 Max)
Storage~50KB/hour of video

Example: Research Playlist

# Extract David Spivak talks
bb extract.clj "https://youtube.com/playlist?list=..." ~/spivak.duckdb

# Query for polynomial functors
duckdb ~/spivak.duckdb "
  SELECT title 
  FROM transcripts 
  WHERE transcript ILIKE '%polynomial%functor%'
"

Troubleshooting

YouTube 403 Errors

pytubefix handles PO tokens automatically. If issues persist:

# Clear cache
rm -rf ~/.cache/pytubefix

mlx-whisper Model Download

First run downloads ~75MB model:

mlx_whisper test.m4a --model mlx-community/whisper-tiny

Private Videos

Private videos cannot be transcribed. Check with:

yt-dlp --flat-playlist --print "%(id)s|%(title)s" "PLAYLIST_URL" | grep -i private

Thread Origin

This skill was developed in Amp thread T-019ba48d-58db-704d-8eac-578a44bc20f7 extracting 343 transcripts (17.99 MB) from a research playlist.