Claude-skill-registry atft-pipeline
Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/atft-pipeline" ~/.claude/skills/majiayu000-claude-skill-registry-atft-pipeline && rm -rf "$T"
manifest:
skills/data/atft-pipeline/SKILL.mdsource content
ATFT Pipeline Skill
Mission
- Provision fresh or historical parquet datasets for ATFT-GAT-FAN with GPU-accelerated ETL.
- Maintain deterministic feature graphs (approx. 395 engineered factors, 307 active).
- Guard J-Quants API quota, credential sanity, and cache health to prevent training stalls.
When To Engage
- Any request mentioning dataset builds, ETL, J-Quants, cache, RAPIDS/cuDF, or feature graph refresh.
- Pre-training sanity checks (“ensure latest dataset”, “verify cache integrity”).
- Recovery tasks (“resume interrupted dataset job”, “clean corrupted cache shards”).
Preflight Checklist
- Confirm
reports at least one free A100 80GB GPU; fallback to CPU only if GPU unavailable.nvidia-smi - Validate credentials:
contains.env
andJQUANTS_AUTH_EMAIL/PASSWORD
.JQUANTS_PLAN_TIER - Ensure
already executed (dependencies + entry points).python -m pip install -e . - Check latest health snapshot:
.tools/project-health-check.sh --section dataset - Inspect existing dataset for reuse:
.ls -lh output/ml_dataset_latest_full.parquet
Core Playbooks
1. Background Five-Year Refresh (default)
— GPU + secrets verification.make dataset-check-strict
— SSH-safe background run with logging inmake dataset-bg START=<optional> END=<optional>
._logs/dataset
— monitor progress (auto prints PID + PGID).tail -f _logs/dataset/*.log
— ensure cache hit-rate & size in expected bounds (<2.5 TB).make cache-stats
— confirm metadata integrity without rebuild.python scripts/pipelines/run_full_dataset.py --dry-run
2. Hotfix / Forced Refresh
— bypasses cached parquet + API throttle aware.make dataset-gpu-refresh START=YYYY-MM-DD END=YYYY-MM-DD
— keep latest dataset generation only.make datasets-prune
— evict stale graph shards to recover disk.make cache-prune CACHE_TTL_DAYS=90
3. Resource-Constrained Fallback
— relaxed diagnostics (CPU acceptable).make dataset-check
— chunked Pandas path.make dataset-cpu START=YYYY-MM-DD END=YYYY-MM-DD
— resume from last safe checkpoint if memory pressure triggered fallback.make dataset-safe-resume
4. Graph Feature Investigation
.python scripts/pipelines/run_full_dataset.py --inspect-graph --start YYYY-MM-DD --end YYYY-MM-DD
— null audit.python -c "import polars as pl; df = pl.read_parquet('output/ml_dataset_latest_full.parquet'); print(df.select(pl.all().is_null().sum()))"
— per-window edge density + overlap stats.make cache-monitor
Observability Hooks
for job logs,_logs/dataset/
metadata for cache.cache/*.json
for column coverage & horizon alignment.ml_dataset_latest_full_metadata.json
to confirm pipeline duration vs baseline (target: <42m GPU path).benchmark_output/dataset_timestamps.json
Failure Triage
- Credential errors → run
.python scripts/pipelines/run_full_dataset.py --auth-test - CUDA OOM → rerun with
(40GB RMM pool pre-configured).make dataset-safe - API rate limits → throttle via
.make dataset-gpu REFRESH_THROTTLE=1 - Corrupted parquet →
thenmake dataset-rebuild
.python tools/parquet_validator.py output/ml_dataset_latest_full.parquet
Codex Collaboration
- Escalate complex ETL debugging or architectural refactors via
(leverages OpenAI Codex deep reasoning)../tools/codex.sh "Diagnose dataset pipeline bottleneck" - For long-running autonomous maintenance, schedule
off-hours (uses./tools/codex.sh --max --exec "Perform full dataset pipeline audit"
from the Codex repo for filesystem/git context)..mcp.json - When Codex proposes changes, sync learnings back here and refresh dataset runbooks if any commands or defaults shift.
Handoff Notes
- Always update
if schema changes.dataset_features_detail.json - Announce new dataset snapshot in
with generation timestamp and settings.EXPERIMENT_STATUS.md - Surface anomalies (missing tickers, new features) via
reports.docs/data_quality/