Claude-skill-registry atft-pipeline

Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/atft-pipeline" ~/.claude/skills/majiayu000-claude-skill-registry-atft-pipeline && rm -rf "$T"

manifest: skills/data/atft-pipeline/SKILL.md

source content

ATFT Pipeline Skill

Mission

Provision fresh or historical parquet datasets for ATFT-GAT-FAN with GPU-accelerated ETL.
Maintain deterministic feature graphs (approx. 395 engineered factors, 307 active).
Guard J-Quants API quota, credential sanity, and cache health to prevent training stalls.

When To Engage

Any request mentioning dataset builds, ETL, J-Quants, cache, RAPIDS/cuDF, or feature graph refresh.
Pre-training sanity checks (“ensure latest dataset”, “verify cache integrity”).
Recovery tasks (“resume interrupted dataset job”, “clean corrupted cache shards”).

Preflight Checklist

Confirm
```
nvidia-smi
```
reports at least one free A100 80GB GPU; fallback to CPU only if GPU unavailable.

Validate credentials:

.env

contains

JQUANTS_AUTH_EMAIL/PASSWORD

and

JQUANTS_PLAN_TIER

Ensure
```
python -m pip install -e .
```
already executed (dependencies + entry points).

Check latest health snapshot:

tools/project-health-check.sh --section dataset

Inspect existing dataset for reuse:

ls -lh output/ml_dataset_latest_full.parquet

Core Playbooks

1. Background Five-Year Refresh (default)

```
make dataset-check-strict
```
— GPU + secrets verification.

make dataset-bg START=<optional> END=<optional>

— SSH-safe background run with logging in

_logs/dataset

```
tail -f _logs/dataset/*.log
```
— monitor progress (auto prints PID + PGID).
```
make cache-stats
```
— ensure cache hit-rate & size in expected bounds (<2.5 TB).

python scripts/pipelines/run_full_dataset.py --dry-run

— confirm metadata integrity without rebuild.

2. Hotfix / Forced Refresh

make dataset-gpu-refresh START=YYYY-MM-DD END=YYYY-MM-DD

— bypasses cached parquet + API throttle aware.

```
make datasets-prune
```
— keep latest dataset generation only.
```
make cache-prune CACHE_TTL_DAYS=90
```
— evict stale graph shards to recover disk.

3. Resource-Constrained Fallback

```
make dataset-check
```
— relaxed diagnostics (CPU acceptable).

make dataset-cpu START=YYYY-MM-DD END=YYYY-MM-DD

— chunked Pandas path.

```
make dataset-safe-resume
```
— resume from last safe checkpoint if memory pressure triggered fallback.

4. Graph Feature Investigation

python scripts/pipelines/run_full_dataset.py --inspect-graph --start YYYY-MM-DD --end YYYY-MM-DD

python -c "import polars as pl; df = pl.read_parquet('output/ml_dataset_latest_full.parquet'); print(df.select(pl.all().is_null().sum()))"

— null audit.

```
make cache-monitor
```
— per-window edge density + overlap stats.

Observability Hooks

```
_logs/dataset/
```
for job logs,
```
cache/*.json
```
metadata for cache.
```
ml_dataset_latest_full_metadata.json
```
for column coverage & horizon alignment.
```
benchmark_output/dataset_timestamps.json
```
to confirm pipeline duration vs baseline (target: <42m GPU path).

Failure Triage

Credential errors → run

python scripts/pipelines/run_full_dataset.py --auth-test

CUDA OOM → rerun with
```
make dataset-safe
```
(40GB RMM pool pre-configured).
API rate limits → throttle via
```
make dataset-gpu REFRESH_THROTTLE=1
```
.

Corrupted parquet →

make dataset-rebuild

then

python tools/parquet_validator.py output/ml_dataset_latest_full.parquet

Codex Collaboration

Escalate complex ETL debugging or architectural refactors via
```
./tools/codex.sh "Diagnose dataset pipeline bottleneck"
```
(leverages OpenAI Codex deep reasoning).
For long-running autonomous maintenance, schedule
```
./tools/codex.sh --max --exec "Perform full dataset pipeline audit"
```
off-hours (uses
```
.mcp.json
```
from the Codex repo for filesystem/git context).
When Codex proposes changes, sync learnings back here and refresh dataset runbooks if any commands or defaults shift.

Handoff Notes

Always update
```
dataset_features_detail.json
```
if schema changes.
Announce new dataset snapshot in
```
EXPERIMENT_STATUS.md
```
with generation timestamp and settings.
Surface anomalies (missing tickers, new features) via
```
docs/data_quality/
```
reports.