Claude-skill-registry ds-plan

REQUIRED Phase 2 of /ds workflow. Profiles data and creates analysis task breakdown.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/ds-plan" ~/.claude/skills/majiayu000-claude-skill-registry-ds-plan && rm -rf "$T"

manifest: skills/data/ds-plan/SKILL.md

Planning (Data Profiling + Task Breakdown)

Profile the data and create an analysis plan based on the spec. Requires

.claude/SPEC.md

from /ds-brainstorm first.

<EXTREMELY-IMPORTANT> ## The Iron Law of DS Planning

SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.

Before exploring data or creating tasks, you MUST have:

```
.claude/SPEC.md
```
with objectives and constraints
Clear success criteria
User-approved spec

.claude/SPEC.md

doesn't exist, run /ds-brainstorm first. </EXTREMELY-IMPORTANT>

Rationalization Table - STOP If You Think:

Excuse	Reality	Do Instead
"Data looks clean, profiling unnecessary"	Your data is never clean	PROFILE to discover issues
"I can profile as I go"	You'll miss systemic issues	PROFILE comprehensively NOW
"Quick .head() is enough"	Your head hides tail problems	RUN full profiling checklist
"Missing values won't affect my analysis"	They always do	DOCUMENT and plan handling
"I'll handle data issues during analysis"	Your issues will derail your analysis	FIX data issues FIRST
"User didn't mention data quality"	They assume YOU'LL check	QUALITY check is YOUR job
"Profiling takes too long"	Your skipping it costs days later	INVEST time now

Honesty Framing

Creating an analysis plan without profiling the data is LYING about understanding the data.

You cannot plan analysis steps without knowing:

Your data's shape and types
Your missing value patterns
Your data quality issues
Your cleaning requirements

Profiling costs you minutes. Your wrong plan costs hours of rework and incorrect results.

No Pause After Completion

After writing

.claude/PLAN.md

, IMMEDIATELY invoke:

Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")

DO NOT:

Ask "should I proceed with implementation?"
Summarize the plan
Wait for user confirmation (they approved SPEC already)
Write status updates

The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.

What Plan Does

DO	DON'T
Read .claude/SPEC.md	Skip brainstorm phase
Profile data (shape, types, stats)	Skip to analysis
Identify data quality issues	Ignore missing/duplicate data
Create ordered task list	Write final analysis code
Write .claude/PLAN.md	Make completion claims

Brainstorm answers: WHAT and WHY Plan answers: HOW and DATA QUALITY

Process

1. Verify Spec Exists

cat .claude/SPEC.md  # verify-spec: read SPEC file to confirm it exists

If missing, stop and run

/ds-brainstorm

first.

2. Data Profiling

For multiple data sources: Profile in parallel using background Task agents.

Single Data Source (Direct Profiling)

MANDATORY profiling steps:

import pandas as pd

# Basic structure
df.shape                    # (rows, columns)
df.dtypes                   # Column types
df.head(10)                 # Sample data
df.tail(5)                  # End of data

# Summary statistics
df.describe()               # Numeric summaries
df.describe(include='object')  # Categorical summaries
df.info()                   # Memory, non-null counts

# Data quality checks
df.isnull().sum()           # Missing values per column
df.duplicated().sum()       # Duplicate rows
df[col].value_counts()      # Distribution of categories

# For time series
df[date_col].min(), df[date_col].max()  # Date range
df.groupby(date_col).size()              # Records per period

Multiple Data Sources (Parallel Profiling)

<EXTREMELY-IMPORTANT> **Pattern from oh-my-opencode: Launch ALL profiling agents in a SINGLE message.**

Use

run_in_background: true

for parallel execution.

When profiling 2+ data sources, launch agents in parallel: </EXTREMELY-IMPORTANT>

# PARALLEL + BACKGROUND: All Task calls in ONE message

Task(
    subagent_type="general-purpose",
    description="Profile dataset 1",
    run_in_background=true,
    prompt="""
Profile this dataset and return a data quality report.

Dataset: /path/to/dataset1.csv

Required checks:
1. Shape: rows x columns
2. Data types: df.dtypes
3. Missing values: df.isnull().sum()
4. Duplicates: df.duplicated().sum()
5. Summary statistics: df.describe()
6. Unique value counts for categorical columns
7. Date range if time series
8. Memory usage: df.info()

Output format:
- Markdown table with column summary
- List of data quality issues found
- Recommendations for cleaning

Tools denied: Write, Edit, NotebookEdit (read-only profiling)
""")

Task(
    subagent_type="general-purpose",
    description="Profile dataset 2",
    run_in_background=true,
    prompt="""
[Same template for dataset 2]
""")

Task(
    subagent_type="general-purpose",
    description="Profile dataset 3",
    run_in_background=true,
    prompt="""
[Same template for dataset 3]
""")

After launching agents:

Continue to other work (don't wait)
Check status with
```
/tasks
```
command
Collect results with TaskOutput when ready

# Collect profiling results
TaskOutput(task_id="task-abc123", block=true, timeout=30000)
TaskOutput(task_id="task-def456", block=true, timeout=30000)
TaskOutput(task_id="task-ghi789", block=true, timeout=30000)

Benefits:

3x faster profiling for 3 datasets
Each agent focused on single source
Results consolidated in main chat

3. Identify Data Quality Issues

CRITICAL: Document ALL issues before proceeding:

Check	What to Look For
Missing values	Null counts, patterns of missingness
Duplicates	Exact duplicates, key-based duplicates
Outliers	Extreme values, impossible values
Type issues	Strings in numeric columns, date parsing
Cardinality	Unexpected unique values
Distribution	Skewness, unexpected patterns

4. Create Task Breakdown

Break analysis into ordered tasks:

Each task should produce visible output
Order by data dependencies
Include data cleaning tasks FIRST

5. Write Plan Doc

Write to

.claude/PLAN.md

# Analysis Plan: [Analysis Name]

> **For Claude:** REQUIRED SUB-SKILL: Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")` to implement this plan with output-first verification.
>
> **Delegation:** Main chat orchestrates, Task agents implement. Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-delegate/SKILL.md")` for subagent templates.

## Spec Reference
See: .claude/SPEC.md

## Data Profile

### Source 1: [name]
- Location: [path/connection]
- Shape: [rows] x [columns]
- Date range: [start] to [end]
- Key columns: [list]

#### Column Summary
| Column | Type | Non-null | Unique | Notes |
|--------|------|----------|--------|-------|
| col1 | int64 | 100% | 50 | Primary key |
| col2 | object | 95% | 10 | Category |

#### Data Quality Issues
- [ ] Missing: col2 has 5% nulls - [strategy: drop/impute/flag]
- [ ] Duplicates: 100 duplicate rows on [key] - [strategy]
- [ ] Outliers: col3 has values > 1000 - [strategy]

### Source 2: [name]
[Same structure]

## Task Breakdown

### Task 1: Data Cleaning (required first)
- Handle missing values in col2
- Remove duplicates
- Fix data types
- Output: Clean DataFrame, log of rows removed

### Task 2: [Analysis Step]
- Input: Clean DataFrame
- Process: [description]
- Output: [specific output to verify]
- Dependencies: Task 1

### Task 3: [Next Step]
[Same structure]

## Output Verification Plan
For each task, define what output proves completion:
- Task 1: "X rows cleaned, Y rows dropped"
- Task 2: "Visualization showing [pattern]"
- Task 3: "Model accuracy >= 0.8"

## Reproducibility Requirements
- Random seed: [value if needed]
- Package versions: [key packages]
- Data snapshot: [date/version]

Red Flags - STOP If You're About To:

Action	Why It's Wrong	Do Instead
Skip data profiling	Your data issues will break your analysis	Always profile first
Ignore missing values	You'll corrupt your results	Document and plan handling
Start analysis immediately	You haven't characterized your data	Complete profiling
Assume your data is clean	Never assume, you must verify	Run quality checks

Output

Complete the plan when:

Read and understand
```
.claude/SPEC.md
```
Profile all data sources (shape, types, stats)
Document data quality issues
Define cleaning strategy for each issue
Order tasks by dependency
Define output verification criteria
Write
```
.claude/PLAN.md
```
Confirm ready for implementation

Phase Complete

REQUIRED SUB-SKILL: After completing plan, IMMEDIATELY invoke:

Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")