Claude-skill-registry ds-plan

REQUIRED Phase 2 of /ds workflow. Profiles data and creates analysis task breakdown.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/ds-plan" ~/.claude/skills/majiayu000-claude-skill-registry-ds-plan && rm -rf "$T"
manifest: skills/data/ds-plan/SKILL.md
source content

Announce: "Using ds-plan (Phase 2) to profile data and create task breakdown."

Contents

Planning (Data Profiling + Task Breakdown)

Profile the data and create an analysis plan based on the spec. Requires

.claude/SPEC.md
from /ds-brainstorm first.

<EXTREMELY-IMPORTANT> ## The Iron Law of DS Planning

SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.

Before exploring data or creating tasks, you MUST have:

  1. .claude/SPEC.md
    with objectives and constraints
  2. Clear success criteria
  3. User-approved spec

If

.claude/SPEC.md
doesn't exist, run /ds-brainstorm first. </EXTREMELY-IMPORTANT>

Rationalization Table - STOP If You Think:

ExcuseRealityDo Instead
"Data looks clean, profiling unnecessary"Your data is never cleanPROFILE to discover issues
"I can profile as I go"You'll miss systemic issuesPROFILE comprehensively NOW
"Quick .head() is enough"Your head hides tail problemsRUN full profiling checklist
"Missing values won't affect my analysis"They always doDOCUMENT and plan handling
"I'll handle data issues during analysis"Your issues will derail your analysisFIX data issues FIRST
"User didn't mention data quality"They assume YOU'LL checkQUALITY check is YOUR job
"Profiling takes too long"Your skipping it costs days laterINVEST time now

Honesty Framing

Creating an analysis plan without profiling the data is LYING about understanding the data.

You cannot plan analysis steps without knowing:

  • Your data's shape and types
  • Your missing value patterns
  • Your data quality issues
  • Your cleaning requirements

Profiling costs you minutes. Your wrong plan costs hours of rework and incorrect results.

No Pause After Completion

After writing

.claude/PLAN.md
, IMMEDIATELY invoke:

Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")

DO NOT:

  • Ask "should I proceed with implementation?"
  • Summarize the plan
  • Wait for user confirmation (they approved SPEC already)
  • Write status updates

The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.

What Plan Does

DODON'T
Read .claude/SPEC.mdSkip brainstorm phase
Profile data (shape, types, stats)Skip to analysis
Identify data quality issuesIgnore missing/duplicate data
Create ordered task listWrite final analysis code
Write .claude/PLAN.mdMake completion claims

Brainstorm answers: WHAT and WHY Plan answers: HOW and DATA QUALITY

Process

1. Verify Spec Exists

cat .claude/SPEC.md  # verify-spec: read SPEC file to confirm it exists

If missing, stop and run

/ds-brainstorm
first.

2. Data Profiling

For multiple data sources: Profile in parallel using background Task agents.

Single Data Source (Direct Profiling)

MANDATORY profiling steps:

import pandas as pd

# Basic structure
df.shape                    # (rows, columns)
df.dtypes                   # Column types
df.head(10)                 # Sample data
df.tail(5)                  # End of data

# Summary statistics
df.describe()               # Numeric summaries
df.describe(include='object')  # Categorical summaries
df.info()                   # Memory, non-null counts

# Data quality checks
df.isnull().sum()           # Missing values per column
df.duplicated().sum()       # Duplicate rows
df[col].value_counts()      # Distribution of categories

# For time series
df[date_col].min(), df[date_col].max()  # Date range
df.groupby(date_col).size()              # Records per period

Multiple Data Sources (Parallel Profiling)

<EXTREMELY-IMPORTANT> **Pattern from oh-my-opencode: Launch ALL profiling agents in a SINGLE message.**

Use

run_in_background: true
for parallel execution.

When profiling 2+ data sources, launch agents in parallel: </EXTREMELY-IMPORTANT>

# PARALLEL + BACKGROUND: All Task calls in ONE message

Task(
    subagent_type="general-purpose",
    description="Profile dataset 1",
    run_in_background=true,
    prompt="""
Profile this dataset and return a data quality report.

Dataset: /path/to/dataset1.csv

Required checks:
1. Shape: rows x columns
2. Data types: df.dtypes
3. Missing values: df.isnull().sum()
4. Duplicates: df.duplicated().sum()
5. Summary statistics: df.describe()
6. Unique value counts for categorical columns
7. Date range if time series
8. Memory usage: df.info()

Output format:
- Markdown table with column summary
- List of data quality issues found
- Recommendations for cleaning

Tools denied: Write, Edit, NotebookEdit (read-only profiling)
""")

Task(
    subagent_type="general-purpose",
    description="Profile dataset 2",
    run_in_background=true,
    prompt="""
[Same template for dataset 2]
""")

Task(
    subagent_type="general-purpose",
    description="Profile dataset 3",
    run_in_background=true,
    prompt="""
[Same template for dataset 3]
""")

After launching agents:

  • Continue to other work (don't wait)
  • Check status with
    /tasks
    command
  • Collect results with TaskOutput when ready
# Collect profiling results
TaskOutput(task_id="task-abc123", block=true, timeout=30000)
TaskOutput(task_id="task-def456", block=true, timeout=30000)
TaskOutput(task_id="task-ghi789", block=true, timeout=30000)

Benefits:

  • 3x faster profiling for 3 datasets
  • Each agent focused on single source
  • Results consolidated in main chat

3. Identify Data Quality Issues

CRITICAL: Document ALL issues before proceeding:

CheckWhat to Look For
Missing valuesNull counts, patterns of missingness
DuplicatesExact duplicates, key-based duplicates
OutliersExtreme values, impossible values
Type issuesStrings in numeric columns, date parsing
CardinalityUnexpected unique values
DistributionSkewness, unexpected patterns

4. Create Task Breakdown

Break analysis into ordered tasks:

  • Each task should produce visible output
  • Order by data dependencies
  • Include data cleaning tasks FIRST

5. Write Plan Doc

Write to

.claude/PLAN.md
:

# Analysis Plan: [Analysis Name]

> **For Claude:** REQUIRED SUB-SKILL: Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")` to implement this plan with output-first verification.
>
> **Delegation:** Main chat orchestrates, Task agents implement. Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-delegate/SKILL.md")` for subagent templates.

## Spec Reference
See: .claude/SPEC.md

## Data Profile

### Source 1: [name]
- Location: [path/connection]
- Shape: [rows] x [columns]
- Date range: [start] to [end]
- Key columns: [list]

#### Column Summary
| Column | Type | Non-null | Unique | Notes |
|--------|------|----------|--------|-------|
| col1 | int64 | 100% | 50 | Primary key |
| col2 | object | 95% | 10 | Category |

#### Data Quality Issues
- [ ] Missing: col2 has 5% nulls - [strategy: drop/impute/flag]
- [ ] Duplicates: 100 duplicate rows on [key] - [strategy]
- [ ] Outliers: col3 has values > 1000 - [strategy]

### Source 2: [name]
[Same structure]

## Task Breakdown

### Task 1: Data Cleaning (required first)
- Handle missing values in col2
- Remove duplicates
- Fix data types
- Output: Clean DataFrame, log of rows removed

### Task 2: [Analysis Step]
- Input: Clean DataFrame
- Process: [description]
- Output: [specific output to verify]
- Dependencies: Task 1

### Task 3: [Next Step]
[Same structure]

## Output Verification Plan
For each task, define what output proves completion:
- Task 1: "X rows cleaned, Y rows dropped"
- Task 2: "Visualization showing [pattern]"
- Task 3: "Model accuracy >= 0.8"

## Reproducibility Requirements
- Random seed: [value if needed]
- Package versions: [key packages]
- Data snapshot: [date/version]

Red Flags - STOP If You're About To:

ActionWhy It's WrongDo Instead
Skip data profilingYour data issues will break your analysisAlways profile first
Ignore missing valuesYou'll corrupt your resultsDocument and plan handling
Start analysis immediatelyYou haven't characterized your dataComplete profiling
Assume your data is cleanNever assume, you must verifyRun quality checks

Output

Complete the plan when:

  • Read and understand
    .claude/SPEC.md
  • Profile all data sources (shape, types, stats)
  • Document data quality issues
  • Define cleaning strategy for each issue
  • Order tasks by dependency
  • Define output verification criteria
  • Write
    .claude/PLAN.md
  • Confirm ready for implementation

Phase Complete

REQUIRED SUB-SKILL: After completing plan, IMMEDIATELY invoke:

Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")