Claude-skill-registry ds-plan
REQUIRED Phase 2 of /ds workflow. Profiles data and creates analysis task breakdown.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/ds-plan" ~/.claude/skills/majiayu000-claude-skill-registry-ds-plan && rm -rf "$T"
skills/data/ds-plan/SKILL.mdAnnounce: "Using ds-plan (Phase 2) to profile data and create task breakdown."
Contents
Planning (Data Profiling + Task Breakdown)
Profile the data and create an analysis plan based on the spec.
Requires
from /ds-brainstorm first..claude/SPEC.md
SPEC MUST EXIST BEFORE PLANNING. This is not negotiable.
Before exploring data or creating tasks, you MUST have:
with objectives and constraints.claude/SPEC.md- Clear success criteria
- User-approved spec
If
doesn't exist, run /ds-brainstorm first.
</EXTREMELY-IMPORTANT>.claude/SPEC.md
Rationalization Table - STOP If You Think:
| Excuse | Reality | Do Instead |
|---|---|---|
| "Data looks clean, profiling unnecessary" | Your data is never clean | PROFILE to discover issues |
| "I can profile as I go" | You'll miss systemic issues | PROFILE comprehensively NOW |
| "Quick .head() is enough" | Your head hides tail problems | RUN full profiling checklist |
| "Missing values won't affect my analysis" | They always do | DOCUMENT and plan handling |
| "I'll handle data issues during analysis" | Your issues will derail your analysis | FIX data issues FIRST |
| "User didn't mention data quality" | They assume YOU'LL check | QUALITY check is YOUR job |
| "Profiling takes too long" | Your skipping it costs days later | INVEST time now |
Honesty Framing
Creating an analysis plan without profiling the data is LYING about understanding the data.
You cannot plan analysis steps without knowing:
- Your data's shape and types
- Your missing value patterns
- Your data quality issues
- Your cleaning requirements
Profiling costs you minutes. Your wrong plan costs hours of rework and incorrect results.
No Pause After Completion
After writing
.claude/PLAN.md, IMMEDIATELY invoke:
Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")
DO NOT:
- Ask "should I proceed with implementation?"
- Summarize the plan
- Wait for user confirmation (they approved SPEC already)
- Write status updates
The workflow phases are SEQUENTIAL. Complete plan → immediately start implement.
What Plan Does
| DO | DON'T |
|---|---|
| Read .claude/SPEC.md | Skip brainstorm phase |
| Profile data (shape, types, stats) | Skip to analysis |
| Identify data quality issues | Ignore missing/duplicate data |
| Create ordered task list | Write final analysis code |
| Write .claude/PLAN.md | Make completion claims |
Brainstorm answers: WHAT and WHY Plan answers: HOW and DATA QUALITY
Process
1. Verify Spec Exists
cat .claude/SPEC.md # verify-spec: read SPEC file to confirm it exists
If missing, stop and run
/ds-brainstorm first.
2. Data Profiling
For multiple data sources: Profile in parallel using background Task agents.
Single Data Source (Direct Profiling)
MANDATORY profiling steps:
import pandas as pd # Basic structure df.shape # (rows, columns) df.dtypes # Column types df.head(10) # Sample data df.tail(5) # End of data # Summary statistics df.describe() # Numeric summaries df.describe(include='object') # Categorical summaries df.info() # Memory, non-null counts # Data quality checks df.isnull().sum() # Missing values per column df.duplicated().sum() # Duplicate rows df[col].value_counts() # Distribution of categories # For time series df[date_col].min(), df[date_col].max() # Date range df.groupby(date_col).size() # Records per period
Multiple Data Sources (Parallel Profiling)
<EXTREMELY-IMPORTANT> **Pattern from oh-my-opencode: Launch ALL profiling agents in a SINGLE message.**Use
for parallel execution.run_in_background: true
When profiling 2+ data sources, launch agents in parallel: </EXTREMELY-IMPORTANT>
# PARALLEL + BACKGROUND: All Task calls in ONE message Task( subagent_type="general-purpose", description="Profile dataset 1", run_in_background=true, prompt=""" Profile this dataset and return a data quality report. Dataset: /path/to/dataset1.csv Required checks: 1. Shape: rows x columns 2. Data types: df.dtypes 3. Missing values: df.isnull().sum() 4. Duplicates: df.duplicated().sum() 5. Summary statistics: df.describe() 6. Unique value counts for categorical columns 7. Date range if time series 8. Memory usage: df.info() Output format: - Markdown table with column summary - List of data quality issues found - Recommendations for cleaning Tools denied: Write, Edit, NotebookEdit (read-only profiling) """) Task( subagent_type="general-purpose", description="Profile dataset 2", run_in_background=true, prompt=""" [Same template for dataset 2] """) Task( subagent_type="general-purpose", description="Profile dataset 3", run_in_background=true, prompt=""" [Same template for dataset 3] """)
After launching agents:
- Continue to other work (don't wait)
- Check status with
command/tasks - Collect results with TaskOutput when ready
# Collect profiling results TaskOutput(task_id="task-abc123", block=true, timeout=30000) TaskOutput(task_id="task-def456", block=true, timeout=30000) TaskOutput(task_id="task-ghi789", block=true, timeout=30000)
Benefits:
- 3x faster profiling for 3 datasets
- Each agent focused on single source
- Results consolidated in main chat
3. Identify Data Quality Issues
CRITICAL: Document ALL issues before proceeding:
| Check | What to Look For |
|---|---|
| Missing values | Null counts, patterns of missingness |
| Duplicates | Exact duplicates, key-based duplicates |
| Outliers | Extreme values, impossible values |
| Type issues | Strings in numeric columns, date parsing |
| Cardinality | Unexpected unique values |
| Distribution | Skewness, unexpected patterns |
4. Create Task Breakdown
Break analysis into ordered tasks:
- Each task should produce visible output
- Order by data dependencies
- Include data cleaning tasks FIRST
5. Write Plan Doc
Write to
.claude/PLAN.md:
# Analysis Plan: [Analysis Name] > **For Claude:** REQUIRED SUB-SKILL: Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")` to implement this plan with output-first verification. > > **Delegation:** Main chat orchestrates, Task agents implement. Use `Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-delegate/SKILL.md")` for subagent templates. ## Spec Reference See: .claude/SPEC.md ## Data Profile ### Source 1: [name] - Location: [path/connection] - Shape: [rows] x [columns] - Date range: [start] to [end] - Key columns: [list] #### Column Summary | Column | Type | Non-null | Unique | Notes | |--------|------|----------|--------|-------| | col1 | int64 | 100% | 50 | Primary key | | col2 | object | 95% | 10 | Category | #### Data Quality Issues - [ ] Missing: col2 has 5% nulls - [strategy: drop/impute/flag] - [ ] Duplicates: 100 duplicate rows on [key] - [strategy] - [ ] Outliers: col3 has values > 1000 - [strategy] ### Source 2: [name] [Same structure] ## Task Breakdown ### Task 1: Data Cleaning (required first) - Handle missing values in col2 - Remove duplicates - Fix data types - Output: Clean DataFrame, log of rows removed ### Task 2: [Analysis Step] - Input: Clean DataFrame - Process: [description] - Output: [specific output to verify] - Dependencies: Task 1 ### Task 3: [Next Step] [Same structure] ## Output Verification Plan For each task, define what output proves completion: - Task 1: "X rows cleaned, Y rows dropped" - Task 2: "Visualization showing [pattern]" - Task 3: "Model accuracy >= 0.8" ## Reproducibility Requirements - Random seed: [value if needed] - Package versions: [key packages] - Data snapshot: [date/version]
Red Flags - STOP If You're About To:
| Action | Why It's Wrong | Do Instead |
|---|---|---|
| Skip data profiling | Your data issues will break your analysis | Always profile first |
| Ignore missing values | You'll corrupt your results | Document and plan handling |
| Start analysis immediately | You haven't characterized your data | Complete profiling |
| Assume your data is clean | Never assume, you must verify | Run quality checks |
Output
Complete the plan when:
- Read and understand
.claude/SPEC.md - Profile all data sources (shape, types, stats)
- Document data quality issues
- Define cleaning strategy for each issue
- Order tasks by dependency
- Define output verification criteria
- Write
.claude/PLAN.md - Confirm ready for implementation
Phase Complete
REQUIRED SUB-SKILL: After completing plan, IMMEDIATELY invoke:
Read("${CLAUDE_PLUGIN_ROOT}/lib/skills/ds-implement/SKILL.md")