Claude-skill-registry data-science-tools

Documentation of available data science libraries (scipy, numpy, pandas, sklearn) and best practices for statistical analysis, regression modeling, and organizing analysis scripts. **CRITICAL:** All analysis scripts MUST be placed in reports/{topic}/scripts/, NOT in root scripts/ directory.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-science-tools" ~/.claude/skills/majiayu000-claude-skill-registry-data-science-tools && rm -rf "$T"

manifest: skills/data/data-science-tools/SKILL.md

Data Science Tools Skill

Purpose

This skill documents the data science ecosystem available in this project, including:

Which Python libraries are installed and available
How to use them for statistical analysis and regression
WHERE to place analysis scripts (reports/{topic}/scripts/ - NOT root scripts/)
Best practices for reproducible data science

🚨 CRITICAL: Script Organization Rule

ALL regression, modeling, and analysis scripts MUST go in:

reports/{topic}_{timestamp}/scripts/

NEVER in:

scripts/  ❌ (root scripts/ is only for reusable utilities)

See Script Organization Best Practices section below.

Available Libraries

Installed in

.venv

Virtual Environment

The following data science libraries are installed and ready to use:

Library	Version	Purpose
numpy	Latest	Numerical computing, arrays, linear algebra
scipy	1.16.3+	Scientific computing, optimization, statistics
pandas	2.3.3+	Data manipulation, DataFrames, time series
scikit-learn	1.7.2+	Machine learning, regression, clustering

Activating the Virtual Environment

All Python scripts must use the virtual environment:

source .venv/bin/activate && python scripts/your_script.py

Or add shebang to scripts:

#!/usr/bin/env python3
# Then run directly: ./scripts/your_script.py

In Bash tool calls:

source .venv/bin/activate && python scripts/analysis.py

Common Use Cases

1. Regression Modeling (scipy.optimize.curve_fit)

Purpose: Fit non-linear models to data (S-curves, exponential, etc.)

Example: Logistic Regression

import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score

# Define model
def logistic(t, L, k, t0):
    """Logistic S-curve: L / (1 + exp(-k*(t - t0)))"""
    return L / (1 + np.exp(-k * (t - t0)))

# Prepare data
years = np.array([1993, 1994, ...])  # Time points
shares = np.array([0.004, 0.005, ...])  # Observed values
t = years - 1993  # Normalize time

# Fit model with bounds
p0 = [80, 0.5, 30]  # Initial guess: L=80%, k=0.5, t0=30
bounds = ([50, 0.1, 20], [100, 2.0, 50])  # Parameter bounds

params, covariance = curve_fit(
    logistic, t, shares,
    p0=p0,
    bounds=bounds,
    maxfev=10000
)

L, k, t0 = params

# Validate
predictions = logistic(t, L, k, t0)
r2 = r2_score(shares, predictions)
rmse = np.sqrt(np.mean((shares - predictions)**2))

print(f"Fitted parameters: L={L:.2f}, k={k:.4f}, t0={t0:.2f}")
print(f"R² = {r2:.6f}, RMSE = {rmse:.4f}")

⚠️ Important: Always use

curve_fit

with:

Initial guess (
```
p0
```
)
Bounds on parameters (prevents unrealistic values)
```
maxfev
```
to allow sufficient iterations

2. Model Comparison

Compare multiple models to find best fit:

models = {
    'logistic': (logistic, [80, 0.5, 30], ([50, 0.1, 20], [100, 2.0, 50])),
    'gompertz': (gompertz, [80, 0.2, 30], ([50, 0.05, 20], [100, 1.0, 50])),
}

results = {}
for name, (func, p0, bounds) in models.items():
    params, _ = curve_fit(func, t, shares, p0=p0, bounds=bounds)
    pred = func(t, *params)
    r2 = r2_score(shares, pred)
    results[name] = {'params': params, 'r2': r2}

# Find best
best_model = max(results.items(), key=lambda x: x[1]['r2'])
print(f"Best model: {best_model[0]} (R² = {best_model[1]['r2']:.6f})")

3. Data Manipulation with Pandas

Read CSV, filter, aggregate:

import pandas as pd

# Read data
df = pd.read_csv('data/ev_annual_bil10.csv')

# Filter
recent = df[df['year'] >= 2015]

# Aggregate
yearly_avg = df.groupby('year')['ev_share_pct'].mean()

# Export
df.to_csv('data/results.csv', index=False)

4. Statistical Analysis

from scipy import stats

# Correlation
corr, p_value = stats.pearsonr(x, y)

# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

# T-test
t_stat, p_value = stats.ttest_ind(group1, group2)

Script Organization Best Practices

Directory Structure

dst_skills/
├── scripts/               # Reusable utilities ONLY
│   ├── fetch_and_store.py
│   ├── db/
│   │   └── helpers.py
│   └── utils.py
│
├── data/                 # Raw data and databases
│   ├── dst.db
│   └── *.csv
│
└── reports/              # Generated reports
    └── {topic}_{timestamp}/
        ├── report.html
        ├── visualizations.html
        ├── data/         # Report-specific intermediate data
        │   └── *.csv
        └── scripts/      # ⚠️ ALL analysis scripts go HERE
            ├── README.md
            ├── fit_models.py
            ├── validate.py
            └── requirements.txt

IMPORTANT: Do NOT create analysis scripts in root

scripts/

directory. All regression, modeling, and analysis scripts must be in the report's

scripts/

folder.

When to Place Scripts in

reports/{topic}/scripts/

✅ ALWAYS for Analysis

Use this for ALL report-specific analysis:

Regression modeling (curve_fit, forecasting, etc.)
Statistical analysis (hypothesis tests, correlations, etc.)
Data transformation specific to this report
Validation and model comparison
Reproducibility - reader can re-run your exact analysis
Documentation - shows exactly what was done
Versioning - freezes code with report at time of publication

✅ ALL of these belong in reports/{topic}/scripts/:

```
fit_ev_models.py
```
- Regression modeling
```
validate_models.py
```
- Model validation
```
verify_regression_models.py
```
- scipy verification
```
forecast_scenarios.py
```
- Forecasting
```
statistical_tests.py
```
- Hypothesis testing

Example structure:

reports/elbiler_danmark_20251031/
├── report.html
├── visualizations.html
├── data/                    # Intermediate data for THIS analysis
│   ├── model_fits.csv
│   ├── forecasts.csv
│   └── residuals.csv
└── scripts/                 # ✅ ALL analysis scripts here
    ├── README.md           # Explains how to reproduce
    ├── fit_ev_models.py    # Main regression analysis
    ├── validate_models.py  # Cross-validation
    ├── verify_regression_models.py  # scipy verification
    └── requirements.txt    # Dependencies snapshot

When to Use

scripts/

(Root Level) ⚠️ ONLY for Reusable Utilities

Root scripts/ is ONLY for infrastructure utilities that are shared across ALL reports:

Database utilities (
```
db/helpers.py
```
,
```
db/validate.py
```
)
Data fetching (
```
fetch_and_store.py
```
)
Generic helpers (
```
utils.py
```
)
NOT for analysis - no regression, modeling, or statistics

❌ NEVER put these in root scripts/:

Regression models
Statistical analysis
Data transformations
Forecasting
Model validation

✅ Root scripts/ should ONLY contain:

# scripts/db/helpers.py - OK (reusable DB utility)
def safe_numeric_cast(column_name):
    """Helper for casting DST suppressed values."""
    return f"CASE WHEN {column_name} != '..' THEN CAST({column_name} AS NUMERIC) ELSE NULL END"

# scripts/utils.py - OK (generic utility)
def format_timestamp():
    """Standard timestamp format for filenames."""
    return datetime.now().strftime('%Y%m%d_%H%M%S')

# scripts/fetch_and_store.py - OK (reusable infrastructure)
def fetch_dst_table(table_id, filters):
    """Fetch data from DST API and store in DuckDB."""
    # ... implementation

If you're doing curve_fit, forecasting, or statistics → reports/{topic}/scripts/ ✅

Template: Report Analysis Script

#!/usr/bin/env python3
"""
EV Adoption Model Fitting and Validation
=========================================

Report: Danmarks Elbilsudvikling 2050
Date: 2025-10-31
Author: Claude Code

Purpose:
    Fit multiple regression models to EV adoption data and compare.

Usage:
    cd reports/{report_name}/scripts/
    source ../../../.venv/bin/activate
    python fit_ev_models.py

Outputs:
    - ../data/model_parameters.csv
    - ../data/forecasts.csv
    - stdout: Model comparison table
"""

import sys
import os
import csv
import numpy as np
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score

def main():
    # 1. Load data using relative path from scripts/ directory
    script_dir = os.path.dirname(os.path.abspath(__file__))
    project_root = os.path.join(script_dir, '../../..')

    # Path to project-level data
    data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')

    print(f"Loading data from {data_path}...")
    years = []
    shares = []
    with open(data_path, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            years.append(int(row['year']))
            shares.append(float(row['ev_share_pct']))

    years = np.array(years)
    shares = np.array(shares)

    # 2. Fit models
    print("\nFitting models...")
    # ... implementation

    # 3. Save results to report's data/ directory
    output_dir = os.path.join(script_dir, '../data')
    os.makedirs(output_dir, exist_ok=True)

    output_path = os.path.join(output_dir, 'model_parameters.csv')
    print(f"\nSaving results to {output_path}...")
    # ... save implementation

if __name__ == '__main__':
    main()

Key points:

Use
```
os.path
```
for cross-platform compatibility
Always use relative paths from script's location
Project data:
```
../../../data/
```
Report data:
```
../data/
```
Activate venv before running

README.md Template for Report Scripts

# Analysis Scripts for EV Adoption Report

## Report Details
- **Topic:** Danmarks Elbilsudvikling til 2050
- **Generated:** 2025-10-31
- **Data:** BIL10, BIL52, BIL51 (Danmarks Statistik)

## Reproducibility

### Prerequisites
```bash
# From project root
source .venv/bin/activate
pip install numpy scipy pandas scikit-learn

Run Analysis

cd reports/elbiler_danmark_20251031/scripts/
python fit_ev_models.py
python validate_models.py

Scripts

```
fit_ev_models.py
```
- Fits logistic, Gompertz, exponential models
```
validate_models.py
```
- Cross-validation and residual analysis
```
export_forecasts.py
```
- Generate 2026-2050 predictions

Outputs

Results saved to

../data/

```
model_parameters.csv
```
- Fitted parameters (L, k, t0)
```
forecasts.csv
```
- Year-by-year predictions
```
validation_metrics.csv
```
- R², RMSE, etc.

Model Details

See

../report.html

Section 3: Methodology


## Common Pitfalls and Solutions

### 1. ModuleNotFoundError

**Problem:**
```bash
ModuleNotFoundError: No module named 'scipy'

Solution:

# Always activate venv first
source .venv/bin/activate
python scripts/your_script.py

2. curve_fit Fails to Converge

Problem:

OptimizeWarning: Covariance of the parameters could not be estimated

Solutions:

Improve initial guess
```
p0
```
Tighten bounds (e.g., L: [60, 90] instead of [50, 100])
Increase
```
maxfev
```
to 20000
Normalize/scale your data first
Try different optimization methods

# Better bounds
bounds = ([65, 0.3, 25], [95, 0.8, 40])  # Tighter

# Or use different method
from scipy.optimize import minimize, differential_evolution

3. Grid Search vs Optimization

Bad (inefficient):

best_r2 = 0
for L in [70, 75, 80, 85, 90, 95]:
    for k in np.arange(0.1, 2.0, 0.05):
        # ... fit and compare

Good (use scipy):

params, _ = curve_fit(logistic, t, shares, p0=[80, 0.5, 30])

When grid search is acceptable:

Quick prototyping to find good
```
p0
```
Testing specific scenarios (e.g., compare L=70% vs L=90%)
Educational purposes

4. Overfitting

Warning signs:

R² > 0.999 on historical data
Model fits noise, not signal
Poor performance on holdout set

Solutions:

# Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, shuffle=False)

# Fit on train, validate on test
params, _ = curve_fit(model, train_t, train_y)
test_pred = model(test_t, *params)
test_r2 = r2_score(test_y, test_pred)

if test_r2 < 0.9:
    print("⚠️ Warning: Poor generalization")

Installation and Verification

Check Installed Packages

source .venv/bin/activate
pip list | grep -E "(numpy|scipy|pandas|scikit)"

Expected output:

numpy          1.x.x
pandas         2.3.3
scikit-learn   1.7.2
scipy          1.16.3

Verify scipy.optimize Works

source .venv/bin/activate
python -c "from scipy.optimize import curve_fit; print('✓ scipy.optimize available')"

Install Missing Packages

source .venv/bin/activate
pip install numpy scipy pandas scikit-learn

Integration with DST Skills Workflow

Typical Workflow

Discovery:
```
/dst-discover
```
→ Find tables
Fetch:
```
/dst-fetch
```
→ Download data to
```
data/
```
Analysis:
```
/dst-analyze
```
→ SQL queries, basic calculations
Modeling: Create script in
```
reports/{topic}/scripts/
```
for regression
Visualize:
```
/dst-visualize
```
→ Create charts from results
Report:
```
/dst-report
```
→ Generate HTML with all findings

Where Each Step Happens

Step	Location	Examples
Data fetching	`data/`	dst.db, *.csv
SQL queries	Agent (ephemeral)	Aggregations, joins
Regression/modeling	`reports/{topic}/scripts/` ✅	curve_fit, forecasting
Results	`reports/{topic}/data/`	model_parameters.csv
Report	`reports/{topic}/`	report.html

Example: Complete Regression Analysis

Step 1: Create analysis script in report folder

File:

reports/elbiler_danmark_20251031/scripts/fit_logistic_model.py

#!/usr/bin/env python3
"""
Fit logistic regression to EV adoption data.

Usage:
    cd reports/elbiler_danmark_20251031/scripts/
    source ../../../.venv/bin/activate
    python fit_logistic_model.py
"""

import csv
import os
import numpy as np
from scipy.optimize import curve_fit

def main():
    # Load data from project data/
    script_dir = os.path.dirname(os.path.abspath(__file__))
    project_root = os.path.join(script_dir, '../../..')
    data_path = os.path.join(project_root, 'data/ev_annual_bil10.csv')

    # 1. Load data
    years = []
    shares = []
    with open(data_path, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            years.append(int(row['year']))
            shares.append(float(row['ev_share_pct']))

    years = np.array(years)
    shares = np.array(shares)
    t = years - years.min()

    # 2. Define and fit model
    def logistic(t, L, k, t0):
        return L / (1 + np.exp(-k * (t - t0)))

    params, _ = curve_fit(logistic, t, shares,
                         p0=[80, 0.5, 30],
                         bounds=([50, 0.1, 20], [100, 2.0, 50]))
    L, k, t0 = params

    # 3. Forecast
    future_years = np.arange(years.max() + 1, 2051)
    future_t = future_years - years.min()
    forecast = logistic(future_t, L, k, t0)

    # 4. Export to report's data/ folder
    output_dir = os.path.join(script_dir, '../data')
    os.makedirs(output_dir, exist_ok=True)

    output_path = os.path.join(output_dir, 'forecast.csv')
    with open(output_path, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(['year', 'predicted_share'])
        for year, pred in zip(future_years, forecast):
            writer.writerow([year, pred])

    print(f"✓ Forecast exported: {output_path}")
    print(f"  Model: L={L:.1f}%, k={k:.3f}, t0={t0:.1f}")

if __name__ == '__main__':
    main()

Step 2: Run from report's scripts/ directory

cd reports/elbiler_danmark_20251031/scripts/
source ../../../.venv/bin/activate
python fit_logistic_model.py

Step 3: Use results in visualization and report

The forecast.csv is now in

reports/elbiler_danmark_20251031/data/

and can be used by

/dst-visualize

and

/dst-report

✅ Benefits of this approach:

Script stays with report (reproducibility)
Relative paths work from any machine
Clear separation: data fetching vs analysis vs reporting
Easy to version control and share

References

Documentation

scipy.optimize.curve_fit: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
sklearn metrics: https://scikit-learn.org/stable/modules/model_evaluation.html
pandas: https://pandas.pydata.org/docs/

Regression Theory

Logistic growth: Bass diffusion model, technology adoption
Gompertz curve: Asymmetric S-curve for market saturation
Model selection: AIC, BIC, cross-validation

Best Practices

Script placement: ALWAYS put analysis scripts in
```
reports/{topic}/scripts/
```
Validation: Use train-test split for model validation
Reporting: Always report R², RMSE, and residual plots
Documentation: Document assumptions and limitations in script docstrings
Reproducibility: Version-control analysis scripts WITH the report they generate
Data paths: Use relative paths with
```
os.path
```
for cross-platform compatibility
Virtual env: Always activate
```
.venv
```
before running scipy/numpy code

Quick Reference: Where Does It Go?

What	Where	Example
Regression scripts	`reports/{topic}/scripts/`	`fit_models.py`
Validation scripts	`reports/{topic}/scripts/`	`verify_regression_models.py`
Forecasting scripts	`reports/{topic}/scripts/`	`forecast_scenarios.py`
Statistical tests	`reports/{topic}/scripts/`	`hypothesis_tests.py`
Intermediate results	`reports/{topic}/data/`	`model_parameters.csv`
Raw data	`data/` (project root)	`dst.db` , `ev_annual_bil10.csv`
Reusable utilities	`scripts/` (project root)	`db/helpers.py` , `fetch_and_store.py`

Simple rule: If it uses scipy/curve_fit/statistics →

reports/{topic}/scripts/

✅