Upsonic data-analysis

Name: data-analysis
Author: upsonic

Analyze, explore, clean, and visualize datasets with statistical rigor. Use when user asks to analyze data, find patterns, compute statistics, create visualizations, clean messy data, or explore a dataset. Trigger when user says things like "analyze this data", "what trends do you see", "find patterns in", "create a chart", "clean this dataset", "run statistics on", "what does this data tell us", or provides CSV/Excel/JSON data for exploration. Also trigger for A/B test analysis, cohort analysis, and data quality assessments. Do NOT trigger for simple data format conversions, database query writing without analysis, or ETL pipeline design.

install

source · Clone the upstream repo

git clone https://github.com/Upsonic/Upsonic

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Upsonic/Upsonic "$T" && mkdir -p ~/.claude/skills && cp -r "$T/src/upsonic/skills/builtins/data-analysis" ~/.claude/skills/upsonic-upsonic-data-analysis && rm -rf "$T"

manifest: src/upsonic/skills/builtins/data-analysis/SKILL.md

source content

Data Analysis

Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.

Before You Analyze

Understand the Question

Before touching the data, clarify:

What question are we answering? ("Is our conversion rate improving?" is an answerable question. "Analyze this data" is not — help the user sharpen it.)
Who needs the answer? (Engineer debugging an issue? Executive making a budget decision? Researcher testing a hypothesis?)
What decisions will this inform? (This determines how precise you need to be and what format the answer should take.)
What's the timeline? (A quick sanity check and a thorough statistical analysis require different approaches.)

If the user says "analyze this data" without a specific question, help them formulate one:

"What would be most useful to know from this data?"
"Are you looking for trends over time, comparisons between groups, or something else?"
"Is there a specific business question this should answer?"

Reference Materials and Scripts

Execute
```
profile_data.py
```
with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run with
```
--help
```
for usage.
Load
```
statistical-tests-guide.md
```
when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.

Understand the Data

Before analysis, get your bearings:

Source and context: Where did this data come from? How was it collected? What time period does it cover?
Schema: What are the columns/fields? What do they represent? What are the data types?
Scale: How many rows/records? What's the granularity? (Per user? Per day? Per transaction?)
Known issues: Is the data known to be incomplete, biased, or have quality problems?

# First look at any dataset
import pandas as pd

df = pd.read_csv("data.csv")  # or read_excel, read_json, etc.
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")

Analysis Workflow

Step 1: Clean and Validate

Data quality determines analysis quality. Don't skip this.

Handle Missing Values

Count them first: What percentage of each column is missing?
Understand why: Are they random? Systematic? (e.g., optional fields vs data collection failures)
Choose a strategy and document it:
- Drop rows: When missing data is rare and random (less than 5%)
- Impute with median/mode: When missing data is moderate and the distribution is known
- Flag as separate category: When missingness itself is informative
- Leave as-is: When the analysis method handles nulls natively

# Document your decisions
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing data percentage per column:")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

Handle Outliers

Detect: Use IQR method, z-scores, or domain knowledge
Investigate: Are they errors or legitimate extreme values?
Document your decision: Keep, cap, or remove — and explain why

# IQR method for outlier detection
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)]
print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")

Validate Data Types and Ranges

Dates should be dates, numbers should be numbers
Check for impossible values (negative ages, future dates, percentages over 100)
Verify categorical values are consistent (watch for "USA", "US", "United States")

Step 2: Explore

Start broad, then focus on what's interesting.

Descriptive Statistics

Always start here — understand the basics before going deeper.

# Numerical columns
print(df.describe())

# Categorical columns
for col in df.select_dtypes(include='object').columns:
    print(f"\n{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts().head(10))

Distributions

Understanding shape matters for choosing the right tests.

import matplotlib.pyplot as plt

# Distribution of key metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']):
    df[col].hist(ax=axes[i], bins=30)
    axes[i].set_title(col)
    axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median')
    axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png")

Correlations and Relationships

Look for patterns between variables.

# Correlation matrix for numerical columns
corr = df.select_dtypes(include='number').corr()
print("Strong correlations (|r| > 0.5):")
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if abs(corr.iloc[i, j]) > 0.5:
            print(f"  {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")

Trends Over Time

If the data has a time dimension, always look at trends.

# Time series analysis
df['date'] = pd.to_datetime(df['date'])
daily = df.groupby('date')['metric'].agg(['mean', 'count'])
daily['mean'].plot(figsize=(12, 4), title='Daily Average')
plt.savefig("trend.png")

Step 3: Analyze

Choose the right method for the question.

Comparison Questions ("Is A different from B?")

Use the right statistical test:

Two groups, continuous outcome: t-test (if normal) or Mann-Whitney U (if not)
Multiple groups: ANOVA (if normal) or Kruskal-Wallis (if not)
Two groups, categorical outcome: Chi-squared test
Before/after with same subjects: Paired t-test or Wilcoxon signed-rank

Always report:

Sample sizes for each group
Effect size (not just p-value) — Cohen's d, odds ratio, or percentage difference
Confidence intervals
Practical significance, not just statistical significance

from scipy import stats

# Example: comparing two groups
group_a = df[df['variant'] == 'A']['metric']
group_b = df[df['variant'] == 'B']['metric']

# Check normality first
_, p_normal_a = stats.shapiro(group_a.sample(min(5000, len(group_a))))
_, p_normal_b = stats.shapiro(group_b.sample(min(5000, len(group_b))))

if p_normal_a > 0.05 and p_normal_b > 0.05:
    stat, p_value = stats.ttest_ind(group_a, group_b)
    test_name = "t-test"
else:
    stat, p_value = stats.mannwhitneyu(group_a, group_b)
    test_name = "Mann-Whitney U"

# Effect size (Cohen's d)
pooled_std = ((group_a.std()**2 + group_b.std()**2) / 2) ** 0.5
cohens_d = (group_a.mean() - group_b.mean()) / pooled_std

print(f"Test: {test_name}")
print(f"Group A: mean={group_a.mean():.3f}, n={len(group_a)}")
print(f"Group B: mean={group_b.mean():.3f}, n={len(group_b)}")
print(f"Difference: {group_a.mean() - group_b.mean():.3f} ({(group_a.mean() - group_b.mean())/group_b.mean()*100:.1f}%)")
print(f"Cohen's d: {cohens_d:.3f}")
print(f"p-value: {p_value:.4f}")

Trend Questions ("Is this changing over time?")

Use rolling averages to smooth noise
Look for seasonality (day-of-week, monthly, quarterly patterns)
Segment by key dimensions — overall trends can mask group-level differences
Compare period-over-period (WoW, MoM, YoY)

Composition Questions ("What's the breakdown?")

Use percentages and proportions
Show both absolute numbers and percentages
Look for the Pareto principle (80/20 rule)
Segment into meaningful groups

A/B Test Analysis

For experiment analysis, follow this checklist:

Sample size: Was the test adequately powered?
Duration: Did it run long enough for weekly patterns?
Randomization: Were groups balanced on key dimensions?
Metric definition: Is the metric clearly defined and correctly computed?
Statistical test: Use the appropriate test for the metric type
Multiple comparisons: Correct for multiple testing if checking many metrics
Practical significance: Is the effect large enough to matter?

Step 4: Communicate Findings

Structure matters as much as the analysis itself.

Lead With the Answer

Start with what the user asked. Then support it.

Pattern:

[Answer to the question in 1-2 sentences]

[Key supporting evidence: 2-4 bullet points with specific numbers]

[Methodology note: How you arrived at this, in 1-2 sentences]

[Caveats and limitations: What could affect this conclusion]

[Recommended next steps: What to do with this information]

Visualization Guidelines

Choose the right chart type:

Trend over time: Line chart
Comparison between categories: Bar chart (horizontal for many categories)
Distribution: Histogram or box plot
Relationship between two variables: Scatter plot
Part-of-whole: Stacked bar or pie chart (only for 2-5 categories)
Multiple dimensions: Heatmap or small multiples

For every chart:

Title that states the finding, not just the data ("Conversion drops 40% on weekends" not "Conversion by day")
Label axes with units
Include sample size or time period
Use consistent colors across related charts

What to Always Include

Sample sizes: How much data backs each claim
Time periods: When the data is from
Definitions: How key metrics are calculated
Confidence: How certain you are (confidence intervals, p-values where appropriate)
Limitations: What the data can't tell you

Common Pitfalls

Survivorship bias: Only analyzing data from things that succeeded/persisted
Simpson's paradox: A trend that appears in groups but reverses when combined
Confounding variables: Correlation doesn't mean causation — look for third variables
Small sample sizes: Don't draw strong conclusions from small N. State the limitation.
Cherry-picking timeframes: Make sure the time period is representative
Precision theater: Reporting "23.847%" when the data only supports "roughly 24%"
Missing baseline: A number without context is meaningless ("10,000 errors" — is that a lot?)

Handling Common Requests

"Just give me a quick look"

Provide: shape, column summary, missing data count, top-level descriptive stats, one key insight. Keep it to one screen of output.

"What's interesting in this data?"

Run the full exploration workflow (Step 2) and surface the 3-5 most notable patterns: unexpected distributions, strong correlations, outliers, trends.

"Is this statistically significant?"

Clarify what comparison they mean, choose the right test, report effect size and p-value, and explain what it means in practical terms.

"Create a dashboard / visualization"

Ask what decisions the dashboard supports. Design 3-5 charts that answer those questions. Use consistent styling and clear titles that state findings.