Upsonic data-analysis

Analyze, explore, clean, and visualize datasets with statistical rigor. Use when user asks to analyze data, find patterns, compute statistics, create visualizations, clean messy data, or explore a dataset. Trigger when user says things like "analyze this data", "what trends do you see", "find patterns in", "create a chart", "clean this dataset", "run statistics on", "what does this data tell us", or provides CSV/Excel/JSON data for exploration. Also trigger for A/B test analysis, cohort analysis, and data quality assessments. Do NOT trigger for simple data format conversions, database query writing without analysis, or ETL pipeline design.

install
source · Clone the upstream repo
git clone https://github.com/Upsonic/Upsonic
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Upsonic/Upsonic "$T" && mkdir -p ~/.claude/skills && cp -r "$T/src/upsonic/skills/builtins/data-analysis" ~/.claude/skills/upsonic-upsonic-data-analysis && rm -rf "$T"
manifest: src/upsonic/skills/builtins/data-analysis/SKILL.md
source content

Data Analysis

Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.

Before You Analyze

Understand the Question

Before touching the data, clarify:

  1. What question are we answering? ("Is our conversion rate improving?" is an answerable question. "Analyze this data" is not — help the user sharpen it.)
  2. Who needs the answer? (Engineer debugging an issue? Executive making a budget decision? Researcher testing a hypothesis?)
  3. What decisions will this inform? (This determines how precise you need to be and what format the answer should take.)
  4. What's the timeline? (A quick sanity check and a thorough statistical analysis require different approaches.)

If the user says "analyze this data" without a specific question, help them formulate one:

  • "What would be most useful to know from this data?"
  • "Are you looking for trends over time, comparisons between groups, or something else?"
  • "Is there a specific business question this should answer?"

Reference Materials and Scripts

  • Execute
    profile_data.py
    with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run with
    --help
    for usage.
  • Load
    statistical-tests-guide.md
    when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.

Understand the Data

Before analysis, get your bearings:

  1. Source and context: Where did this data come from? How was it collected? What time period does it cover?
  2. Schema: What are the columns/fields? What do they represent? What are the data types?
  3. Scale: How many rows/records? What's the granularity? (Per user? Per day? Per transaction?)
  4. Known issues: Is the data known to be incomplete, biased, or have quality problems?
# First look at any dataset
import pandas as pd

df = pd.read_csv("data.csv")  # or read_excel, read_json, etc.
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")

Analysis Workflow

Step 1: Clean and Validate

Data quality determines analysis quality. Don't skip this.

Handle Missing Values

  • Count them first: What percentage of each column is missing?
  • Understand why: Are they random? Systematic? (e.g., optional fields vs data collection failures)
  • Choose a strategy and document it:
    • Drop rows: When missing data is rare and random (less than 5%)
    • Impute with median/mode: When missing data is moderate and the distribution is known
    • Flag as separate category: When missingness itself is informative
    • Leave as-is: When the analysis method handles nulls natively
# Document your decisions
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing data percentage per column:")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

Handle Outliers

  • Detect: Use IQR method, z-scores, or domain knowledge
  • Investigate: Are they errors or legitimate extreme values?
  • Document your decision: Keep, cap, or remove — and explain why
# IQR method for outlier detection
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)]
print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")

Validate Data Types and Ranges

  • Dates should be dates, numbers should be numbers
  • Check for impossible values (negative ages, future dates, percentages over 100)
  • Verify categorical values are consistent (watch for "USA", "US", "United States")

Step 2: Explore

Start broad, then focus on what's interesting.

Descriptive Statistics

Always start here — understand the basics before going deeper.

# Numerical columns
print(df.describe())

# Categorical columns
for col in df.select_dtypes(include='object').columns:
    print(f"\n{col}: {df[col].nunique()} unique values")
    print(df[col].value_counts().head(10))

Distributions

Understanding shape matters for choosing the right tests.

import matplotlib.pyplot as plt

# Distribution of key metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']):
    df[col].hist(ax=axes[i], bins=30)
    axes[i].set_title(col)
    axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median')
    axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png")

Correlations and Relationships

Look for patterns between variables.

# Correlation matrix for numerical columns
corr = df.select_dtypes(include='number').corr()
print("Strong correlations (|r| > 0.5):")
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if abs(corr.iloc[i, j]) > 0.5:
            print(f"  {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")

Trends Over Time

If the data has a time dimension, always look at trends.

# Time series analysis
df['date'] = pd.to_datetime(df['date'])
daily = df.groupby('date')['metric'].agg(['mean', 'count'])
daily['mean'].plot(figsize=(12, 4), title='Daily Average')
plt.savefig("trend.png")

Step 3: Analyze

Choose the right method for the question.

Comparison Questions ("Is A different from B?")

Use the right statistical test:

  • Two groups, continuous outcome: t-test (if normal) or Mann-Whitney U (if not)
  • Multiple groups: ANOVA (if normal) or Kruskal-Wallis (if not)
  • Two groups, categorical outcome: Chi-squared test
  • Before/after with same subjects: Paired t-test or Wilcoxon signed-rank

Always report:

  • Sample sizes for each group
  • Effect size (not just p-value) — Cohen's d, odds ratio, or percentage difference
  • Confidence intervals
  • Practical significance, not just statistical significance
from scipy import stats

# Example: comparing two groups
group_a = df[df['variant'] == 'A']['metric']
group_b = df[df['variant'] == 'B']['metric']

# Check normality first
_, p_normal_a = stats.shapiro(group_a.sample(min(5000, len(group_a))))
_, p_normal_b = stats.shapiro(group_b.sample(min(5000, len(group_b))))

if p_normal_a > 0.05 and p_normal_b > 0.05:
    stat, p_value = stats.ttest_ind(group_a, group_b)
    test_name = "t-test"
else:
    stat, p_value = stats.mannwhitneyu(group_a, group_b)
    test_name = "Mann-Whitney U"

# Effect size (Cohen's d)
pooled_std = ((group_a.std()**2 + group_b.std()**2) / 2) ** 0.5
cohens_d = (group_a.mean() - group_b.mean()) / pooled_std

print(f"Test: {test_name}")
print(f"Group A: mean={group_a.mean():.3f}, n={len(group_a)}")
print(f"Group B: mean={group_b.mean():.3f}, n={len(group_b)}")
print(f"Difference: {group_a.mean() - group_b.mean():.3f} ({(group_a.mean() - group_b.mean())/group_b.mean()*100:.1f}%)")
print(f"Cohen's d: {cohens_d:.3f}")
print(f"p-value: {p_value:.4f}")

Trend Questions ("Is this changing over time?")

  • Use rolling averages to smooth noise
  • Look for seasonality (day-of-week, monthly, quarterly patterns)
  • Segment by key dimensions — overall trends can mask group-level differences
  • Compare period-over-period (WoW, MoM, YoY)

Composition Questions ("What's the breakdown?")

  • Use percentages and proportions
  • Show both absolute numbers and percentages
  • Look for the Pareto principle (80/20 rule)
  • Segment into meaningful groups

A/B Test Analysis

For experiment analysis, follow this checklist:

  1. Sample size: Was the test adequately powered?
  2. Duration: Did it run long enough for weekly patterns?
  3. Randomization: Were groups balanced on key dimensions?
  4. Metric definition: Is the metric clearly defined and correctly computed?
  5. Statistical test: Use the appropriate test for the metric type
  6. Multiple comparisons: Correct for multiple testing if checking many metrics
  7. Practical significance: Is the effect large enough to matter?

Step 4: Communicate Findings

Structure matters as much as the analysis itself.

Lead With the Answer

Start with what the user asked. Then support it.

Pattern:

[Answer to the question in 1-2 sentences]

[Key supporting evidence: 2-4 bullet points with specific numbers]

[Methodology note: How you arrived at this, in 1-2 sentences]

[Caveats and limitations: What could affect this conclusion]

[Recommended next steps: What to do with this information]

Visualization Guidelines

Choose the right chart type:

  • Trend over time: Line chart
  • Comparison between categories: Bar chart (horizontal for many categories)
  • Distribution: Histogram or box plot
  • Relationship between two variables: Scatter plot
  • Part-of-whole: Stacked bar or pie chart (only for 2-5 categories)
  • Multiple dimensions: Heatmap or small multiples

For every chart:

  • Title that states the finding, not just the data ("Conversion drops 40% on weekends" not "Conversion by day")
  • Label axes with units
  • Include sample size or time period
  • Use consistent colors across related charts

What to Always Include

  • Sample sizes: How much data backs each claim
  • Time periods: When the data is from
  • Definitions: How key metrics are calculated
  • Confidence: How certain you are (confidence intervals, p-values where appropriate)
  • Limitations: What the data can't tell you

Common Pitfalls

  • Survivorship bias: Only analyzing data from things that succeeded/persisted
  • Simpson's paradox: A trend that appears in groups but reverses when combined
  • Confounding variables: Correlation doesn't mean causation — look for third variables
  • Small sample sizes: Don't draw strong conclusions from small N. State the limitation.
  • Cherry-picking timeframes: Make sure the time period is representative
  • Precision theater: Reporting "23.847%" when the data only supports "roughly 24%"
  • Missing baseline: A number without context is meaningless ("10,000 errors" — is that a lot?)

Handling Common Requests

"Just give me a quick look"

Provide: shape, column summary, missing data count, top-level descriptive stats, one key insight. Keep it to one screen of output.

"What's interesting in this data?"

Run the full exploration workflow (Step 2) and surface the 3-5 most notable patterns: unexpected distributions, strong correlations, outliers, trends.

"Is this statistically significant?"

Clarify what comparison they mean, choose the right test, report effect size and p-value, and explain what it means in practical terms.

"Create a dashboard / visualization"

Ask what decisions the dashboard supports. Design 3-5 charts that answer those questions. Use consistent styling and clear titles that state findings.