Upsonic data-analysis
Analyze, explore, clean, and visualize datasets with statistical rigor. Use when user asks to analyze data, find patterns, compute statistics, create visualizations, clean messy data, or explore a dataset. Trigger when user says things like "analyze this data", "what trends do you see", "find patterns in", "create a chart", "clean this dataset", "run statistics on", "what does this data tell us", or provides CSV/Excel/JSON data for exploration. Also trigger for A/B test analysis, cohort analysis, and data quality assessments. Do NOT trigger for simple data format conversions, database query writing without analysis, or ETL pipeline design.
git clone https://github.com/Upsonic/Upsonic
T=$(mktemp -d) && git clone --depth=1 https://github.com/Upsonic/Upsonic "$T" && mkdir -p ~/.claude/skills && cp -r "$T/src/upsonic/skills/builtins/data-analysis" ~/.claude/skills/upsonic-upsonic-data-analysis && rm -rf "$T"
src/upsonic/skills/builtins/data-analysis/SKILL.mdData Analysis
Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.
Before You Analyze
Understand the Question
Before touching the data, clarify:
- What question are we answering? ("Is our conversion rate improving?" is an answerable question. "Analyze this data" is not — help the user sharpen it.)
- Who needs the answer? (Engineer debugging an issue? Executive making a budget decision? Researcher testing a hypothesis?)
- What decisions will this inform? (This determines how precise you need to be and what format the answer should take.)
- What's the timeline? (A quick sanity check and a thorough statistical analysis require different approaches.)
If the user says "analyze this data" without a specific question, help them formulate one:
- "What would be most useful to know from this data?"
- "Are you looking for trends over time, comparisons between groups, or something else?"
- "Is there a specific business question this should answer?"
Reference Materials and Scripts
- Execute
with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run withprofile_data.py
for usage.--help - Load
when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.statistical-tests-guide.md
Understand the Data
Before analysis, get your bearings:
- Source and context: Where did this data come from? How was it collected? What time period does it cover?
- Schema: What are the columns/fields? What do they represent? What are the data types?
- Scale: How many rows/records? What's the granularity? (Per user? Per day? Per transaction?)
- Known issues: Is the data known to be incomplete, biased, or have quality problems?
# First look at any dataset import pandas as pd df = pd.read_csv("data.csv") # or read_excel, read_json, etc. print(f"Shape: {df.shape}") print(f"\nColumn types:\n{df.dtypes}") print(f"\nFirst rows:\n{df.head()}") print(f"\nMissing values:\n{df.isnull().sum()}") print(f"\nBasic stats:\n{df.describe()}")
Analysis Workflow
Step 1: Clean and Validate
Data quality determines analysis quality. Don't skip this.
Handle Missing Values
- Count them first: What percentage of each column is missing?
- Understand why: Are they random? Systematic? (e.g., optional fields vs data collection failures)
- Choose a strategy and document it:
- Drop rows: When missing data is rare and random (less than 5%)
- Impute with median/mode: When missing data is moderate and the distribution is known
- Flag as separate category: When missingness itself is informative
- Leave as-is: When the analysis method handles nulls natively
# Document your decisions missing_pct = df.isnull().sum() / len(df) * 100 print("Missing data percentage per column:") print(missing_pct[missing_pct > 0].sort_values(ascending=False))
Handle Outliers
- Detect: Use IQR method, z-scores, or domain knowledge
- Investigate: Are they errors or legitimate extreme values?
- Document your decision: Keep, cap, or remove — and explain why
# IQR method for outlier detection Q1 = df['value'].quantile(0.25) Q3 = df['value'].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)] print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
Validate Data Types and Ranges
- Dates should be dates, numbers should be numbers
- Check for impossible values (negative ages, future dates, percentages over 100)
- Verify categorical values are consistent (watch for "USA", "US", "United States")
Step 2: Explore
Start broad, then focus on what's interesting.
Descriptive Statistics
Always start here — understand the basics before going deeper.
# Numerical columns print(df.describe()) # Categorical columns for col in df.select_dtypes(include='object').columns: print(f"\n{col}: {df[col].nunique()} unique values") print(df[col].value_counts().head(10))
Distributions
Understanding shape matters for choosing the right tests.
import matplotlib.pyplot as plt # Distribution of key metrics fig, axes = plt.subplots(1, 3, figsize=(15, 5)) for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']): df[col].hist(ax=axes[i], bins=30) axes[i].set_title(col) axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median') axes[i].legend() plt.tight_layout() plt.savefig("distributions.png")
Correlations and Relationships
Look for patterns between variables.
# Correlation matrix for numerical columns corr = df.select_dtypes(include='number').corr() print("Strong correlations (|r| > 0.5):") for i in range(len(corr.columns)): for j in range(i+1, len(corr.columns)): if abs(corr.iloc[i, j]) > 0.5: print(f" {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")
Trends Over Time
If the data has a time dimension, always look at trends.
# Time series analysis df['date'] = pd.to_datetime(df['date']) daily = df.groupby('date')['metric'].agg(['mean', 'count']) daily['mean'].plot(figsize=(12, 4), title='Daily Average') plt.savefig("trend.png")
Step 3: Analyze
Choose the right method for the question.
Comparison Questions ("Is A different from B?")
Use the right statistical test:
- Two groups, continuous outcome: t-test (if normal) or Mann-Whitney U (if not)
- Multiple groups: ANOVA (if normal) or Kruskal-Wallis (if not)
- Two groups, categorical outcome: Chi-squared test
- Before/after with same subjects: Paired t-test or Wilcoxon signed-rank
Always report:
- Sample sizes for each group
- Effect size (not just p-value) — Cohen's d, odds ratio, or percentage difference
- Confidence intervals
- Practical significance, not just statistical significance
from scipy import stats # Example: comparing two groups group_a = df[df['variant'] == 'A']['metric'] group_b = df[df['variant'] == 'B']['metric'] # Check normality first _, p_normal_a = stats.shapiro(group_a.sample(min(5000, len(group_a)))) _, p_normal_b = stats.shapiro(group_b.sample(min(5000, len(group_b)))) if p_normal_a > 0.05 and p_normal_b > 0.05: stat, p_value = stats.ttest_ind(group_a, group_b) test_name = "t-test" else: stat, p_value = stats.mannwhitneyu(group_a, group_b) test_name = "Mann-Whitney U" # Effect size (Cohen's d) pooled_std = ((group_a.std()**2 + group_b.std()**2) / 2) ** 0.5 cohens_d = (group_a.mean() - group_b.mean()) / pooled_std print(f"Test: {test_name}") print(f"Group A: mean={group_a.mean():.3f}, n={len(group_a)}") print(f"Group B: mean={group_b.mean():.3f}, n={len(group_b)}") print(f"Difference: {group_a.mean() - group_b.mean():.3f} ({(group_a.mean() - group_b.mean())/group_b.mean()*100:.1f}%)") print(f"Cohen's d: {cohens_d:.3f}") print(f"p-value: {p_value:.4f}")
Trend Questions ("Is this changing over time?")
- Use rolling averages to smooth noise
- Look for seasonality (day-of-week, monthly, quarterly patterns)
- Segment by key dimensions — overall trends can mask group-level differences
- Compare period-over-period (WoW, MoM, YoY)
Composition Questions ("What's the breakdown?")
- Use percentages and proportions
- Show both absolute numbers and percentages
- Look for the Pareto principle (80/20 rule)
- Segment into meaningful groups
A/B Test Analysis
For experiment analysis, follow this checklist:
- Sample size: Was the test adequately powered?
- Duration: Did it run long enough for weekly patterns?
- Randomization: Were groups balanced on key dimensions?
- Metric definition: Is the metric clearly defined and correctly computed?
- Statistical test: Use the appropriate test for the metric type
- Multiple comparisons: Correct for multiple testing if checking many metrics
- Practical significance: Is the effect large enough to matter?
Step 4: Communicate Findings
Structure matters as much as the analysis itself.
Lead With the Answer
Start with what the user asked. Then support it.
Pattern:
[Answer to the question in 1-2 sentences] [Key supporting evidence: 2-4 bullet points with specific numbers] [Methodology note: How you arrived at this, in 1-2 sentences] [Caveats and limitations: What could affect this conclusion] [Recommended next steps: What to do with this information]
Visualization Guidelines
Choose the right chart type:
- Trend over time: Line chart
- Comparison between categories: Bar chart (horizontal for many categories)
- Distribution: Histogram or box plot
- Relationship between two variables: Scatter plot
- Part-of-whole: Stacked bar or pie chart (only for 2-5 categories)
- Multiple dimensions: Heatmap or small multiples
For every chart:
- Title that states the finding, not just the data ("Conversion drops 40% on weekends" not "Conversion by day")
- Label axes with units
- Include sample size or time period
- Use consistent colors across related charts
What to Always Include
- Sample sizes: How much data backs each claim
- Time periods: When the data is from
- Definitions: How key metrics are calculated
- Confidence: How certain you are (confidence intervals, p-values where appropriate)
- Limitations: What the data can't tell you
Common Pitfalls
- Survivorship bias: Only analyzing data from things that succeeded/persisted
- Simpson's paradox: A trend that appears in groups but reverses when combined
- Confounding variables: Correlation doesn't mean causation — look for third variables
- Small sample sizes: Don't draw strong conclusions from small N. State the limitation.
- Cherry-picking timeframes: Make sure the time period is representative
- Precision theater: Reporting "23.847%" when the data only supports "roughly 24%"
- Missing baseline: A number without context is meaningless ("10,000 errors" — is that a lot?)
Handling Common Requests
"Just give me a quick look"
Provide: shape, column summary, missing data count, top-level descriptive stats, one key insight. Keep it to one screen of output.
"What's interesting in this data?"
Run the full exploration workflow (Step 2) and surface the 3-5 most notable patterns: unexpected distributions, strong correlations, outliers, trends.
"Is this statistically significant?"
Clarify what comparison they mean, choose the right test, report effect size and p-value, and explain what it means in practical terms.
"Create a dashboard / visualization"
Ask what decisions the dashboard supports. Design 3-5 charts that answer those questions. Use consistent styling and clear titles that state findings.