Claude-code-skills-social-science descriptive-statistics
Summary statistics, data exploration, and descriptive analysis. Use for: summary stats, describe data, data exploration, distributions, means, medians, correlations, cross-tabs.
git clone https://github.com/sshtomar/claude-code-skills-social-science
T=$(mktemp -d) && git clone --depth=1 https://github.com/sshtomar/claude-code-skills-social-science "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/descriptive-statistics" ~/.claude/skills/sshtomar-claude-code-skills-social-science-descriptive-statistics && rm -rf "$T"
skills/descriptive-statistics/SKILL.md<skill_content>
<overview> Descriptive statistics are the foundation of all statistical analysis. They reveal data structure, quality issues, and patterns that inform subsequent modeling choices. ALWAYS start with descriptive statistics before any inferential or causal analysis.This skill ensures you understand your data thoroughly before making statistical claims. </overview>
<mandatory_requirements>
<requirement priority="critical"> <name>Complete Summary Statistics</name> <description>Generate comprehensive summary statistics for all variables</description> <rationale>You cannot choose appropriate methods without understanding distributions, missingness, and ranges (Tukey, 1977: "Exploratory Data Analysis")</rationale> <consequence>Using wrong methods (e.g., linear regression on bounded outcomes, t-tests on skewed distributions)</consequence> </requirement> <requirement priority="critical"> <name>Missing Data Analysis</name> <description>Report missingness patterns for all variables used in analysis. NEVER fabricate or impute values to "fill in" missing data without explicit justification and documentation. Report actual missing data patterns, not synthetic replacements</description> <rationale>Missing data mechanisms (MCAR/MAR/MNAR) affect validity of all subsequent analyses (Little & Rubin, 2002). Fabricating missing values masks true data limitations and can introduce undetectable bias. See core-methodology skill for data authenticity requirements</rationale> <consequence>Biased estimates if missingness is informative. Fabricated values presented as real data invalidate findings</consequence> </requirement> <requirement priority="critical"> <name>Distribution Visualization</name> <description>Plot distributions for all key variables</description> <rationale>Summary statistics can hide bimodality, outliers, and skewness (Anscombe's Quartet demonstrates this)</rationale> <consequence>Inappropriate method selection and violated assumptions</consequence> </requirement> <requirement priority="high"> <name>Sample Size Reporting</name> <description>Report N for overall sample and by groups</description> <rationale>Statistical power and precision depend on sample size</rationale> <consequence>Underpowered analyses or false precision claims</consequence> </requirement></mandatory_requirements>
<thinking_process> When implementing descriptive statistics:
- Check data types (numeric, categorical, dates)
- Inspect multi-select question formats (binary columns vs concatenated strings)
- Generate appropriate summaries for each type
- Identify missingness patterns
- Visualize distributions to detect issues
- Check for data quality problems
- Report everything clearly </thinking_process>
<implementation_pattern>
<code_template>
@app.cell def comprehensive_descriptive_stats(df): #Following Tukey's EDA principles, we examine the data from multiple # angles before any modeling. This reveals quality issues, appropriate methods, # and potential problems that could invalidate subsequent analyses. import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import os # MANDATORY: Dataset overview print("=" * 60) print("DATASET OVERVIEW") print("=" * 60) print(f"Observations: {len(df):,}") print(f"Variables: {len(df.columns)}") print(f"\nVariable Types:") print(df.dtypes.value_counts()) # MANDATORY: Identify numeric and categorical columns numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist() categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist() # MANDATORY: Summary statistics for numeric variables if numeric_cols: print("\n" + "=" * 60) print("NUMERIC VARIABLES SUMMARY") print("=" * 60) summary = df[numeric_cols].describe(percentiles=[.01, .05, .25, .5, .75, .95, .99]) print(summary.round(3)) # MANDATORY: Check for outliers using IQR method print("\n" + "=" * 60) print("OUTLIER DETECTION (IQR Method)") print("=" * 60) for col in numeric_cols[:10]: # Limit to first 10 for brevity Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)][col] if len(outliers) > 0: print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)") # MANDATORY: Missing data analysis print("\n" + "=" * 60) print("MISSING DATA ANALYSIS") print("=" * 60) missing = df.isnull().sum() missing_pct = (missing / len(df)) * 100 missing_df = pd.DataFrame({ 'Missing_Count': missing, 'Missing_Percentage': missing_pct }) missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False) if len(missing_df) > 0: print(missing_df.round(2)) print(f"\nTotal rows with any missing: {df.isnull().any(axis=1).sum()} ({df.isnull().any(axis=1).sum()/len(df)*100:.1f}%)") else: print("No missing values detected") # MANDATORY: Categorical variables summary if categorical_cols: print("\n" + "=" * 60) print("CATEGORICAL VARIABLES SUMMARY") print("=" * 60) for col in categorical_cols[:10]: # Limit to first 10 n_unique = df[col].nunique() print(f"\n{col}: {n_unique} unique values") if n_unique <= 20: print(df[col].value_counts().head(10)) # MANDATORY: Distribution plots for numeric variables if numeric_cols: n_cols = min(len(numeric_cols), 9) # Max 9 subplots n_rows = (n_cols + 2) // 3 n_cols_plot = min(3, n_cols) fig, axes = plt.subplots(n_rows, n_cols_plot, figsize=(15, n_rows*4)) axes = axes.flatten() if n_rows > 1 else [axes] if n_cols == 1 else axes for i, col in enumerate(numeric_cols[:9]): data = df[col].dropna() # Histogram with KDE axes[i].hist(data, bins=30, density=True, alpha=0.7, edgecolor='black') axes[i].set_title(f'{col}\nSkew: {data.skew():.2f}, Kurt: {data.kurtosis():.2f}') axes[i].set_xlabel(col) axes[i].set_ylabel('Density') # Add KDE if enough data if len(data) > 10: from scipy import stats kde = stats.gaussian_kde(data) x_range = np.linspace(data.min(), data.max(), 100) axes[i].plot(x_range, kde(x_range), 'r-', linewidth=2) # Hide unused subplots for i in range(len(numeric_cols[:9]), len(axes)): axes[i].set_visible(False) plt.tight_layout() os.makedirs("./images", exist_ok=True) plt.savefig("./images/distributions_overview.png", dpi=144, bbox_inches="tight") print(f"\nSUCCESS: Distribution plots saved to ./images/distributions_overview.png") # Save figure for inline display fig = plt.gcf() return summary, fig, else: return None, None,
</code_template>
</implementation_pattern>
<examples> <example context="clean_data" difficulty="basic"> <description>Basic descriptive statistics for clean survey data</description> <code> ```python @app.cell def describe_survey_data(df): #Survey data requires checking for response patterns, # missing data from non-response, and scale distributions to # identify potential response biases or data entry errors.import pandas as pd import numpy as np # Check response scales likert_cols = [col for col in df.columns if 'rating' in col.lower() or 'scale' in col.lower()] if likert_cols: print("Likert Scale Variables:") for col in likert_cols: print(f"\n{col}:") print(f" Range: {df[col].min()} - {df[col].max()}") print(f" Mean: {df[col].mean():.2f}, Median: {df[col].median():.2f}") print(f" % at ceiling: {(df[col] == df[col].max()).mean()*100:.1f}%") print(f" % at floor: {(df[col] == df[col].min()).mean()*100:.1f}%") # Check for response patterns (straight-lining) if len(likert_cols) > 3: straight_line = (df[likert_cols].std(axis=1) == 0).sum() print(f"\nStraight-lining detected: {straight_line} respondents ({straight_line/len(df)*100:.1f}%)") return None,
</code> <output_interpretation> Look for: - Ceiling/floor effects (>15% at extremes suggests scale problems) - Straight-lining (>5% suggests inattentive respondents) - Unexpected ranges (values outside scale bounds indicate data errors) </output_interpretation> </example> <example context="multiselect_questions" difficulty="basic"> <description>Handling multi-select survey questions in different formats</description> <code> ```python @app.cell def inspect_multiselect_format(df): """Multi-select questions appear in two common formats depending on survey platform. ALWAYS inspect before analyzing to avoid counting errors. Format 1 (Binary/Dummy): ODK, SurveyCTO, Qualtrics, RedCap - One column per option with values {0, 1, NaN} - Example: 'activities/Business', 'activities/Farming' Format 2 (Concatenated): Google Forms, Excel, TypeForm - One column with space/comma-separated text - Example: 'activities' = "Business Farming CHE" """ import pandas as pd # STEP 1: Identify potential multi-select columns # Look for column name patterns potential_multiselect = [col for col in df.columns if 'select all' in col.lower() or col.count('/') > 0 # Binary format indicator or 'activities' in col.lower()] print("=" * 70) print("MULTI-SELECT QUESTION FORMAT INSPECTION") print("=" * 70) # STEP 2: For each potential column, check format for col in potential_multiselect[:5]: # Check first 5 print(f"\nColumn: {col}") print(f"Data type: {df[col].dtype}") # Check unique values unique_vals = df[col].dropna().unique() n_unique = len(unique_vals) print(f"Unique values: {n_unique}") # DIAGNOSTIC: Is this binary format? if df[col].dtype in ['float64', 'int64'] and set(unique_vals).issubset({0, 1}): print("→ FORMAT: Binary/Dummy variable (count where value = 1)") missing = df[col].isna().sum() zeros = (df[col] == 0).sum() ones = (df[col] == 1).sum() print(f" Missing: {missing} ({missing/len(df)*100:.1f}%)") print(f" 0 (not selected): {zeros}") print(f" 1 (selected): {ones}") # DIAGNOSTIC: Is this concatenated format? elif df[col].dtype == 'object': print("→ FORMAT: Concatenated string (parse with .str.contains)") missing = df[col].isna().sum() print(f" Missing: {missing} ({missing/len(df)*100:.1f}%)") print(f" Sample values:") for val in df[col].dropna().head(3): val_str = str(val)[:70] + "..." if len(str(val)) > 70 else str(val) print(f" - {val_str}") else: print(f"→ UNKNOWN FORMAT - Manual inspection needed") return None,
@app.cell def analyze_multiselect_binary(df): """Count responses for binary/dummy format multi-select questions. Common in ODK/SurveyCTO exports. """ import pandas as pd # Identify all binary columns for a multi-select question question_prefix = "What are your main income-generating activities? (select all that apply)/" activity_cols = [col for col in df.columns if col.startswith(question_prefix)] print("=" * 70) print("BINARY FORMAT MULTI-SELECT ANALYSIS") print("=" * 70) print(f"Total respondents: {len(df)}") print() results = [] for col in activity_cols: activity_name = col.replace(question_prefix, "") # CRITICAL: Count only where value = 1 (NOT .notna() which counts 0s too!) count = (df[col] == 1).sum() percentage = (count / len(df)) * 100 results.append({ 'Activity': activity_name, 'N': count, 'Percentage': percentage }) results_df = pd.DataFrame(results).sort_values('N', ascending=False) print(results_df.to_string(index=False)) # Calculate average selections per person total_selections = sum(r['N'] for r in results) avg_per_person = total_selections / len(df) print(f"\nAverage selections per person: {avg_per_person:.2f}") return results_df,
</code> <output_interpretation> **Binary format**: - Value counts show {0, 1} → Use `(df[col] == 1).sum()` to count selections - Using `.notna().sum()` counts BOTH 0s and 1s → Wrong totals (will show 100% for all) - Missing values (NaN) represent true non-response, not "not selected"@app.cell def analyze_multiselect_concatenated(df): """Parse concatenated string format multi-select questions. Common in Google Forms/Excel exports. """ import pandas as pd col = 'main_income_generating' # Adjust column name assert col in df.columns, f"Column {col} not found" # Define possible options to search for possible_activities = [ 'Community Health Entrepreneur (CHE)', 'Community Health Promoter (CHP)', 'Business', 'Farming', 'Casual work', 'Teaching', 'Tailoring', 'Other' ] print("=" * 70) print("CONCATENATED STRING MULTI-SELECT ANALYSIS") print("=" * 70) print(f"Total respondents: {len(df)}") print(f"Missing: {df[col].isna().sum()}") print() results = [] for activity in possible_activities: # CRITICAL: Use regex=False for literal matching (parentheses are special in regex!) count = df[col].str.contains(activity, na=False, regex=False).sum() percentage = (count / len(df)) * 100 if count > 0: # Only include if found results.append({ 'Activity': activity, 'N': count, 'Percentage': percentage }) results_df = pd.DataFrame(results).sort_values('N', ascending=False) print(results_df.to_string(index=False)) # Calculate average selections per person total_selections = sum(r['N'] for r in results) avg_per_person = total_selections / len(df) print(f"\nAverage selections per person: {avg_per_person:.2f}") return results_df,
Concatenated format:
- Sample values show multiple activities in one string
- MUST use
inregex=False
if options have special chars like parentheses.str.contains() - Delimiter varies (space, comma, semicolon) - inspect samples first
- Person can select multiple activities → counts will sum to > 100%
Common error: Using
.notna() on binary format → counts all non-missing as "selected"
Fix: Always use == 1 for binary, .str.contains() for strings
</output_interpretation>
<best_practice>
ALWAYS run format inspection first:
- Check data type (float/int = binary, object = string)
- Check unique values (
)value_counts(dropna=False) - View sample rows to confirm structure
- Choose appropriate counting method
Survey platforms export differently:
- Binary: ODK, SurveyCTO, Qualtrics, RedCap, LimeSurvey
- Concatenated: Google Forms, TypeForm, some Excel/manual entry
Document which format you found and method used! </best_practice> </example>
<example context="skewed_data" difficulty="intermediate"> <description>Handling highly skewed income data</description> <code> ```python @app.cell def describe_skewed_income(df): #Income data is typically right-skewed with outliers. # We report both standard statistics and robust alternatives, # and consider log transformation for modeling.import pandas as pd import numpy as np import matplotlib.pyplot as plt import os income_col = 'income' # Adjust as needed assert income_col in df.columns, f"Column {income_col} not found" income = df[income_col].dropna() # Report both parametric and robust statistics print("Income Distribution Analysis") print("=" * 50) print(f"N: {len(income):,}") print(f"\nCentral Tendency:") print(f" Mean: ${income.mean():,.2f}") print(f" Median: ${income.median():,.2f}") print(f" Trimmed Mean (5%): ${stats.trim_mean(income, 0.05):,.2f}") print(f"\nDispersion:") print(f" Std Dev: ${income.std():,.2f}") print(f" IQR: ${income.quantile(0.75) - income.quantile(0.25):,.2f}") print(f" MAD: ${stats.median_abs_deviation(income):,.2f}") print(f"\nDistribution Shape:") print(f" Skewness: {income.skew():.2f}") print(f" Kurtosis: {income.kurtosis():.2f}") # Suggest transformation if highly skewed if abs(income.skew()) > 2: print(f"\nHigh skewness detected. Consider log transformation.") log_income = np.log1p(income) # log(1+x) to handle zeros print(f" Log-transformed skewness: {log_income.skew():.2f}") # Percentiles for inequality measures print(f"\nPercentiles:") for p in [10, 25, 50, 75, 90, 95, 99]: print(f" P{p}: ${income.quantile(p/100):,.2f}") print(f"\nInequality Measures:") print(f" P90/P10 ratio: {income.quantile(0.9) / income.quantile(0.1):.2f}") print(f" P90/P50 ratio: {income.quantile(0.9) / income.quantile(0.5):.2f}") # Visualize with both linear and log scales fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Original scale histogram axes[0].hist(income, bins=50, edgecolor='black', alpha=0.7) axes[0].set_xlabel('Income ($)') axes[0].set_ylabel('Frequency') axes[0].set_title('Income Distribution (Linear Scale)') # Log scale histogram axes[1].hist(income, bins=50, edgecolor='black', alpha=0.7) axes[1].set_xlabel('Income ($)') axes[1].set_ylabel('Frequency') axes[1].set_title('Income Distribution (Log Scale)') axes[1].set_xscale('log') # Box plot to show outliers axes[2].boxplot(income, vert=True) axes[2].set_ylabel('Income ($)') axes[2].set_title('Income Box Plot') axes[2].set_yscale('log') plt.tight_layout() os.makedirs("./images", exist_ok=True) plt.savefig("./images/income_distribution.png", dpi=144, bbox_inches="tight") # Return figure for inline display return plt.gcf(),
</code> <best_practice> For panel data: 1. Always check balance and attrition patterns 2. Decompose variation (within/between) to inform model choice 3. Consider if fixed effects will absorb key variables 4. Document entry/exit patterns for external validity </best_practice> </example> </examples></code> <lesson> For skewed data: 1. Always report both mean and median 2. Consider robust statistics (trimmed mean, MAD) 3. Visualize on both linear and log scales 4. Document transformation decisions </lesson> </example> <example context="panel_data" difficulty="advanced"> <description>Descriptive statistics for panel/longitudinal data</description> <code> ```python @app.cell def describe_panel_data(df): #Panel data requires checking both cross-sectional # and time-series properties. We examine balance, attrition, # and within vs. between variation. import pandas as pd import numpy as np # Identify panel structure columns unit_col = 'unit_id' # Adjust as needed time_col = 'year' # Adjust as needed assert {unit_col, time_col}.issubset(df.columns), f"Panel columns not found" print("PANEL DATA STRUCTURE") print("=" * 60) # Basic panel properties n_units = df[unit_col].nunique() n_periods = df[time_col].nunique() n_obs = len(df) print(f"Units: {n_units:,}") print(f"Time periods: {n_periods}") print(f"Total observations: {n_obs:,}") print(f"Theoretical balanced panel size: {n_units * n_periods:,}") # Check balance obs_per_unit = df.groupby(unit_col).size() is_balanced = (obs_per_unit == n_periods).all() print(f"\nPanel balance: {'Balanced' if is_balanced else 'Unbalanced'}") if not is_balanced: print(f" Units with complete data: {(obs_per_unit == n_periods).sum()} ({(obs_per_unit == n_periods).sum()/n_units*100:.1f}%)") print(f" Mean observations per unit: {obs_per_unit.mean():.1f}") print(f" Min observations: {obs_per_unit.min()}") print(f" Max observations: {obs_per_unit.max()}") # Attrition analysis first_period = df[time_col].min() last_period = df[time_col].max() units_first = set(df[df[time_col] == first_period][unit_col]) units_last = set(df[df[time_col] == last_period][unit_col]) attrition = len(units_first - units_last) entry = len(units_last - units_first) print(f"\nPanel dynamics:") print(f" Units at start: {len(units_first):,}") print(f" Units at end: {len(units_last):,}") print(f" Attrition: {attrition} units ({attrition/len(units_first)*100:.1f}%)") print(f" Late entry: {entry} units") # Within vs. between variation for numeric variables numeric_cols = df.select_dtypes(include=[np.number]).columns numeric_cols = [c for c in numeric_cols if c not in [unit_col, time_col]] if numeric_cols: print(f"\nWITHIN VS. BETWEEN VARIATION") print("=" * 60) for col in numeric_cols[:5]: # First 5 numeric variables # Overall variation overall_std = df[col].std() # Between variation (across units) unit_means = df.groupby(unit_col)[col].mean() between_std = unit_means.std() # Within variation (within units over time) df_demeaned = df[col] - df.groupby(unit_col)[col].transform('mean') within_std = df_demeaned.std() print(f"\n{col}:") print(f" Overall SD: {overall_std:.3f}") print(f" Between SD: {between_std:.3f} ({between_std/overall_std*100:.1f}% of total)") print(f" Within SD: {within_std:.3f} ({within_std/overall_std*100:.1f}% of total)") # Check if mostly cross-sectional or time-series variation if between_std > within_std * 2: print(f" → Primarily cross-sectional variation") elif within_std > between_std * 2: print(f" → Primarily time-series variation") else: print(f" → Substantial both between and within variation") return obs_per_unit,
<common_mistakes>
<mistake severity="critical"> <what>Using .notna() or .count() to tally binary multi-select questions</what> <consequence>Counts both 0s and 1s as "selected" → inflated percentages (often 100% for all options)</consequence> <prevention>ALWAYS use (df[col] == 1).sum() for binary format. Inspect format first with value_counts(dropna=False)</prevention> <real_world_example>Builder activities showing 100% participation in all activities because code counted zeros as selections</real_world_example> </mistake> <mistake severity="high"> <what>Using regex=True (default) on .str.contains() with special characters</what> <consequence>Parentheses, brackets, dots treated as regex patterns → missing matches</consequence> <prevention>Use regex=False for literal string matching in concatenated multi-select fields</prevention> <real_world_example>CHE and CHP activities not detected because parentheses in "Community Health Entrepreneur (CHE)" treated as regex grouping</real_world_example> </mistake> <mistake severity="high"> <what>Computing means without checking distributions</what> <consequence>Mean is misleading for skewed data (e.g., income)</consequence> <prevention>Always visualize distributions and report median alongside mean</prevention> </mistake> <mistake severity="high"> <what>Ignoring missing data patterns</what> <consequence>Biased estimates if missingness is not random</consequence> <prevention>Test for patterns in missingness (e.g., missing income correlated with age)</prevention> </mistake> <mistake severity="medium"> <what>Not checking data types before analysis</what> <consequence>Treating categorical as numeric or vice versa</consequence> <prevention>Explicitly check dtypes and convert as needed</prevention> </mistake> <mistake severity="medium"> <what>Using sample statistics on population data</what> <consequence>Incorrect uncertainty quantification</consequence> <prevention>Distinguish between sample and population analyses</prevention> </mistake></common_mistakes>
<interpretation_guide>
<interpreting_results>
- Mean vs. Median difference > 20% → Consider skewness
- Kurtosis > 3 → Heavy tails, expect outliers
- Missing > 10% → Need missing data strategy
- Between-SD > Within-SD → Fixed effects may not be appropriate </interpreting_results>
<red_flags>
- Skewness > 2 or < -2 → Distribution far from normal
- Multiple modes in histogram → Distinct subpopulations
- Straight-lining in surveys → Data quality issues
- Sudden changes in panel balance → Selection bias risk </red_flags>
<next_steps>
- All checks pass → Proceed to inferential analysis
- Skewed outcomes → Consider transformations or GLM
- Missing data → Implement appropriate handling strategy
- Outliers detected → Robustness checks needed </next_steps>
</interpretation_guide>
<references> <paper>Tukey, J.W. (1977). "Exploratory Data Analysis." Addison-Wesley. Foundation of EDA principles.</paper> <paper>Little, R.J.A. & Rubin, D.B. (2002). "Statistical Analysis with Missing Data." Wiley. Missing data mechanisms.</paper> <paper>Anscombe, F.J. (1973). "Graphs in Statistical Analysis." American Statistician. Why visualization matters.</paper> </references></skill_content>