Claude-awesome-stack data-explore

Profile and explore datasets -- schema inference, distributions, missing values, outliers, correlations. Use when starting work with a new dataset or investigating data quality issues.

install
source · Clone the upstream repo
git clone https://github.com/giacomogaglione/claude-awesome-stack
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/giacomogaglione/claude-awesome-stack "$T" && mkdir -p ~/.claude/skills && cp -r "$T/stacks/python-ml/skills/data-explore" ~/.claude/skills/giacomogaglione-claude-awesome-stack-data-explore && rm -rf "$T"
manifest: stacks/python-ml/skills/data-explore/SKILL.md
source content

Data Exploration Skill

When exploring a dataset, follow this structured approach. Adapt based on whether the data is tabular (CSV/DataFrame), image-based, or text.

1. Schema and Structure

First, understand what you're working with:

  • Load a sample (first 5 rows + last 5 rows)
  • Column names, dtypes, and count of non-null values
  • Dataset dimensions (rows x columns)
  • Memory usage
  • Identify the target variable if this is a supervised learning task

For tabular data:

df.info()
df.describe(include='all')
df.head()
df.dtypes.value_counts()

2. Missing Values

Map the missing data landscape:

  • Count and percentage of missing values per column
  • Pattern analysis: are values Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)?
  • Identify columns with >50% missing (candidates for dropping)
  • Check if missingness correlates with the target variable
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({'count': missing, 'pct': missing_pct})
missing_report[missing_report['count'] > 0].sort_values('pct', ascending=False)

3. Distribution Analysis

For each feature, characterize its distribution:

Numerical features:

  • Min, max, mean, median, std
  • Skewness and kurtosis
  • Identify if log-transform would help (right-skewed data)

Categorical features:

  • Cardinality (number of unique values)
  • Value counts for top-10 categories
  • Identify rare categories (<1% frequency)

4. Outlier Detection

Flag potential outliers:

  • IQR method: values below Q1 - 1.5IQR or above Q3 + 1.5IQR
  • Z-score method: values with |z| > 3
  • Domain-specific checks (e.g., negative ages, future dates)
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5 * IQR) | (df[col] > Q3 + 1.5 * IQR)]

5. Correlations and Relationships

Identify feature relationships:

  • Pearson correlation matrix for numerical features
  • Flag highly correlated pairs (|r| > 0.8) as candidates for feature selection
  • Check correlation with target variable
  • For categorical features, use chi-squared test or Cramer's V

6. Data Quality Score

Summarize findings with a quality assessment:

  • Completeness: % of non-null values across all cells
  • Uniqueness: % of columns with no duplicates (where expected)
  • Consistency: any type mismatches or encoding issues
  • Validity: % of values within expected ranges

Present the final summary as a structured report.