Harness-ml eda

HarnessML: Exploratory Data Analysis

install

source · Clone the upstream repo

git clone https://github.com/msilverblatt/harness-ml

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/msilverblatt/harness-ml "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/eda" ~/.claude/skills/msilverblatt-harness-ml-eda-c528bb && rm -rf "$T"

manifest: skills/eda/SKILL.md

source content

HarnessML: Exploratory Data Analysis

Use when you need to build intuition about the data. This isn't a checkbox — it's how you generate your first hypotheses and catch problems before they poison your models.

The Goal of EDA

EDA produces hypotheses, not charts. Every observation should lead to a question: "If X is true, then feature Y should be predictive because Z." Those hypotheses feed into

domain-research

and

experiment-design

Target Analysis

Start with the thing you're predicting.

data(action="inspect")

Ask:

What's the distribution? (Class balance for classification, skew for regression)
Are there temporal patterns? (Does the target rate change over time?)
Are there segments where the target behaves differently? (Target rate by category, region, time period)
Is the target noisy or clean? (If noisy, your metric ceiling may be lower than you think)

What this tells you:

Class imbalance → you may need stratified CV, class weights, or a metric that handles imbalance (F1 over accuracy)
Temporal drift → you need temporal CV (expanding window, not random k-fold), and drift-aware features
Segment differences → features that capture segment membership may be valuable
Noise → calibrate your expectations. Don't chase the last 0.001 on a noisy target.

Feature-Target Relationships

features(action="discover")

Ask:

Which features have the strongest univariate signal?
Are the strong features surprising or expected?
Are there features with near-zero signal that you expected to matter? (This is interesting — why don't they matter?)
Are there features with suspiciously strong signal? (Possible leakage)

What this tells you:

Strong expected features confirm domain knowledge — good, build on this
Strong unexpected features need investigation — is it leakage? A proxy for something else?
Weak expected features are the most interesting — either the relationship is non-linear (trees will find it), conditional (need an interaction), or your expectation was wrong (update your mental model)

Feature-Feature Relationships

Ask:

Are there highly correlated feature pairs? (>0.9 correlation means one is redundant)
Are there feature clusters? (Groups of features measuring the same underlying thing)
Do categorical features have high cardinality? (May need encoding strategy)

What this tells you:

Correlated features inflate importance scores and can destabilize linear models
Feature clusters suggest a latent variable — sometimes a single derived feature capturing that latent variable is better than the cluster
High cardinality categoricals need careful handling — frequency encoding, target encoding (with leakage protection), or grouping rare levels

Missing Data Patterns

Ask:

Is missingness random or patterned?
Are certain features always missing together? (Suggests a data source issue)
Is missingness correlated with the target? (If so, missingness itself is a feature)

What this tells you:

Random missingness → impute and move on
Patterned missingness → the pattern is information. A
```
null_indicator
```
feature for these columns may be predictive.
Target-correlated missingness → important signal, but also a leakage risk if the missingness is caused by the outcome

Segments and Subgroups

Ask:

Does the target behave differently across natural segments? (Time periods, categories, geographies)
Are there subgroups where prediction will be harder? (Small segments, unusual distributions)
Would separate models or segment-specific features help?

What this tells you:

Segment-dependent target behavior → regime indicators or interaction features
Hard-to-predict subgroups → these will dominate your error. Understanding them is where the biggest gains come from.

What EDA Is Not

It is not running
```
data(action="inspect")
```
and moving on
It is not a phase to "complete" before modeling
It is something you return to throughout the project — every surprising experiment result is an invitation to look at the data again
The best data scientists spend more time here than anywhere else

Output

By the end of initial EDA, you should have:

A written description of the data landscape (target, features, quality issues)
A list of 5-10 hypotheses about what might be predictive and why
Identified risks (leakage candidates, class imbalance, temporal drift)
Questions you can't answer yet that will guide early experiments