Gsd-skill-creator data-wrangling

Data cleaning, transformation, reshaping, joins, missing data handling, and tidy data principles. Covers the full pipeline from raw ingestion to analysis-ready datasets -- type coercion, deduplication, outlier detection, normalization, melting/pivoting, regex extraction, and reproducible transformation chains. Use when preparing, cleaning, or transforming data for analysis.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/data-science/data-wrangling" ~/.claude/skills/tibsfox-gsd-skill-creator-data-wrangling && rm -rf "$T"
manifest: examples/skills/data-science/data-wrangling/SKILL.md
source content

Data Wrangling

Data wrangling is the work that sits between raw data and analysis -- the unglamorous, indispensable practice of making data trustworthy. Estimates vary, but practitioners consistently report that 60-80% of analysis time is spent wrangling. This skill covers the principles and techniques of data cleaning, transformation, reshaping, and integration, grounded in Hadley Wickham's tidy data framework and extended to the realities of messy real-world datasets.

Agent affinity: tukey (EDA-driven cleaning), nightingale (routing wrangling tasks)

Concept IDs: data-data-sources, data-data-quality, data-sampling-bias

The Wrangling Pipeline

StageGoalKey operations
1. IngestionGet data into a working environmentRead CSV/JSON/Parquet/SQL, handle encodings, parse dates
2. ProfilingUnderstand what you haveShape, dtypes, nulls, distributions, cardinality
3. CleaningFix structural problemsDedup, type coercion, standardize categories, fix encodings
4. Missing dataHandle gapsDetect patterns (MCAR/MAR/MNAR), impute or flag
5. TransformationDerive analysis-ready featuresNormalize, bin, log-transform, create indicators
6. ReshapingMatch the analysis structureMelt, pivot, tidy form, denormalize
7. IntegrationCombine sourcesJoins (inner/left/right/full/cross), concatenation, dedup post-join
8. ValidationConfirm readinessSchema checks, assertion tests, row-count reconciliation

Tidy Data Principles

Hadley Wickham (2014) formalized "tidy data" as three rules:

  1. Each variable forms a column. A single column should contain values of exactly one variable.
  2. Each observation forms a row. A single row should contain all values for exactly one observational unit.
  3. Each type of observational unit forms a table. Mixing patient demographics and lab results in one table violates this rule.

Most messy datasets violate one or more of these rules in predictable ways:

ViolationExampleFix
Column headers are values, not variable namesColumns:
income_2020
,
income_2021
,
income_2022
Melt to columns:
year
,
income
Multiple variables stored in one column
"M-25"
encodes both sex and age
Split into
sex
and
age
columns
Variables stored in both rows and columnsPivot table with row headers as categoriesMelt and re-pivot to tidy form
Multiple types in one tablePatient info mixed with visit recordsNormalize into two related tables
One type spread across multiple tablesMonthly CSV files with identical schemaConcatenate with a
month
column

Tidy data is not the only valid structure -- wide formats are sometimes more efficient for computation or display. But tidy form is the canonical starting point for analysis, and most tools (ggplot2, pandas groupby, SQL aggregation) assume it.

Cleaning Techniques

Type Coercion

Raw data arrives as strings. Coercion converts to the correct type:

  • Numeric: Strip currency symbols, commas, whitespace. Handle locale-specific decimals (
    ,
    vs
    .
    ). Flag non-numeric values rather than silently converting to NaN.
  • Dates: Parse with explicit format strings, never rely on automatic detection. Time zones matter -- store in UTC, display in local.
  • Categorical: Standardize case, strip whitespace, map synonyms (
    "USA"
    ,
    "US"
    ,
    "United States"
    ->
    "US"
    ). Use controlled vocabularies where possible.
  • Boolean: Map common representations (
    "yes"/"no"
    ,
    "1"/"0"
    ,
    "true"/"false"
    ,
    "Y"/"N"
    ) to a single canonical form.

Deduplication

Exact duplicates are trivial to detect. The hard cases are near-duplicates:

  • Record linkage: When the same entity appears with slight variations (
    "John Smith"
    vs
    "J. Smith"
    vs
    "SMITH, JOHN"
    ). Use fuzzy matching (Levenshtein distance, phonetic encoding) with a human-reviewed threshold.
  • Temporal duplicates: The same event recorded at slightly different timestamps. Define a dedup window and keep the first/last/most-complete record.
  • Key discipline: Always define what constitutes a unique observation before deduplication. A table of purchases has a different uniqueness key than a table of customers.

Outlier Detection

Outliers are not errors -- they are values that warrant investigation:

  • Statistical: Values beyond 1.5 * IQR (Tukey's fences), or beyond 3 standard deviations. These thresholds are guidelines, not laws.
  • Domain-based: A human age of 150 is an error. A human age of 95 is unusual but valid. Domain knowledge trumps statistical rules.
  • Multivariate: A value can be normal on each variable individually but extreme in combination (e.g., age 25 with 40 years of work experience). Mahalanobis distance or isolation forests detect these.

Action on outliers: Investigate first. If the value is a data entry error, correct it. If it is a measurement error, flag it. If it is a genuine extreme value, keep it and note its influence on summary statistics.

Missing Data

Missing Data Mechanisms

Rubin (1976) classified three mechanisms:

MechanismDefinitionExampleImplication
MCARMissingness is unrelated to any variableLab sample randomly droppedSafe to drop or impute; no bias
MARMissingness depends on observed variablesHigh-income respondents skip income question less oftenImputation using observed predictors is valid
MNARMissingness depends on the missing value itselfPeople with depression less likely to report depression severityNo imputation is fully valid; requires sensitivity analysis

Handling Strategies

StrategyWhen to useTrade-off
Listwise deletionMCAR, small fraction missing (<5%)Simple but loses observations
Pairwise deletionMCAR, different analyses need different subsetsKeeps more data but correlation matrices may not be positive-definite
Mean/median imputationQuick exploration onlyReduces variance, biases correlations toward zero
Regression imputationMAR, continuous variablesBetter than mean but inflates R-squared
Multiple imputationMAR, formal inferenceGold standard; accounts for imputation uncertainty
Indicator methodAny mechanism, tree-based modelsAdd a binary "was_missing" column; let the model learn missingness patterns
Domain-specific fillKnown defaults"No response" for surveys, 0 for counts that should exist

Joins and Integration

Join Types

JoinKeepsUse when
InnerRows matching in both tablesYou only want complete matches
LeftAll rows from left, matching from rightLeft table is the primary; right is enrichment
RightAll rows from right, matching from leftSymmetric to left join
Full outerAll rows from bothYou need the complete picture of both sources
CrossEvery combination of left and right rowsGenerating all pairs (e.g., all product-store combinations)
AntiLeft rows with NO match in rightFinding orphans or gaps

Join Hazards

  • Many-to-many joins: Produce a Cartesian product of matching rows. Row count explodes. Always check cardinality before joining.
  • Key mismatches: Different key formats (
    "001"
    vs
    1
    ), trailing whitespace, case differences. Standardize keys before joining.
  • Null keys: NULLs never match other NULLs in standard SQL joins. Decide how to handle null-keyed rows explicitly.
  • Post-join dedup: Joins can introduce duplicates when key relationships are not 1:1. Validate row counts after every join.

Transformation Techniques

Normalization and Scaling

MethodFormulaUse when
Min-max(x - min) / (max - min)Need values in [0, 1]; distribution shape preserved
Z-score(x - mean) / stdNeed zero-centered data; assumes roughly normal
Robust scaling(x - median) / IQROutliers present; median/IQR resistant to extremes
Log transformlog(x + 1)Right-skewed data; multiplicative relationships

Binning and Discretization

  • Equal-width: Divide range into n bins of equal size. Simple but poor for skewed data.
  • Equal-frequency (quantile): Each bin contains approximately the same number of observations. Better for skewed data.
  • Domain-driven: Age groups (0-17, 18-34, 35-54, 55+), income brackets, clinical thresholds. Always preferred when domain knowledge exists.

Feature Engineering

Feature engineering is the creative step where domain knowledge becomes computable:

  • Date decomposition: Extract year, month, day-of-week, is-weekend, days-since-event.
  • Text extraction: Regex for structured patterns (phone numbers, ZIP codes, email domains).
  • Interaction terms: Product of two variables when their combination matters (e.g., age * income).
  • Aggregation: Group-level statistics (customer's average order size, store's monthly variance).
  • Lag features: Previous period's value for time series (yesterday's temperature, last month's sales).

Reproducibility

A wrangling pipeline is only trustworthy if it is reproducible:

  1. Script everything. No manual Excel edits. Every transformation is code.
  2. Version raw data. Hash or checksum the input files. If the raw data changes, the pipeline should detect it.
  3. Document assumptions. Why did you impute with median instead of mean? Why did you drop rows with more than 50% missing? Write it down.
  4. Test transformations. Assert expected row counts, column types, value ranges, and null counts at each stage.
  5. Separate raw from derived. Never overwrite raw data. Produce a new file or table at each stage.

Common Mistakes

MistakeWhy it failsFix
Imputing before exploringYou do not know the missingness mechanismProfile first, then choose strategy
Joining without checking cardinalityMany-to-many join silently explodes row countCheck key uniqueness before joining
Dropping outliers without investigationOutliers may be the most important observationsInvestigate, then decide
Mixing cleaning and analysisImpossible to reproduce or auditSeparate wrangling script from analysis
Ignoring encoding issuesGarbled text, silent data lossSpecify encoding explicitly at ingestion
Normalizing before train/test splitData leakage -- test set statistics influence trainingFit scaler on training data only

Cross-References

  • tukey agent: Exploratory data analysis techniques that drive cleaning decisions. Primary agent for profiling and outlier investigation.
  • breiman agent: Machine learning pipelines that consume wrangled data. Feature engineering feeds directly into model training.
  • fisher agent: Experimental design that determines what data needs collecting and how.
  • statistical-modeling skill: Regression and inference methods that require clean, properly typed data.
  • ethics-governance skill: Privacy and consent considerations during data collection and integration.

References

  • Wickham, H. (2014). "Tidy Data." Journal of Statistical Software, 59(10), 1-23.
  • Rubin, D. B. (1976). "Inference and Missing Data." Biometrika, 63(3), 581-592.
  • Van Buuren, S. (2018). Flexible Imputation of Missing Data. 2nd edition. CRC Press.
  • Dasu, T. & Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. Wiley.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.