Gsd-skill-creator data-wrangling
Data cleaning, transformation, reshaping, joins, missing data handling, and tidy data principles. Covers the full pipeline from raw ingestion to analysis-ready datasets -- type coercion, deduplication, outlier detection, normalization, melting/pivoting, regex extraction, and reproducible transformation chains. Use when preparing, cleaning, or transforming data for analysis.
git clone https://github.com/Tibsfox/gsd-skill-creator
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/data-science/data-wrangling" ~/.claude/skills/tibsfox-gsd-skill-creator-data-wrangling && rm -rf "$T"
examples/skills/data-science/data-wrangling/SKILL.mdData Wrangling
Data wrangling is the work that sits between raw data and analysis -- the unglamorous, indispensable practice of making data trustworthy. Estimates vary, but practitioners consistently report that 60-80% of analysis time is spent wrangling. This skill covers the principles and techniques of data cleaning, transformation, reshaping, and integration, grounded in Hadley Wickham's tidy data framework and extended to the realities of messy real-world datasets.
Agent affinity: tukey (EDA-driven cleaning), nightingale (routing wrangling tasks)
Concept IDs: data-data-sources, data-data-quality, data-sampling-bias
The Wrangling Pipeline
| Stage | Goal | Key operations |
|---|---|---|
| 1. Ingestion | Get data into a working environment | Read CSV/JSON/Parquet/SQL, handle encodings, parse dates |
| 2. Profiling | Understand what you have | Shape, dtypes, nulls, distributions, cardinality |
| 3. Cleaning | Fix structural problems | Dedup, type coercion, standardize categories, fix encodings |
| 4. Missing data | Handle gaps | Detect patterns (MCAR/MAR/MNAR), impute or flag |
| 5. Transformation | Derive analysis-ready features | Normalize, bin, log-transform, create indicators |
| 6. Reshaping | Match the analysis structure | Melt, pivot, tidy form, denormalize |
| 7. Integration | Combine sources | Joins (inner/left/right/full/cross), concatenation, dedup post-join |
| 8. Validation | Confirm readiness | Schema checks, assertion tests, row-count reconciliation |
Tidy Data Principles
Hadley Wickham (2014) formalized "tidy data" as three rules:
- Each variable forms a column. A single column should contain values of exactly one variable.
- Each observation forms a row. A single row should contain all values for exactly one observational unit.
- Each type of observational unit forms a table. Mixing patient demographics and lab results in one table violates this rule.
Most messy datasets violate one or more of these rules in predictable ways:
| Violation | Example | Fix |
|---|---|---|
| Column headers are values, not variable names | Columns: , , | Melt to columns: , |
| Multiple variables stored in one column | encodes both sex and age | Split into and columns |
| Variables stored in both rows and columns | Pivot table with row headers as categories | Melt and re-pivot to tidy form |
| Multiple types in one table | Patient info mixed with visit records | Normalize into two related tables |
| One type spread across multiple tables | Monthly CSV files with identical schema | Concatenate with a column |
Tidy data is not the only valid structure -- wide formats are sometimes more efficient for computation or display. But tidy form is the canonical starting point for analysis, and most tools (ggplot2, pandas groupby, SQL aggregation) assume it.
Cleaning Techniques
Type Coercion
Raw data arrives as strings. Coercion converts to the correct type:
- Numeric: Strip currency symbols, commas, whitespace. Handle locale-specific decimals (
vs,
). Flag non-numeric values rather than silently converting to NaN.. - Dates: Parse with explicit format strings, never rely on automatic detection. Time zones matter -- store in UTC, display in local.
- Categorical: Standardize case, strip whitespace, map synonyms (
,"USA"
,"US"
->"United States"
). Use controlled vocabularies where possible."US" - Boolean: Map common representations (
,"yes"/"no"
,"1"/"0"
,"true"/"false"
) to a single canonical form."Y"/"N"
Deduplication
Exact duplicates are trivial to detect. The hard cases are near-duplicates:
- Record linkage: When the same entity appears with slight variations (
vs"John Smith"
vs"J. Smith"
). Use fuzzy matching (Levenshtein distance, phonetic encoding) with a human-reviewed threshold."SMITH, JOHN" - Temporal duplicates: The same event recorded at slightly different timestamps. Define a dedup window and keep the first/last/most-complete record.
- Key discipline: Always define what constitutes a unique observation before deduplication. A table of purchases has a different uniqueness key than a table of customers.
Outlier Detection
Outliers are not errors -- they are values that warrant investigation:
- Statistical: Values beyond 1.5 * IQR (Tukey's fences), or beyond 3 standard deviations. These thresholds are guidelines, not laws.
- Domain-based: A human age of 150 is an error. A human age of 95 is unusual but valid. Domain knowledge trumps statistical rules.
- Multivariate: A value can be normal on each variable individually but extreme in combination (e.g., age 25 with 40 years of work experience). Mahalanobis distance or isolation forests detect these.
Action on outliers: Investigate first. If the value is a data entry error, correct it. If it is a measurement error, flag it. If it is a genuine extreme value, keep it and note its influence on summary statistics.
Missing Data
Missing Data Mechanisms
Rubin (1976) classified three mechanisms:
| Mechanism | Definition | Example | Implication |
|---|---|---|---|
| MCAR | Missingness is unrelated to any variable | Lab sample randomly dropped | Safe to drop or impute; no bias |
| MAR | Missingness depends on observed variables | High-income respondents skip income question less often | Imputation using observed predictors is valid |
| MNAR | Missingness depends on the missing value itself | People with depression less likely to report depression severity | No imputation is fully valid; requires sensitivity analysis |
Handling Strategies
| Strategy | When to use | Trade-off |
|---|---|---|
| Listwise deletion | MCAR, small fraction missing (<5%) | Simple but loses observations |
| Pairwise deletion | MCAR, different analyses need different subsets | Keeps more data but correlation matrices may not be positive-definite |
| Mean/median imputation | Quick exploration only | Reduces variance, biases correlations toward zero |
| Regression imputation | MAR, continuous variables | Better than mean but inflates R-squared |
| Multiple imputation | MAR, formal inference | Gold standard; accounts for imputation uncertainty |
| Indicator method | Any mechanism, tree-based models | Add a binary "was_missing" column; let the model learn missingness patterns |
| Domain-specific fill | Known defaults | "No response" for surveys, 0 for counts that should exist |
Joins and Integration
Join Types
| Join | Keeps | Use when |
|---|---|---|
| Inner | Rows matching in both tables | You only want complete matches |
| Left | All rows from left, matching from right | Left table is the primary; right is enrichment |
| Right | All rows from right, matching from left | Symmetric to left join |
| Full outer | All rows from both | You need the complete picture of both sources |
| Cross | Every combination of left and right rows | Generating all pairs (e.g., all product-store combinations) |
| Anti | Left rows with NO match in right | Finding orphans or gaps |
Join Hazards
- Many-to-many joins: Produce a Cartesian product of matching rows. Row count explodes. Always check cardinality before joining.
- Key mismatches: Different key formats (
vs"001"
), trailing whitespace, case differences. Standardize keys before joining.1 - Null keys: NULLs never match other NULLs in standard SQL joins. Decide how to handle null-keyed rows explicitly.
- Post-join dedup: Joins can introduce duplicates when key relationships are not 1:1. Validate row counts after every join.
Transformation Techniques
Normalization and Scaling
| Method | Formula | Use when |
|---|---|---|
| Min-max | (x - min) / (max - min) | Need values in [0, 1]; distribution shape preserved |
| Z-score | (x - mean) / std | Need zero-centered data; assumes roughly normal |
| Robust scaling | (x - median) / IQR | Outliers present; median/IQR resistant to extremes |
| Log transform | log(x + 1) | Right-skewed data; multiplicative relationships |
Binning and Discretization
- Equal-width: Divide range into n bins of equal size. Simple but poor for skewed data.
- Equal-frequency (quantile): Each bin contains approximately the same number of observations. Better for skewed data.
- Domain-driven: Age groups (0-17, 18-34, 35-54, 55+), income brackets, clinical thresholds. Always preferred when domain knowledge exists.
Feature Engineering
Feature engineering is the creative step where domain knowledge becomes computable:
- Date decomposition: Extract year, month, day-of-week, is-weekend, days-since-event.
- Text extraction: Regex for structured patterns (phone numbers, ZIP codes, email domains).
- Interaction terms: Product of two variables when their combination matters (e.g., age * income).
- Aggregation: Group-level statistics (customer's average order size, store's monthly variance).
- Lag features: Previous period's value for time series (yesterday's temperature, last month's sales).
Reproducibility
A wrangling pipeline is only trustworthy if it is reproducible:
- Script everything. No manual Excel edits. Every transformation is code.
- Version raw data. Hash or checksum the input files. If the raw data changes, the pipeline should detect it.
- Document assumptions. Why did you impute with median instead of mean? Why did you drop rows with more than 50% missing? Write it down.
- Test transformations. Assert expected row counts, column types, value ranges, and null counts at each stage.
- Separate raw from derived. Never overwrite raw data. Produce a new file or table at each stage.
Common Mistakes
| Mistake | Why it fails | Fix |
|---|---|---|
| Imputing before exploring | You do not know the missingness mechanism | Profile first, then choose strategy |
| Joining without checking cardinality | Many-to-many join silently explodes row count | Check key uniqueness before joining |
| Dropping outliers without investigation | Outliers may be the most important observations | Investigate, then decide |
| Mixing cleaning and analysis | Impossible to reproduce or audit | Separate wrangling script from analysis |
| Ignoring encoding issues | Garbled text, silent data loss | Specify encoding explicitly at ingestion |
| Normalizing before train/test split | Data leakage -- test set statistics influence training | Fit scaler on training data only |
Cross-References
- tukey agent: Exploratory data analysis techniques that drive cleaning decisions. Primary agent for profiling and outlier investigation.
- breiman agent: Machine learning pipelines that consume wrangled data. Feature engineering feeds directly into model training.
- fisher agent: Experimental design that determines what data needs collecting and how.
- statistical-modeling skill: Regression and inference methods that require clean, properly typed data.
- ethics-governance skill: Privacy and consent considerations during data collection and integration.
References
- Wickham, H. (2014). "Tidy Data." Journal of Statistical Software, 59(10), 1-23.
- Rubin, D. B. (1976). "Inference and Missing Data." Biometrika, 63(3), 581-592.
- Van Buuren, S. (2018). Flexible Imputation of Missing Data. 2nd edition. CRC Press.
- Dasu, T. & Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. Wiley.
- Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.