CheatCodes-Skill-Library org-data-pipeline
Universalized pipeline for pulling, reconciling, slicing, and reporting organizational data. Handles ad-hoc and recurring requests: ingest from any source, validate against a roster, cut by org dimensions, generate formatted reports, and distribute.
git clone https://github.com/jac007x/CheatCodes-Skill-Library
T=$(mktemp -d) && git clone --depth=1 https://github.com/jac007x/CheatCodes-Skill-Library "$T" && mkdir -p ~/.claude/skills && cp -r "$T/org-data-pipeline" ~/.claude/skills/jac007x-cheatcodes-skill-library-org-data-pipeline && rm -rf "$T"
org-data-pipeline/SKILL.md📊 Org Data Pipeline
A universalized, repeatable pipeline for turning raw organizational data requests into validated, dimensionally-sliced, formatted reports — whether the ask is ad-hoc ("pull me headcount by job family") or recurring ("monthly MBR data").
🧠 Core Philosophy
- Never trust the source — always validate and reconcile before reporting
- Dimensions are the product — raw totals are useless; cuts by org dimension drive decisions
- Idempotent outputs — same request + same data = same report, every time
- Archive everything — every pull gets timestamped and saved for audit trail
- Self-serve is the goal — recurring requests should become dashboards, not emails
🚀 Intake: Customize This Skill
| Variable | Description | Type | Required | Default |
|---|---|---|---|---|
| Where to pull data from | choice | ✅ | — |
| Connection string, file path, or API endpoint | string | ✅ | — |
| Path to authoritative roster for reconciliation | file-path | ❌ | |
| Org dimensions to slice by | list | ✅ | |
| What to measure (headcount, turnover, tenure, etc.) | list | ✅ | |
| Standing filters (e.g., US-only, active associates) | dict | ❌ | |
| How to deliver the report | choice | ❌ | |
| One-time or recurring schedule | choice | ❌ | |
| Prefix for output files | string | ❌ | |
| Where to save outputs | path | ❌ | |
Data Source Options
— use powerbi sub-agent for DAX queriesPowerBI
— use bigquery-explorer sub-agentBigQuery
— local file ingestionExcel/CSV
— export from Workday, load as CSV/XLSXWorkday Report
— REST endpoint (provide URL + auth method)API
🏗️ Architecture
┌───────────────────────────────────────────────────┐ │ ORG DATA PIPELINE │ ├───────────────────────────────────────────────────┤ │ REQUEST → ACQUIRE → RECONCILE → SLICE → REPORT │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ Parse Pull from Validate Cut by Format │ │ intent source(s) vs roster dims & share │ └───────────────────────────────────────────────────┘
Phase 1: Request Parsing
Goal: Translate a natural language request into structured parameters.
Map the request to the intake variables:
| User Says | Maps To |
|---|---|
| "by job family" | |
| "for Sanjay's org" | |
| "last 12 months" | |
| "headcount and turnover" | |
| "as a slide" | |
| "every Monday" | |
Phase 2: Data Acquisition
Goal: Pull data from the specified source.
Source-Specific Strategies
PowerBI:
invoke_agent('powerbi', 'List datasets in workspace {{WORKSPACE_ID}}, then run DAX: EVALUATE SUMMARIZECOLUMNS(...) with filters {{FILTERS}}')
BigQuery:
invoke_agent('bigquery-explorer', 'Query: SELECT {{METRICS}} FROM {{TABLE}} WHERE {{FILTERS}} GROUP BY {{CUT_DIMENSIONS}}')
Excel/CSV:
def load_data(path: Path) -> pd.DataFrame: suffix = path.suffix.lower() if suffix == ".csv": return pd.read_csv(path, dtype=str) elif suffix in sx", ".xls"): return pd.read_excel(path, dtype=str) raise ValueError(f"Unsupported format: {suffix}")
Data Freshness Check
- Log the data's last-modified timestamp
- Warn if data is > 7 days old for ad-hoc requests
- Warn if data is > 1 day old for recurring reports
Phase 3: Reconciliation & Validation
Goal: Cross-reference against an authoritative roster and flag discrepancies.
def reconcile(data: pd.DataFrame, roster: pd.DataFrame, join_key: str = "Employee ID") -> dict: """Compare data against roster. Return discrepancy report.""" data_ids = set(data[join_key].dropna()) roster_ids = set(roster[join_key].dropna()) return { "in_data_not_roster": data_ids - roster_ids, "in_roster_not_data": roster_ids - data_ids, "matched": data_ids & roster_ids, "match_rate": len(data_ids & roster_ids) / max(len(data_ids), 1), }
Validation Rules
- No duplicate employee IDs (warn + dedupe, keeping latest)
- Org dimension values are non-null for >90% of rows
- Numeric metrics parse correctly (no "N/A", "#REF!", etc.)
- Date fields are parseable and within expected range
- Match rate against roster > 95% (warn if lower)
Phase 4: Dimensional Slicing
Goal: Cut the data by
{{CUT_DIMENSIONS}} and compute {{METRICS}}.
def slice_and_dice(df: pd.DataFrame, dimensions: list[str], metrics: dict[str, str]) -> pd.DataFrame: """ Group by dimensions, compute metrics. metrics = {"headcount": "count", "avg_tenure": ("Tenure", "mean")} """ agg_spec = {} for name, spec in metrics.items(): if spec == "count": agg_spec[name] = (dimensions[0], "count") elif isinstance(spec, tuple): agg_spec[name] = spec return df.groupby(dimensions, dropna=False).agg(**agg_spec).reset_index()
Standard Metric Definitions
| Metric | Computation | Notes |
|---|---|---|
| Headcount | Count distinct employee IDs | Active only |
| Turnover | Termed / Avg HC × 100 | Annualize if partial period |
| Avg Tenure | Mean of tenure-in-months | Exclude < 30 days |
| Open Reqs | Count open requisitions | From recruiting data |
| Span of Control | Direct reports per manager | Mean + distribution |
| Promo Rate | Promoted / Total HC × 100 | By level band |
Phase 5: Report Generation
Goal: Format the sliced data into the requested output.
HTML Dashboard (Default)
- Tailwind CSS + Chart.js
- Executive summary at top with key callouts
- Sortable data table
- Bar/line charts per metric
- Heatmap for multi-dimensional views
- Walmart colors:
primary,#0053e2
accent#ffc220 - Wrap Chart.js canvases in fixed-height container divs
Excel
def write_excel(df: pd.DataFrame, path: Path, sheet_name: str = "Report"): with pd.ExcelWriter(path, engine="openpyxl") as writer: df.to_excel(writer, sheet_name=sheet_name, index=False)
Slide Deck
invoke_agent('slide-deck', 'Create a deck from this data: {{summary_json}}')
Phase 6: Distribution & Archival
Goal: Deliver the report and archive it.
Distribution Options
— open HTML on Mac withopenopen /path/to/report.html
— publish to your publishing platformyour-publishing-agent
— send via msgraph agentemail
— post to Teams channel via msgraph agentteams
Archival
{{OUTPUT_DIR}}/ {{PROJECT_NAME}}_{{YYYYMMDD}}.html {{PROJECT_NAME}}_{{YYYYMMDD}}.csv {{PROJECT_NAME}}_{{YYYYMMDD}}_reconciliation.json {{PROJECT_NAME}}_{{YYYYMMDD}}_audit.log
🛡️ Bulletproofing Checklist
- Always check data freshness before reporting
- Reconcile against roster when available
- Suppress cells with N < 5 (privacy threshold)
- Never expose individual-level data in shared reports
- Timestamp every output file for auditability
- Log every filter applied so the report is reproducible
- Validate that dimension columns have reasonable cardinality (<500 unique values)
- Test with a small filter first, then expand
📚 Example Applications
| Context | Data Source | Dimensions | Metrics | Output |
|---|---|---|---|---|
| Monthly HC review | ARS PowerBI | L2, L3, Job Family | HC, Turnover | Slides |
| Ad-hoc MADA request | Budget report Excel | Cost Center, Leader | Spend, HC | Excel |
| Weekly hiring report | Workday Recruiting | Req Status, Org | Open Reqs, TTF | HTML |
| Quarterly talent review | Master Talent XLSX | Level, Perf Rating, Potential | HC, Promo Rate | Dashboard |
| Job family cleanup audit | Workday export | Job Family, Job Profile | Mismatch count | CSV |
⚠️ Anti-Patterns
❌ Sending raw exports without validation (always reconcile first) ❌ Hardcoding org dimension names (parameterize everything) ❌ Reporting on stale data without noting the refresh date ❌ Building a one-off script when a recurring dashboard would serve better ❌ Manually cutting the same data 20 ways (automate the dimension loop) ❌ Forgetting to suppress small-N cells (privacy violation risk) ❌ Emailing Excel files instead of sharing via OneDrive/SharePoint link
🌐 Platform Notes
| Platform | Compatible | Notes |
|---|---|---|
| code-puppy | ✅ | Activate with |
| wibey | ✅ | Copy SKILL.md to |
| Codex | ✅ | Paste as system prompt prefix |
| Any LLM | ✅ | Plain Markdown — paste as context |
🔁 Deployment Ladder
| Stage | Team | What to validate |
|---|---|---|
| Refine | Skill Owner | Reconciliation logic, metric definitions, output formatting |
| Prove | Peer Teams | Cross-org dimension cuts, data source integrations |
| Scale | Walmart Home Office | Self-serve dashboard mode, recurring schedules, multi-source joins |