CheatCodes-Skill-Library org-data-pipeline

Universalized pipeline for pulling, reconciling, slicing, and reporting organizational data. Handles ad-hoc and recurring requests: ingest from any source, validate against a roster, cut by org dimensions, generate formatted reports, and distribute.

install
source · Clone the upstream repo
git clone https://github.com/jac007x/CheatCodes-Skill-Library
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jac007x/CheatCodes-Skill-Library "$T" && mkdir -p ~/.claude/skills && cp -r "$T/org-data-pipeline" ~/.claude/skills/jac007x-cheatcodes-skill-library-org-data-pipeline && rm -rf "$T"
manifest: org-data-pipeline/SKILL.md
source content

📊 Org Data Pipeline

A universalized, repeatable pipeline for turning raw organizational data requests into validated, dimensionally-sliced, formatted reports — whether the ask is ad-hoc ("pull me headcount by job family") or recurring ("monthly MBR data").


🧠 Core Philosophy

  • Never trust the source — always validate and reconcile before reporting
  • Dimensions are the product — raw totals are useless; cuts by org dimension drive decisions
  • Idempotent outputs — same request + same data = same report, every time
  • Archive everything — every pull gets timestamped and saved for audit trail
  • Self-serve is the goal — recurring requests should become dashboards, not emails

🚀 Intake: Customize This Skill

VariableDescriptionTypeRequiredDefault
{{DATA_SOURCE}}
Where to pull data fromchoice
{{SOURCE_DETAILS}}
Connection string, file path, or API endpointstring
{{ROSTER_FILE}}
Path to authoritative roster for reconciliationfile-path
null
{{CUT_DIMENSIONS}}
Org dimensions to slice bylist
[]
{{METRICS}}
What to measure (headcount, turnover, tenure, etc.)list
["headcount"]
{{FILTERS}}
Standing filters (e.g., US-only, active associates)dict
{}
{{OUTPUT_FORMAT}}
How to deliver the reportchoice
"html"
{{RECURRENCE}}
One-time or recurring schedulechoice
"one-time"
{{PROJECT_NAME}}
Prefix for output filesstring
"report"
{{OUTPUT_DIR}}
Where to save outputspath
"./outputs"

Data Source Options

  • PowerBI
    — use powerbi sub-agent for DAX queries
  • BigQuery
    — use bigquery-explorer sub-agent
  • Excel/CSV
    — local file ingestion
  • Workday Report
    — export from Workday, load as CSV/XLSX
  • API
    — REST endpoint (provide URL + auth method)

🏗️ Architecture

┌───────────────────────────────────────────────────┐
│               ORG DATA PIPELINE                    │
├───────────────────────────────────────────────────┤
│  REQUEST → ACQUIRE → RECONCILE → SLICE → REPORT    │
│     │         │          │          │        │      │
│     ▼         ▼          ▼          ▼        ▼      │
│  Parse     Pull from   Validate   Cut by   Format  │
│  intent    source(s)   vs roster  dims     & share  │
└───────────────────────────────────────────────────┘

Phase 1: Request Parsing

Goal: Translate a natural language request into structured parameters.

Map the request to the intake variables:

User SaysMaps To
"by job family"
CUT_DIMENSIONS += ["Job Family"]
"for Sanjay's org"
FILTERS["L3"] = "Sanjay"
"last 12 months"
FILTERS["date_range"] = "12mo"
"headcount and turnover"
METRICS = ["headcount", "turnover"]
"as a slide"
OUTPUT_FORMAT = "pptx"
"every Monday"
RECURRENCE = "weekly-monday"

Phase 2: Data Acquisition

Goal: Pull data from the specified source.

Source-Specific Strategies

PowerBI:

invoke_agent('powerbi', 'List datasets in workspace {{WORKSPACE_ID}},
  then run DAX: EVALUATE SUMMARIZECOLUMNS(...) with filters {{FILTERS}}')

BigQuery:

invoke_agent('bigquery-explorer', 'Query: SELECT {{METRICS}} FROM {{TABLE}}
  WHERE {{FILTERS}} GROUP BY {{CUT_DIMENSIONS}}')

Excel/CSV:

def load_data(path: Path) -> pd.DataFrame:
    suffix = path.suffix.lower()
    if suffix == ".csv":
        return pd.read_csv(path, dtype=str)
    elif suffix in sx", ".xls"):
        return pd.read_excel(path, dtype=str)
    raise ValueError(f"Unsupported format: {suffix}")

Data Freshness Check

  • Log the data's last-modified timestamp
  • Warn if data is > 7 days old for ad-hoc requests
  • Warn if data is > 1 day old for recurring reports

Phase 3: Reconciliation & Validation

Goal: Cross-reference against an authoritative roster and flag discrepancies.

def reconcile(data: pd.DataFrame, roster: pd.DataFrame,
              join_key: str = "Employee ID") -> dict:
    """Compare data against roster. Return discrepancy report."""
    data_ids = set(data[join_key].dropna())
    roster_ids = set(roster[join_key].dropna())

    return {
        "in_data_not_roster": data_ids - roster_ids,
        "in_roster_not_data": roster_ids - data_ids,
        "matched": data_ids & roster_ids,
        "match_rate": len(data_ids & roster_ids) / max(len(data_ids), 1),
    }

Validation Rules

  • No duplicate employee IDs (warn + dedupe, keeping latest)
  • Org dimension values are non-null for >90% of rows
  • Numeric metrics parse correctly (no "N/A", "#REF!", etc.)
  • Date fields are parseable and within expected range
  • Match rate against roster > 95% (warn if lower)

Phase 4: Dimensional Slicing

Goal: Cut the data by

{{CUT_DIMENSIONS}}
and compute
{{METRICS}}
.

def slice_and_dice(df: pd.DataFrame, dimensions: list[str],
                   metrics: dict[str, str]) -> pd.DataFrame:
    """
    Group by dimensions, compute metrics.
    metrics = {"headcount": "count", "avg_tenure": ("Tenure", "mean")}
    """
    agg_spec = {}
    for name, spec in metrics.items():
        if spec == "count":
            agg_spec[name] = (dimensions[0], "count")
        elif isinstance(spec, tuple):
            agg_spec[name] = spec
    return df.groupby(dimensions, dropna=False).agg(**agg_spec).reset_index()

Standard Metric Definitions

MetricComputationNotes
HeadcountCount distinct employee IDsActive only
TurnoverTermed / Avg HC × 100Annualize if partial period
Avg TenureMean of tenure-in-monthsExclude < 30 days
Open ReqsCount open requisitionsFrom recruiting data
Span of ControlDirect reports per managerMean + distribution
Promo RatePromoted / Total HC × 100By level band

Phase 5: Report Generation

Goal: Format the sliced data into the requested output.

HTML Dashboard (Default)

  • Tailwind CSS + Chart.js
  • Executive summary at top with key callouts
  • Sortable data table
  • Bar/line charts per metric
  • Heatmap for multi-dimensional views
  • Walmart colors:
    #0053e2
    primary,
    #ffc220
    accent
  • Wrap Chart.js canvases in fixed-height container divs

Excel

def write_excel(df: pd.DataFrame, path: Path, sheet_name: str = "Report"):
    with pd.ExcelWriter(path, engine="openpyxl") as writer:
        df.to_excel(writer, sheet_name=sheet_name, index=False)

Slide Deck

invoke_agent('slide-deck', 'Create a deck from this data: {{summary_json}}')

Phase 6: Distribution & Archival

Goal: Deliver the report and archive it.

Distribution Options

  • open
    — open HTML on Mac with
    open /path/to/report.html
  • your-publishing-agent
    — publish to your publishing platform
  • email
    — send via msgraph agent
  • teams
    — post to Teams channel via msgraph agent

Archival

{{OUTPUT_DIR}}/
  {{PROJECT_NAME}}_{{YYYYMMDD}}.html
  {{PROJECT_NAME}}_{{YYYYMMDD}}.csv
  {{PROJECT_NAME}}_{{YYYYMMDD}}_reconciliation.json
  {{PROJECT_NAME}}_{{YYYYMMDD}}_audit.log

🛡️ Bulletproofing Checklist

  • Always check data freshness before reporting
  • Reconcile against roster when available
  • Suppress cells with N < 5 (privacy threshold)
  • Never expose individual-level data in shared reports
  • Timestamp every output file for auditability
  • Log every filter applied so the report is reproducible
  • Validate that dimension columns have reasonable cardinality (<500 unique values)
  • Test with a small filter first, then expand

📚 Example Applications

ContextData SourceDimensionsMetricsOutput
Monthly HC reviewARS PowerBIL2, L3, Job FamilyHC, TurnoverSlides
Ad-hoc MADA requestBudget report ExcelCost Center, LeaderSpend, HCExcel
Weekly hiring reportWorkday RecruitingReq Status, OrgOpen Reqs, TTFHTML
Quarterly talent reviewMaster Talent XLSXLevel, Perf Rating, PotentialHC, Promo RateDashboard
Job family cleanup auditWorkday exportJob Family, Job ProfileMismatch countCSV

⚠️ Anti-Patterns

❌ Sending raw exports without validation (always reconcile first)
❌ Hardcoding org dimension names (parameterize everything)
❌ Reporting on stale data without noting the refresh date
❌ Building a one-off script when a recurring dashboard would serve better
❌ Manually cutting the same data 20 ways (automate the dimension loop)
❌ Forgetting to suppress small-N cells (privacy violation risk)
❌ Emailing Excel files instead of sharing via OneDrive/SharePoint link

🌐 Platform Notes

PlatformCompatibleNotes
code-puppyActivate with
/skill org-data-pipeline
wibeyCopy SKILL.md to
~/.wibey/skills/org-data-pipeline/
CodexPaste as system prompt prefix
Any LLMPlain Markdown — paste as context

🔁 Deployment Ladder

StageTeamWhat to validate
RefineSkill OwnerReconciliation logic, metric definitions, output formatting
ProvePeer TeamsCross-org dimension cuts, data source integrations
ScaleWalmart Home OfficeSelf-serve dashboard mode, recurring schedules, multi-source joins