Claude-skill-registry data-pipeline-architect
Designs ETL/ELT data pipelines with proper extraction, transformation, and loading patterns, including orchestration, error handling, and data quality validation.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-pipeline-architect" ~/.claude/skills/majiayu000-claude-skill-registry-data-pipeline-architect && rm -rf "$T"
manifest:
skills/data/data-pipeline-architect/SKILL.mdsource content
Data Pipeline Architect
This skill provides guidance for designing robust, scalable data pipelines that move data reliably from sources to destinations.
Core Competencies
- ETL vs ELT: Traditional Extract-Transform-Load vs modern Extract-Load-Transform patterns
- Orchestration: Airflow, Dagster, Prefect, dbt for workflow management
- Data Quality: Validation, monitoring, lineage tracking
- Scalability: Batch vs streaming, partitioning, parallelization
Pipeline Design Process
1. Requirements Analysis
To begin pipeline design, gather:
- Source systems and data formats (APIs, databases, files, streams)
- Target destinations (data warehouse, lake, lakehouse)
- Freshness requirements (real-time, hourly, daily)
- Data volume and velocity estimates
- Quality and compliance requirements
2. Architecture Selection
Batch Pipelines - For periodic bulk processing:
- Schedule-driven (hourly, daily, weekly)
- Higher latency tolerance
- Simpler error recovery (re-run entire batch)
- Tools: Airflow, dbt, Spark
Streaming Pipelines - For real-time requirements:
- Event-driven processing
- Sub-second to minute latency
- Complex state management
- Tools: Kafka, Flink, Spark Streaming
Hybrid Approaches - Lambda or Kappa architecture:
- Batch layer for completeness
- Speed layer for low latency
- Serving layer for queries
3. ETL vs ELT Decision
ETL (Transform before Load):
- When target has limited compute
- When transformation reduces data volume significantly
- When sensitive data must be masked before landing
- Legacy data warehouse patterns
ELT (Transform after Load):
- Modern cloud warehouses with cheap compute
- When raw data preservation is needed
- When transformations change frequently
- dbt-style transformations in warehouse
4. Pipeline Components
Extraction Layer:
- Full extraction vs incremental (CDC, timestamp-based)
- API pagination and rate limiting
- Connection pooling and retry logic
- Schema detection and drift handling
Transformation Layer:
- Data cleansing and standardization
- Business logic application
- Aggregation and denormalization
- Type casting and null handling
Loading Layer:
- Upsert strategies (merge, delete+insert)
- Partitioning schemes (time, hash, range)
- Index management
- Transaction boundaries
5. Error Handling Patterns
┌─────────────────────────────────────────────────────────┐ │ Pipeline Execution │ ├─────────────────────────────────────────────────────────┤ │ ┌─────────┐ ┌───────────┐ ┌──────────┐ │ │ │ Extract │───▶│ Transform │───▶│ Load │ │ │ └────┬────┘ └─────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────┐ ┌───────────┐ ┌──────────┐ │ │ │ Retry │ │ Dead Letter│ │ Rollback │ │ │ │ w/Backoff│ │ Queue │ │ Checkpoint│ │ │ └─────────┘ └───────────┘ └──────────┘ │ └─────────────────────────────────────────────────────────┘
- Retry with backoff: Transient failures (network, rate limits)
- Dead letter queues: Poison messages that can't be processed
- Checkpointing: Resume from last successful point
- Idempotency: Safe to re-run without duplicates
6. Data Quality Framework
Implement checks at each stage:
| Stage | Check Type | Example |
|---|---|---|
| Extract | Completeness | Row count matches source |
| Extract | Freshness | Data timestamp within SLA |
| Transform | Validity | Values in expected ranges |
| Transform | Uniqueness | Primary keys unique |
| Load | Reconciliation | Target matches source totals |
| Load | Integrity | Foreign keys valid |
7. Monitoring and Observability
Essential metrics to track:
- Pipeline duration and trends
- Row counts at each stage
- Error rates and types
- Data freshness (time since last successful run)
- Resource utilization
Alert on:
- SLA breaches (data not fresh)
- Anomalous row counts (±20% from baseline)
- Schema changes in sources
- Repeated failures
Common Patterns
Slowly Changing Dimensions (SCD)
- Type 1: Overwrite (no history)
- Type 2: Add row with validity dates
- Type 3: Previous value column
- Type 4: History table
Incremental Processing
-- Timestamp-based incremental SELECT * FROM source WHERE updated_at > {{ last_run_timestamp }} -- CDC-based (Change Data Capture) -- Captures inserts, updates, deletes from transaction log
Idempotent Loads
-- Delete + Insert pattern DELETE FROM target WHERE date_partition = '2024-01-15'; INSERT INTO target SELECT * FROM staging WHERE date_partition = '2024-01-15'; -- Merge/Upsert pattern MERGE INTO target t USING staging s ON t.id = s.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...
References
- Airflow, Dagster, Prefect patternsreferences/orchestration-patterns.md
- Validation frameworks and rulesreferences/data-quality-checks.md
- Common pipeline architecturesreferences/pipeline-templates.md