Skillshub apache-spark

Apache Spark

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/TerminalSkills/skills/apache-spark" ~/.claude/skills/comeonoliver-skillshub-apache-spark && rm -rf "$T"

manifest: skills/TerminalSkills/skills/apache-spark/SKILL.md

source content

Apache Spark

Overview

Apache Spark is the standard for distributed data processing. It handles batch processing, streaming, SQL, machine learning, and graph processing. PySpark provides a Python API. Runs on standalone clusters, YARN, Kubernetes, or managed services (Databricks, EMR, Dataproc).

Instructions

Step 1: PySpark Setup

pip install pyspark

Step 2: DataFrame Operations

# etl/process.py — PySpark data processing
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("DataPipeline") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read data
df = spark.read.parquet("s3://bucket/raw/events/")

# Transform
processed = (df
    .filter(F.col("event_type").isin(["purchase", "signup"]))
    .withColumn("date", F.to_date("timestamp"))
    .withColumn("revenue", F.col("amount") * F.col("quantity"))
    .groupBy("date", "event_type")
    .agg(
        F.count("*").alias("event_count"),
        F.sum("revenue").alias("total_revenue"),
        F.countDistinct("user_id").alias("unique_users"),
    )
    .orderBy("date")
)

# Write results
processed.write \
    .mode("overwrite") \
    .partitionBy("date") \
    .parquet("s3://bucket/processed/daily_metrics/")

Step 3: SQL Interface

# Register as SQL table
df.createOrReplaceTempView("events")

result = spark.sql("""
    SELECT
        date_trunc('month', timestamp) as month,
        COUNT(DISTINCT user_id) as monthly_active_users,
        SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) as revenue
    FROM events
    WHERE timestamp >= '2025-01-01'
    GROUP BY 1
    ORDER BY 1
""")
result.show()

Step 4: Structured Streaming

# Real-time processing from Kafka
stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "events") \
    .load()

parsed = stream.select(
    F.from_json(F.col("value").cast("string"), schema).alias("data")
).select("data.*")

query = parsed \
    .groupBy(F.window("timestamp", "5 minutes"), "event_type") \
    .count() \
    .writeStream \
    .outputMode("update") \
    .format("console") \
    .start()

Guidelines

Use DataFrames (not RDDs) for most work — they're optimized by Catalyst query optimizer.
Partitioning is critical for performance — partition by date or high-cardinality columns.
For managed Spark, consider Databricks (easiest), AWS EMR, or GCP Dataproc.
PySpark syntax mirrors Pandas but executes distributed — think in columns, not rows.