Skills apache-spark
install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/apache-spark" ~/.claude/skills/terminalskills-skills-apache-spark && rm -rf "$T"
manifest:
skills/apache-spark/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- pip install
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
Apache Spark
Overview
Apache Spark is the standard for distributed data processing. It handles batch processing, streaming, SQL, machine learning, and graph processing. PySpark provides a Python API. Runs on standalone clusters, YARN, Kubernetes, or managed services (Databricks, EMR, Dataproc).
Instructions
Step 1: PySpark Setup
pip install pyspark
Step 2: DataFrame Operations
# etl/process.py — PySpark data processing from pyspark.sql import SparkSession from pyspark.sql import functions as F spark = SparkSession.builder \ .appName("DataPipeline") \ .config("spark.sql.adaptive.enabled", "true") \ .getOrCreate() # Read data df = spark.read.parquet("s3://bucket/raw/events/") # Transform processed = (df .filter(F.col("event_type").isin(["purchase", "signup"])) .withColumn("date", F.to_date("timestamp")) .withColumn("revenue", F.col("amount") * F.col("quantity")) .groupBy("date", "event_type") .agg( F.count("*").alias("event_count"), F.sum("revenue").alias("total_revenue"), F.countDistinct("user_id").alias("unique_users"), ) .orderBy("date") ) # Write results processed.write \ .mode("overwrite") \ .partitionBy("date") \ .parquet("s3://bucket/processed/daily_metrics/")
Step 3: SQL Interface
# Register as SQL table df.createOrReplaceTempView("events") result = spark.sql(""" SELECT date_trunc('month', timestamp) as month, COUNT(DISTINCT user_id) as monthly_active_users, SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) as revenue FROM events WHERE timestamp >= '2025-01-01' GROUP BY 1 ORDER BY 1 """) result.show()
Step 4: Structured Streaming
# Real-time processing from Kafka stream = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka:9092") \ .option("subscribe", "events") \ .load() parsed = stream.select( F.from_json(F.col("value").cast("string"), schema).alias("data") ).select("data.*") query = parsed \ .groupBy(F.window("timestamp", "5 minutes"), "event_type") \ .count() \ .writeStream \ .outputMode("update") \ .format("console") \ .start()
Guidelines
- Use DataFrames (not RDDs) for most work — they're optimized by Catalyst query optimizer.
- Partitioning is critical for performance — partition by date or high-cardinality columns.
- For managed Spark, consider Databricks (easiest), AWS EMR, or GCP Dataproc.
- PySpark syntax mirrors Pandas but executes distributed — think in columns, not rows.