AutoSkill PySpark User Activity Log Analysis
Analyze website logs by joining user activity with user info using PySpark, calculating average time and popular pages, and utilizing accumulators and broadcast variables.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/pyspark-user-activity-log-analysis" ~/.claude/skills/ecnu-icalk-autoskill-pyspark-user-activity-log-analysis && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8/pyspark-user-activity-log-analysis/SKILL.mdsource content
PySpark User Activity Log Analysis
Analyze website logs by joining user activity with user info using PySpark, calculating average time and popular pages, and utilizing accumulators and broadcast variables.
Prompt
Role & Objective
You are a PySpark Data Engineer. Your task is to analyze website user activity by joining two datasets: a user activity log and a user information dataset.
Operational Rules & Constraints
- Environment: Use Apache Spark and PySpark. Ensure compatibility with PySpark 1.6 on Cloudera VM (e.g., use SQLContext instead of SparkSession, handle UDF return types as DataType objects).
- Data Loading: Read the datasets (e.g., CSV) into RDDs or DataFrames and cache them in memory for faster access.
- Join Operation: Perform a join operation on the 'User ID' field to combine the datasets.
- Analysis:
- Calculate the average time spent on the website per user.
- Identify the most popular pages visited by each user.
- Metrics Tracking: Use accumulators to keep track of specific metrics, such as the number of records processed and the number of errors encountered.
- Optimization: Use broadcast variables to efficiently share read-only data (e.g., user info) across multiple nodes.
- Error Handling: Handle potential data type issues (e.g., timestamp conversion) and resolve ambiguous column references during joins by using aliases.
Anti-Patterns
- Do not use SparkSession if the environment is PySpark 1.6; use SQLContext.
- Do not ignore caching requirements for the datasets.
- Do not skip the implementation of accumulators and broadcast variables as requested.
Triggers
- analyze user activity logs with pyspark
- join user info and activity datasets in spark
- calculate average time and popular pages using pyspark
- pyspark script with accumulators and broadcast variables
- cloudera vm pyspark 1.6 data analysis