Awesome-Agent-Skills-for-Empirical-Research mooc-analytics-guide
Analyzing MOOC data, learning analytics, and online education metrics
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/education/mooc-analytics-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-mooc-analytics-gu && rm -rf "$T"
skills/43-wentorai-research-plugins/skills/domains/education/mooc-analytics-guide/SKILL.mdMOOC Analytics Guide
A skill for analyzing Massive Open Online Course data, implementing learning analytics pipelines, and extracting actionable insights from online education platforms. Covers clickstream processing, engagement modeling, dropout prediction, and A/B testing for course design.
Data Sources and Formats
Common MOOC Data Schemas
MOOC platforms export several standard data types:
| Data Type | Description | Typical Format |
|---|---|---|
| Clickstream logs | Page views, video plays, pauses, seeks | JSON event logs |
| Forum posts | Discussion text, timestamps, thread structure | CSV/JSON |
| Grade records | Assignment scores, quiz attempts, certificates | CSV |
| Course structure | Module hierarchy, release dates, prerequisites | XML/JSON |
| Survey responses | Pre/post course surveys, demographics | CSV |
Accessing Open MOOC Datasets
Several open datasets are available for research:
- MOOCdb: Standardized schema from MIT, includes clickstream, forum, and grade data
- Stanford MOOCPosts: 30,000+ labeled forum posts for sentiment and urgency classification
- Open University Learning Analytics (OULAD): Anonymized data for 30,000+ students across 7 courses
- edX Research Data Exchange: Available to institutional partners via application
import pandas as pd # Load OULAD dataset (publicly available) students = pd.read_csv("studentInfo.csv") assessments = pd.read_csv("assessments.csv") interactions = pd.read_csv("studentVle.csv") # Basic engagement metric: total clicks per student per course engagement = ( interactions .groupby(["id_student", "code_module", "code_presentation"]) .agg(total_clicks=("sum_click", "sum"), active_days=("date", "nunique")) .reset_index() ) print(engagement.describe())
Engagement and Retention Analysis
Defining Engagement Metrics
Key metrics used in learning analytics research:
- Session count: Number of distinct learning sessions (gap-based, e.g., 30-min inactivity threshold)
- Time on task: Total seconds spent on content pages and videos
- Video completion ratio: Fraction of video duration actually watched
- Forum participation rate: Posts + replies per student per week
- Assignment submission rate: Fraction of graded assignments submitted on time
- Regularity index: Entropy of daily activity distribution (lower entropy = more regular)
import numpy as np def regularity_index(daily_counts: np.ndarray) -> float: """ Compute regularity index based on Shannon entropy. Lower values indicate more regular study patterns. daily_counts: array of click counts per day over the course. """ total = daily_counts.sum() if total == 0: return float("nan") probs = daily_counts / total probs = probs[probs > 0] entropy = -np.sum(probs * np.log2(probs)) max_entropy = np.log2(len(daily_counts)) return round(entropy / max_entropy, 4) # normalized [0, 1]
Dropout Prediction
Predicting which learners will drop out is a central MOOC analytics task:
from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import TimeSeriesSplit from sklearn.metrics import roc_auc_score # Feature engineering: weekly aggregates features = [ "clicks_week", "video_time_week", "forum_posts_week", "assignments_submitted", "avg_score", "days_since_last_login", "regularity_index", "week_number" ] X = weekly_features[features] y = weekly_features["dropped_next_week"] # Time-aware cross-validation (no future leakage) tscv = TimeSeriesSplit(n_splits=5) aucs = [] for train_idx, test_idx in tscv.split(X): model = GradientBoostingClassifier( n_estimators=200, max_depth=4, learning_rate=0.1 ) model.fit(X.iloc[train_idx], y.iloc[train_idx]) pred = model.predict_proba(X.iloc[test_idx])[:, 1] aucs.append(roc_auc_score(y.iloc[test_idx], pred)) print(f"Mean AUC: {np.mean(aucs):.3f} +/- {np.std(aucs):.3f}")
Video Analytics
Clickstream Processing for Video Events
Video interaction is the primary learning activity in MOOCs. Analyzing play, pause, seek, and speed-change events reveals learning patterns:
def compute_video_metrics(events: pd.DataFrame) -> dict: """ Process video clickstream events into engagement metrics. events: DataFrame with columns [user_id, video_id, event_type, timestamp, position_seconds, video_duration] """ plays = events[events.event_type == "play"] pauses = events[events.event_type == "pause"] seeks = events[events.event_type == "seek"] total_duration = events.video_duration.iloc[0] watched_positions = set() for _, row in plays.iterrows(): start = int(row.position_seconds) # Estimate 10-second watch window per play event for sec in range(start, min(start + 10, int(total_duration))): watched_positions.add(sec) return { "play_count": len(plays), "pause_count": len(pauses), "seek_count": len(seeks), "coverage_ratio": len(watched_positions) / max(total_duration, 1), "replay_indicator": len(plays) > 1, }
Optimal Video Length
Research findings on video engagement (Guo et al., 2014):
- Videos under 6 minutes have the highest engagement
- Informal talking-head videos outperform studio productions
- Tablet drawing (Khan Academy style) is more engaging than slides
- Pre-production planning matters more than production quality
A/B Testing for Course Design
Experimental Design in MOOCs
MOOCs provide large sample sizes ideal for randomized experiments:
- Unit of randomization: Typically the learner, but can be section or cohort
- Outcome metrics: Completion rate, quiz scores, time to completion, forum engagement
- Duration: Run for at least one full module cycle (typically 1-2 weeks)
- Power analysis: With 10,000+ enrollees, even small effects (d=0.05) are detectable
from scipy.stats import norm def mooc_power_analysis(effect_size: float, n_per_group: int, alpha: float = 0.05) -> float: """Compute statistical power for a two-sample t-test in MOOC A/B test.""" z_alpha = norm.ppf(1 - alpha / 2) z_beta = effect_size * (n_per_group ** 0.5) / 2 - z_alpha power = norm.cdf(z_beta) return round(power, 4) # Example: 5000 per group, small effect print(mooc_power_analysis(0.1, 5000)) # ~0.94
Tools and Platforms
- edX Insights: Built-in analytics dashboard for edX course teams
- Google BigQuery + Coursera Research Exports: SQL-based analysis at scale
- Open edX: Self-hosted platform with full database access (MySQL + MongoDB)
- Learning Locker: Open-source Learning Record Store (xAPI compliant)
- MORF (MOOC Replication Framework): Docker-based reproducible analytics pipeline from University of Michigan
Key References
- Guo, P.J., Kim, J., and Rubin, R. (2014). How video production affects student engagement. ACM L@S.
- Gardner, J. and Brooks, C. (2018). Student success prediction in MOOCs. User Modeling and User-Adapted Interaction.
- Reich, J. and Ruiperez-Valiente, J.A. (2019). The MOOC pivot. Science.