Claude-skill-registry data-exploration
Explore and analyze pilot data sets to uncover patterns, anomalies, and initial insights. Use when performing ad-hoc data investigations, validating data quality, or preparing exploratory visualizations for hypothesis generation.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-exploration" ~/.claude/skills/majiayu000-claude-skill-registry-data-exploration && rm -rf "$T"
skills/data/data-exploration/SKILL.mdData Exploration
Overview
The data-exploration skill enables systematic investigation of datasets to uncover patterns, validate quality, and surface insights that inform decision-making. This skill is used when you need to understand a new dataset, validate data quality before analysis, or generate initial findings for hypothesis formation.
Quick Start
Use this skill when you need to:
- Understand a new dataset: Get initial feel for structure, quality, and contents
- Validate data quality: Check for completeness, accuracy, and consistency
- Find patterns: Identify trends, correlations, or anomalies
- Generate insights: Surface findings that could inform strategic decisions
- Prepare for hypothesis testing: Establish baseline understanding before formal analysis
Workflow Decision Tree
New Dataset → Data Profiling → Quality Assessment → Pattern Discovery → Insight Generation ↓ ↓ ↓ ↓ [Statistical Summary] [Quality Report] [Visual Analysis] [Intelligence Brief]
Data Profiling
Step 1: Initial Assessment
When you receive a new dataset, start with basic profiling:
import pandas as pd import numpy as np def profile_dataset(df: pd.DataFrame) -> dict: """Generate initial dataset profile.""" profile = { "shape": df.shape, "columns": list(df.columns), "dtypes": df.dtypes.to_dict(), "memory_usage": df.memory_usage(deep=True).sum() / 1024**2, # MB "missing_values": df.isnull().sum().to_dict(), "duplicate_rows": df.duplicated().sum() } return profile
Output: Dataset profile with shape, columns, types, memory usage, missing values
Step 2: Statistical Summary
Generate descriptive statistics for numerical and categorical columns:
def generate_statistical_summary(df: pd.DataFrame) -> dict: """Create statistical summary of dataset.""" summary = {} # Numerical columns numeric_cols = df.select_dtypes(include=[np.number]).columns if len(numeric_cols) > 0: summary['numerical'] = df[numeric_cols].describe().to_dict() # Categorical columns categorical_cols = df.select_dtypes(include=['object', 'category']).columns if len(categorical_cols) > 0: summary['categorical'] = { col: { 'unique_values': df[col].nunique(), 'top_5_values': df[col].value_counts().head(5).to_dict(), 'missing': df[col].isnull().sum() } for col in categorical_cols } # Temporal columns datetime_cols = df.select_dtypes(include=['datetime64']).columns if len(datetime_cols) > 0: summary['temporal'] = { col: { 'min': df[col].min(), 'max': df[col].max(), 'range_days': (df[col].max() - df[col].min()).days } for col in datetime_cols } return summary
Output: Statistical summary organized by column type
Quality Assessment
Data Quality Checks
Apply systematic quality checks to identify issues:
def assess_data_quality(df: pd.DataFrame) -> dict: """Assess data quality and identify issues.""" quality_report = { "completeness": {}, "consistency": {}, "accuracy": {}, "issues": [] } # Completeness: Missing values missing_pct = (df.isnull().sum() / len(df) * 100).to_dict() quality_report["completeness"] = { col: f"{pct:.2f}%" for col, pct in missing_pct.items() if pct > 0 } # Flag columns with >20% missing for col, pct in missing_pct.items(): if pct > 20: quality_report["issues"].append({ "type": "completeness", "severity": "high" if pct > 50 else "medium", "column": col, "message": f"{col} has {pct:.2f}% missing values" }) # Consistency: Duplicate rows dup_count = df.duplicated().sum() if dup_count > 0: dup_pct = dup_count / len(df) * 100 quality_report["consistency"]["duplicate_rows"] = f"{dup_count} ({dup_pct:.2f}%)" if dup_pct > 5: quality_report["issues"].append({ "type": "consistency", "severity": "high" if dup_pct > 10 else "medium", "message": f"{dup_count} duplicate rows found ({dup_pct:.2f}%)" }) # Consistency: Data type mismatches for col in df.columns: if df[col].dtype == 'object': # Check if numeric values are stored as strings try: pd.to_numeric(df[col], errors='coerce') numeric_pct = df[col].apply(lambda x: str(x).replace('.', '').replace('-', '').isdigit()).sum() / len(df) if numeric_pct > 0.8: # If >80% could be numeric quality_report["issues"].append({ "type": "consistency", "severity": "low", "column": col, "message": f"{col} appears to contain numeric values but is stored as text" }) except: pass # Accuracy: Outliers in numerical columns numeric_cols = df.select_dtypes(include=[np.number]).columns for col in numeric_cols: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df[col] < Q1 - 3*IQR) | (df[col] > Q3 + 3*IQR)][col] if len(outliers) > 0: outlier_pct = len(outliers) / len(df) * 100 quality_report["accuracy"][col] = f"{len(outliers)} outliers ({outlier_pct:.2f}%)" if outlier_pct > 5: quality_report["issues"].append({ "type": "accuracy", "severity": "medium" if outlier_pct < 10 else "high", "column": col, "message": f"{col} has {len(outliers)} potential outliers ({outlier_pct:.2f}%)" }) return quality_report
Output: Quality report with completeness, consistency, accuracy issues
Duke Data Governance Alignment
When working with Duke data, ensure alignment with data governance standards:
Check:
- Data classification identified (Public, Internal, Confidential, Restricted)
- Privacy requirements assessed (PII, FERPA, HIPAA)
- Data retention policy noted
- Access controls documented
- Data lineage tracked
Reference: references/duke_data_governance.md
Pattern Discovery
Visual Analysis
Create exploratory visualizations to identify patterns:
import matplotlib.pyplot as plt import seaborn as sns def explore_patterns_visual(df: pd.DataFrame, output_dir: str = "./outputs"): """Generate exploratory visualizations.""" # Distribution plots for numerical columns numeric_cols = df.select_dtypes(include=[np.number]).columns if len(numeric_cols) > 0: fig, axes = plt.subplots(len(numeric_cols), 1, figsize=(10, 3*len(numeric_cols))) if len(numeric_cols) == 1: axes = [axes] for idx, col in enumerate(numeric_cols): axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black') axes[idx].set_title(f'Distribution of {col}') axes[idx].set_xlabel(col) axes[idx].set_ylabel('Frequency') plt.tight_layout() plt.savefig(f"{output_dir}/distributions.png") plt.close() # Correlation heatmap if len(numeric_cols) > 1: plt.figure(figsize=(10, 8)) correlation = df[numeric_cols].corr() sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0) plt.title('Correlation Heatmap') plt.tight_layout() plt.savefig(f"{output_dir}/correlation_heatmap.png") plt.close() # Time series if datetime column exists datetime_cols = df.select_dtypes(include=['datetime64']).columns if len(datetime_cols) > 0 and len(numeric_cols) > 0: for date_col in datetime_cols: for num_col in numeric_cols[:3]: # First 3 numeric columns plt.figure(figsize=(12, 4)) df.sort_values(date_col).plot(x=date_col, y=num_col, ax=plt.gca()) plt.title(f'{num_col} over time') plt.tight_layout() plt.savefig(f"{output_dir}/timeseries_{num_col}.png") plt.close()
Output: Distribution plots, correlation heatmap, time series visualizations
Anomaly Detection
Identify unusual patterns or outliers:
from sklearn.ensemble import IsolationForest def detect_anomalies(df: pd.DataFrame, contamination: float = 0.1) -> pd.DataFrame: """Detect anomalies using Isolation Forest.""" # Select numerical columns numeric_cols = df.select_dtypes(include=[np.number]).columns if len(numeric_cols) == 0: return pd.DataFrame() # Prepare data X = df[numeric_cols].fillna(df[numeric_cols].median()) # Detect anomalies clf = IsolationForest(contamination=contamination, random_state=42) anomaly_labels = clf.fit_predict(X) # Add anomaly flag to dataframe df_with_anomalies = df.copy() df_with_anomalies['is_anomaly'] = anomaly_labels == -1 return df_with_anomalies[df_with_anomalies['is_anomaly']]
Output: Dataframe containing detected anomalies
Insight Generation
Generate Intelligence Brief
Synthesize findings into actionable intelligence:
Template Structure:
# Data Exploration Report: [Dataset Name] **Date**: [YYYY-MM-DD] **Analyst**: Mayuri - Captain of Intelligence **Dataset**: [description] ## Executive Summary [2-3 sentences: Key findings and recommendations] ## Dataset Overview - **Rows**: [count] - **Columns**: [count] - **Time Period**: [date range if applicable] - **Data Quality Score**: [High/Medium/Low] ## Key Findings ### 1. [Finding Title] **Observation**: [What was found] **Evidence**: [Supporting statistics/visualizations] **Implication**: [What this means for Duke] ### 2. [Finding Title] ... ## Data Quality Assessment - **Completeness**: [summary of missing values] - **Consistency**: [summary of duplicates, format issues] - **Accuracy**: [summary of outliers, validation results] ### Issues Requiring Attention - [ ] **High Priority**: [Issue description and recommended action] - [ ] **Medium Priority**: [Issue description and recommended action] ## Patterns & Trends - [Pattern 1]: [Description with supporting data] - [Pattern 2]: [Description with supporting data] ## Anomalies Detected - [Anomaly 1]: [Description, potential cause, recommended follow-up] - [Anomaly 2]: [Description, potential cause, recommended follow-up] ## Recommendations 1. [Action item with rationale] 2. [Action item with rationale] 3. [Action item with rationale] ## Next Steps - [ ] [Follow-up analysis needed] - [ ] [Data quality improvements] - [ ] [Stakeholder review required] ## Appendix - Data Governance Classification: [Public/Internal/Confidential/Restricted] - Privacy Considerations: [PII/FERPA/HIPAA notes] - Data Sources: [List with access dates]
Duke-Specific Considerations
Alignment with 2026 Goals
When exploring data, always consider alignment with Duke's 2026 goals:
-
Goal: Improve student service response times by 30%
- Data to explore: Service desk tickets, response times, resolution times
- Look for: Patterns in request types, peak times, staff utilization
-
Goal: Enhance research computing capabilities
- Data to explore: HPC usage, grant activity, researcher feedback
- Look for: Utilization trends, capacity constraints, service gaps
OIT Priority Alignment
Connect findings to OIT priorities:
- Infrastructure modernization
- Security posture improvement
- User experience enhancement
- Operational efficiency
Reference: references/duke_2026_goals.md
Common Use Cases
Use Case 1: Pilot Data Validation
Scenario: Strategy Captain requests validation of data for AI service desk pilot
Workflow:
- Load service desk ticket data
- Profile dataset (shape, columns, types)
- Assess quality (missing values, duplicates, consistency)
- Explore patterns (common issues, peak times, resolution patterns)
- Detect anomalies (unusual tickets, outlier response times)
- Generate intelligence brief for Strategy Captain
Use Case 2: Hypothesis Support
Scenario: Need data to support hypothesis about student service improvement
Workflow:
- Load relevant datasets (tickets, surveys, usage logs)
- Calculate baseline metrics (current response times, satisfaction scores)
- Identify patterns (what's working, what's not)
- Find correlations (factors affecting response time)
- Generate findings to inform hypothesis testing
Use Case 3: Early Warning Detection
Scenario: Monitor ongoing pilot for data quality degradation
Workflow:
- Establish baseline data quality metrics
- Set monitoring thresholds
- Regular quality checks (daily/weekly)
- Detect anomalies and trends
- Generate early warning signal if thresholds crossed
Tools & Libraries
Required Python Libraries
pip install pandas numpy matplotlib seaborn scikit-learn
Recommended Tools
- Pandas: Data manipulation and analysis
- NumPy: Numerical operations
- Matplotlib/Seaborn: Visualization
- Scikit-learn: Anomaly detection, statistical analysis
- Jupyter: Interactive exploration
Resources
scripts/
- explore_dataset.py: Main exploration script
- quality_checks.py: Data quality assessment utilities
- anomaly_detection.py: Anomaly detection implementations
references/
- duke_data_governance.md: Duke data governance standards
- duke_2026_goals.md: 2026 strategic goals
- privacy_requirements.md: Privacy and compliance requirements
- statistical_methods.md: Reference for statistical techniques
assets/
- exploration_template.ipynb: Jupyter notebook template
- quality_report_template.md: Quality report template
- intelligence_brief_template.md: Brief template
Best Practices
- Start Broad, Then Focus: Begin with overall profiling, then drill into interesting areas
- Document Assumptions: Note any assumptions made during exploration
- Visualize Early: Create visualizations to spot patterns humans can see but stats might miss
- Check Quality First: Don't analyze bad data - validate quality before deep analysis
- Align with Goals: Always connect findings back to Duke's strategic objectives
- Maintain Provenance: Log data sources, access dates, and transformations applied
- Privacy First: Be mindful of PII and sensitive data throughout exploration
- Iterate: Exploration is iterative - new findings lead to new questions
Decision Log Integration
Every data exploration should generate a decision log entry:
from agents.mayuri_intelligence import MayuriIntelligence mayuri = MayuriIntelligence() # After exploration mayuri.log_decision( decision_type="data_exploration_completed", inputs={ "dataset": "service_desk_tickets_2025.csv", "rows": 15234, "date_range": "2025-01-01 to 2025-12-31" }, outputs={ "quality_score": "Medium", "issues_found": 3, "patterns_identified": 5, "brief_path": "./outputs/intelligence/BRIEF-20260105-143022.md" }, rationale="Explored service desk data to establish baseline for AI chatbot pilot", metadata={ "alignment": "2026 Goal: Improve student service response times", "stakeholder": "Strategy Captain (Kisuke)" } )
Remember
- Quality over speed: Thorough exploration prevents bad decisions later
- Context matters: Duke's environment, goals, and constraints shape what's important
- Communicate findings: Insights are useless if not shared in actionable format
- Stay curious: The best insights often come from unexpected patterns
- Respect privacy: Duke data governance and privacy standards are non-negotiable