Claude-skill-registry ai-data-analyst
Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Use when you need to analyze datasets, perform statistical tests, create visualizations, or build predictive models with reproducible, code-based workflows.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/ai-data-analyst-grupous-gpus" ~/.claude/skills/majiayu000-claude-skill-registry-ai-data-analyst-27a3aa && rm -rf "$T"
skills/data/ai-data-analyst-grupous-gpus/SKILL.md- pip install
Skill: AI data analyst
Purpose
Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Generate publication-quality charts, statistical reports, and actionable insights from data files or databases.
When to use this skill
- You need to analyze datasets to understand patterns, trends, or relationships.
- You want to perform statistical tests or build predictive models.
- You need data visualizations (charts, graphs, dashboards) to communicate findings.
- You're doing exploratory data analysis (EDA) to understand data structure and quality.
- You need to clean, transform, or merge datasets for analysis.
- You want reproducible analysis with documented methodology and code.
- You are performing Convex Backend Engineering (schema design, query optimization, log analysis).
Key capabilities
Unlike point-solution data analysis tools:
- Convex Engineering Integration: Native support for Convex MCP tools (
) and CLI.mcp_convex - Full Python ecosystem: Access to pandas, numpy, scikit-learn, statsmodels, matplotlib, seaborn, plotly, and more.
- Runs locally: Your data stays on your machine; no uploads to third-party services.
- Reproducible: All analysis is code-based and version controllable.
- Customizable: Extend with any Python library or custom analysis logic.
- Publication-quality output: Generate professional charts and reports.
- Statistical rigor: Access to comprehensive statistical and ML libraries.
Inputs
- Data sources: CSV files, Excel files, JSON, Parquet, or database connections.
- Analysis goals: Questions to answer or hypotheses to test.
- Variables of interest: Specific columns, metrics, or dimensions to focus on.
- Output preferences: Chart types, report format, statistical tests needed.
- Context: Business domain, data dictionary, or known data quality issues.
Out of scope
- Real-time streaming data analysis (use appropriate streaming tools).
- Extremely large datasets requiring distributed computing (use Spark/Dask instead).
- Production ML model deployment (use ML ops tools and infrastructure).
- Live dashboarding (use BI tools like Tableau/Looker for operational dashboards).
Conventions and best practices
Python environment
- Use virtual environments to isolate dependencies.
- Install only necessary packages for the specific analysis.
- Document all dependencies in
orrequirements.txt
.environment.yml
Code structure
- Write self-contained scripts that can be re-run by others.
- Use clear variable names and add comments for complex logic.
- Separate concerns: data loading, cleaning, analysis, visualization.
- Save intermediate results to files when analysis is multi-stage.
Data handling
- Never modify source data files – work on copies or in-memory dataframes.
- Document data transformations clearly in code comments.
- Handle missing values explicitly and document approach.
- Validate data quality before analysis (check for nulls, outliers, duplicates).
Visualization best practices
- Choose appropriate chart types for the data and question.
- Use clear labels, titles, and legends on all charts.
- Apply appropriate color schemes (colorblind-friendly when possible).
- Include sample sizes and confidence intervals where relevant.
- Save visualizations in high-resolution formats (PNG 300 DPI, SVG for vector graphics).
Statistical analysis
- State assumptions for statistical tests clearly.
- Check assumptions before applying tests (normality, homoscedasticity, etc.).
- Report effect sizes not just p-values.
- Use appropriate corrections for multiple comparisons.
- Explain practical significance in addition to statistical significance.
Required behavior
- Understand the question: Clarify what insights or decisions the analysis should support.
- Explore the data: Check structure, types, missing values, distributions, outliers.
- Clean and prepare: Handle missing data, outliers, and transformations appropriately.
- Analyze systematically: Apply appropriate statistical methods or ML techniques.
- Visualize effectively: Create clear, informative charts that answer the question.
- Generate insights: Translate statistical findings into actionable business insights.
- Document thoroughly: Explain methodology, assumptions, limitations, and conclusions.
- Make reproducible: Ensure others can re-run the analysis and get the same results.
Required artifacts
- Analysis script(s): Well-documented Python code performing the analysis.
- Visualizations: Charts saved as high-quality image files (PNG/SVG).
- Analysis report: Markdown or text document summarizing:
- Research question and methodology
- Data description and quality assessment
- Key findings with supporting statistics
- Visualizations with interpretations
- Limitations and caveats
- Recommendations or next steps
- Requirements file:
with all dependencies.requirements.txt - Sample data (if appropriate and non-sensitive): Small sample for reproducibility.
Implementation checklist
1. Data exploration and preparation
- Load data and inspect structure (shape, columns, types)
- Check for missing values, duplicates, outliers
- Generate summary statistics (mean, median, std, min, max)
- Visualize distributions of key variables
- Document data quality issues found
2. Data cleaning and transformation
- Handle missing values (impute, drop, or flag)
- Address outliers if needed (cap, transform, or document)
- Create derived variables if needed
- Normalize or scale variables for modeling
- Split data if doing train/test analysis
3. Analysis execution
- Choose appropriate analytical methods
- Check statistical assumptions
- Execute analysis with proper parameters
- Calculate confidence intervals and effect sizes
- Perform sensitivity analyses if appropriate
4. Visualization
- Create exploratory visualizations
- Generate publication-quality final charts
- Ensure all charts have clear labels and titles
- Use appropriate color schemes and styling
- Save in high-resolution formats
5. Reporting
- Write clear summary of methods used
- Present key findings with supporting evidence
- Explain practical significance of results
- Document limitations and assumptions
- Provide actionable recommendations
6. Reproducibility
- Test that script runs from clean environment
- Document all dependencies
- Add comments explaining non-obvious code
- Include instructions for running analysis
Convex Engineering Workflow
When working with Convex (backend, database, schemas), you MUST follow this specialized workflow:
1. Protocols & Rules
- READ FIRST: Always read
before writing any Convex code.resources/convex_rules.md- Command:
view_file(AbsolutePath=".../resources/convex_rules.md")
- Command:
- MCP Integration: Use
tools to inspect CURRENT state before proposing changes.mcp_convex
: Check table schemas.mcp_convex_tables
: Check existing functions.mcp_convex_functionSpec
: Analyze recent failures.mcp_convex_logs
2. Implementation & fix
- CLI First: Use
for all operations.bunx convex- DO NOT use generic SQL or other DB commands.
- Example:
bunx convex run serena/actions:doSomething
- Log Analysis:
- When debugging, pull logs via
ORbunx convex logs --prod --failure
.mcp_convex_logs - Analyze stack traces using Python scripts if text analysis is insufficient.
- When debugging, pull logs via
3. Code Generation
- Schema: Define in
usingconvex/schema.ts
anddefineSchema
.defineTable - Functions: Use
,query
,mutation
fromaction
._generated/server - Validation: Ensure
andargs
validators (e.g.,returns
,v.string()
) are strictly typed.v.id()
Verification
Run the following to verify the analysis:
# Create virtual environment python3 -m venv venv source venv/bin/activate # or `venv\Scripts\activate` on Windows # Install dependencies pip install -r requirements.txt # Run analysis script python analysis.py # Check outputs generated ls -lh outputs/
The skill is complete when:
- Analysis script runs without errors from clean environment.
- All required visualizations are generated in high quality.
- Report clearly explains methodology, findings, and limitations.
- Results are interpretable and actionable.
- Code is well-documented and reproducible.
Common analysis patterns
Exploratory Data Analysis (EDA)
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load and inspect data df = pd.read_csv('data.csv') print(df.info()) print(df.describe()) # Check for missing values print(df.isnull().sum()) # Visualize distributions df.hist(figsize=(12, 10), bins=30) plt.tight_layout() plt.savefig('distributions.png', dpi=300) # Check correlations corr = df.corr() sns.heatmap(corr, annot=True, cmap='coolwarm') plt.savefig('correlations.png', dpi=300)
Time series analysis
import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.seasonal import seasonal_decompose # Load time series data df = pd.read_csv('timeseries.csv', parse_dates=['date']) df.set_index('date', inplace=True) # Decompose time series decomposition = seasonal_decompose(df['value'], model='additive', period=30) fig = decomposition.plot() fig.set_size_inches(12, 8) plt.savefig('decomposition.png', dpi=300) # Calculate rolling statistics df['rolling_mean'] = df['value'].rolling(window=7).mean() df['rolling_std'] = df['value'].rolling(window=7).std() # Plot with trends plt.figure(figsize=(12, 6)) plt.plot(df['value'], label='Original') plt.plot(df['rolling_mean'], label='7-day Moving Avg', linewidth=2) plt.fill_between(df.index, df['rolling_mean'] - df['rolling_std'], df['rolling_mean'] + df['rolling_std'], alpha=0.3) plt.legend() plt.savefig('trends.png', dpi=300)
Statistical hypothesis testing
from scipy import stats import numpy as np # Compare two groups group_a = df[df['group'] == 'A']['metric'] group_b = df[df['group'] == 'B']['metric'] # Check normality _, p_norm_a = stats.shapiro(group_a) _, p_norm_b = stats.shapiro(group_b) # Choose appropriate test if p_norm_a > 0.05 and p_norm_b > 0.05: # Parametric test (t-test) statistic, p_value = stats.ttest_ind(group_a, group_b) test_used = "Independent t-test" else: # Non-parametric test (Mann-Whitney U) statistic, p_value = stats.mannwhitneyu(group_a, group_b) test_used = "Mann-Whitney U test" # Calculate effect size (Cohen's d) pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2) cohens_d = (group_a.mean() - group_b.mean()) / pooled_std print(f"Test used: {test_used}") print(f"Test statistic: {statistic:.4f}") print(f"P-value: {p_value:.4f}") print(f"Effect size (Cohen's d): {cohens_d:.4f}")
Predictive modeling
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt # Prepare data X = df.drop('target', axis=1) y = df['target'] # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train model model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"RMSE: {rmse:.4f}") print(f"R² Score: {r2:.4f}") # Feature importance importance = pd.DataFrame({ 'feature': X.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) plt.figure(figsize=(10, 6)) plt.barh(importance['feature'][:10], importance['importance'][:10]) plt.xlabel('Feature Importance') plt.title('Top 10 Most Important Features') plt.tight_layout() plt.savefig('feature_importance.png', dpi=300)
Recommended Python libraries
Data manipulation
- pandas: Data manipulation and analysis
- numpy: Numerical computing
- polars: High-performance DataFrame library (alternative to pandas)
Visualization
- matplotlib: Foundational plotting library
- seaborn: Statistical visualizations
- plotly: Interactive charts
- altair: Declarative statistical visualization
Statistical analysis
- scipy.stats: Statistical functions and tests
- statsmodels: Statistical modeling
- pingouin: Statistical tests with clear output
Machine learning
- scikit-learn: ML algorithms and tools
- xgboost: Gradient boosting
- lightgbm: Fast gradient boosting
Time series
- statsmodels.tsa: Time series analysis
- prophet: Forecasting tool
- pmdarima: Auto ARIMA
Specialized
- networkx: Network analysis
- geopandas: Geospatial data analysis
- textblob / spacy: Natural language processing
Safety and escalation
- Data privacy: Never analyze or share data containing PII without proper authorization.
- Statistical validity: If sample sizes are too small for reliable inference, call this out explicitly.
- Causal claims: Avoid implying causation from correlational analysis; be explicit about limitations.
- Model limitations: Document when models may not generalize or when predictions should not be trusted.
- Data quality: If data quality issues could materially affect conclusions, flag this prominently.
Integration with other skills
This skill can be combined with:
- Internal data querying: To fetch data from warehouses or databases for analysis.
- Web app builder: To create interactive dashboards displaying analysis results.
- Internal tools: To build analysis tools for non-technical stakeholders.