Claude-skill-registry data-ai-guide
Comprehensive data science, machine learning, and AI guide covering Python, deep learning, NLP, LLMs, prompt engineering, and MLOps. Use when building AI models, data pipelines, or machine learning systems.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-ai" ~/.claude/skills/majiayu000-claude-skill-registry-data-ai-guide && rm -rf "$T"
manifest:
skills/data/data-ai/SKILL.mdsource content
Data Science & AI Guide
Master data science, machine learning, generative AI, and modern AI engineering practices.
Quick Start
Python Data Science Stack
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # Load and prepare data df = pd.read_csv('data.csv') X = df.drop('target', axis=1) y = df['target'] # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Train model model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) # Evaluate accuracy = model.score(X_test, y_test)
Deep Learning with PyTorch
import torch import torch.nn as nn class SimpleNN(nn.Module): def __init__(self): super().__init__() self.linear1 = nn.Linear(784, 128) self.linear2 = nn.Linear(128, 10) def forward(self, x): x = torch.relu(self.linear1(x)) return self.linear2(x) # Training loop model = SimpleNN() optimizer = torch.optim.Adam(model.parameters()) criterion = nn.CrossEntropyLoss()
LLM Prompt Engineering
from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is machine learning?"} ], temperature=0.7 )
Data Science Path
Fundamentals
- Mathematics: Statistics, linear algebra, calculus
- Python: Libraries (Pandas, NumPy, Scikit-learn)
- Data Analysis: Exploratory analysis, visualization
- SQL: Querying and data manipulation
Machine Learning
- Supervised Learning: Regression, classification
- Unsupervised Learning: Clustering, dimensionality reduction
- Model Evaluation: Cross-validation, metrics
- Hyperparameter Tuning: Grid search, Bayesian optimization
Deep Learning
- Neural Networks: Architecture, training
- CNNs: Computer vision tasks
- RNNs: Sequence modeling
- Transformers: Modern architecture for NLP/Vision
Natural Language Processing
- Text Processing: Tokenization, embeddings
- Word Embeddings: Word2Vec, GloVe, FastText
- BERT: Contextual embeddings
- Transformers: GPT, BERT for various NLP tasks
Generative AI & LLMs
Large Language Models
- GPT Family: GPT-3.5, GPT-4 for text generation
- Claude: Constitutional AI models
- Open Source: Llama, Mistral, Zephyr
- Fine-tuning: Adapting models for specific tasks
Prompt Engineering
- Role-based Prompting: Setting context and expertise
- Few-shot Learning: Examples in prompt
- Chain-of-Thought: Step-by-step reasoning
- Retrieval Augmented Generation (RAG): Knowledge augmentation
# RAG Example from langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings from langchain.chains import RetrievalQA embeddings = OpenAIEmbeddings() vectorstore = Chroma(embedding_function=embeddings) qa = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever() )
AI Agents
- Tool Use: Agents calling external tools
- Planning: Multi-step task execution
- Memory: Conversation history, context
- Evaluation: Assessing agent performance
Data Engineering
ETL Pipelines
- Apache Airflow: Workflow orchestration
- dbt: Data transformation
- Kafka: Stream processing
- Spark: Distributed processing
Big Data
- Hadoop: Distributed storage and processing
- Spark: In-memory processing framework
- Scala: Spark's native language
- Distributed Systems: Understanding CAP theorem
Data Warehousing
- Snowflake: Cloud data warehouse
- BigQuery: Google's data warehouse
- Redshift: AWS data warehouse
- Star Schema: Dimensional modeling
MLOps
Model Management
- Model Versioning: Tracking model versions
- Model Registry: MLflow, Weights & Biases
- Experiment Tracking: Monitoring training runs
- Model Cards: Documenting model capabilities
Deployment
- Model Serving: FastAPI, TFServing
- Containerization: Docker for models
- Kubernetes: Production ML deployment
- API Monitoring: Performance and data drift
Monitoring
- Data Drift: Detecting distribution changes
- Model Drift: Performance degradation
- Feature Store: Consistent feature serving
- Observability: Logging and metrics
Technology Stack
Core Libraries
- Pandas: Data manipulation
- NumPy: Numerical computing
- Scikit-learn: Machine learning
- Matplotlib/Seaborn: Visualization
- Plotly: Interactive plots
Deep Learning
- TensorFlow: Keras API, distributed training
- PyTorch: Dynamic graphs, research-friendly
- JAX: Functional programming for ML
LLM Frameworks
- LangChain: Building LLM applications
- LlamaIndex: RAG and indexing
- OpenAI API: GPT models access
- Hugging Face: Model hub and transformers
Learning Path
-
Fundamentals (3 months)
- Python programming
- Statistics and mathematics
- Data manipulation with Pandas
-
Machine Learning (3 months)
- Supervised learning
- Model evaluation
- Feature engineering
-
Deep Learning (2 months)
- Neural networks
- CNNs and RNNs
- Transformers
-
Specialization (ongoing)
- NLP / Computer Vision / Tabular Data
- LLMs and generative AI
- MLOps and production
Projects
- Iris Classification - Classic ML project
- Housing Price Prediction - Regression
- Sentiment Analysis - NLP with transformers
- Image Classification - CNN with deep learning
- LLM Chatbot - Using prompt engineering
- RAG System - Knowledge-augmented AI
- Time Series Forecasting - Stock predictions
Resources
Learning Platforms
- Coursera: Andrew Ng's ML course
- Fast.ai: Practical deep learning
- DataCamp: Interactive data science
- Kaggle: Competitions and datasets
Documentation
Roadmap.sh Reference: https://roadmap.sh/ai-engineer
Status: ✅ Production Ready | SASMP: v1.3.0 | Bonded Agent: 04-data-ai-specialist