Claude-skill-registry llm-basics
LLM architecture, tokenization, transformers, and inference optimization. Use for understanding and working with language models.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-basics" ~/.claude/skills/majiayu000-claude-skill-registry-llm-basics && rm -rf "$T"
manifest:
skills/data/llm-basics/SKILL.mdsource content
LLM Basics
Master the fundamentals of Large Language Models.
Quick Start
Using OpenAI API
from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain transformers briefly."} ], temperature=0.7, max_tokens=500 ) print(response.choices[0].message.content)
Using Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) inputs = tokenizer("Hello, how are", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0]))
Core Concepts
Transformer Architecture
Input → Embedding → [N × Transformer Block] → Output Transformer Block: ┌───────────────────────────┐ │ Multi-Head Self-Attention │ ├───────────────────────────┤ │ Layer Normalization │ ├───────────────────────────┤ │ Feed-Forward Network │ ├───────────────────────────┤ │ Layer Normalization │ └───────────────────────────┘
Tokenization
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") text = "Hello, world!" # Encode tokens = tokenizer.encode(text) print(tokens) # [15496, 11, 995, 0] # Decode decoded = tokenizer.decode(tokens) print(decoded) # "Hello, world!"
Key Parameters
# Generation parameters params = { 'temperature': 0.7, # Randomness (0-2) 'max_tokens': 1000, # Output length limit 'top_p': 0.9, # Nucleus sampling 'top_k': 50, # Top-k sampling 'frequency_penalty': 0, # Reduce repetition 'presence_penalty': 0 # Encourage new topics }
Model Comparison
| Model | Parameters | Context | Best For |
|---|---|---|---|
| GPT-4 | ~1.7T | 128K | Complex reasoning |
| GPT-3.5 | 175B | 16K | General tasks |
| Claude 3 | N/A | 200K | Long context |
| Llama 2 | 7-70B | 4K | Open source |
| Mistral 7B | 7B | 32K | Efficient inference |
Local Inference
With Ollama
# Install Ollama curl -fsSL https://ollama.ai/install.sh | sh # Run a model ollama run llama2 # API usage curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?" }'
With vLLM
from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-2-7b-hf") sampling = SamplingParams(temperature=0.8, max_tokens=100) outputs = llm.generate(["Hello, my name is"], sampling)
Best Practices
- Start simple: Use API before local deployment
- Mind context: Stay within context window limits
- Temperature tuning: Lower for facts, higher for creativity
- Token efficiency: Shorter prompts = lower costs
- Streaming: Use for better UX in applications
Error Handling & Retry
from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10)) def call_llm_with_retry(prompt: str) -> str: return client.chat.completions.create(...)
Troubleshooting
| Symptom | Cause | Solution |
|---|---|---|
| Rate limit errors | Too many requests | Add exponential backoff |
| Empty response | max_tokens=0 | Check parameter values |
| High latency | Large model | Use smaller model |
| Timeout | Prompt too long | Reduce input size |
Unit Test Template
def test_llm_completion(): response = call_llm("Hello") assert response is not None assert len(response) > 0