Claude-skill-registry llm-basics

LLM architecture, tokenization, transformers, and inference optimization. Use for understanding and working with language models.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-basics" ~/.claude/skills/majiayu000-claude-skill-registry-llm-basics && rm -rf "$T"

manifest: skills/data/llm-basics/SKILL.md

LLM Basics

Master the fundamentals of Large Language Models.

Quick Start

Using OpenAI API

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain transformers briefly."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Using Hugging Face

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

inputs = tokenizer("Hello, how are", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Core Concepts

Transformer Architecture

Input → Embedding → [N × Transformer Block] → Output

Transformer Block:
┌───────────────────────────┐
│ Multi-Head Self-Attention │
├───────────────────────────┤
│   Layer Normalization     │
├───────────────────────────┤
│   Feed-Forward Network    │
├───────────────────────────┤
│   Layer Normalization     │
└───────────────────────────┘

Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"

# Encode
tokens = tokenizer.encode(text)
print(tokens)  # [15496, 11, 995, 0]

# Decode
decoded = tokenizer.decode(tokens)
print(decoded)  # "Hello, world!"

Key Parameters

# Generation parameters
params = {
    'temperature': 0.7,      # Randomness (0-2)
    'max_tokens': 1000,      # Output length limit
    'top_p': 0.9,            # Nucleus sampling
    'top_k': 50,             # Top-k sampling
    'frequency_penalty': 0,  # Reduce repetition
    'presence_penalty': 0    # Encourage new topics
}

Model Comparison

Model	Parameters	Context	Best For
GPT-4	~1.7T	128K	Complex reasoning
GPT-3.5	175B	16K	General tasks
Claude 3	N/A	200K	Long context
Llama 2	7-70B	4K	Open source
Mistral 7B	7B	32K	Efficient inference

Local Inference

With Ollama

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Run a model
ollama run llama2

# API usage
curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}'

With vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling = SamplingParams(temperature=0.8, max_tokens=100)

outputs = llm.generate(["Hello, my name is"], sampling)

Best Practices

Start simple: Use API before local deployment
Mind context: Stay within context window limits
Temperature tuning: Lower for facts, higher for creativity
Token efficiency: Shorter prompts = lower costs
Streaming: Use for better UX in applications

Error Handling & Retry

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt: str) -> str:
    return client.chat.completions.create(...)

Troubleshooting

Symptom	Cause	Solution
Rate limit errors	Too many requests	Add exponential backoff
Empty response	max_tokens=0	Check parameter values
High latency	Large model	Use smaller model
Timeout	Prompt too long	Reduce input size

Unit Test Template

def test_llm_completion():
    response = call_llm("Hello")
    assert response is not None
    assert len(response) > 0