Software_development_department llm-app-patterns
Provides architectural patterns for LLM-powered applications and AI assistants, including prompt engineering, RAG, agent loops, conversation management, and evaluation. Use when building AI-based features, chatbots, or complex AI system architectures.
install
source · Clone the upstream repo
git clone https://github.com/tranhieutt/software_development_department
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/llm-app-patterns" ~/.claude/skills/tranhieutt-software-development-department-llm-app-patterns && rm -rf "$T"
manifest:
.claude/skills/llm-app-patterns/SKILL.mdsource content
LLM Application & AI Assistant Patterns
Resources
for detailed patterns and examples.resources/implementation-playbook.md
Architecture decision matrix
| Pattern | Use when | Cost |
|---|---|---|
| Simple RAG | FAQ, docs Q&A | Low |
| Hybrid RAG (semantic + BM25) | Mixed query types | Medium |
| Function calling | Structured tool use | Low |
| ReAct agent | Multi-step reasoning | Medium |
| Plan-and-execute | Complex decomposable tasks | High |
| Multi-agent | Research, critique-refine | Very High |
RAG: critical config numbers
CHUNK_CONFIG = { "chunk_size": 512, # tokens — sweet spot for most docs "chunk_overlap": 50, # prevents context loss at boundaries "separators": ["\n\n", "\n", ". ", " "], } # Hybrid search alpha: 1.0=semantic only, 0.0=BM25 only, 0.5=balanced
RAG: retrieval strategies
# Basic: semantic search results = vector_db.similarity_search(embed(query), top_k=5) # Better: hybrid (semantic + keyword via RRF) def hybrid_search(query, alpha=0.5): return rrf_merge(vector_db.search(query), bm25_search(query), alpha) # Best for recall: multi-query (3 variations, deduplicate) queries = llm.generate_variations(query, n=3) results = deduplicate([semantic_search(q) for q in queries])
RAG: generation prompt template
RAG_PROMPT = """Answer based ONLY on the context below. If insufficient, say "I don't have enough information." Context: {context} Question: {question} Answer:"""
Agent: function calling loop
messages = [{"role": "user", "content": question}] while True: response = llm.chat(messages=messages, tools=TOOLS, tool_choice="auto") if not response.tool_calls: return response.content for call in response.tool_calls: result = execute_tool(call.name, call.arguments) messages.append({"role": "tool", "tool_call_id": call.id, "content": str(result)})
Production: caching (only temperature=0 responses)
def get_or_generate(prompt, model, **kwargs): deterministic = kwargs.get("temperature", 1.0) == 0 if deterministic: key = sha256(f"{model}:{prompt}:{json.dumps(kwargs, sort_keys=True)}") if cached := redis.get(key): return cached response = llm.generate(prompt, model=model, **kwargs) if deterministic: redis.setex(key, 3600, response) return response
Production: retry + fallback
from tenacity import retry, wait_exponential, stop_after_attempt @retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(5)) def call_llm(prompt): return llm.generate(prompt) # Fallback chain for model in [primary] + fallbacks: try: return llm.generate(prompt, model=model) except (RateLimitError, APIError): continue
LLMOps: key metrics
Latency : p50, p99 response time Quality : satisfaction (thumbs), task completion %, hallucination rate Cost : cost_per_request, tokens_per_request, cache_hit_rate Health : error_rate, timeout_rate, retry_rate
Embedding model selection
| Model | Dims | Cost | Use |
|---|---|---|---|
| text-embedding-3-small | 1536 | $0.02/1M | Most cases |
| text-embedding-3-large | 3072 | $0.13/1M | High accuracy |
| bge-large (local) | 1024 | Free | Self-hosted |