Software_development_department llm-app-patterns

Provides architectural patterns for LLM-powered applications and AI assistants, including prompt engineering, RAG, agent loops, conversation management, and evaluation. Use when building AI-based features, chatbots, or complex AI system architectures.

install
source · Clone the upstream repo
git clone https://github.com/tranhieutt/software_development_department
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/llm-app-patterns" ~/.claude/skills/tranhieutt-software-development-department-llm-app-patterns && rm -rf "$T"
manifest: .claude/skills/llm-app-patterns/SKILL.md
source content

LLM Application & AI Assistant Patterns

Resources

  • resources/implementation-playbook.md
    for detailed patterns and examples.

Architecture decision matrix

PatternUse whenCost
Simple RAGFAQ, docs Q&ALow
Hybrid RAG (semantic + BM25)Mixed query typesMedium
Function callingStructured tool useLow
ReAct agentMulti-step reasoningMedium
Plan-and-executeComplex decomposable tasksHigh
Multi-agentResearch, critique-refineVery High

RAG: critical config numbers

CHUNK_CONFIG = {
    "chunk_size": 512,       # tokens — sweet spot for most docs
    "chunk_overlap": 50,     # prevents context loss at boundaries
    "separators": ["\n\n", "\n", ". ", " "],
}
# Hybrid search alpha: 1.0=semantic only, 0.0=BM25 only, 0.5=balanced

RAG: retrieval strategies

# Basic: semantic search
results = vector_db.similarity_search(embed(query), top_k=5)

# Better: hybrid (semantic + keyword via RRF)
def hybrid_search(query, alpha=0.5):
    return rrf_merge(vector_db.search(query), bm25_search(query), alpha)

# Best for recall: multi-query (3 variations, deduplicate)
queries = llm.generate_variations(query, n=3)
results = deduplicate([semantic_search(q) for q in queries])

RAG: generation prompt template

RAG_PROMPT = """Answer based ONLY on the context below.
If insufficient, say "I don't have enough information."

Context: {context}
Question: {question}
Answer:"""

Agent: function calling loop

messages = [{"role": "user", "content": question}]
while True:
    response = llm.chat(messages=messages, tools=TOOLS, tool_choice="auto")
    if not response.tool_calls:
        return response.content
    for call in response.tool_calls:
        result = execute_tool(call.name, call.arguments)
        messages.append({"role": "tool", "tool_call_id": call.id, "content": str(result)})

Production: caching (only temperature=0 responses)

def get_or_generate(prompt, model, **kwargs):
    deterministic = kwargs.get("temperature", 1.0) == 0
    if deterministic:
        key = sha256(f"{model}:{prompt}:{json.dumps(kwargs, sort_keys=True)}")
        if cached := redis.get(key): return cached
    response = llm.generate(prompt, model=model, **kwargs)
    if deterministic: redis.setex(key, 3600, response)
    return response

Production: retry + fallback

from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(5))
def call_llm(prompt): return llm.generate(prompt)

# Fallback chain
for model in [primary] + fallbacks:
    try: return llm.generate(prompt, model=model)
    except (RateLimitError, APIError): continue

LLMOps: key metrics

Latency : p50, p99 response time
Quality : satisfaction (thumbs), task completion %, hallucination rate
Cost    : cost_per_request, tokens_per_request, cache_hit_rate
Health  : error_rate, timeout_rate, retry_rate

Embedding model selection

ModelDimsCostUse
text-embedding-3-small1536$0.02/1MMost cases
text-embedding-3-large3072$0.13/1MHigh accuracy
bge-large (local)1024FreeSelf-hosted