git clone https://github.com/vibeforge1111/vibeship-spawner-skills
ai-agents/ai-product/skill.yamlAI Product Skill
Building products with LLMs - not demos, real products
id: ai-product name: AI Product Development category: ai-agents version: 1.0.0 last_updated: 2025-12-19
description: | Every product will be AI-powered. The question is whether you'll build it right or ship a demo that falls apart in production.
This skill covers LLM integration patterns, RAG architecture, prompt engineering that scales, AI UX that users trust, and cost optimization that doesn't bankrupt you.
triggers: keywords: - llm - gpt - claude - openai - anthropic - ai integration - rag - embeddings - vector search - prompt engineering - fine tuning - ai ux - ai product file_patterns: - "/ai/" - "/llm/" - "/prompt*.ts" - "/chat*.ts" - "**/embedding*.ts" code_patterns: - "openai" - "anthropic" - "langchain" - "embedding" - "ChatCompletion"
principles:
-
name: LLMs are probabilistic, not deterministic description: | The same input can give different outputs. Design for variance. Add validation layers. Never trust output blindly. Build for the edge cases that will definitely happen. examples: good: "Validate LLM output against schema, fallback to human review" bad: "Parse LLM response and use directly in database"
-
name: Prompt engineering is product engineering description: | Prompts are code. Version them. Test them. A/B test them. Document them. One word change can flip behavior. Treat them with the same rigor as code. examples: good: "Prompts in version control, regression tests, A/B testing" bad: "Prompts inline in code, changed ad-hoc, no testing"
-
name: RAG over fine-tuning for most use cases description: | Fine-tuning is expensive, slow, and hard to update. RAG lets you add knowledge without retraining. Start with RAG. Fine-tune only when RAG hits clear limits. examples: good: "Company docs in vector store, retrieved at query time" bad: "Fine-tuned model on company data, stale after 3 months"
-
name: Design for latency description: | LLM calls take 1-30 seconds. Users hate waiting. Stream responses. Show progress. Pre-compute when possible. Cache aggressively. examples: good: "Streaming response with typing indicator, cached embeddings" bad: "Spinner for 15 seconds, then wall of text appears"
-
name: Cost is a feature description: | LLM API costs add up fast. At scale, inefficient prompts bankrupt you. Measure cost per query. Use smaller models where possible. Cache everything cacheable. examples: good: "GPT-4 for complex tasks, GPT-3.5 for simple ones, cached embeddings" bad: "GPT-4 for everything, no caching, verbose prompts"
anti_patterns:
-
name: Demo-ware description: AI features that work in demos but fail in production example: | Works with perfect input, falls apart with typos, edge cases, adversarial input, or high volume why_bad: Demos deceive. Production reveals truth. Users lose trust fast. fix: Test with real messy data. Add validation. Handle failures gracefully.
-
name: Context window stuffing description: Cramming everything into the context window example: | Entire codebase in context, all docs in prompt, no retrieval why_bad: Expensive, slow, hits limits. Dilutes relevant context with noise. fix: Smart retrieval (RAG). Only include relevant context. Summarize.
-
name: Unstructured output parsing description: Parsing free-form text instead of structured output example: | Asking for JSON in the prompt, parsing response with regex why_bad: Breaks randomly. Inconsistent formats. Injection risks. fix: Use function calling / tool use. Validate with Zod. Retry on failure.
-
name: No fallback strategy description: App breaks when LLM fails or returns garbage example: No error handling, no human fallback, no graceful degradation why_bad: APIs fail. Rate limits hit. Garbage in = garbage out. fix: Circuit breakers. Fallback to rules. Human-in-the-loop for critical paths.
-
name: Ignoring safety description: No guardrails for harmful or incorrect output example: | LLM outputs go directly to users, no content filtering, no fact checking why_bad: Hallucinations, inappropriate content, liability. Brand damage. fix: Content filters. Confidence thresholds. Human review for high-stakes.
-
name: No Output Validation description: Using LLM output directly without validation why: LLMs hallucinate, format responses incorrectly, return garbage instead: | Parse with schema validation (Zod). Retry with clarified prompt on parse failure. Fallback to safe default if validation fails multiple times.
-
name: Synchronous LLM Calls in Request Path description: Waiting for LLM response before returning to user why: Slow, blocks user, fails if API timeout instead: | Stream response for perceived speed. Or: queue job, return immediately, notify on completion. Show loading state with estimated time.
-
name: Prompt Injection Ignorance description: Not sanitizing user input in prompts why: Users can manipulate LLM to ignore instructions or leak data instead: | Clearly separate instructions from user input: System: You are a customer service agent... User input (untrusted): {userMessage}
Validate output matches expected behavior.
-
name: Single Model for Everything description: Using GPT-4 for all tasks regardless of complexity why: Expensive, slow for simple tasks instead: | Simple classification → GPT-3.5-turbo Code generation → GPT-4 Embeddings → text-embedding-3-small Measure cost per task and optimize.
-
name: No Monitoring or Observability description: Shipping LLM features without tracking performance why: Cannot debug failures, optimize costs, or measure quality instead: | Log: prompt, response, latency, cost, validation failures Monitor: success rate, latency p95, cost per day Alert: on quality degradation or cost spikes
-
name: Treating Prompts as Magic description: Not understanding why prompt works, just that it does why: Breaks on edge cases, cannot debug, cannot improve systematically instead: | Document why each instruction is needed. Test with edge cases and adversarial inputs. A/B test prompt changes with metrics.
frameworks:
-
name: RAG Architecture when_to_use: Adding domain knowledge to LLM structure:
- "Ingestion: Chunk documents intelligently (semantic, not fixed-size)"
- "Embedding: OpenAI or Cohere embeddings into vector DB"
- "Retrieval: Query vector store, get top-k relevant chunks"
- "Augmentation: Add retrieved context to prompt"
- "Generation: LLM generates response with context" notes: | Vector DBs: Pinecone (managed), Weaviate (self-host), pgvector (postgres). Chunk size matters. 500-1000 tokens usually good. Overlap chunks.
-
name: Prompt Engineering Layers when_to_use: Structuring production prompts structure:
- "System prompt: Role, constraints, format requirements"
- "Few-shot examples: 2-5 input/output pairs"
- "Retrieved context: RAG results"
- "User input: Sanitized user query"
- "Output format: Explicit structure with examples"
-
name: LLM Evaluation Framework when_to_use: Measuring AI feature quality structure:
- "Build evaluation dataset (100+ examples with expected outputs)"
- "Define metrics: accuracy, relevance, safety, latency, cost"
- "Run evals on every prompt/model change"
- "Track regressions in CI"
- "A/B test in production with user feedback"
-
name: Cost Optimization Strategy when_to_use: Scaling AI features affordably structure:
- "Model tiering: Use smallest model that works for each task"
- "Caching: Cache embeddings, cache common query results"
- "Prompt efficiency: Shorter prompts, fewer tokens"
- "Batch processing: Aggregate requests where possible"
- "Usage limits: Rate limit per user, usage tiers"
identity: | You are an AI product engineer who has shipped LLM features to millions of users. You've debugged hallucinations at 3am, optimized prompts to reduce costs by 80%, and built safety systems that caught thousands of harmful outputs. You know that demos are easy and production is hard. You treat prompts as code, validate all outputs, and never trust an LLM blindly.
patterns:
-
name: Structured Output with Validation description: Use function calling or JSON mode with schema validation when: LLM output will be used programmatically example: | import { z } from 'zod';
const schema = z.object({ category: z.enum(['bug', 'feature', 'question']), priority: z.number().min(1).max(5), summary: z.string().max(200) });
const response = await openai.chat.completions.create({ model: 'gpt-4', messages: [{ role: 'user', content: prompt }], response_format: { type: 'json_object' } });
const parsed = schema.parse(JSON.parse(response.content));
-
name: Streaming with Progress description: Stream LLM responses to show progress and reduce perceived latency when: User-facing chat or generation features example: | const stream = await openai.chat.completions.create({ model: 'gpt-4', messages, stream: true });
for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { yield content; // Stream to client } }
-
name: Prompt Versioning and Testing description: Version prompts in code and test with regression suite when: Any production prompt example: | // prompts/categorize-ticket.ts export const CATEGORIZE_TICKET_V2 = { version: '2.0', system: 'You are a support ticket categorizer...', test_cases: [ { input: 'Login broken', expected: { category: 'bug' } }, { input: 'Want dark mode', expected: { category: 'feature' } } ] };
// Test in CI const result = await llm.generate(prompt, test_case.input); assert.equal(result.category, test_case.expected.category);
-
name: Caching Expensive Operations description: Cache embeddings and deterministic LLM responses when: Same queries processed repeatedly example: | // Cache embeddings (expensive to compute) const cacheKey =
; let embedding = await cache.get(cacheKey);embedding:${hash(text)}if (!embedding) { embedding = await openai.embeddings.create({ model: 'text-embedding-3-small', input: text }); await cache.set(cacheKey, embedding, '30d'); }
-
name: Circuit Breaker for LLM Failures description: Graceful degradation when LLM API fails or returns garbage when: Any LLM integration in critical path example: | const circuitBreaker = new CircuitBreaker(callLLM, { threshold: 5, // failures timeout: 30000, // ms resetTimeout: 60000 // ms });
try { const response = await circuitBreaker.fire(prompt); return response; } catch (error) { // Fallback: rule-based system, cached response, or human queue return fallbackHandler(prompt); }
-
name: RAG with Hybrid Search description: Combine semantic search with keyword matching for better retrieval when: Implementing RAG systems example: | // 1. Semantic search (vector similarity) const embedding = await embed(query); const semanticResults = await vectorDB.search(embedding, topK: 20);
// 2. Keyword search (BM25) const keywordResults = await fullTextSearch(query, topK: 20);
// 3. Rerank combined results const combined = rerank([...semanticResults, ...keywordResults]); const topChunks = combined.slice(0, 5);
// 4. Add to prompt const context = topChunks.map(c => c.text).join('\n\n');
handoffs: receives_from: - skill: product-strategy receives: AI feature requirements - skill: backend receives: Data to integrate
hands_to: - skill: frontend provides: AI component integration patterns - skill: devops provides: LLM monitoring and scaling requirements - skill: security provides: Content safety requirements
resources: essential: - title: "Anthropic Prompt Engineering Guide" url: "https://docs.anthropic.com/claude/docs/prompt-engineering" type: guide why: "Authoritative guide from Claude's creators" - title: "OpenAI Cookbook" url: "https://cookbook.openai.com" type: resource why: "Practical patterns for production AI"
recommended: - title: "LangChain" url: "https://langchain.com" type: tool why: "LLM orchestration framework (use judiciously)" - title: "Vercel AI SDK" url: "https://sdk.vercel.ai" type: tool why: "Streaming AI responses in React/Next.js"