Claude-skill-registry agenta
LLM prompt management and evaluation platform. Version prompts, run A/B tests, evaluate with metrics, and deploy with confidence using Agenta's self-hosted solution.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/agenta" ~/.claude/skills/majiayu000-claude-skill-registry-agenta && rm -rf "$T"
manifest:
skills/data/agenta/SKILL.mdsafety · automated scan (medium risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- pip install
- references API keys
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
Agenta Skill
Manage, evaluate, and deploy LLM prompts with confidence. Version control your prompts, run A/B tests, and measure quality with automated evaluation.
Quick Start
# Install Agenta SDK pip install agenta # Start Agenta locally with Docker docker run -d -p 3000:3000 -p 8000:8000 ghcr.io/agenta-ai/agenta # Or use pip for just the SDK pip install agenta # Initialize project agenta init --app-name my-llm-app
When to Use This Skill
USE when:
- Managing multiple versions of prompts in production
- Need systematic A/B testing of prompt variations
- Evaluating prompt quality with automated metrics
- Collaborating on prompt development across teams
- Requiring audit trails for prompt changes
- Building LLM applications that need to iterate quickly
- Need to compare different models with same prompts
- Want a playground for rapid prompt experimentation
- Self-hosting is required for security/compliance
DON'T USE when:
- Simple single-prompt applications
- No need for prompt versioning or testing
- Already using another prompt management system
- Rapid prototyping without evaluation needs
- Cost-sensitive projects (evaluation adds API calls)
Prerequisites
# SDK installation pip install agenta>=0.10.0 # For self-hosted deployment docker pull ghcr.io/agenta-ai/agenta # Or with docker-compose git clone https://github.com/Agenta-AI/agenta cd agenta docker-compose up -d # Environment setup export AGENTA_HOST="http://localhost:3000" export AGENTA_API_KEY="your-api-key" # If using cloud version # For LLM providers export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..."
Verify Installation
import agenta as ag from agenta import Agenta # Initialize client client = Agenta() # Check connection print(f"Agenta SDK version: {ag.__version__}") print("Connection successful!")
Core Capabilities
1. Prompt Versioning and Management
Creating Versioned Prompts:
""" Create and manage versioned prompts with Agenta. """ import agenta as ag from agenta import Agenta from typing import Optional, Dict, Any # Initialize Agenta ag.init() @ag.entrypoint def generate_summary( text: str, max_length: int = 100, style: str = "professional" ) -> str: """ Generate a summary with versioned prompt. Args: text: Text to summarize max_length: Maximum summary length style: Writing style (professional, casual, technical) Returns: Generated summary """ # Define prompt template (this becomes versioned) prompt = f"""Summarize the following text in a {style} tone. Keep the summary under {max_length} words. Text: {text} Summary:""" # Call LLM (Agenta tracks this) response = ag.llm.complete( prompt=prompt, model="gpt-4", temperature=0.3, max_tokens=max_length * 2 ) return response.text # Example usage text = """ The company reported strong Q3 results with revenue up 25% year-over-year. Operating margins improved to 18% from 15% in the prior year. The CEO highlighted expansion into new markets and product launches. """ summary = generate_summary(text, max_length=50, style="professional") print(summary)
Managing Prompt Versions:
""" Manage multiple prompt versions programmatically. """ import agenta as ag from agenta import Agenta from dataclasses import dataclass from typing import List, Dict, Optional from datetime import datetime @dataclass class PromptVersion: """Represents a prompt version.""" version_id: str name: str template: str parameters: Dict[str, Any] created_at: datetime is_active: bool = False class PromptManager: """ Manage prompt versions with Agenta. """ def __init__(self, app_name: str): self.app_name = app_name self.client = Agenta() def create_version( self, name: str, template: str, parameters: Dict[str, Any] = None ) -> PromptVersion: """ Create a new prompt version. Args: name: Version name template: Prompt template parameters: Default parameters Returns: Created PromptVersion """ # Create variant in Agenta variant = self.client.create_variant( app_name=self.app_name, variant_name=name, config={ "template": template, "parameters": parameters or {} } ) return PromptVersion( version_id=variant.id, name=name, template=template, parameters=parameters or {}, created_at=datetime.now(), is_active=False ) def list_versions(self) -> List[PromptVersion]: """List all prompt versions.""" variants = self.client.list_variants(app_name=self.app_name) versions = [] for v in variants: versions.append(PromptVersion( version_id=v.id, name=v.name, template=v.config.get("template", ""), parameters=v.config.get("parameters", {}), created_at=v.created_at, is_active=v.is_default )) return versions def set_active_version(self, version_id: str) -> None: """Set a version as the active/default version.""" self.client.set_default_variant( app_name=self.app_name, variant_id=version_id ) def get_version(self, version_id: str) -> PromptVersion: """Get a specific version.""" variant = self.client.get_variant(variant_id=version_id) return PromptVersion( version_id=variant.id, name=variant.name, template=variant.config.get("template", ""), parameters=variant.config.get("parameters", {}), created_at=variant.created_at, is_active=variant.is_default ) def compare_versions( self, version_ids: List[str], test_input: str ) -> Dict[str, str]: """ Compare outputs from multiple versions. Args: version_ids: List of version IDs to compare test_input: Input to test with Returns: Dictionary mapping version_id to output """ results = {} for vid in version_ids: version = self.get_version(vid) # Format prompt with test input prompt = version.template.format(input=test_input) # Generate output response = ag.llm.complete(prompt=prompt) results[vid] = response.text return results # Usage manager = PromptManager("summarizer-app") # Create versions v1 = manager.create_version( name="concise-v1", template="Summarize briefly: {input}", parameters={"max_tokens": 100} ) v2 = manager.create_version( name="detailed-v2", template="Provide a comprehensive summary with key points: {input}", parameters={"max_tokens": 300} ) # List all versions versions = manager.list_versions() for v in versions: print(f"{v.name}: {v.version_id} (active: {v.is_active})") # Set active version manager.set_active_version(v1.version_id)
2. A/B Testing Prompts
Setting Up A/B Tests:
""" Configure and run A/B tests on prompt variations. """ import agenta as ag from agenta import Agenta from typing import Dict, List, Optional from dataclasses import dataclass import random @dataclass class ABTestConfig: """Configuration for A/B test.""" name: str variants: Dict[str, float] # variant_id: traffic_percentage metrics: List[str] min_samples: int = 100 class ABTestRunner: """ Run A/B tests on prompt variants. """ def __init__(self, app_name: str): self.app_name = app_name self.client = Agenta() self.results: Dict[str, List[Dict]] = {} def create_test( self, name: str, control_variant: str, treatment_variant: str, traffic_split: float = 0.5 ) -> ABTestConfig: """ Create an A/B test. Args: name: Test name control_variant: Control variant ID treatment_variant: Treatment variant ID traffic_split: Percentage for treatment (0-1) Returns: ABTestConfig """ config = ABTestConfig( name=name, variants={ control_variant: 1 - traffic_split, treatment_variant: traffic_split }, metrics=["response_quality", "latency", "cost"] ) # Initialize results tracking for variant in config.variants.keys(): self.results[variant] = [] return config def route_request(self, config: ABTestConfig) -> str: """ Route a request to a variant based on traffic split. Args: config: A/B test configuration Returns: Selected variant ID """ rand = random.random() cumulative = 0 for variant_id, percentage in config.variants.items(): cumulative += percentage if rand <= cumulative: return variant_id # Fallback to first variant return list(config.variants.keys())[0] def run_request( self, config: ABTestConfig, input_data: str ) -> Dict: """ Run a single request in the A/B test. Args: config: A/B test configuration input_data: Input for the prompt Returns: Result dictionary with variant and output """ import time # Route to variant variant_id = self.route_request(config) variant = self.client.get_variant(variant_id) # Prepare prompt prompt = variant.config.get("template", "").format(input=input_data) # Run with timing start_time = time.time() response = ag.llm.complete(prompt=prompt) latency = time.time() - start_time result = { "variant_id": variant_id, "input": input_data, "output": response.text, "latency": latency, "tokens_used": response.usage.total_tokens if hasattr(response, 'usage') else 0 } # Store result self.results[variant_id].append(result) return result def get_test_results(self, config: ABTestConfig) -> Dict: """ Get aggregated results for an A/B test. Args: config: A/B test configuration Returns: Aggregated results by variant """ summary = {} for variant_id, results in self.results.items(): if not results: continue latencies = [r["latency"] for r in results] tokens = [r["tokens_used"] for r in results] summary[variant_id] = { "sample_count": len(results), "avg_latency": sum(latencies) / len(latencies), "avg_tokens": sum(tokens) / len(tokens) if tokens else 0, "min_latency": min(latencies), "max_latency": max(latencies) } return summary def declare_winner(self, config: ABTestConfig) -> Optional[str]: """ Analyze results and declare a winner. Args: config: A/B test configuration Returns: Winner variant ID or None if inconclusive """ summary = self.get_test_results(config) # Check minimum samples for variant_id, stats in summary.items(): if stats["sample_count"] < config.min_samples: print(f"Insufficient samples for {variant_id}") return None # Simple winner selection based on latency # In production, use statistical significance tests best_variant = min( summary.keys(), key=lambda v: summary[v]["avg_latency"] ) return best_variant # Usage Example ag.init() runner = ABTestRunner("chatbot-app") # Create A/B test test_config = runner.create_test( name="prompt-optimization-test", control_variant="variant-a-id", treatment_variant="variant-b-id", traffic_split=0.5 ) # Run test requests test_inputs = [ "What is machine learning?", "Explain neural networks", "How does backpropagation work?" ] for input_text in test_inputs: result = runner.run_request(test_config, input_text) print(f"Variant: {result['variant_id']}, Latency: {result['latency']:.3f}s") # Get results results = runner.get_test_results(test_config) print("\nTest Results:") for variant, stats in results.items(): print(f" {variant}: {stats}")
3. Evaluation Metrics and Testing
Automated Evaluation Pipeline:
""" Evaluate prompts with automated metrics. """ import agenta as ag from agenta import Agenta from typing import List, Dict, Callable, Any from dataclasses import dataclass import json @dataclass class EvaluationResult: """Result of an evaluation.""" metric_name: str score: float details: Dict[str, Any] class MetricEvaluator: """Base class for evaluation metrics.""" def __init__(self, name: str): self.name = name def evaluate( self, output: str, expected: str = None, context: Dict = None ) -> EvaluationResult: raise NotImplementedError class ExactMatchMetric(MetricEvaluator): """Exact match evaluation.""" def __init__(self): super().__init__("exact_match") def evaluate(self, output: str, expected: str = None, context: Dict = None) -> EvaluationResult: if expected is None: return EvaluationResult(self.name, 0.0, {"error": "No expected value"}) match = output.strip().lower() == expected.strip().lower() return EvaluationResult( metric_name=self.name, score=1.0 if match else 0.0, details={"match": match} ) class ContainsMetric(MetricEvaluator): """Check if output contains expected keywords.""" def __init__(self, keywords: List[str]): super().__init__("contains_keywords") self.keywords = keywords def evaluate(self, output: str, expected: str = None, context: Dict = None) -> EvaluationResult: output_lower = output.lower() found = [kw for kw in self.keywords if kw.lower() in output_lower] score = len(found) / len(self.keywords) return EvaluationResult( metric_name=self.name, score=score, details={ "found_keywords": found, "missing_keywords": [kw for kw in self.keywords if kw.lower() not in output_lower] } ) class LengthMetric(MetricEvaluator): """Evaluate output length.""" def __init__(self, min_length: int = 10, max_length: int = 500): super().__init__("length") self.min_length = min_length self.max_length = max_length def evaluate(self, output: str, expected: str = None, context: Dict = None) -> EvaluationResult: length = len(output.split()) if self.min_length <= length <= self.max_length: score = 1.0 elif length < self.min_length: score = length / self.min_length else: score = max(0, 1 - (length - self.max_length) / self.max_length) return EvaluationResult( metric_name=self.name, score=score, details={ "word_count": length, "min_length": self.min_length, "max_length": self.max_length } ) class LLMJudgeMetric(MetricEvaluator): """Use an LLM to judge output quality.""" def __init__(self, criteria: str = "helpfulness"): super().__init__(f"llm_judge_{criteria}") self.criteria = criteria def evaluate(self, output: str, expected: str = None, context: Dict = None) -> EvaluationResult: judge_prompt = f"""Evaluate the following response on {self.criteria}. Score from 0.0 to 1.0. Response: {output} {f'Expected: {expected}' if expected else ''} Provide your evaluation as JSON: {{"score": 0.0-1.0, "reasoning": "..."}} """ response = ag.llm.complete( prompt=judge_prompt, model="gpt-4", temperature=0 ) try: result = json.loads(response.text) score = float(result.get("score", 0.5)) reasoning = result.get("reasoning", "") except (json.JSONDecodeError, ValueError): score = 0.5 reasoning = "Failed to parse judge response" return EvaluationResult( metric_name=self.name, score=score, details={"reasoning": reasoning, "criteria": self.criteria} ) class EvaluationPipeline: """ Pipeline for running multiple evaluations. """ def __init__(self, app_name: str): self.app_name = app_name self.client = Agenta() self.metrics: List[MetricEvaluator] = [] def add_metric(self, metric: MetricEvaluator) -> 'EvaluationPipeline': """Add a metric to the pipeline.""" self.metrics.append(metric) return self def evaluate_single( self, output: str, expected: str = None, context: Dict = None ) -> Dict[str, EvaluationResult]: """ Evaluate a single output with all metrics. Args: output: Generated output expected: Expected output (optional) context: Additional context Returns: Dictionary of metric results """ results = {} for metric in self.metrics: result = metric.evaluate(output, expected, context) results[metric.name] = result return results def evaluate_batch( self, test_cases: List[Dict] ) -> Dict[str, List[EvaluationResult]]: """ Evaluate a batch of test cases. Args: test_cases: List of {input, output, expected} dicts Returns: Aggregated results by metric """ all_results = {metric.name: [] for metric in self.metrics} for case in test_cases: results = self.evaluate_single( output=case.get("output", ""), expected=case.get("expected"), context=case.get("context") ) for metric_name, result in results.items(): all_results[metric_name].append(result) return all_results def get_summary(self, batch_results: Dict[str, List[EvaluationResult]]) -> Dict: """ Get summary statistics from batch evaluation. Args: batch_results: Results from evaluate_batch Returns: Summary statistics """ summary = {} for metric_name, results in batch_results.items(): scores = [r.score for r in results] summary[metric_name] = { "mean": sum(scores) / len(scores) if scores else 0, "min": min(scores) if scores else 0, "max": max(scores) if scores else 0, "count": len(scores) } return summary # Usage ag.init() # Create evaluation pipeline pipeline = EvaluationPipeline("qa-bot") pipeline.add_metric(ContainsMetric(["answer", "explanation"])) pipeline.add_metric(LengthMetric(min_length=20, max_length=200)) pipeline.add_metric(LLMJudgeMetric(criteria="helpfulness")) # Test cases test_cases = [ { "input": "What is Python?", "output": "Python is a programming language known for its simplicity. The answer is that it's versatile. Here's an explanation: it's widely used in data science and web development.", "expected": "Python is a high-level programming language" }, { "input": "Explain recursion", "output": "Recursion is a function calling itself. The answer involves base cases and recursive calls. Explanation: it's useful for tree structures.", "expected": "A function that calls itself" } ] # Run evaluation results = pipeline.evaluate_batch(test_cases) summary = pipeline.get_summary(results) print("Evaluation Summary:") for metric, stats in summary.items(): print(f" {metric}: mean={stats['mean']:.2f}, min={stats['min']:.2f}, max={stats['max']:.2f}")
4. Playground and Experimentation
Creating Interactive Playground:
""" Build an interactive playground for prompt experimentation. """ import agenta as ag from agenta import Agenta from typing import Dict, List, Any, Optional from dataclasses import dataclass, field from datetime import datetime import json @dataclass class ExperimentRun: """Single experiment run.""" run_id: str prompt: str parameters: Dict[str, Any] output: str metrics: Dict[str, float] timestamp: datetime = field(default_factory=datetime.now) class Playground: """ Interactive playground for prompt experimentation. """ def __init__(self, app_name: str): self.app_name = app_name self.client = Agenta() self.experiments: List[ExperimentRun] = [] self.current_prompt = "" self.current_params = {} def set_prompt(self, prompt: str) -> 'Playground': """Set the current prompt template.""" self.current_prompt = prompt return self def set_parameters(self, **params) -> 'Playground': """Set LLM parameters.""" self.current_params.update(params) return self def run(self, input_data: str) -> ExperimentRun: """ Run the current prompt with input. Args: input_data: Input to format into prompt Returns: ExperimentRun with results """ import time import uuid # Format prompt formatted_prompt = self.current_prompt.format(input=input_data) # Run with timing start_time = time.time() response = ag.llm.complete( prompt=formatted_prompt, **self.current_params ) latency = time.time() - start_time # Create run record run = ExperimentRun( run_id=str(uuid.uuid4())[:8], prompt=formatted_prompt, parameters=self.current_params.copy(), output=response.text, metrics={ "latency": latency, "output_length": len(response.text), "tokens": response.usage.total_tokens if hasattr(response, 'usage') else 0 } ) self.experiments.append(run) return run def compare( self, prompts: List[str], test_input: str, parameters: Dict = None ) -> List[ExperimentRun]: """ Compare multiple prompts with same input. Args: prompts: List of prompt templates test_input: Input to test parameters: Shared parameters Returns: List of ExperimentRuns """ runs = [] original_prompt = self.current_prompt original_params = self.current_params.copy() if parameters: self.set_parameters(**parameters) for prompt in prompts: self.set_prompt(prompt) run = self.run(test_input) runs.append(run) # Restore original state self.current_prompt = original_prompt self.current_params = original_params return runs def parameter_sweep( self, param_name: str, values: List[Any], test_input: str ) -> List[ExperimentRun]: """ Sweep over parameter values. Args: param_name: Parameter to sweep values: List of values to try test_input: Input for testing Returns: List of ExperimentRuns """ runs = [] original_value = self.current_params.get(param_name) for value in values: self.current_params[param_name] = value run = self.run(test_input) runs.append(run) # Restore original value if original_value is not None: self.current_params[param_name] = original_value else: self.current_params.pop(param_name, None) return runs def get_history(self, limit: int = 10) -> List[ExperimentRun]: """Get recent experiment history.""" return self.experiments[-limit:] def export_experiments(self, filepath: str) -> None: """Export experiments to JSON file.""" data = [] for exp in self.experiments: data.append({ "run_id": exp.run_id, "prompt": exp.prompt, "parameters": exp.parameters, "output": exp.output, "metrics": exp.metrics, "timestamp": exp.timestamp.isoformat() }) with open(filepath, 'w') as f: json.dump(data, f, indent=2) def find_best_run(self, metric: str = "latency", minimize: bool = True) -> Optional[ExperimentRun]: """ Find the best run based on a metric. Args: metric: Metric to optimize minimize: Whether to minimize (True) or maximize (False) Returns: Best ExperimentRun or None """ if not self.experiments: return None valid_runs = [e for e in self.experiments if metric in e.metrics] if not valid_runs: return None if minimize: return min(valid_runs, key=lambda e: e.metrics[metric]) else: return max(valid_runs, key=lambda e: e.metrics[metric]) # Usage ag.init() playground = Playground("experiment-app") # Set up experiment playground.set_prompt("Answer this question concisely: {input}") playground.set_parameters(model="gpt-4", temperature=0.3, max_tokens=100) # Run single experiment run = playground.run("What is machine learning?") print(f"Output: {run.output}") print(f"Latency: {run.metrics['latency']:.3f}s") # Compare prompts comparison_runs = playground.compare( prompts=[ "Answer briefly: {input}", "Explain in detail: {input}", "Give a one-sentence answer: {input}" ], test_input="What is deep learning?" ) print("\nPrompt Comparison:") for i, run in enumerate(comparison_runs): print(f" Prompt {i+1}: {run.metrics['latency']:.3f}s, {run.metrics['output_length']} chars") # Parameter sweep temperature_runs = playground.parameter_sweep( param_name="temperature", values=[0.0, 0.3, 0.7, 1.0], test_input="Write a creative story opening" ) print("\nTemperature Sweep:") for run in temperature_runs: print(f" temp={run.parameters['temperature']}: {run.output[:50]}...") # Find best run best = playground.find_best_run(metric="latency", minimize=True) if best: print(f"\nBest run: {best.run_id} with latency {best.metrics['latency']:.3f}s") # Export experiments playground.export_experiments("experiments.json")
5. Model Comparison
Comparing Different LLM Models:
""" Compare performance across different LLM models. """ import agenta as ag from agenta import Agenta from typing import Dict, List, Any from dataclasses import dataclass import time @dataclass class ModelResult: """Result from a single model run.""" model: str output: str latency: float tokens: int cost: float class ModelComparator: """ Compare prompts across different models. """ # Cost per 1K tokens (approximate) MODEL_COSTS = { "gpt-4": {"input": 0.03, "output": 0.06}, "gpt-4-turbo": {"input": 0.01, "output": 0.03}, "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015}, "claude-3-opus": {"input": 0.015, "output": 0.075}, "claude-3-sonnet": {"input": 0.003, "output": 0.015}, "claude-3-haiku": {"input": 0.00025, "output": 0.00125} } def __init__(self, models: List[str] = None): self.models = models or ["gpt-4", "gpt-3.5-turbo"] self.results: Dict[str, List[ModelResult]] = {m: [] for m in self.models} def _estimate_cost(self, model: str, input_tokens: int, output_tokens: int) -> float: """Estimate cost for a model run.""" costs = self.MODEL_COSTS.get(model, {"input": 0.01, "output": 0.03}) return (input_tokens / 1000 * costs["input"] + output_tokens / 1000 * costs["output"]) def run_comparison( self, prompt: str, temperature: float = 0.3, max_tokens: int = 200 ) -> Dict[str, ModelResult]: """ Run the same prompt across all models. Args: prompt: Prompt to test temperature: Temperature setting max_tokens: Maximum output tokens Returns: Results for each model """ results = {} for model in self.models: start_time = time.time() try: response = ag.llm.complete( prompt=prompt, model=model, temperature=temperature, max_tokens=max_tokens ) latency = time.time() - start_time # Get token counts input_tokens = len(prompt.split()) * 1.3 # Rough estimate output_tokens = len(response.text.split()) * 1.3 if hasattr(response, 'usage'): input_tokens = response.usage.prompt_tokens output_tokens = response.usage.completion_tokens result = ModelResult( model=model, output=response.text, latency=latency, tokens=int(input_tokens + output_tokens), cost=self._estimate_cost(model, input_tokens, output_tokens) ) except Exception as e: result = ModelResult( model=model, output=f"Error: {str(e)}", latency=0, tokens=0, cost=0 ) results[model] = result self.results[model].append(result) return results def run_benchmark( self, prompts: List[str], temperature: float = 0.3 ) -> Dict[str, Dict]: """ Run benchmark across multiple prompts. Args: prompts: List of prompts to test temperature: Temperature setting Returns: Aggregated benchmark results """ for prompt in prompts: self.run_comparison(prompt, temperature) return self.get_summary() def get_summary(self) -> Dict[str, Dict]: """Get summary statistics for all models.""" summary = {} for model, results in self.results.items(): if not results: continue valid_results = [r for r in results if r.latency > 0] if not valid_results: continue summary[model] = { "runs": len(valid_results), "avg_latency": sum(r.latency for r in valid_results) / len(valid_results), "avg_tokens": sum(r.tokens for r in valid_results) / len(valid_results), "total_cost": sum(r.cost for r in valid_results), "min_latency": min(r.latency for r in valid_results), "max_latency": max(r.latency for r in valid_results) } return summary def recommend_model( self, priority: str = "balanced" ) -> str: """ Recommend best model based on priority. Args: priority: "speed", "cost", "quality", or "balanced" Returns: Recommended model name """ summary = self.get_summary() if not summary: return self.models[0] if priority == "speed": return min(summary.keys(), key=lambda m: summary[m]["avg_latency"]) elif priority == "cost": return min(summary.keys(), key=lambda m: summary[m]["total_cost"]) elif priority == "quality": # Assume larger models = better quality quality_order = ["gpt-4", "claude-3-opus", "gpt-4-turbo", "claude-3-sonnet", "gpt-3.5-turbo"] for model in quality_order: if model in summary: return model else: # balanced # Score based on normalized latency and cost scores = {} max_latency = max(s["avg_latency"] for s in summary.values()) max_cost = max(s["total_cost"] for s in summary.values()) or 1 for model, stats in summary.items(): norm_latency = stats["avg_latency"] / max_latency norm_cost = stats["total_cost"] / max_cost scores[model] = norm_latency * 0.5 + norm_cost * 0.5 return min(scores.keys(), key=lambda m: scores[m]) return self.models[0] # Usage ag.init() comparator = ModelComparator(models=["gpt-4", "gpt-3.5-turbo"]) # Single comparison results = comparator.run_comparison("Explain quantum computing in simple terms") print("Single Comparison Results:") for model, result in results.items(): print(f" {model}:") print(f" Latency: {result.latency:.3f}s") print(f" Tokens: {result.tokens}") print(f" Cost: ${result.cost:.4f}") print(f" Output: {result.output[:100]}...") # Benchmark benchmark_prompts = [ "What is machine learning?", "Explain the difference between AI and ML", "Write a haiku about technology" ] comparator.run_benchmark(benchmark_prompts) print("\nBenchmark Summary:") summary = comparator.get_summary() for model, stats in summary.items(): print(f" {model}:") print(f" Runs: {stats['runs']}") print(f" Avg Latency: {stats['avg_latency']:.3f}s") print(f" Total Cost: ${stats['total_cost']:.4f}") # Get recommendation recommended = comparator.recommend_model(priority="balanced") print(f"\nRecommended model (balanced): {recommended}")
6. Self-Hosted Deployment
Setting Up Self-Hosted Agenta:
""" Configure and manage self-hosted Agenta deployment. """ import agenta as ag from agenta import Agenta from typing import Dict, Any, Optional import os import requests from dataclasses import dataclass @dataclass class DeploymentConfig: """Configuration for self-hosted deployment.""" host: str port: int api_key: Optional[str] database_url: str redis_url: Optional[str] enable_tracing: bool = True class SelfHostedManager: """ Manage self-hosted Agenta deployment. """ def __init__(self, config: DeploymentConfig): self.config = config self.base_url = f"http://{config.host}:{config.port}" self.client = None def initialize(self) -> bool: """ Initialize connection to self-hosted instance. Returns: True if successful """ try: # Set environment for SDK os.environ["AGENTA_HOST"] = self.base_url if self.config.api_key: os.environ["AGENTA_API_KEY"] = self.config.api_key # Initialize Agenta ag.init() self.client = Agenta() # Test connection response = requests.get(f"{self.base_url}/api/health") return response.status_code == 200 except Exception as e: print(f"Initialization failed: {e}") return False def create_app( self, name: str, description: str = "" ) -> Dict: """ Create a new application. Args: name: Application name description: Application description Returns: Created application details """ return self.client.create_app( name=name, description=description ) def deploy_variant( self, app_name: str, variant_name: str, environment: str = "production" ) -> Dict: """ Deploy a variant to an environment. Args: app_name: Application name variant_name: Variant to deploy environment: Target environment Returns: Deployment details """ # Get variant variants = self.client.list_variants(app_name=app_name) variant = next((v for v in variants if v.name == variant_name), None) if not variant: raise ValueError(f"Variant '{variant_name}' not found") # Deploy return self.client.deploy_variant( variant_id=variant.id, environment=environment ) def get_deployment_status(self, app_name: str) -> Dict: """ Get deployment status for an application. Args: app_name: Application name Returns: Deployment status """ response = requests.get( f"{self.base_url}/api/apps/{app_name}/deployments", headers={"Authorization": f"Bearer {self.config.api_key}"} if self.config.api_key else {} ) return response.json() def configure_observability( self, tracing_endpoint: str = None, metrics_endpoint: str = None ) -> None: """ Configure observability endpoints. Args: tracing_endpoint: Endpoint for traces (e.g., Jaeger) metrics_endpoint: Endpoint for metrics (e.g., Prometheus) """ config = {} if tracing_endpoint: config["tracing"] = { "enabled": True, "endpoint": tracing_endpoint } if metrics_endpoint: config["metrics"] = { "enabled": True, "endpoint": metrics_endpoint } response = requests.post( f"{self.base_url}/api/config/observability", json=config, headers={"Authorization": f"Bearer {self.config.api_key}"} if self.config.api_key else {} ) if response.status_code != 200: raise Exception(f"Failed to configure observability: {response.text}") def generate_docker_compose(config: DeploymentConfig) -> str: """ Generate docker-compose.yml for self-hosted deployment. Args: config: Deployment configuration Returns: Docker compose YAML content """ compose = f"""version: '3.8' services: agenta-backend: image: ghcr.io/agenta-ai/agenta-backend:latest ports: - "{config.port}:8000" environment: - DATABASE_URL={config.database_url} - REDIS_URL={config.redis_url or "redis://redis:6379"} - ENABLE_TRACING={str(config.enable_tracing).lower()} depends_on: - postgres - redis agenta-frontend: image: ghcr.io/agenta-ai/agenta-frontend:latest ports: - "3000:3000" environment: - NEXT_PUBLIC_API_URL=http://agenta-backend:8000 postgres: image: postgres:15 environment: - POSTGRES_DB=agenta - POSTGRES_USER=agenta - POSTGRES_PASSWORD=agenta_password volumes: - postgres_data:/var/lib/postgresql/data redis: image: redis:7 volumes: - redis_data:/data volumes: postgres_data: redis_data: """ return compose # Usage config = DeploymentConfig( host="localhost", port=8000, api_key=None, # Optional for local deployment database_url="postgresql://agenta:agenta_password@postgres:5432/agenta", redis_url="redis://redis:6379", enable_tracing=True ) # Generate docker-compose compose_yaml = generate_docker_compose(config) print("Docker Compose Configuration:") print(compose_yaml) # Initialize manager (after deploying with docker-compose) # manager = SelfHostedManager(config) # if manager.initialize(): # print("Connected to self-hosted Agenta!") # # # Create app # app = manager.create_app("my-llm-app", "Production LLM application") # # # Deploy variant # deployment = manager.deploy_variant("my-llm-app", "v1", "production") # print(f"Deployed: {deployment}")
Integration Examples
FastAPI Integration
""" Integrate Agenta with FastAPI for production deployments. """ from fastapi import FastAPI, HTTPException from pydantic import BaseModel from typing import Optional import agenta as ag from agenta import Agenta app = FastAPI(title="Agenta-Powered API") # Initialize Agenta ag.init() client = Agenta() class QueryRequest(BaseModel): """Request model for queries.""" input: str variant: Optional[str] = None parameters: Optional[dict] = None class QueryResponse(BaseModel): """Response model.""" output: str variant_used: str latency: float @app.post("/generate", response_model=QueryResponse) async def generate(request: QueryRequest): """Generate response using Agenta-managed prompts.""" import time try: # Get variant (default or specified) if request.variant: variant = client.get_variant_by_name( app_name="production-app", variant_name=request.variant ) else: variant = client.get_default_variant(app_name="production-app") # Get prompt template template = variant.config.get("template", "{input}") prompt = template.format(input=request.input) # Get parameters params = variant.config.get("parameters", {}) if request.parameters: params.update(request.parameters) # Generate start_time = time.time() response = ag.llm.complete(prompt=prompt, **params) latency = time.time() - start_time return QueryResponse( output=response.text, variant_used=variant.name, latency=latency ) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @app.get("/variants") async def list_variants(): """List available variants.""" variants = client.list_variants(app_name="production-app") return [{"name": v.name, "id": v.id, "is_default": v.is_default} for v in variants] # Run with: uvicorn api:app --reload
Langchain Integration
""" Use Agenta for prompt management in Langchain applications. """ import agenta as ag from agenta import Agenta from langchain_core.prompts import PromptTemplate from langchain_openai import ChatOpenAI from langchain_core.output_parsers import StrOutputParser from typing import Dict, Any class AgentaPromptLoader: """ Load prompts from Agenta into Langchain. """ def __init__(self, app_name: str): self.app_name = app_name self.client = Agenta() self._cache: Dict[str, PromptTemplate] = {} def get_prompt( self, variant_name: str = None, use_cache: bool = True ) -> PromptTemplate: """ Get a Langchain PromptTemplate from Agenta. Args: variant_name: Variant to load (None for default) use_cache: Whether to use cached prompts Returns: Langchain PromptTemplate """ cache_key = variant_name or "default" if use_cache and cache_key in self._cache: return self._cache[cache_key] # Get variant from Agenta if variant_name: variant = self.client.get_variant_by_name( app_name=self.app_name, variant_name=variant_name ) else: variant = self.client.get_default_variant(app_name=self.app_name) # Create Langchain prompt template = variant.config.get("template", "{input}") prompt = PromptTemplate.from_template(template) # Cache self._cache[cache_key] = prompt return prompt def create_chain( self, variant_name: str = None, model: str = "gpt-4", temperature: float = 0.3 ): """ Create a Langchain chain from Agenta prompt. Args: variant_name: Variant to use model: Model name temperature: Temperature setting Returns: Langchain chain """ prompt = self.get_prompt(variant_name) llm = ChatOpenAI(model=model, temperature=temperature) return prompt | llm | StrOutputParser() # Usage ag.init() loader = AgentaPromptLoader("qa-app") # Get prompt template prompt = loader.get_prompt("concise-v1") print(f"Template: {prompt.template}") # Create and use chain chain = loader.create_chain(variant_name="detailed-v2") result = chain.invoke({"input": "What is machine learning?"}) print(f"Result: {result}")
Best Practices
1. Prompt Versioning Strategy
"""Best practices for prompt versioning.""" # DO: Use semantic versioning for prompts version_naming = { "v1.0.0": "Initial production version", "v1.1.0": "Added context handling", "v1.1.1": "Fixed edge case in formatting", "v2.0.0": "Major rewrite with new approach" } # DO: Include metadata with versions def create_versioned_prompt(name: str, template: str, metadata: dict): return { "name": name, "template": template, "metadata": { "created_by": metadata.get("author"), "description": metadata.get("description"), "changelog": metadata.get("changelog"), "test_results": metadata.get("test_results") } } # DO: Test before promoting to production def promote_to_production(variant_id: str, min_eval_score: float = 0.8): # Run evaluation score = run_evaluation(variant_id) if score >= min_eval_score: client.set_default_variant(variant_id) return True return False
2. Evaluation Strategy
"""Best practices for prompt evaluation.""" # DO: Define clear evaluation criteria evaluation_criteria = { "accuracy": {"weight": 0.4, "threshold": 0.8}, "relevance": {"weight": 0.3, "threshold": 0.7}, "coherence": {"weight": 0.2, "threshold": 0.7}, "safety": {"weight": 0.1, "threshold": 0.9} } # DO: Use diverse test sets def create_evaluation_set(): return [ {"input": "...", "expected": "...", "category": "basic"}, {"input": "...", "expected": "...", "category": "edge_case"}, {"input": "...", "expected": "...", "category": "adversarial"} ] # DO: Track evaluation over time def track_evaluation_history(app_name: str, variant_id: str, results: dict): # Store results with timestamp for trend analysis pass
3. A/B Testing Guidelines
"""Best practices for A/B testing prompts.""" # DO: Calculate required sample size def calculate_sample_size( baseline_metric: float, minimum_detectable_effect: float, alpha: float = 0.05, power: float = 0.8 ) -> int: # Statistical calculation for required samples pass # DO: Use proper statistical tests def analyze_ab_test(control_results: list, treatment_results: list): from scipy import stats # T-test for continuous metrics t_stat, p_value = stats.ttest_ind(control_results, treatment_results) return { "significant": p_value < 0.05, "p_value": p_value, "effect_size": (sum(treatment_results)/len(treatment_results) - sum(control_results)/len(control_results)) }
Troubleshooting
Connection Issues
# Problem: Cannot connect to Agenta host # Solution: Verify host and network settings def diagnose_connection(host: str): import requests try: response = requests.get(f"{host}/api/health", timeout=5) if response.status_code == 200: print("Connection successful") else: print(f"Server returned: {response.status_code}") except requests.exceptions.ConnectionError: print("Cannot reach server - check host/port") except requests.exceptions.Timeout: print("Connection timed out - server may be overloaded")
Evaluation Failures
# Problem: Evaluations failing or inconsistent # Solution: Add retry logic and validation def robust_evaluation(prompt: str, max_retries: int = 3): for attempt in range(max_retries): try: result = ag.llm.complete(prompt=prompt) if validate_result(result): return result except Exception as e: if attempt == max_retries - 1: raise time.sleep(2 ** attempt)
Version Conflicts
# Problem: Multiple team members editing same variant # Solution: Use branching strategy def create_branch_variant(base_variant: str, branch_name: str): # Clone variant for isolated development base = client.get_variant_by_name(app_name, base_variant) return client.create_variant( app_name=app_name, variant_name=f"{base_variant}-{branch_name}", config=base.config )
Resources
- Agenta Documentation: https://docs.agenta.ai/
- GitHub Repository: https://github.com/Agenta-AI/agenta
- Self-Hosting Guide: https://docs.agenta.ai/self-hosting
- API Reference: https://docs.agenta.ai/api-reference
Version History
- 1.0.0 (2026-01-17): Initial release with versioning, A/B testing, evaluation, playground, model comparison, self-hosting
This skill provides comprehensive patterns for LLM prompt management with Agenta, refined from production prompt engineering workflows.