Learn-skills.dev testing-agent
Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/adenhq/hive/testing-agent" ~/.claude/skills/neversight-learn-skills-dev-testing-agent && rm -rf "$T"
data/skills-md/adenhq/hive/testing-agent/SKILL.mdTesting Workflow
This skill provides tools for testing agents built with the building-agents skill.
Workflow Overview
- Check what tests existmcp__agent-builder__list_tests
ormcp__agent-builder__generate_constraint_tests
- Get test guidelinesmcp__agent-builder__generate_success_tests- Write tests directly using the Write tool with the guidelines provided
- Execute testsmcp__agent-builder__run_tests
- Debug failuresmcp__agent-builder__debug_test
How Test Generation Works
The
generate_*_tests MCP tools return guidelines and templates - they do NOT generate test code via LLM.
You (Claude) write the tests directly using the Write tool based on the guidelines.
Example Workflow
# Step 1: Get test guidelines result = mcp__agent-builder__generate_constraint_tests( goal_id="my-goal", goal_json='{"id": "...", "constraints": [...]}', agent_path="exports/my_agent" ) # Step 2: The result contains: # - output_file: where to write tests # - file_header: imports and fixtures to use # - test_template: format for test functions # - constraints_formatted: the constraints to test # - test_guidelines: rules for writing tests # Step 3: Write tests directly using the Write tool Write( file_path=result["output_file"], content=result["file_header"] + test_code_you_write ) # Step 4: Run tests via MCP tool mcp__agent-builder__run_tests( goal_id="my-goal", agent_path="exports/my_agent" ) # Step 5: Debug failures via MCP tool mcp__agent-builder__debug_test( goal_id="my-goal", test_name="test_constraint_foo", agent_path="exports/my_agent" )
Testing Agents with MCP Tools
Run goal-based evaluation tests for agents built with the building-agents skill.
Key Principle: MCP tools provide guidelines, Claude writes tests directly
- ✅ Get guidelines:
,generate_constraint_tests
→ returns templates and guidelinesgenerate_success_tests - ✅ Write tests: Use the Write tool with the provided file_header and test_template
- ✅ Run tests:
(runs pytest via subprocess)run_tests - ✅ Debug failures:
(re-runs single test with verbose output)debug_test - ✅ List tests:
(scans Python test files)list_tests - ✅ Tests stored in
exports/{agent}/tests/test_*.py
Architecture: Python Test Files
exports/my_agent/ ├── __init__.py ├── agent.py ← Agent to test ├── nodes/__init__.py ├── config.py ├── __main__.py └── tests/ ← Test files written by MCP tools ├── conftest.py # Shared fixtures (auto-created) ├── test_constraints.py ├── test_success_criteria.py └── test_edge_cases.py
Tests import the agent directly:
import pytest from exports.my_agent import default_agent @pytest.mark.asyncio async def test_happy_path(mock_mode): result = await default_agent.run({"query": "test"}, mock_mode=mock_mode) assert result.success assert len(result.output) > 0
Why This Approach
- MCP tools provide consistent test guidelines with proper imports, fixtures, and API key enforcement
- Claude writes tests directly, eliminating circular LLM dependencies in the MCP server
parses pytest output into structured results for iterationrun_tests
provides formatted output with actionable debugging infodebug_test- File headers include conftest.py setup with proper fixtures
Quick Start
- Check existing tests -
list_tests(goal_id, agent_path) - Get test guidelines -
orgenerate_constraint_testsgenerate_success_tests - Write tests - Use the Write tool with the provided file_header and guidelines
- Run tests -
run_tests(goal_id, agent_path) - Debug failures -
debug_test(goal_id, test_name, agent_path) - Iterate - Repeat steps 4-5 until all pass
⚠️ API Key Requirement for Real Testing
CRITICAL: Real LLM testing requires an API key. Mock mode only validates structure and does NOT test actual agent behavior.
Prerequisites
Before running agent tests, you MUST set your API key:
export ANTHROPIC_API_KEY="your-key-here"
Why API keys are required:
- Tests need to execute the agent's LLM nodes to validate behavior
- Mock mode bypasses LLM calls, providing no confidence in real-world performance
- Success criteria (personalization, reasoning quality, constraint adherence) can only be tested with real LLM calls
Mock Mode Limitations
Mock mode (
--mock flag or mock_mode=True) is ONLY for structure validation:
✓ Validates graph structure (nodes, edges, connections) ✓ Tests that code doesn't crash on execution ✗ Does NOT test LLM message generation ✗ Does NOT test reasoning or decision-making quality ✗ Does NOT test constraint validation (length limits, format rules) ✗ Does NOT test real API integrations or tool use ✗ Does NOT test personalization or content quality
Bottom line: If you're testing whether an agent achieves its goal, you MUST use a real API key.
Enforcing API Key in Tests
When generating tests, ALWAYS include API key checks:
import os import pytest from aden_tools.credentials import CredentialManager # At the top of every test file pytestmark = pytest.mark.skipif( not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"), reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1 for structure validation only." ) @pytest.fixture(scope="session", autouse=True) def check_api_key(): """Ensure API key is set for real testing.""" creds = CredentialManager() if not creds.is_available("anthropic"): if os.environ.get("MOCK_MODE"): print("\n⚠️ Running in MOCK MODE - structure validation only") print(" This does NOT test LLM behavior or agent quality") print(" Set ANTHROPIC_API_KEY for real testing\n") else: pytest.fail( "\n❌ ANTHROPIC_API_KEY not set!\n\n" "Real testing requires an API key. Choose one:\n" "1. Set API key (RECOMMENDED):\n" " export ANTHROPIC_API_KEY='your-key-here'\n" "2. Run structure validation only:\n" " MOCK_MODE=1 pytest exports/{agent}/tests/\n\n" "Note: Mock mode does NOT validate agent behavior or quality." )
User Communication
When the user asks to test an agent, ALWAYS check for the API key first:
from aden_tools.credentials import CredentialManager # Before running any tests creds = CredentialManager() if not creds.is_available("anthropic"): print("⚠️ No ANTHROPIC_API_KEY found!") print() print("Testing requires a real API key to validate agent behavior.") print() print("Options:") print("1. Set your API key (RECOMMENDED):") print(" export ANTHROPIC_API_KEY='your-key-here'") print() print("2. Run in mock mode (structure validation only):") print(" MOCK_MODE=1 pytest exports/{agent}/tests/") print() print("Mock mode does NOT test:") print(" - LLM message generation") print(" - Reasoning or decision quality") print(" - Constraint validation") print(" - Real API integrations") # Ask user what to do AskUserQuestion(...)
The Three-Stage Flow
┌─────────────────────────────────────────────────────────────────────────┐ │ GOAL STAGE │ │ (building-agents skill) │ │ │ │ 1. User defines goal with success_criteria and constraints │ │ 2. Goal written to agent.py immediately │ │ 3. Generate CONSTRAINT TESTS → Write to tests/ → USER APPROVAL │ │ Files created: exports/{agent}/tests/test_constraints.py │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ AGENT STAGE │ │ (building-agents skill) │ │ │ │ Build nodes + edges, written immediately to files │ │ Constraint tests can run during development: │ │ run_tests(goal_id, agent_path, test_types='["constraint"]') │ └─────────────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────────────┐ │ EVAL STAGE (this skill) │ │ │ │ 1. Generate SUCCESS_CRITERIA TESTS → Write to tests/ → USER APPROVAL │ │ Files created: exports/{agent}/tests/test_success_criteria.py │ │ 2. Run all tests: run_tests(goal_id, agent_path) │ │ 3. On failure → debug_test(goal_id, test_name, agent_path) │ │ 4. Iterate: Edit agent code → Re-run run_tests (instant feedback) │ └─────────────────────────────────────────────────────────────────────────┘
Step-by-Step: Testing an Agent
Step 1: Check Existing Tests
ALWAYS check first before generating new tests:
mcp__agent-builder__list_tests( goal_id="your-goal-id", agent_path="exports/your_agent" )
This shows what test files already exist. If tests exist:
- Review the list to see what's covered
- Ask user if they want to add more or run existing tests
Step 2: Get Constraint Test Guidelines (Goal Stage)
After goal is defined, get test guidelines using the MCP tool:
# First, read the goal from agent.py to get the goal JSON goal_code = Read(file_path="exports/your_agent/agent.py") # Extract the goal definition and convert to JSON # Get constraint test guidelines via MCP tool result = mcp__agent-builder__generate_constraint_tests( goal_id="your-goal-id", goal_json='{"id": "goal-id", "name": "...", "constraints": [...]}', agent_path="exports/your_agent" )
Response includes:
: Where to write tests (e.g.,output_file
)exports/your_agent/tests/test_constraints.py
: Imports, fixtures, and pytest setup to use at the top of the filefile_header
: Format for test functionstest_template
: The constraints to testconstraints_formatted
: Rules and best practices for writing teststest_guidelines
: How to proceedinstruction
Write tests directly using the provided guidelines:
# Write tests using the Write tool Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + your_test_code )
Step 3: Get Success Criteria Test Guidelines (Eval Stage)
After agent is fully built, get success criteria test guidelines:
# Get success criteria test guidelines via MCP tool result = mcp__agent-builder__generate_success_tests( goal_id="your-goal-id", goal_json='{"id": "goal-id", "name": "...", "success_criteria": [...]}', node_names="analyze_request,search_web,format_results", tool_names="web_search,web_scrape", agent_path="exports/your_agent" )
Write tests directly using the provided guidelines:
# Write tests using the Write tool Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + your_test_code )
Step 4: Test Fixtures (conftest.py)
The
file_header returned by the MCP tools includes proper imports and fixtures.
You should also create a conftest.py file in the tests directory with shared fixtures:
# Create conftest.py with the conftest template Write( file_path="exports/your_agent/tests/conftest.py", content=conftest_content # Use PYTEST_CONFTEST_TEMPLATE format )
Step 5: Run Tests
Use the MCP tool to run tests (not pytest directly):
mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent" ) **Response includes structured results:** ```json { "goal_id": "your-goal-id", "overall_passed": false, "summary": { "total": 12, "passed": 10, "failed": 2, "skipped": 0, "errors": 0, "pass_rate": "83.3%" }, "test_results": [ {"file": "test_constraints.py", "test_name": "test_constraint_api_rate_limits", "status": "passed"}, {"file": "test_success_criteria.py", "test_name": "test_success_find_relevant_results", "status": "failed"} ], "failures": [ {"test_name": "test_success_find_relevant_results", "details": "AssertionError: Expected 3-5 results..."} ] }
Options for
:run_tests
# Run only constraint tests mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent", test_types='["constraint"]' ) # Run with parallel workers mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent", parallel=4 ) # Stop on first failure mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent", fail_fast=True )
Step 6: Debug Failed Tests
Use the MCP tool to debug (not Bash/pytest directly):
mcp__agent-builder__debug_test( goal_id="your-goal-id", test_name="test_success_find_relevant_results", agent_path="exports/your_agent" )
Response includes:
- Full verbose output from the test
- Stack trace with exact line numbers
- Captured logs and prints
- Suggestions for fixing the issue
Step 7: Categorize Errors
When a test fails, categorize the error to guide iteration:
def categorize_test_failure(test_output, agent_code): """Categorize test failure to guide iteration.""" # Read test output and agent code failure_info = { "test_name": "...", "error_message": "...", "stack_trace": "...", } # Pattern-based categorization if any(pattern in failure_info["error_message"].lower() for pattern in [ "typeerror", "attributeerror", "keyerror", "valueerror", "null", "none", "undefined", "tool call failed" ]): category = "IMPLEMENTATION_ERROR" guidance = { "stage": "Agent", "action": "Fix the bug in agent code", "files_to_edit": ["agent.py", "nodes/__init__.py"], "restart_required": False, "description": "Code bug - fix and re-run tests" } elif any(pattern in failure_info["error_message"].lower() for pattern in [ "assertion", "expected", "got", "should be", "success criteria" ]): category = "LOGIC_ERROR" guidance = { "stage": "Goal", "action": "Update goal definition", "files_to_edit": ["agent.py (goal section)"], "restart_required": True, "description": "Goal definition is wrong - update and rebuild" } elif any(pattern in failure_info["error_message"].lower() for pattern in [ "timeout", "rate limit", "empty", "boundary", "edge case" ]): category = "EDGE_CASE" guidance = { "stage": "Eval", "action": "Add edge case test and fix handling", "files_to_edit": ["agent.py", "tests/test_edge_cases.py"], "restart_required": False, "description": "New scenario - add test and handle it" } else: category = "UNKNOWN" guidance = { "stage": "Unknown", "action": "Manual investigation required", "restart_required": False } return { "category": category, "guidance": guidance, "failure_info": failure_info }
Show categorization to user:
AskUserQuestion( questions=[{ "question": f"Test failed with {category}. How would you like to proceed?", "header": "Test Failure", "options": [ { "label": "Fix code directly (Recommended)" if category == "IMPLEMENTATION_ERROR" else "Update goal", "description": guidance["description"] }, { "label": "Show detailed error info", "description": "View full stack trace and logs" }, { "label": "Skip for now", "description": "Continue with other tests" } ], "multiSelect": false }] )
Step 8: Iterate Based on Error Category
IMPLEMENTATION_ERROR → Fix Agent Code
# 1. Show user the exact file and line that failed print(f"Error in: exports/{agent_name}/nodes/__init__.py:42") print(f"Issue: 'NoneType' object has no attribute 'get'") # 2. Read the problematic code code = Read(file_path=f"exports/{agent_name}/nodes/__init__.py") # 3. User can fix directly, or you suggest a fix: Edit( file_path=f"exports/{agent_name}/nodes/__init__.py", old_string="if results.get('videos'):", new_string="if results and results.get('videos'):" ) # 4. Re-run tests immediately (instant feedback!) mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path=f"exports/{agent_name}" )
LOGIC_ERROR → Update Goal
# 1. Show user the goal definition goal_code = Read(file_path=f"exports/{agent_name}/agent.py") # 2. Discuss what needs to change in success_criteria or constraints # 3. Edit the goal Edit( file_path=f"exports/{agent_name}/agent.py", old_string='target="3-5 videos"', new_string='target="1-5 videos"' # More realistic ) # 4. May need to regenerate agent nodes if goal changed significantly # This requires going back to building-agents skill
EDGE_CASE → Add Test and Fix
# 1. Create new edge case test with API key enforcement edge_case_test = ''' @pytest.mark.asyncio async def test_edge_case_empty_results(mock_mode): """Test: Agent handles no results gracefully""" result = await default_agent.run({{"query": "xyzabc123nonsense"}}, mock_mode=mock_mode) # Should succeed with empty results, not crash assert result.success or result.error is not None if result.success: assert result.output.get("message") == "No results found" ''' # 2. Add to test file Edit( file_path=f"exports/{agent_name}/tests/test_edge_cases.py", old_string="# Add edge case tests here", new_string=edge_case_test ) # 3. Fix agent to handle edge case # Edit agent code to handle empty results # 4. Re-run tests
Test File Templates (Reference Only)
⚠️ Do NOT copy-paste these templates directly. Use
generate_constraint_tests and generate_success_tests MCP tools to create properly structured tests with correct imports and fixtures.
These templates show the structure of generated tests for reference only.
Constraint Test Template
"""Constraint tests for {agent_name}. These tests validate that the agent respects its defined constraints. Requires ANTHROPIC_API_KEY for real testing. """ import os import pytest from exports.{agent_name} import default_agent from aden_tools.credentials import CredentialManager # Enforce API key for real testing pytestmark = pytest.mark.skipif( not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"), reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1." ) @pytest.mark.asyncio async def test_constraint_{constraint_id}(): """Test: {constraint_description}""" # Test implementation based on constraint type mock_mode = bool(os.environ.get("MOCK_MODE")) result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode) # Assert constraint is respected assert True # Replace with actual check
Success Criteria Test Template
"""Success criteria tests for {agent_name}. These tests validate that the agent achieves its defined success criteria. Requires ANTHROPIC_API_KEY for real testing - mock mode cannot validate success criteria. """ import os import pytest from exports.{agent_name} import default_agent from aden_tools.credentials import CredentialManager # Enforce API key for real testing pytestmark = pytest.mark.skipif( not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"), reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1." ) @pytest.mark.asyncio async def test_success_{criteria_id}(): """Test: {criteria_description}""" mock_mode = bool(os.environ.get("MOCK_MODE")) result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode) assert result.success, f"Agent failed: {{result.error}}" # Verify success criterion met # e.g., assert metric meets target assert True # Replace with actual check
Edge Case Test Template
"""Edge case tests for {agent_name}. These tests validate agent behavior in unusual or boundary conditions. Requires ANTHROPIC_API_KEY for real testing. """ import os import pytest from exports.{agent_name} import default_agent from aden_tools.credentials import CredentialManager # Enforce API key for real testing pytestmark = pytest.mark.skipif( not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"), reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1." ) @pytest.mark.asyncio async def test_edge_case_{scenario_name}(): """Test: Agent handles {scenario_description}""" mock_mode = bool(os.environ.get("MOCK_MODE")) result = await default_agent.run({{"edge": "case_input"}}, mock_mode=mock_mode) # Verify graceful handling assert result.success or result.error is not None
Interactive Build + Test Loop
During agent construction (Agent stage), you can run constraint tests incrementally:
# After adding first node print("Added search_node. Running relevant constraint tests...") mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path=f"exports/{agent_name}", test_types='["constraint"]' ) # After adding second node print("Added filter_node. Running all constraint tests...") mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path=f"exports/{agent_name}", test_types='["constraint"]' )
This provides immediate feedback during development, catching issues early.
Common Test Patterns
Note: All test patterns should include API key enforcement via conftest.py.
⚠️ CRITICAL: Framework Features You Must Know
OutputCleaner - Automatic I/O Cleaning (NEW!)
The framework now automatically validates and cleans node outputs using a fast LLM (Cerebras llama-3.3-70b) at edge traversal time. This prevents cascading failures from malformed output.
What OutputCleaner does:
- ✅ Validates output matches next node's input schema
- ✅ Detects JSON parsing trap (entire response in one key)
- ✅ Cleans malformed output automatically (~200-500ms, ~$0.001 per cleaning)
- ✅ Boosts success rates by 1.8-2.2x
Impact on tests: Tests should still use safe patterns because OutputCleaner may not catch all issues in test mode.
Safe Test Patterns (REQUIRED)
❌ UNSAFE (will cause test failures):
# Direct key access - can crash! approval_decision = result.output["approval_decision"] assert approval_decision == "APPROVED" # Nested access without checks category = result.output["analysis"]["category"] # Assuming parsed JSON structure for issue in result.output["compliance_issues"]: ...
✅ SAFE (correct patterns):
# 1. Safe dict access with .get() output = result.output or {} approval_decision = output.get("approval_decision", "UNKNOWN") assert "APPROVED" in approval_decision or approval_decision == "APPROVED" # 2. Type checking before operations analysis = output.get("analysis", {}) if isinstance(analysis, dict): category = analysis.get("category", "unknown") # 3. Parse JSON from strings (the JSON parsing trap!) import json recommendation = output.get("recommendation", "{}") if isinstance(recommendation, str): try: parsed = json.loads(recommendation) if isinstance(parsed, dict): approval = parsed.get("approval_decision", "UNKNOWN") except json.JSONDecodeError: approval = "UNKNOWN" elif isinstance(recommendation, dict): approval = recommendation.get("approval_decision", "UNKNOWN") # 4. Safe iteration with type check compliance_issues = output.get("compliance_issues", []) if isinstance(compliance_issues, list): for issue in compliance_issues: ...
Helper Functions for Safe Access
Add to conftest.py:
import json import re def _parse_json_from_output(result, key): """Parse JSON from agent output (framework may store full LLM response as string).""" response_text = result.output.get(key, "") # Remove markdown code blocks if present json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip() try: return json.loads(json_text) except (json.JSONDecodeError, AttributeError, TypeError): return result.output.get(key) def safe_get_nested(result, key_path, default=None): """Safely get nested value from result.output.""" output = result.output or {} current = output for key in key_path: if isinstance(current, dict): current = current.get(key) elif isinstance(current, str): try: json_text = re.sub(r'```json\s*|\s*```', '', current).strip() parsed = json.loads(json_text) if isinstance(parsed, dict): current = parsed.get(key) else: return default except json.JSONDecodeError: return default else: return default return current if current is not None else default # Make available in tests pytest.parse_json_from_output = _parse_json_from_output pytest.safe_get_nested = safe_get_nested
Usage in tests:
# Use helper to parse JSON safely parsed = pytest.parse_json_from_output(result, "recommendation") if isinstance(parsed, dict): approval = parsed.get("approval_decision", "UNKNOWN") # Safe nested access risk_score = pytest.safe_get_nested(result, ["analysis", "risk_score"], default=0.0)
Test Count Guidance
Generate 8-15 tests total, NOT 30+
- ✅ 2-3 tests per success criterion
- ✅ 1 happy path test
- ✅ 1 boundary/edge case test
- ✅ 1 error handling test (optional)
Why fewer tests?:
- Each test requires real LLM call (~3 seconds, costs money)
- 30 tests = 90 seconds, $0.30+ in costs
- 12 tests = 36 seconds, $0.12 in costs
- Focus on quality over quantity
ExecutionResult Fields (Important!)
means NO exception, NOT goal achievedresult.success=True
# ❌ WRONG - assumes goal achieved assert result.success # ✅ RIGHT - check success AND output assert result.success, f"Agent failed: {result.error}" output = result.output or {} approval = output.get("approval_decision") assert approval == "APPROVED", f"Expected APPROVED, got {approval}"
All ExecutionResult fields:
- Execution completed without exception (NOT goal achieved!)success: bool
- Complete memory snapshot (may contain raw strings)output: dict
- Error message if failederror: str | None
- Number of nodes executedsteps_executed: int
- Cumulative token usagetotal_tokens: int
- Total execution timetotal_latency_ms: int
- Node IDs traversedpath: list[str]
- Node ID if HITL pause occurredpaused_at: str | None
- State for resumingsession_state: dict
Happy Path Test
@pytest.mark.asyncio async def test_happy_path(mock_mode): """Test normal successful execution""" result = await default_agent.run({{"query": "python tutorials"}}, mock_mode=mock_mode) assert result.success assert len(result.output) > 0
Boundary Condition Test
@pytest.mark.asyncio async def test_boundary_minimum(mock_mode): """Test at minimum threshold""" result = await default_agent.run({{"query": "very specific niche topic"}}, mock_mode=mock_mode) assert result.success assert len(result.output.get("results", [])) >= 1
Error Handling Test
@pytest.mark.asyncio async def test_error_handling(mock_mode): """Test graceful error handling""" result = await default_agent.run({{"query": ""}}, mock_mode=mock_mode) # Invalid input assert not result.success or result.output.get("error") is not None
Performance Test
@pytest.mark.asyncio async def test_performance_latency(mock_mode): """Test response time is acceptable""" import time start = time.time() result = await default_agent.run({{"query": "test"}}, mock_mode=mock_mode) duration = time.time() - start assert duration < 5.0, f"Took {{duration}}s, expected <5s"
Integration with building-agents
Handoff Points
| Scenario | From | To | Action |
|---|---|---|---|
| Agent built, ready to test | building-agents | testing-agent | Generate success tests |
| LOGIC_ERROR found | testing-agent | building-agents | Update goal, rebuild |
| IMPLEMENTATION_ERROR found | testing-agent | Direct fix | Edit agent files, re-run tests |
| EDGE_CASE found | testing-agent | testing-agent | Add edge case test |
| All tests pass | testing-agent | Done | Agent validated ✅ |
Iteration Speed Comparison
| Scenario | Old Approach | New Approach |
|---|---|---|
| Bug Fix | Rebuild via MCP tools (14 min) | Edit Python file, pytest (2 min) |
| Add Test | Generate via MCP, export (5 min) | Write test file directly (1 min) |
| Debug | Read subprocess logs | pdb, breakpoints, prints |
| Inspect | Limited visibility | Full Python introspection |
Anti-Patterns
Testing Best Practices
| Don't | Do Instead |
|---|---|
| ❌ Write tests without getting guidelines first | ✅ Use to get proper file_header and guidelines |
| ❌ Run pytest via Bash | ✅ Use MCP tool for structured results |
| ❌ Debug tests with Bash pytest -vvs | ✅ Use MCP tool for formatted output |
| ❌ Check for tests with Glob | ✅ Use MCP tool |
| ❌ Skip the file_header from guidelines | ✅ Always include the file_header for proper imports and fixtures |
General Testing
| Don't | Do Instead |
|---|---|
| ❌ Treat all failures the same | ✅ Use debug_test to categorize and iterate appropriately |
| ❌ Rebuild entire agent for small bugs | ✅ Edit code directly, re-run tests |
| ❌ Run tests without API key | ✅ Always set ANTHROPIC_API_KEY first |
| ❌ Write tests without understanding the constraints/criteria | ✅ Read the formatted constraints/criteria from guidelines |
Workflow Summary
1. Check existing tests: list_tests(goal_id, agent_path) → Scans exports/{agent}/tests/test_*.py ↓ 2. Get test guidelines: generate_constraint_tests, generate_success_tests → Returns file_header, test_template, constraints/criteria, guidelines ↓ 3. Write tests: Use Write tool with the provided guidelines → Write tests to exports/{agent}/tests/test_*.py ↓ 4. Run tests: run_tests(goal_id, agent_path) → Executes: pytest exports/{agent}/tests/ -v ↓ 5. Debug failures: debug_test(goal_id, test_name, agent_path) → Re-runs single test with verbose output ↓ 6. Fix based on category: - IMPLEMENTATION_ERROR → Edit agent code directly - ASSERTION_FAILURE → Fix agent logic or update test - IMPORT_ERROR → Check package structure - API_ERROR → Check API keys and connectivity ↓ 7. Re-run tests: run_tests(goal_id, agent_path) ↓ 8. Repeat until all pass ✅
MCP Tools Reference
# Check existing tests (scans Python test files) mcp__agent-builder__list_tests( goal_id="your-goal-id", agent_path="exports/your_agent" ) # Get constraint test guidelines (returns templates and guidelines, NOT generated tests) mcp__agent-builder__generate_constraint_tests( goal_id="your-goal-id", goal_json='{"id": "...", "constraints": [...]}', agent_path="exports/your_agent" ) # Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines # Get success criteria test guidelines mcp__agent-builder__generate_success_tests( goal_id="your-goal-id", goal_json='{"id": "...", "success_criteria": [...]}', node_names="node1,node2", tool_names="tool1,tool2", agent_path="exports/your_agent" ) # Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines # Run tests via pytest subprocess mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent" ) # Debug a failed test (re-runs with verbose output) mcp__agent-builder__debug_test( goal_id="your-goal-id", test_name="test_constraint_foo", agent_path="exports/your_agent" )
run_tests Options
# Run only constraint tests mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent", test_types='["constraint"]' ) # Run only success criteria tests mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent", test_types='["success"]' ) # Run with pytest-xdist parallelism (requires pytest-xdist) mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent", parallel=4 ) # Stop on first failure mcp__agent-builder__run_tests( goal_id="your-goal-id", agent_path="exports/your_agent", fail_fast=True )
Direct pytest Commands
You can also run tests directly with pytest (the MCP tools use pytest internally):
# Run all tests pytest exports/your_agent/tests/ -v # Run specific test file pytest exports/your_agent/tests/test_constraints.py -v # Run specific test pytest exports/your_agent/tests/test_constraints.py::test_constraint_foo -vvs # Run in mock mode (structure validation only) MOCK_MODE=1 pytest exports/your_agent/tests/ -v
MCP tools generate tests, write them to Python files, and run them via pytest.