Learn-skills.dev hive-test
Iterative agent testing with session recovery. Execute, analyze, fix, resume from checkpoints. Use when testing an agent, debugging test failures, or verifying fixes without re-running from scratch.
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/adenhq/hive/hive-test" ~/.claude/skills/neversight-learn-skills-dev-hive-test && rm -rf "$T"
data/skills-md/adenhq/hive/hive-test/SKILL.mdAgent Testing
Test agents iteratively: execute, analyze failures, fix, resume from checkpoint, repeat.
When to Use
- Testing a newly built agent against its goal
- Debugging a failing agent iteratively
- Verifying fixes without re-running expensive early nodes
- Running final regression tests before deployment
Prerequisites
- Agent package at
(built withexports/{agent_name}/
)/hive-create - Credentials configured (
)/hive-credentials
set (or appropriate LLM provider key)ANTHROPIC_API_KEY
Path distinction (critical — don't confuse these):
— agent source code (edit here)exports/{agent_name}/
— runtime data: sessions, checkpoints, logs (read here)~/.hive/agents/{agent_name}/
The Iterative Test Loop
This is the core workflow. Don't re-run the entire agent when a late node fails — analyze, fix, and resume from the last clean checkpoint.
┌──────────────────────────────────────┐ │ PHASE 1: Generate Test Scenarios │ │ Goal → synthetic test inputs + tests │ └──────────────┬───────────────────────┘ ↓ ┌──────────────────────────────────────┐ │ PHASE 2: Execute │◄────────────────┐ │ Run agent (CLI or pytest) │ │ └──────────────┬───────────────────────┘ │ ↓ │ Pass? ──yes──► PHASE 6: Final Verification │ │ │ no │ ↓ │ ┌──────────────────────────────────────┐ │ │ PHASE 3: Analyze │ │ │ Session + runtime logs + checkpoints │ │ └──────────────┬───────────────────────┘ │ ↓ │ ┌──────────────────────────────────────┐ │ │ PHASE 4: Fix │ │ │ Prompt / code / graph / goal │ │ └──────────────┬───────────────────────┘ │ ↓ │ ┌──────────────────────────────────────┐ │ │ PHASE 5: Recover & Resume │─────────────────┘ │ Checkpoint resume OR fresh re-run │ └──────────────────────────────────────┘
Phase 1: Generate Test Scenarios
Create synthetic tests from the agent's goal, constraints, and success criteria.
Step 1a: Read the goal
# Read goal from agent.py Read(file_path="exports/{agent_name}/agent.py") # Extract the Goal definition and convert to JSON string
Step 1b: Get test guidelines
# Get constraint test guidelines generate_constraint_tests( goal_id="your-goal-id", goal_json='{"id": "...", "constraints": [...]}', agent_path="exports/{agent_name}" ) # Get success criteria test guidelines generate_success_tests( goal_id="your-goal-id", goal_json='{"id": "...", "success_criteria": [...]}', node_names="intake,research,review,report", tool_names="web_search,web_scrape", agent_path="exports/{agent_name}" )
These return
file_header, test_template, constraints_formatted/success_criteria_formatted, and test_guidelines. They do NOT generate test code — you write the tests.
Step 1c: Write tests
Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + your_test_code )
Test writing rules
- Every test MUST be
withasync@pytest.mark.asyncio - Every test MUST accept
fixturesrunner, auto_responder, mock_mode - Use
before running,await auto_responder.start()
inawait auto_responder.stop()finally - Use
— this goes through AgentRunner → AgentRuntime → ExecutionStreamawait runner.run(input_dict) - Access output via
— NEVERresult.output.get("key")result.output["key"]
means no exception, NOT goal achieved — always check outputresult.success=True- Write 8-15 tests total, not 30+
- Each real test costs ~3 seconds + LLM tokens
- NEVER use
— it bypasses the runtime (no sessions, no logs, client-facing nodes hang)default_agent.run()
Step 1d: Check existing tests
Before generating, check if tests already exist:
list_tests( goal_id="your-goal-id", agent_path="exports/{agent_name}" )
Phase 2: Execute
Two execution paths, use the right one for your situation.
Iterative debugging (for complex agents)
Run the agent via CLI. This creates sessions with checkpoints at
~/.hive/agents/{agent_name}/sessions/:
uv run hive run exports/{agent_name} --input '{"query": "test topic"}'
Sessions and checkpoints are saved automatically.
Client-facing nodes: Agents with
client_facing=True nodes (interactive conversation) work in headless mode when run from a real terminal — the agent streams output to stdout and reads user input from stdin via a >>> prompt. In non-interactive shells (like Claude Code's Bash tool), client-facing nodes will hang because there is no stdin. For testing interactive agents from Claude Code, use run_tests with mock mode or have the user run the agent manually in their terminal.
Automated regression (for CI or final verification)
Use the
run_tests MCP tool to run all pytest tests:
run_tests( goal_id="your-goal-id", agent_path="exports/{agent_name}" )
Returns structured results:
{ "overall_passed": false, "summary": {"total": 12, "passed": 10, "failed": 2, "pass_rate": "83.3%"}, "test_results": [{"test_name": "test_success_source_diversity", "status": "failed"}], "failures": [{"test_name": "test_success_source_diversity", "details": "..."}] }
Options:
# Run only constraint tests run_tests(goal_id, agent_path, test_types='["constraint"]') # Stop on first failure run_tests(goal_id, agent_path, fail_fast=True) # Parallel execution run_tests(goal_id, agent_path, parallel=4)
Note:
run_tests uses AgentRunner with tmp_path storage, so sessions are isolated per test run. For checkpoint-based recovery with persistent sessions, use CLI execution. Use run_tests for quick regression checks and final verification.
Phase 3: Analyze Failures
When a test fails, drill down systematically. Don't guess — use the tools.
Step 3a: Get error category
debug_test( goal_id="your-goal-id", test_name="test_success_source_diversity", agent_path="exports/{agent_name}" )
Returns error category (
IMPLEMENTATION_ERROR, ASSERTION_FAILURE, TIMEOUT, IMPORT_ERROR, API_ERROR) plus full traceback and suggestions.
Step 3b: Find the failed session
list_agent_sessions( agent_work_dir="~/.hive/agents/{agent_name}", status="failed", limit=5 )
Returns session list with IDs, timestamps, current_node (where it failed), execution_quality.
Step 3c: Inspect session state
get_agent_session_state( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345" )
Returns execution path, which node was current, step count, timestamps — but excludes memory values (to avoid context bloat). Shows
memory_keys and memory_size instead.
Step 3d: Examine runtime logs (L2/L3)
# L2: Per-node success/failure, retry counts query_runtime_log_details( agent_work_dir="~/.hive/agents/{agent_name}", run_id="session_20260209_143022_abc12345", needs_attention_only=True ) # L3: Exact LLM responses, tool call inputs/outputs query_runtime_log_raw( agent_work_dir="~/.hive/agents/{agent_name}", run_id="session_20260209_143022_abc12345", node_id="research" )
Step 3e: Inspect memory data
# See what data a node actually produced get_agent_session_memory( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", key="research_results" )
Step 3f: Find recovery points
list_agent_checkpoints( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", is_clean="true" )
Returns checkpoint summaries with IDs, types (
node_start, node_complete), which node, and is_clean flag. Clean checkpoints are safe resume points.
Step 3g: Compare checkpoints (optional)
To understand what changed between two points in execution:
compare_agent_checkpoints( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", checkpoint_id_before="cp_node_complete_research_143030", checkpoint_id_after="cp_node_complete_review_143115" )
Returns memory diff (added/removed/changed keys) and execution path diff.
Phase 4: Fix Based on Root Cause
Use the analysis from Phase 3 to determine what to fix and where.
| Root Cause | What to Fix | Where to Edit |
|---|---|---|
| Prompt issue — LLM produces wrong output format, misses instructions | Node | |
| Code bug — TypeError, KeyError, logic error in Python | Agent code | , |
| Graph issue — wrong routing, missing edge, bad condition_expr | Edges, node config | |
| Tool issue — MCP tool fails, wrong config, missing credential | Tool config | , |
| Goal issue — success criteria too strict/vague, wrong constraints | Goal definition | (goal section) |
| Test issue — test expectations don't match actual agent behavior | Test code | |
Fix strategies by error category
IMPLEMENTATION_ERROR (TypeError, AttributeError, KeyError):
# Read the failing code Read(file_path="exports/{agent_name}/nodes/__init__.py") # Fix the bug Edit( file_path="exports/{agent_name}/nodes/__init__.py", old_string="results.get('videos')", new_string="(results or {}).get('videos', [])" )
ASSERTION_FAILURE (test assertions fail but agent ran successfully):
- Check if the agent's output is actually wrong → fix the prompt
- Check if the test's expectations are unrealistic → fix the test
- Use
to see what the agent actually producedget_agent_session_memory
TIMEOUT / STALL (agent runs too long):
- Check
for feedback loops hitting max_node_visitsnode_visit_counts - Check L3 logs for tool calls that hang
- Reduce
in loop_config or fix the prompt to converge fastermax_iterations
API_ERROR (connection, rate limit, auth):
- Verify credentials with
/hive-credentials - Check MCP server configuration
Phase 5: Recover & Resume
After fixing the agent, decide whether to resume or re-run.
When to resume from checkpoint
Resume when ALL of these are true:
- The fix is to a node that comes AFTER existing clean checkpoints
- Clean checkpoints exist (from a CLI execution with checkpointing)
- The early nodes are expensive (web scraping, API calls, long LLM chains)
# Resume from the last clean checkpoint before the failing node uv run hive run exports/{agent_name} \ --resume-session session_20260209_143022_abc12345 \ --checkpoint cp_node_complete_research_143030
This skips all nodes before the checkpoint and only re-runs the fixed node onward.
When to re-run from scratch
Re-run when ANY of these are true:
- The fix is to the entry node or an early node
- No checkpoints exist (e.g., agent was run via
)run_tests - The agent is fast (2-3 nodes, completes in seconds)
- You changed the graph structure (added/removed nodes/edges)
uv run hive run exports/{agent_name} --input '{"query": "test topic"}'
Inspecting a checkpoint before resuming
get_agent_checkpoint( agent_work_dir="~/.hive/agents/{agent_name}", session_id="session_20260209_143022_abc12345", checkpoint_id="cp_node_complete_research_143030" )
Returns the full checkpoint: shared_memory snapshot, execution_path, current_node, next_node, is_clean.
Loop back to Phase 2
After resuming or re-running, check if the fix worked. If not, go back to Phase 3.
Phase 6: Final Verification
Once the iterative fix loop converges (the agent produces correct output), run the full automated test suite:
run_tests( goal_id="your-goal-id", agent_path="exports/{agent_name}" )
All tests should pass. If not, repeat the loop for remaining failures.
Credential Requirements
CRITICAL: Testing requires ALL credentials the agent depends on. This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).
Prerequisites
Before running agent tests, you MUST collect ALL required credentials from the user.
Step 1: LLM API Key (always required)
export ANTHROPIC_API_KEY="your-key-here"
Step 2: Tool-specific credentials (depends on agent's tools)
Inspect the agent's
mcp_servers.json and tool configuration to determine which tools the agent uses, then check for all required credentials:
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS creds = CredentialManager() # Determine which tools the agent uses (from agent.json or mcp_servers.json) agent_tools = [...] # e.g., ["hubspot_search_contacts", "web_search", ...] # Find all missing credentials for those tools missing = creds.get_missing_for_tools(agent_tools)
Common tool credentials:
| Tool | Env Var | Help URL |
|---|---|---|
| HubSpot CRM | | https://developers.hubspot.com/docs/api/private-apps |
| Brave Search | | https://brave.com/search/api/ |
| Google Search | + | https://developers.google.com/custom-search |
Why ALL credentials are required:
- Tests need to execute the agent's LLM nodes to validate behavior
- Tools with missing credentials will return error dicts instead of real data
- Mock mode bypasses everything, providing no confidence in real-world performance
Mock Mode Limitations
Mock mode (
--mock flag or MOCK_MODE=1) is ONLY for structure validation:
- Validates graph structure (nodes, edges, connections)
- Validates that
succeeds and the agent is importableAgentRunner.load() - Does NOT execute event_loop agents — MockLLMProvider never calls
, so event_loop nodes loop foreverset_output - Does NOT test LLM reasoning, content quality, or constraint validation
- Does NOT test real API integrations or tool use
Bottom line: If you're testing whether an agent achieves its goal, you MUST use real credentials.
Enforcing Credentials in Tests
When writing tests, ALWAYS include credential checks:
import os import pytest from aden_tools.credentials import CredentialManager pytestmark = pytest.mark.skipif( not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"), reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1." ) @pytest.fixture(scope="session", autouse=True) def check_credentials(): """Ensure ALL required credentials are set for real testing.""" creds = CredentialManager() mock_mode = os.environ.get("MOCK_MODE") if not creds.is_available("anthropic"): if mock_mode: print("\nRunning in MOCK MODE - structure validation only") else: pytest.fail( "\nANTHROPIC_API_KEY not set!\n" "Set API key: export ANTHROPIC_API_KEY='your-key-here'\n" "Or run structure validation: MOCK_MODE=1 pytest exports/{agent}/tests/" ) if not mock_mode: agent_tools = [] # Update per agent missing = creds.get_missing_for_tools(agent_tools) if missing: lines = ["\nMissing tool credentials!"] for name in missing: spec = creds.specs.get(name) if spec: lines.append(f" {spec.env_var} - {spec.description}") pytest.fail("\n".join(lines))
User Communication
When the user asks to test an agent, ALWAYS check for ALL credentials first:
- Identify the agent's tools from
mcp_servers.json - Check ALL required credentials using
CredentialManager - Ask the user to provide any missing credentials before proceeding
- Collect ALL missing credentials in a single prompt — not one at a time
Safe Test Patterns
OutputCleaner
The framework automatically validates and cleans node outputs using a fast LLM at edge traversal time. Tests should still use safe patterns because OutputCleaner may not catch all issues.
Safe Access (REQUIRED)
# UNSAFE - will crash on missing keys approval = result.output["approval_decision"] category = result.output["analysis"]["category"] # SAFE - use .get() with defaults output = result.output or {} approval = output.get("approval_decision", "UNKNOWN") # SAFE - type check before operations analysis = output.get("analysis", {}) if isinstance(analysis, dict): category = analysis.get("category", "unknown") # SAFE - handle JSON parsing trap (LLM response as string) import json recommendation = output.get("recommendation", "{}") if isinstance(recommendation, str): try: parsed = json.loads(recommendation) if isinstance(parsed, dict): approval = parsed.get("approval_decision", "UNKNOWN") except json.JSONDecodeError: approval = "UNKNOWN" elif isinstance(recommendation, dict): approval = recommendation.get("approval_decision", "UNKNOWN") # SAFE - type check before iteration items = output.get("items", []) if isinstance(items, list): for item in items: ...
Helper Functions for conftest.py
import json import re def _parse_json_from_output(result, key): """Parse JSON from agent output (framework may store full LLM response as string).""" response_text = result.output.get(key, "") json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip() try: return json.loads(json_text) except (json.JSONDecodeError, AttributeError, TypeError): return result.output.get(key) def safe_get_nested(result, key_path, default=None): """Safely get nested value from result.output.""" output = result.output or {} current = output for key in key_path: if isinstance(current, dict): current = current.get(key) elif isinstance(current, str): try: json_text = re.sub(r'```json\s*|\s*```', '', current).strip() parsed = json.loads(json_text) if isinstance(parsed, dict): current = parsed.get(key) else: return default except json.JSONDecodeError: return default else: return default return current if current is not None else default # Make available in tests pytest.parse_json_from_output = _parse_json_from_output pytest.safe_get_nested = safe_get_nested
ExecutionResult Fields
means NO exception, NOT goal achievedresult.success=True
# WRONG assert result.success # RIGHT assert result.success, f"Agent failed: {result.error}" output = result.output or {} approval = output.get("approval_decision") assert approval == "APPROVED", f"Expected APPROVED, got {approval}"
All fields:
— Completed without exception (NOT goal achieved!)success: bool
— Complete memory snapshot (may contain raw strings)output: dict
— Error message if failederror: str | None
— Number of nodes executedsteps_executed: int
— Cumulative token usagetotal_tokens: int
— Total execution timetotal_latency_ms: int
— Node IDs traversed (may repeat in feedback loops)path: list[str]
— Node ID if pausedpaused_at: str | None
— State for resumingsession_state: dict
— Visit counts per node (feedback loop testing)node_visit_counts: dict[str, int]
— "clean", "degraded", or "failed"execution_quality: str
Test Count Guidance
Write 8-15 tests, not 30+
- 2-3 tests per success criterion
- 1 happy path test
- 1 boundary/edge case test
- 1 error handling test (optional)
Each real test costs ~3 seconds + LLM tokens. 12 tests = ~36 seconds, $0.12.
Test Patterns
Happy Path
@pytest.mark.asyncio async def test_happy_path(runner, auto_responder, mock_mode): """Test normal successful execution.""" await auto_responder.start() try: result = await runner.run({"query": "python tutorials"}) finally: await auto_responder.stop() assert result.success, f"Agent failed: {result.error}" output = result.output or {} assert output.get("report"), "No report produced"
Boundary Condition
@pytest.mark.asyncio async def test_minimum_sources(runner, auto_responder, mock_mode): """Test at minimum source threshold.""" await auto_responder.start() try: result = await runner.run({"query": "niche topic"}) finally: await auto_responder.stop() assert result.success, f"Agent failed: {result.error}" output = result.output or {} sources = output.get("sources", []) if isinstance(sources, list): assert len(sources) >= 3, f"Expected >= 3 sources, got {len(sources)}"
Error Handling
@pytest.mark.asyncio async def test_empty_input(runner, auto_responder, mock_mode): """Test graceful handling of empty input.""" await auto_responder.start() try: result = await runner.run({"query": ""}) finally: await auto_responder.stop() # Agent should either fail gracefully or produce an error message output = result.output or {} assert not result.success or output.get("error"), "Should handle empty input"
Feedback Loop
@pytest.mark.asyncio async def test_feedback_loop_terminates(runner, auto_responder, mock_mode): """Test that feedback loops don't run forever.""" await auto_responder.start() try: result = await runner.run({"query": "test"}) finally: await auto_responder.stop() visits = result.node_visit_counts or {} for node_id, count in visits.items(): assert count <= 5, f"Node {node_id} visited {count} times — possible infinite loop"
MCP Tool Reference
Phase 1: Test Generation
# Check existing tests list_tests(goal_id, agent_path) # Get constraint test guidelines (returns templates, NOT generated tests) generate_constraint_tests(goal_id, goal_json, agent_path) # Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines # Get success criteria test guidelines generate_success_tests(goal_id, goal_json, node_names, tool_names, agent_path) # Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines
Phase 2: Execution
# Automated regression (no checkpoints, fresh runs) run_tests(goal_id, agent_path, test_types='["all"]', parallel=-1, fail_fast=False) # Run only specific test types run_tests(goal_id, agent_path, test_types='["constraint"]') run_tests(goal_id, agent_path, test_types='["success"]')
# Iterative debugging with checkpoints (via CLI) uv run hive run exports/{agent_name} --input '{"query": "test"}'
Phase 3: Analysis
# Debug a specific failed test debug_test(goal_id, test_name, agent_path) # Find failed sessions list_agent_sessions(agent_work_dir, status="failed", limit=5) # Inspect session state (excludes memory values) get_agent_session_state(agent_work_dir, session_id) # Inspect memory data get_agent_session_memory(agent_work_dir, session_id, key="research_results") # Runtime logs: L1 summaries query_runtime_logs(agent_work_dir, status="needs_attention") # Runtime logs: L2 per-node details query_runtime_log_details(agent_work_dir, run_id, needs_attention_only=True) # Runtime logs: L3 tool/LLM raw data query_runtime_log_raw(agent_work_dir, run_id, node_id="research") # Find clean checkpoints list_agent_checkpoints(agent_work_dir, session_id, is_clean="true") # Compare checkpoints (memory diff) compare_agent_checkpoints(agent_work_dir, session_id, cp_before, cp_after)
Phase 5: Recovery
# Inspect checkpoint before resuming get_agent_checkpoint(agent_work_dir, session_id, checkpoint_id) # Empty checkpoint_id = latest checkpoint
# Resume from checkpoint via CLI (headless) uv run hive run exports/{agent_name} \ --resume-session {session_id} --checkpoint {checkpoint_id}
Anti-Patterns
| Don't | Do Instead |
|---|---|
Use in tests | Use with fixtures (goes through AgentRuntime) |
| Re-run entire agent when a late node fails | Resume from last clean checkpoint |
Treat as goal achieved | Check for actual criteria |
Access directly | Use |
| Fix random things hoping tests pass | Analyze L2/L3 logs to find root cause first |
| Write 30+ tests | Write 8-15 focused tests |
| Skip credential check | Use before testing |
Confuse with | Code in , runtime data in |
Use for iterative debugging | Use headless CLI with checkpoints for iterative debugging |
| Use headless CLI for final regression | Use for automated regression |
Use from Claude Code | Use headless command — TUI hangs in non-interactive shells |
| Test client-facing nodes from Claude Code | Use mock mode, or have the user run the agent in their terminal |
| Run tests without reading goal first | Always understand the goal before writing tests |
| Skip Phase 3 analysis and guess | Use session + log tools to identify root cause |
Example Walkthrough: Deep Research Agent
A complete iteration showing the test loop for an agent with nodes:
intake → research → review → report.
Phase 1: Generate tests
# Read the goal Read(file_path="exports/deep_research_agent/agent.py") # Get success criteria test guidelines result = generate_success_tests( goal_id="rigorous-interactive-research", goal_json='{"id": "rigorous-interactive-research", "success_criteria": [{"id": "source-diversity", "target": ">=5"}, {"id": "citation-coverage", "target": "100%"}, {"id": "report-completeness", "target": "90%"}]}', node_names="intake,research,review,report", tool_names="web_search,web_scrape", agent_path="exports/deep_research_agent" ) # Write tests Write( file_path=result["output_file"], content=result["file_header"] + "\n\n" + test_code )
Phase 2: First execution
run_tests( goal_id="rigorous-interactive-research", agent_path="exports/deep_research_agent", fail_fast=True )
Result:
test_success_source_diversity fails — agent only found 2 sources instead of 5.
Phase 3: Analyze
# Debug the failing test debug_test( goal_id="rigorous-interactive-research", test_name="test_success_source_diversity", agent_path="exports/deep_research_agent" ) # → ASSERTION_FAILURE: Expected >= 5 sources, got 2 # Find the session list_agent_sessions( agent_work_dir="~/.hive/agents/deep_research_agent", status="completed", limit=1 ) # → session_20260209_150000_abc12345 # See what the research node produced get_agent_session_memory( agent_work_dir="~/.hive/agents/deep_research_agent", session_id="session_20260209_150000_abc12345", key="research_results" ) # → Only 2 web_search calls made, each returned 1 source # Check the LLM's behavior in the research node query_runtime_log_raw( agent_work_dir="~/.hive/agents/deep_research_agent", run_id="session_20260209_150000_abc12345", node_id="research" ) # → LLM called web_search only twice, then called set_output
Root cause: The research node's prompt doesn't tell the LLM to search for at least 5 diverse sources. It stops after the first couple of searches.
Phase 4: Fix the prompt
Read(file_path="exports/deep_research_agent/nodes/__init__.py") Edit( file_path="exports/deep_research_agent/nodes/__init__.py", old_string='system_prompt="Search for information on the user\'s topic."', new_string='system_prompt="Search for information on the user\'s topic. You MUST find at least 5 diverse, authoritative sources. Use multiple different search queries to ensure source diversity. Do not stop searching until you have at least 5 distinct sources."' )
Phase 5: Resume from checkpoint
For this example, the fix is to the
research node. If we had run via CLI with checkpointing, we could resume from the checkpoint after intake to skip re-running intake:
# Check if clean checkpoint exists after intake list_agent_checkpoints( agent_work_dir="~/.hive/agents/deep_research_agent", session_id="session_20260209_150000_abc12345", is_clean="true" ) # → cp_node_complete_intake_150005 # Resume from after intake, re-run research with fixed prompt uv run hive run exports/deep_research_agent \ --resume-session session_20260209_150000_abc12345 \ --checkpoint cp_node_complete_intake_150005
Or for this simple case (intake is fast), just re-run:
uv run hive run exports/deep_research_agent --input '{"topic": "test"}'
Phase 6: Final verification
run_tests( goal_id="rigorous-interactive-research", agent_path="exports/deep_research_agent" ) # → All 12 tests pass
Test File Structure
exports/{agent_name}/ ├── agent.py ← Agent to test (goal, nodes, edges) ├── nodes/__init__.py ← Node implementations (prompts, config) ├── config.py ← Agent configuration ├── mcp_servers.json ← Tool server config └── tests/ ├── conftest.py ← Shared fixtures + safe access helpers ├── test_constraints.py ← Constraint tests ├── test_success_criteria.py ← Success criteria tests └── test_edge_cases.py ← Edge case tests
Integration with Other Skills
| Scenario | From | To | Action |
|---|---|---|---|
| Agent built, ready to test | | | Generate tests, start loop |
| Prompt fix needed | Phase 4 | Direct edit | Edit , resume |
| Goal definition wrong | Phase 4 | | Update goal, may need rebuild |
| Missing credentials | Phase 3 | | Set up credentials |
| Complex runtime failure | Phase 3 | | Deep L1/L2/L3 analysis |
| All tests pass | Phase 6 | Done | Agent validated |