Learn-skills.dev hive-test

Iterative agent testing with session recovery. Execute, analyze, fix, resume from checkpoints. Use when testing an agent, debugging test failures, or verifying fixes without re-running from scratch.

install

source · Clone the upstream repo

git clone https://github.com/NeverSight/learn-skills.dev

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/adenhq/hive/hive-test" ~/.claude/skills/neversight-learn-skills-dev-hive-test && rm -rf "$T"

manifest: data/skills-md/adenhq/hive/hive-test/SKILL.md

source content

Agent Testing

Test agents iteratively: execute, analyze failures, fix, resume from checkpoint, repeat.

When to Use

Testing a newly built agent against its goal
Debugging a failing agent iteratively
Verifying fixes without re-running expensive early nodes
Running final regression tests before deployment

Prerequisites

Agent package at
```
exports/{agent_name}/
```
(built with
```
/hive-create
```
)
Credentials configured (
```
/hive-credentials
```
)
```
ANTHROPIC_API_KEY
```
set (or appropriate LLM provider key)

Path distinction (critical — don't confuse these):

```
exports/{agent_name}/
```
— agent source code (edit here)
```
~/.hive/agents/{agent_name}/
```
— runtime data: sessions, checkpoints, logs (read here)

The Iterative Test Loop

This is the core workflow. Don't re-run the entire agent when a late node fails — analyze, fix, and resume from the last clean checkpoint.

┌──────────────────────────────────────┐
│ PHASE 1: Generate Test Scenarios     │
│ Goal → synthetic test inputs + tests │
└──────────────┬───────────────────────┘
               ↓
┌──────────────────────────────────────┐
│ PHASE 2: Execute                     │◄────────────────┐
│ Run agent (CLI or pytest)            │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
          Pass? ──yes──► PHASE 6: Final Verification     │
               │                                         │
               no                                        │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 3: Analyze                     │                 │
│ Session + runtime logs + checkpoints │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 4: Fix                         │                 │
│ Prompt / code / graph / goal         │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 5: Recover & Resume            │─────────────────┘
│ Checkpoint resume OR fresh re-run    │
└──────────────────────────────────────┘

Phase 1: Generate Test Scenarios

Create synthetic tests from the agent's goal, constraints, and success criteria.

Step 1a: Read the goal

# Read goal from agent.py
Read(file_path="exports/{agent_name}/agent.py")
# Extract the Goal definition and convert to JSON string

Step 1b: Get test guidelines

# Get constraint test guidelines
generate_constraint_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "constraints": [...]}',
    agent_path="exports/{agent_name}"
)

# Get success criteria test guidelines
generate_success_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "success_criteria": [...]}',
    node_names="intake,research,review,report",
    tool_names="web_search,web_scrape",
    agent_path="exports/{agent_name}"
)

These return

file_header

test_template

constraints_formatted

success_criteria_formatted

, and

test_guidelines

. They do NOT generate test code — you write the tests.

Step 1c: Write tests

Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + your_test_code
)

Test writing rules

Every test MUST be
```
async
```
with
```
@pytest.mark.asyncio
```
Every test MUST accept
```
runner, auto_responder, mock_mode
```
fixtures

Use

await auto_responder.start()

before running,

await auto_responder.stop()

finally

Use
```
await runner.run(input_dict)
```
— this goes through AgentRunner → AgentRuntime → ExecutionStream

Access output via

result.output.get("key")

— NEVER

result.output["key"]

```
result.success=True
```
means no exception, NOT goal achieved — always check output
Write 8-15 tests total, not 30+
Each real test costs ~3 seconds + LLM tokens
NEVER use
```
default_agent.run()
```
— it bypasses the runtime (no sessions, no logs, client-facing nodes hang)

Step 1d: Check existing tests

Before generating, check if tests already exist:

list_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

Phase 2: Execute

Two execution paths, use the right one for your situation.

Iterative debugging (for complex agents)

Run the agent via CLI. This creates sessions with checkpoints at

~/.hive/agents/{agent_name}/sessions/

uv run hive run exports/{agent_name} --input '{"query": "test topic"}'

Sessions and checkpoints are saved automatically.

Client-facing nodes: Agents with

client_facing=True

nodes (interactive conversation) work in headless mode when run from a real terminal — the agent streams output to stdout and reads user input from stdin via a

>>>

prompt. In non-interactive shells (like Claude Code's Bash tool), client-facing nodes will hang because there is no stdin. For testing interactive agents from Claude Code, use

run_tests

with mock mode or have the user run the agent manually in their terminal.

Automated regression (for CI or final verification)

Use the

run_tests

MCP tool to run all pytest tests:

run_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

Returns structured results:

{
  "overall_passed": false,
  "summary": {"total": 12, "passed": 10, "failed": 2, "pass_rate": "83.3%"},
  "test_results": [{"test_name": "test_success_source_diversity", "status": "failed"}],
  "failures": [{"test_name": "test_success_source_diversity", "details": "..."}]
}

Options:

# Run only constraint tests
run_tests(goal_id, agent_path, test_types='["constraint"]')

# Stop on first failure
run_tests(goal_id, agent_path, fail_fast=True)

# Parallel execution
run_tests(goal_id, agent_path, parallel=4)

Note:

run_tests

uses

AgentRunner

with

tmp_path

storage, so sessions are isolated per test run. For checkpoint-based recovery with persistent sessions, use CLI execution. Use

run_tests

for quick regression checks and final verification.

Phase 3: Analyze Failures

When a test fails, drill down systematically. Don't guess — use the tools.

Step 3a: Get error category

debug_test(
    goal_id="your-goal-id",
    test_name="test_success_source_diversity",
    agent_path="exports/{agent_name}"
)

Returns error category (

IMPLEMENTATION_ERROR

ASSERTION_FAILURE

TIMEOUT

IMPORT_ERROR

API_ERROR

) plus full traceback and suggestions.

Step 3b: Find the failed session

list_agent_sessions(
    agent_work_dir="~/.hive/agents/{agent_name}",
    status="failed",
    limit=5
)

Returns session list with IDs, timestamps, current_node (where it failed), execution_quality.

Step 3c: Inspect session state

get_agent_session_state(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345"
)

Returns execution path, which node was current, step count, timestamps — but excludes memory values (to avoid context bloat). Shows

memory_keys

and

memory_size

instead.

Step 3d: Examine runtime logs (L2/L3)

# L2: Per-node success/failure, retry counts
query_runtime_log_details(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="session_20260209_143022_abc12345",
    needs_attention_only=True
)

# L3: Exact LLM responses, tool call inputs/outputs
query_runtime_log_raw(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="session_20260209_143022_abc12345",
    node_id="research"
)

Step 3e: Inspect memory data

# See what data a node actually produced
get_agent_session_memory(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    key="research_results"
)

Step 3f: Find recovery points

list_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    is_clean="true"
)

Returns checkpoint summaries with IDs, types (

node_start

node_complete

), which node, and

is_clean

flag. Clean checkpoints are safe resume points.

Step 3g: Compare checkpoints (optional)

To understand what changed between two points in execution:

compare_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    checkpoint_id_before="cp_node_complete_research_143030",
    checkpoint_id_after="cp_node_complete_review_143115"
)

Returns memory diff (added/removed/changed keys) and execution path diff.

Phase 4: Fix Based on Root Cause

Use the analysis from Phase 3 to determine what to fix and where.

Root Cause	What to Fix	Where to Edit
Prompt issue — LLM produces wrong output format, misses instructions	Node `system_prompt`	`exports/{agent}/nodes/__init__.py`
Code bug — TypeError, KeyError, logic error in Python	Agent code	`exports/{agent}/agent.py` , `nodes/__init__.py`
Graph issue — wrong routing, missing edge, bad condition_expr	Edges, node config	`exports/{agent}/agent.py`
Tool issue — MCP tool fails, wrong config, missing credential	Tool config	`exports/{agent}/mcp_servers.json` , `/hive-credentials`
Goal issue — success criteria too strict/vague, wrong constraints	Goal definition	`exports/{agent}/agent.py` (goal section)
Test issue — test expectations don't match actual agent behavior	Test code	`exports/{agent}/tests/test_*.py`

Fix strategies by error category

IMPLEMENTATION_ERROR (TypeError, AttributeError, KeyError):

# Read the failing code
Read(file_path="exports/{agent_name}/nodes/__init__.py")

# Fix the bug
Edit(
    file_path="exports/{agent_name}/nodes/__init__.py",
    old_string="results.get('videos')",
    new_string="(results or {}).get('videos', [])"
)

ASSERTION_FAILURE (test assertions fail but agent ran successfully):

Check if the agent's output is actually wrong → fix the prompt
Check if the test's expectations are unrealistic → fix the test
Use
```
get_agent_session_memory
```
to see what the agent actually produced

TIMEOUT / STALL (agent runs too long):

Check
```
node_visit_counts
```
for feedback loops hitting max_node_visits
Check L3 logs for tool calls that hang
Reduce
```
max_iterations
```
in loop_config or fix the prompt to converge faster

API_ERROR (connection, rate limit, auth):

Verify credentials with
```
/hive-credentials
```
Check MCP server configuration

Phase 5: Recover & Resume

After fixing the agent, decide whether to resume or re-run.

When to resume from checkpoint

Resume when ALL of these are true:

The fix is to a node that comes AFTER existing clean checkpoints
Clean checkpoints exist (from a CLI execution with checkpointing)
The early nodes are expensive (web scraping, API calls, long LLM chains)

# Resume from the last clean checkpoint before the failing node
uv run hive run exports/{agent_name} \
  --resume-session session_20260209_143022_abc12345 \
  --checkpoint cp_node_complete_research_143030

This skips all nodes before the checkpoint and only re-runs the fixed node onward.

When to re-run from scratch

Re-run when ANY of these are true:

The fix is to the entry node or an early node
No checkpoints exist (e.g., agent was run via
```
run_tests
```
)
The agent is fast (2-3 nodes, completes in seconds)
You changed the graph structure (added/removed nodes/edges)

uv run hive run exports/{agent_name} --input '{"query": "test topic"}'

Inspecting a checkpoint before resuming

get_agent_checkpoint(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    checkpoint_id="cp_node_complete_research_143030"
)

Returns the full checkpoint: shared_memory snapshot, execution_path, current_node, next_node, is_clean.

Loop back to Phase 2

After resuming or re-running, check if the fix worked. If not, go back to Phase 3.

Phase 6: Final Verification

Once the iterative fix loop converges (the agent produces correct output), run the full automated test suite:

run_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

All tests should pass. If not, repeat the loop for remaining failures.

Credential Requirements

CRITICAL: Testing requires ALL credentials the agent depends on. This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).

Prerequisites

Before running agent tests, you MUST collect ALL required credentials from the user.

Step 1: LLM API Key (always required)

export ANTHROPIC_API_KEY="your-key-here"

Step 2: Tool-specific credentials (depends on agent's tools)

Inspect the agent's

mcp_servers.json

and tool configuration to determine which tools the agent uses, then check for all required credentials:

from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS

creds = CredentialManager()

# Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...]  # e.g., ["hubspot_search_contacts", "web_search", ...]

# Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)

Common tool credentials:

Tool Env Var Help URL

HubSpot CRM

HUBSPOT_ACCESS_TOKEN

https://developers.hubspot.com/docs/api/private-apps

Brave Search

BRAVE_SEARCH_API_KEY

https://brave.com/search/api/

Google Search

GOOGLE_SEARCH_API_KEY

GOOGLE_SEARCH_CX

https://developers.google.com/custom-search

Why ALL credentials are required:

Tests need to execute the agent's LLM nodes to validate behavior
Tools with missing credentials will return error dicts instead of real data
Mock mode bypasses everything, providing no confidence in real-world performance

Mock Mode Limitations

Mock mode (

--mock

flag or

MOCK_MODE=1

) is ONLY for structure validation:

Validates graph structure (nodes, edges, connections)
Validates that
```
AgentRunner.load()
```
succeeds and the agent is importable
Does NOT execute event_loop agents — MockLLMProvider never calls
```
set_output
```
, so event_loop nodes loop forever
Does NOT test LLM reasoning, content quality, or constraint validation
Does NOT test real API integrations or tool use

Bottom line: If you're testing whether an agent achieves its goal, you MUST use real credentials.

Enforcing Credentials in Tests

When writing tests, ALWAYS include credential checks:

import os
import pytest
from aden_tools.credentials import CredentialManager

pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.fixture(scope="session", autouse=True)
def check_credentials():
    """Ensure ALL required credentials are set for real testing."""
    creds = CredentialManager()
    mock_mode = os.environ.get("MOCK_MODE")

    if not creds.is_available("anthropic"):
        if mock_mode:
            print("\nRunning in MOCK MODE - structure validation only")
        else:
            pytest.fail(
                "\nANTHROPIC_API_KEY not set!\n"
                "Set API key: export ANTHROPIC_API_KEY='your-key-here'\n"
                "Or run structure validation: MOCK_MODE=1 pytest exports/{agent}/tests/"
            )

    if not mock_mode:
        agent_tools = []  # Update per agent
        missing = creds.get_missing_for_tools(agent_tools)
        if missing:
            lines = ["\nMissing tool credentials!"]
            for name in missing:
                spec = creds.specs.get(name)
                if spec:
                    lines.append(f"  {spec.env_var} - {spec.description}")
            pytest.fail("\n".join(lines))

User Communication

When the user asks to test an agent, ALWAYS check for ALL credentials first:

Identify the agent's tools from
```
mcp_servers.json
```
Check ALL required credentials using
```
CredentialManager
```
Ask the user to provide any missing credentials before proceeding
Collect ALL missing credentials in a single prompt — not one at a time

Safe Test Patterns

OutputCleaner

The framework automatically validates and cleans node outputs using a fast LLM at edge traversal time. Tests should still use safe patterns because OutputCleaner may not catch all issues.

Safe Access (REQUIRED)

# UNSAFE - will crash on missing keys
approval = result.output["approval_decision"]
category = result.output["analysis"]["category"]

# SAFE - use .get() with defaults
output = result.output or {}
approval = output.get("approval_decision", "UNKNOWN")

# SAFE - type check before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
    category = analysis.get("category", "unknown")

# SAFE - handle JSON parsing trap (LLM response as string)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
    try:
        parsed = json.loads(recommendation)
        if isinstance(parsed, dict):
            approval = parsed.get("approval_decision", "UNKNOWN")
    except json.JSONDecodeError:
        approval = "UNKNOWN"
elif isinstance(recommendation, dict):
    approval = recommendation.get("approval_decision", "UNKNOWN")

# SAFE - type check before iteration
items = output.get("items", [])
if isinstance(items, list):
    for item in items:
        ...

Helper Functions for conftest.py

import json
import re

def _parse_json_from_output(result, key):
    """Parse JSON from agent output (framework may store full LLM response as string)."""
    response_text = result.output.get(key, "")
    json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()
    try:
        return json.loads(json_text)
    except (json.JSONDecodeError, AttributeError, TypeError):
        return result.output.get(key)

def safe_get_nested(result, key_path, default=None):
    """Safely get nested value from result.output."""
    output = result.output or {}
    current = output
    for key in key_path:
        if isinstance(current, dict):
            current = current.get(key)
        elif isinstance(current, str):
            try:
                json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
                parsed = json.loads(json_text)
                if isinstance(parsed, dict):
                    current = parsed.get(key)
                else:
                    return default
            except json.JSONDecodeError:
                return default
        else:
            return default
    return current if current is not None else default

# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested

ExecutionResult Fields

result.success=True

means NO exception, NOT goal achieved

# WRONG
assert result.success

# RIGHT
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"

All fields:

```
success: bool
```
— Completed without exception (NOT goal achieved!)
```
output: dict
```
— Complete memory snapshot (may contain raw strings)
```
error: str | None
```
— Error message if failed
```
steps_executed: int
```
— Number of nodes executed
```
total_tokens: int
```
— Cumulative token usage
```
total_latency_ms: int
```
— Total execution time
```
path: list[str]
```
— Node IDs traversed (may repeat in feedback loops)
```
paused_at: str | None
```
— Node ID if paused
```
session_state: dict
```
— State for resuming
```
node_visit_counts: dict[str, int]
```
— Visit counts per node (feedback loop testing)
```
execution_quality: str
```
— "clean", "degraded", or "failed"

Test Count Guidance

Write 8-15 tests, not 30+

2-3 tests per success criterion
1 happy path test
1 boundary/edge case test
1 error handling test (optional)

Each real test costs ~3 seconds + LLM tokens. 12 tests = ~36 seconds, $0.12.

Test Patterns

Happy Path

@pytest.mark.asyncio
async def test_happy_path(runner, auto_responder, mock_mode):
    """Test normal successful execution."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "python tutorials"})
    finally:
        await auto_responder.stop()
    assert result.success, f"Agent failed: {result.error}"
    output = result.output or {}
    assert output.get("report"), "No report produced"

Boundary Condition

@pytest.mark.asyncio
async def test_minimum_sources(runner, auto_responder, mock_mode):
    """Test at minimum source threshold."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "niche topic"})
    finally:
        await auto_responder.stop()
    assert result.success, f"Agent failed: {result.error}"
    output = result.output or {}
    sources = output.get("sources", [])
    if isinstance(sources, list):
        assert len(sources) >= 3, f"Expected >= 3 sources, got {len(sources)}"

Error Handling

@pytest.mark.asyncio
async def test_empty_input(runner, auto_responder, mock_mode):
    """Test graceful handling of empty input."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": ""})
    finally:
        await auto_responder.stop()
    # Agent should either fail gracefully or produce an error message
    output = result.output or {}
    assert not result.success or output.get("error"), "Should handle empty input"

Feedback Loop

@pytest.mark.asyncio
async def test_feedback_loop_terminates(runner, auto_responder, mock_mode):
    """Test that feedback loops don't run forever."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "test"})
    finally:
        await auto_responder.stop()
    visits = result.node_visit_counts or {}
    for node_id, count in visits.items():
        assert count <= 5, f"Node {node_id} visited {count} times — possible infinite loop"

MCP Tool Reference

Phase 1: Test Generation

# Check existing tests
list_tests(goal_id, agent_path)

# Get constraint test guidelines (returns templates, NOT generated tests)
generate_constraint_tests(goal_id, goal_json, agent_path)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines

# Get success criteria test guidelines
generate_success_tests(goal_id, goal_json, node_names, tool_names, agent_path)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines

Phase 2: Execution

# Automated regression (no checkpoints, fresh runs)
run_tests(goal_id, agent_path, test_types='["all"]', parallel=-1, fail_fast=False)

# Run only specific test types
run_tests(goal_id, agent_path, test_types='["constraint"]')
run_tests(goal_id, agent_path, test_types='["success"]')

# Iterative debugging with checkpoints (via CLI)
uv run hive run exports/{agent_name} --input '{"query": "test"}'

Phase 3: Analysis

# Debug a specific failed test
debug_test(goal_id, test_name, agent_path)

# Find failed sessions
list_agent_sessions(agent_work_dir, status="failed", limit=5)

# Inspect session state (excludes memory values)
get_agent_session_state(agent_work_dir, session_id)

# Inspect memory data
get_agent_session_memory(agent_work_dir, session_id, key="research_results")

# Runtime logs: L1 summaries
query_runtime_logs(agent_work_dir, status="needs_attention")

# Runtime logs: L2 per-node details
query_runtime_log_details(agent_work_dir, run_id, needs_attention_only=True)

# Runtime logs: L3 tool/LLM raw data
query_runtime_log_raw(agent_work_dir, run_id, node_id="research")

# Find clean checkpoints
list_agent_checkpoints(agent_work_dir, session_id, is_clean="true")

# Compare checkpoints (memory diff)
compare_agent_checkpoints(agent_work_dir, session_id, cp_before, cp_after)

Phase 5: Recovery

# Inspect checkpoint before resuming
get_agent_checkpoint(agent_work_dir, session_id, checkpoint_id)
# Empty checkpoint_id = latest checkpoint

# Resume from checkpoint via CLI (headless)
uv run hive run exports/{agent_name} \
  --resume-session {session_id} --checkpoint {checkpoint_id}

Anti-Patterns

Don't	Do Instead
Use `default_agent.run()` in tests	Use `runner.run()` with `auto_responder` fixtures (goes through AgentRuntime)
Re-run entire agent when a late node fails	Resume from last clean checkpoint
Treat `result.success` as goal achieved	Check `result.output` for actual criteria
Access `result.output["key"]` directly	Use `result.output.get("key")`
Fix random things hoping tests pass	Analyze L2/L3 logs to find root cause first
Write 30+ tests	Write 8-15 focused tests
Skip credential check	Use `/hive-credentials` before testing
Confuse `exports/` with `~/.hive/agents/`	Code in `exports/` , runtime data in `~/.hive/`
Use `run_tests` for iterative debugging	Use headless CLI with checkpoints for iterative debugging
Use headless CLI for final regression	Use `run_tests` for automated regression
Use `--tui` from Claude Code	Use headless `run` command — TUI hangs in non-interactive shells
Test client-facing nodes from Claude Code	Use mock mode, or have the user run the agent in their terminal
Run tests without reading goal first	Always understand the goal before writing tests
Skip Phase 3 analysis and guess	Use session + log tools to identify root cause

Example Walkthrough: Deep Research Agent

A complete iteration showing the test loop for an agent with nodes:

intake → research → review → report

Phase 1: Generate tests

# Read the goal
Read(file_path="exports/deep_research_agent/agent.py")

# Get success criteria test guidelines
result = generate_success_tests(
    goal_id="rigorous-interactive-research",
    goal_json='{"id": "rigorous-interactive-research", "success_criteria": [{"id": "source-diversity", "target": ">=5"}, {"id": "citation-coverage", "target": "100%"}, {"id": "report-completeness", "target": "90%"}]}',
    node_names="intake,research,review,report",
    tool_names="web_search,web_scrape",
    agent_path="exports/deep_research_agent"
)

# Write tests
Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + test_code
)

Phase 2: First execution

run_tests(
    goal_id="rigorous-interactive-research",
    agent_path="exports/deep_research_agent",
    fail_fast=True
)

Result:

test_success_source_diversity

fails — agent only found 2 sources instead of 5.

Phase 3: Analyze

# Debug the failing test
debug_test(
    goal_id="rigorous-interactive-research",
    test_name="test_success_source_diversity",
    agent_path="exports/deep_research_agent"
)
# → ASSERTION_FAILURE: Expected >= 5 sources, got 2

# Find the session
list_agent_sessions(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    status="completed",
    limit=1
)
# → session_20260209_150000_abc12345

# See what the research node produced
get_agent_session_memory(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    session_id="session_20260209_150000_abc12345",
    key="research_results"
)
# → Only 2 web_search calls made, each returned 1 source

# Check the LLM's behavior in the research node
query_runtime_log_raw(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    run_id="session_20260209_150000_abc12345",
    node_id="research"
)
# → LLM called web_search only twice, then called set_output

Root cause: The research node's prompt doesn't tell the LLM to search for at least 5 diverse sources. It stops after the first couple of searches.

Phase 4: Fix the prompt

Read(file_path="exports/deep_research_agent/nodes/__init__.py")

Edit(
    file_path="exports/deep_research_agent/nodes/__init__.py",
    old_string='system_prompt="Search for information on the user\'s topic."',
    new_string='system_prompt="Search for information on the user\'s topic. You MUST find at least 5 diverse, authoritative sources. Use multiple different search queries to ensure source diversity. Do not stop searching until you have at least 5 distinct sources."'
)

Phase 5: Resume from checkpoint

For this example, the fix is to the

research

node. If we had run via CLI with checkpointing, we could resume from the checkpoint after

intake

to skip re-running intake:

# Check if clean checkpoint exists after intake
list_agent_checkpoints(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    session_id="session_20260209_150000_abc12345",
    is_clean="true"
)
# → cp_node_complete_intake_150005

# Resume from after intake, re-run research with fixed prompt
uv run hive run exports/deep_research_agent \
  --resume-session session_20260209_150000_abc12345 \
  --checkpoint cp_node_complete_intake_150005

Or for this simple case (intake is fast), just re-run:

uv run hive run exports/deep_research_agent --input '{"topic": "test"}'

Phase 6: Final verification

run_tests(
    goal_id="rigorous-interactive-research",
    agent_path="exports/deep_research_agent"
)
# → All 12 tests pass

Test File Structure

exports/{agent_name}/
├── agent.py              ← Agent to test (goal, nodes, edges)
├── nodes/__init__.py     ← Node implementations (prompts, config)
├── config.py             ← Agent configuration
├── mcp_servers.json      ← Tool server config
└── tests/
    ├── conftest.py           ← Shared fixtures + safe access helpers
    ├── test_constraints.py   ← Constraint tests
    ├── test_success_criteria.py  ← Success criteria tests
    └── test_edge_cases.py    ← Edge case tests

Integration with Other Skills

Scenario	From	To	Action
Agent built, ready to test	`/hive-create`	`/hive-test`	Generate tests, start loop
Prompt fix needed	`/hive-test` Phase 4	Direct edit	Edit `nodes/__init__.py` , resume
Goal definition wrong	`/hive-test` Phase 4	`/hive-create`	Update goal, may need rebuild
Missing credentials	`/hive-test` Phase 3	`/hive-credentials`	Set up credentials
Complex runtime failure	`/hive-test` Phase 3	`/hive-debugger`	Deep L1/L2/L3 analysis
All tests pass	`/hive-test` Phase 6	Done	Agent validated