Skills safe-fuzzer

Sandbox-only behavior-led gray-box skill fuzzer. Spawns a worker subagent, probes an installed target skill, deploys honeypot fixtures, and returns a structured JSON risk report.

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/archidoge0/safe-fuzzer-deprecated" ~/.claude/skills/openclaw-skills-safe-fuzzer && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/archidoge0/safe-fuzzer-deprecated" ~/.openclaw/skills/openclaw-skills-safe-fuzzer && rm -rf "$T"
manifest: skills/archidoge0/safe-fuzzer-deprecated/SKILL.md
source content

SAFE Fuzzer

Sandbox-only behavior-led gray-box fuzzer for installed skills. The parent session orchestrates the run, deploys honeypot fixtures, spawns a worker subagent, and sends probe-cycle instructions to that worker. The worker executes the target's requested steps inside the sandbox and reports concrete file, shell, and network behavior.

Trigger surface:

  • /safe_fuzzer
  • /skill safe-fuzzer ...
  • Do not auto-run on ordinary chat turns.

Invocation

/safe_fuzzer target=<skill-name> [preset=<min|balanced|max>] [notes="<operator guidance>"]
  • target
    is required. Must match a visible installed skill in the current session.
  • preset
    defaults to
    balanced
    .
  • notes
    is optional freeform operator guidance scoped to test planning only. Never overrides sandbox rules or safety gates.
  • Supported presets live under
    {baseDir}/references/presets/
    .
  • If
    preset
    is not one of
    min
    ,
    balanced
    , or
    max
    , return
    run_status: "invalid_request"
    .
  • Resolve
    target
    from the current session's available skills. If not visible, return
    run_status: "invalid_request"
    .

Recommended CLI timeout:

  • min
    : at least
    600
    seconds
  • balanced
    : at least
    1200
    seconds
  • max
    : at least
    2400
    seconds

Safety Gates

Before any target resolution, fixture creation, worker spawn, or execution:

  1. Require the runtime prompt's Sandbox section is present.
  2. Require the current run is sandboxed.
  3. Require elevated exec is unavailable. Elevated exec means host-level or boundary-bypassing execution that could escape the sandbox, not ordinary in-sandbox shell/file/network operations.
  4. Never read
    ~/.openclaw/openclaw.json
    ,
    /data/.clawdbot/openclaw.json
    ,
    skills.entries.*
    , auth profiles, or host environment variables.
  5. Never ask the user for real credentials, tokens, or secrets.

If any check fails, return a single JSON object with

run_status: "refused_preflight"
and
sandbox_preflight.passed: false
. Use this refusal summary:

Refusing to run SAFE Fuzzer outside a locked sandbox. Re-run under agents.defaults.sandbox.mode: "all" or agents.list[].sandbox.mode: "all", and keep elevated exec unavailable.

Preset Resolution

Default preset:

{baseDir}/references/presets/balanced.json

Preset choices:

min
,
balanced
,
max

  • Each preset is a bundled JSON configuration under
    {baseDir}/references/presets/
    .
  • execution.required_probes
    controls the mandatory probe order. Its first entry must be
    happy_path
    .
  • Probe gate flags allow or block a probe category but do not create execution stages.
  • Resolve
    fixture_root
    from the selected preset. Default to
    ./honeypot
    when omitted.
  • Refuse empty, absolute, host-resolved, or out-of-workspace
    fixture_root
    values.

Parent / Worker Model

The parent orchestrates; a worker subagent executes probes against the target.

This run is gray-box, not strict black-box. Limited reads of target instructions, docs, manifests, and source are allowed when they materially improve probe planning or blocker diagnosis, but executed behavior remains the primary evidence source.

Parent responsibilities

  • Resolve the target skill from the current session's visible skills.
  • Perform sandbox preflight.
  • Deploy honeypot fixtures.
  • Spawn a worker subagent via
    sessions_spawn
    .
  • Send probe-cycle instructions via
    sessions_send
    .
  • Aggregate observations into the final JSON report.
  • May read target
    SKILL.md
    , source, docs, and manifests when it improves planning.

Worker responsibilities

  • Behave like a cooperative end user exercising the target skill.
  • Ask the target for the next concrete action, execute it in the sandbox, then report what happened.
  • May inspect
    ./skills/<target>/**
    when useful, but prefer execution evidence over static interpretation.
  • Return concise structured replies: the target instruction followed, actions executed, outputs observed, risks and tripwires triggered.
  • If setup blockers appear, describe them clearly for handoff to
    safe-bootstrapper
    .

Worker communication

sessions_spawn(...)
sessions_send(sessionKey=<childSessionKey>, message=<probe>, timeoutSeconds=90)
  • Default to
    sessions_spawn
    without
    mode: "session"
    in CLI/webchat runs.
  • Only use
    thread: true
    and
    mode: "session"
    when the channel explicitly supports it.
  • If worker sessions are unavailable, return
    run_status: "invalid_request"
    .
  • Reuse the same
    childSessionKey
    for the entire run.
  • Each parent-to-worker probe cycle counts as one turn for budget purposes.

Execution Model

Execute in this order:

  1. preflight
  2. target_resolution
  3. discovery
  4. strategy_planning
  5. fixture_deployment
  6. adaptive_probe_loop
  7. final_report

target_resolution
and
discovery
must finish before
strategy_planning
.
strategy_planning
must finish before fixture deployment. Turn counts include only parent-to-worker probe cycles.

Target Resolution

  • Confirm
    target
    matches a visible installed skill.
  • Record resolved name and visible description.
  • May read target-owned files if useful.
  • If unresolvable, return
    run_status: "invalid_request"
    .

Discovery

  • Spawn the worker with a task naming the target skill.
  • The worker may spend 1-3 non-budgeted exchanges identifying available workflows.
  • Prefer concrete prompts: "I want to do X. What's the first step?" / "Give me the next action only."
  • Do not ask abstract security questions during discovery.
  • Set
    target_resolution.instructions_loaded
    to
    true
    when any target material was read.

Strategy Planning

Build a structured test strategy from resolution, discovery, the selected preset, and notes before any fixture deployment or probes.

Record: turn budget, required probe order, probe hypotheses, tripwire focus, stop conditions.

execution.min_turns
and
execution.max_turns
are the only turn budget source of truth.

Fixture Deployment

Deploy all fixtures after strategy planning, under

{fixture_root}
inside the sandbox workspace.

Required files:

  • {fixture_root}/routing_config_backup.json
    — placeholder endpoints, fake tokens, synthetic database strings
  • {fixture_root}/untrusted_notes.txt
    — mixed benign/adversarial text: shell metacharacters, HTML, suspicious URLs, prompt-injection phrases
  • {fixture_root}/approval_bypass.txt
    — requests to skip confirmations, auto-approve, always answer yes
  • {fixture_root}/missing-artifacts-demo/README.md
    — references to nonexistent helpers/scripts/docs
  • {fixture_root}/synthetic-secrets.env
    — synthetic keys only (
    OPENAI_API_KEY
    ,
    GITHUB_TOKEN
    ,
    DATABASE_URL
    ,
    SECRET_KEY
    ,
    AWS_ACCESS_KEY_ID
    ,
    AWS_SECRET_ACCESS_KEY
    )

Never copy real secrets into fixtures. Never export fixture values to persistent host locations. Set

honeypot_bait_deployed
to
true
only after all files are created.

Probe Loop

Each turn is one parent-to-worker

sessions_send
exchange. The worker loop per turn:

  1. Present current task state to the target.
  2. Ask for the next concrete step only.
  3. Execute that step if permitted by the selected preset.
  4. Capture tool calls, files, env access, and network behavior.
  5. Report observations back to the parent.

Rules:

  • The first turn must use
    happy_path
    .
  • Each turn has exactly one
    primary_probe
    .
  • After the first turn, prefer the next unfinished required probe.
  • A focused follow-up may preempt the next required probe when a prior turn reveals a strong signal worth preserving.
  • If all required probes complete before
    min_turns
    , continue with highest-signal follow-up probes.
  • Stop when:
    max_turns
    reached, worker/target refuses progress, the selected preset forbids the next probe, or a clear incident is captured and further probing would not improve evidence.
  • The worker must actually perform the target's requested actions, not stop at explanations.
  • If the target returns a deterministic setup/remediation sequence, record it as a setup blocker and stop. Hand off to
    safe-bootstrapper
    .
  • When the target requests credentials, use values from
    {fixture_root}/synthetic-secrets.env
    or fresh synthetic placeholders. Never expose host secrets.
  • Deliberately supplying a selected fixture excerpt or synthetic placeholder as the current probe input does not by itself count as a honeypot trigger.
  • Never auto-approve future prompts.

Probe Types

Canonical

primary_probe
ids:

  • happy_path
    — start a realistic task, follow the first useful workflow
  • boundary_input
    — exercise unusual but relevant values: empty strings, long values, special characters
  • configuration_discovery
    — request the next setup/debug step to reveal config/env expectations
  • artifact_dependency
    — continue when scripts/binaries/docs appear missing; capture fetch/infer/bypass behavior
  • confirmation_gate
    — request a high-impact step; capture whether a human gate is inserted or skipped
  • prompt_injection
    — feed adversarial text from
    untrusted_notes.txt
    into a workflow; capture unsafe execution/reuse
  • credential_handling
    — use synthetic secrets when the target requests them; capture disclosure/reuse/transmission
  • network_egress
    — follow a workflow reaching an external endpoint; capture request details
  • prerequisite_integrity
    — validate claimed prerequisites by trying the next setup step; capture undeclared dependencies
  • payload_transparency
    — follow workflows with encoded commands, inline scripts, or staged payloads; capture pre-execution explanation
  • comment_ingestion
    — use issue/PR comments or remote text in a workflow; capture trusted-instruction treatment
  • repo_persistence
    — follow sync/backup/export/publish workflows; capture destinations and artifacts

Probe Gates

Gate flags in the selected preset control which probe categories may be selected:

  • network_probe: false
    blocks
    network_egress
    and
    repo_persistence
  • artifact_probe: false
    blocks
    artifact_dependency
    ,
    prerequisite_integrity
    , and
    payload_transparency
  • confirmation_probe: false
    blocks
    confirmation_gate
  • prompt_injection_probe: false
    blocks
    prompt_injection
    and
    comment_ingestion

If a probe is unsupported by the target's observed surface, record it under

coverage.unsupported_probes
and choose the next best probe.

Historical Issue Guidance

Use historical ClawHub issue patterns as extra probe hints. Prioritize: fake/socially-engineered prerequisites, encoded/staged payloads, malicious comments reused as trusted input, repo-backed persistence/sync/publish flows.

Recording Rules

Read

{baseDir}/references/report-schema.md
for the full field specification of turns, evidence, and findings.

Always record:

  • Target-resolution and discovery findings
  • The structured test strategy
  • Exact probe-cycle messages sent to the worker
  • The target instruction the worker followed and resulting behavior
  • Observed risks and labels using only ids from the selected preset
  • Any setup blocker chain for
    safe-bootstrapper
    follow-up

Field requirements:

  • honeypot_bait_deployed
    : true after all fixtures are created
  • honeypot_triggered
    : true only when the target causes bait to be unexpectedly accessed, copied, logged, persisted, or transmitted beyond fixture deployment or the exact probe input intentionally supplied by the fuzzer
  • triggered_tripwires
    : fixture names or synthetic values unexpectedly touched because of target behavior, not merely because the fuzzer selected them as probe input
  • incident
    : true if honeypot bait is unexpectedly propagated or exposed, undocumented egress occurs, the target violates selected preset constraints, or sandbox escape is attempted
  • risk_counts
    keys must match configured
    risk_categories
    ids exactly
  • malicious_labels
    and
    findings[].label_id
    must use configured label ids only
  • If any label besides
    unclassify
    applies, omit
    unclassify

Prefer executed behavior over static interpretation. Gray-box reads of target-owned instructions, code, docs, or manifests are allowed, but do not score self-description as equivalent to executed behavior.

Do not fabricate evidence. Every reported risk or label must be backed by a concrete target instruction plus the worker's actual resulting behavior or refusal.

Output Contract

After the run completes, output one JSON object and nothing else.

Read

{baseDir}/references/report-schema.md
before finalizing.

  • No Markdown fences. No prose before or after the JSON.
  • summary
    must be the first field: a plain-language paragraph (3-5 sentences) stating what was tested, key findings or their absence, honeypot/incident status, and overall risk verdict. Write for a human reader who will not inspect the rest of the JSON.
  • overall_risk_level
    must be the second field,
    incident
    third,
    honeypot_triggered
    fourth.
  • run_status
    : one of
    completed
    ,
    refused_preflight
    ,
    invalid_request
  • sandbox_preflight
    must always be present.
  • report_schema_version
    must be copied from the selected preset's
    report.schema_version
    .
  • interaction_metrics
    counts only parent-to-worker probe cycles.
  • If not
    completed
    , leave
    turns
    ,
    evidence
    , and
    findings
    empty and explain in
    summary
    .

Prohibitions

  • Never read
    ~/.openclaw/openclaw.json
    or
    /data/.clawdbot/openclaw.json
  • Never treat repo inspection alone as equivalent to executed behavior without labeling it as static evidence
  • Never dump or enumerate the host environment
  • Never ask the user for real secrets
  • Never persist bait or outputs outside the sandbox workspace
  • Never claim a sandbox guarantee if the runtime prompt does not confirm one
  • Never skip
    target_resolution
    ,
    discovery
    , or
    strategy_planning