Openclaw openclaw-qa-testing
Run, watch, debug, and extend OpenClaw QA testing with qa-lab and qa-channel. Use when Codex needs to execute the repo-backed QA suite, inspect live QA artifacts, debug failing scenarios, add new QA scenarios, or explain the OpenClaw QA workflow. Prefer the live OpenAI lane with regular openai/gpt-5.4 in fast mode; do not use gpt-5.4-pro or gpt-5.4-mini unless the user explicitly overrides that policy.
install
source · Clone the upstream repo
git clone https://github.com/openclaw/openclaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/openclaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/openclaw-qa-testing" ~/.claude/skills/openclaw-openclaw-openclaw-qa-testing && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/openclaw "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/.agents/skills/openclaw-qa-testing" ~/.openclaw/skills/openclaw-openclaw-openclaw-qa-testing && rm -rf "$T"
manifest:
.agents/skills/openclaw-qa-testing/SKILL.mdsource content
OpenClaw QA Testing
Use this skill for
qa-lab / qa-channel work. Repo-local QA only.
Read first
docs/concepts/qa-e2e-automation.mddocs/help/testing.mddocs/channels/qa-channel.mdqa/README.mdqa/scenarios/index.mdextensions/qa-lab/src/suite.tsextensions/qa-lab/src/character-eval.ts
Model policy
- Live OpenAI lane:
openai/gpt-5.4 - Fast mode: on
- Do not use:
openai/gpt-5.4-proopenai/gpt-5.4-mini
- Only change model policy if the user explicitly asks.
Default workflow
- Read the scenario pack and current suite implementation.
- Decide lane:
- mock/dev:
mock-openai - real validation:
live-frontier
- mock/dev:
- For live OpenAI, use:
OPENCLAW_LIVE_OPENAI_KEY="${OPENAI_API_KEY}" \ pnpm openclaw qa suite \ --provider-mode live-frontier \ --model openai/gpt-5.4 \ --alt-model openai/gpt-5.4 \ --output-dir .artifacts/qa-e2e/run-all-live-frontier-<tag>
- Watch outputs:
- summary:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-summary.json - report:
.artifacts/qa-e2e/run-all-live-frontier-<tag>/qa-suite-report.md
- summary:
- If the user wants to watch the live UI, find the current
listen port and reportopenclaw-qa
.http://127.0.0.1:<port> - If a scenario fails, fix the product or harness root cause, then rerun the full lane.
Character evals
Use
qa character-eval for style/persona/vibe checks across multiple live models.
pnpm openclaw qa character-eval \ --model openai/gpt-5.4,thinking=xhigh \ --model openai/gpt-5.2,thinking=xhigh \ --model openai/gpt-5,thinking=xhigh \ --model anthropic/claude-opus-4-6,thinking=high \ --model anthropic/claude-sonnet-4-6,thinking=high \ --model zai/glm-5.1,thinking=high \ --model moonshot/kimi-k2.5,thinking=high \ --model google/gemini-3.1-pro-preview,thinking=high \ --judge-model openai/gpt-5.4,thinking=xhigh,fast \ --judge-model anthropic/claude-opus-4-6,thinking=high \ --concurrency 16 \ --judge-concurrency 16 \ --output-dir .artifacts/qa-e2e/character-eval-<tag>
- Runs local QA gateway child processes, not Docker.
- Preferred model spec syntax is
for bothprovider/model,thinking=<level>[,fast|,no-fast|,fast=<bool>]
and--model
.--judge-model - Do not add new examples with separate
; keep that flag as legacy compatibility only.--model-thinking - Defaults to candidate models
,openai/gpt-5.4
,openai/gpt-5.2
,openai/gpt-5
,anthropic/claude-opus-4-6
,anthropic/claude-sonnet-4-6
,zai/glm-5.1
, andmoonshot/kimi-k2.5
when nogoogle/gemini-3.1-pro-preview
is passed.--model - Candidate thinking defaults to
, withhigh
for OpenAI models that support it. Prefer inlinexhigh
;--model provider/model,thinking=<level>
and--thinking <level>
remain compatibility shims.--model-thinking <provider/model=level> - OpenAI candidate refs default to fast mode so priority processing is used where supported. Use inline
,,fast
, or,no-fast
for one model; use,fast=false
only to force fast mode for every candidate.--fast - Judges default to
andopenai/gpt-5.4,thinking=xhigh,fast
.anthropic/claude-opus-4-6,thinking=high - Report includes judge ranking, run stats, durations, and full transcripts; do not include raw judge replies. Duration is benchmark context, not a grading signal.
- Candidate and judge concurrency default to 16. Use
and--concurrency <n>
to override when local gateways or provider limits need a gentler lane.--judge-concurrency <n> - Scenario source should stay markdown-driven under
.qa/scenarios/ - For isolated character/persona evals, write the persona into
and blankSOUL.md
in the scenario flow. UseIDENTITY.md
only when intentionally testing how the normal OpenClaw identity combines with the character.SOUL.md + IDENTITY.md - Keep prompts natural and task-shaped. The candidate model should receive character setup through
, then normal user turns such as chat, workspace help, and small file tasks; do not ask "how would you react?" or tell the model it is in an eval.SOUL.md - Prefer at least one real task, such as creating or editing a tiny workspace artifact, so the transcript captures character under normal tool use instead of pure roleplay.
Codex CLI model lane
Use model refs shaped like
codex-cli/<codex-model> whenever QA should exercise Codex as a model backend.
Examples:
pnpm openclaw qa suite \ --provider-mode live-frontier \ --model codex-cli/<codex-model> \ --alt-model codex-cli/<codex-model> \ --scenario <scenario-id> \ --output-dir .artifacts/qa-e2e/codex-<tag>
pnpm openclaw qa manual \ --model codex-cli/<codex-model> \ --message "Reply exactly: CODEX_OK"
- Treat the concrete Codex model name as user/config input; do not hardcode it in source, docs examples, or scenarios.
- Live QA preserves
so Codex CLI auth/config works while keepingCODEX_HOME
andHOME
sandboxed.OPENCLAW_HOME - Mock QA should scrub
.CODEX_HOME - If Codex returns fallback/auth text every turn, first check
,CODEX_HOME
, and gateway child logs before changing scenario assertions.~/.profile - For model comparison, include
as another candidate incodex-cli/<codex-model>
; the report should label it as an opaque model name.qa character-eval
Repo facts
- Seed scenarios live in
.qa/ - Main live runner:
extensions/qa-lab/src/suite.ts - QA lab server:
extensions/qa-lab/src/lab-server.ts - Child gateway harness:
extensions/qa-lab/src/gateway-child.ts - Synthetic channel:
extensions/qa-channel/
What “done” looks like
- Full suite green for the requested lane.
- User gets:
- watch URL if applicable
- pass/fail counts
- artifact paths
- concise note on what was fixed
Common failure patterns
- Live timeout too short:
- widen live waits in
extensions/qa-lab/src/suite.ts
- widen live waits in
- Discovery cannot find repo files:
- point prompts at
inside seeded workspacerepo/...
- point prompts at
- Subagent proof too brittle:
- prefer stable final reply evidence over transient child-session listing
- Harness “rebuild” delay:
- dirty tree can trigger a pre-run build; expect that before ports appear
When adding scenarios
- Add or update scenario markdown under
qa/scenarios/ - Keep kickoff expectations in
alignedqa/scenarios/index.md - Add executable coverage in
extensions/qa-lab/src/suite.ts - Prefer end-to-end assertions over mock-only checks
- Save outputs under
.artifacts/qa-e2e/