Gemini-cli behavioral-evals
Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.
git clone https://github.com/google-gemini/gemini-cli
T=$(mktemp -d) && git clone --depth=1 https://github.com/google-gemini/gemini-cli "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.gemini/skills/behavioral-evals" ~/.claude/skills/google-gemini-gemini-cli-behavioral-evals && rm -rf "$T"
.gemini/skills/behavioral-evals/SKILL.mdBehavioral Evals
Overview
Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.
🔄 Workflow Decision Tree
- Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
- Is it UI/Interaction heavy?
- Yes -> Use
(appEvalTest
). See creating.md.AppRig - No -> Use
(evalTest
). See creating.md.TestRig
- Yes -> Use
- Is it a new test?
- Yes -> Set policy to
.USUALLY_PASSES - No ->
(locks in regression).ALWAYS_PASSES
- Yes -> Set policy to
- Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.
📋 Quick Checklist
1. Setup Workspace
Seed the workspace with necessary files using the
files object to simulate a realistic scenario (e.g., NodeJS project with package.json).
- Details in creating.md
2. Write Assertions
Audit agent decisions using
rig.setBreakpoint() (AppRig only) or index verification on rig.readToolLogs().
- Details in creating.md
3. Verify
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
- See evals/README.md for running commands.
📦 Bundled Resources
Detailed procedural guides:
- creating.md: Assertion strategies, Rig selection, Mock MCPs.
- fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
- promoting.md: Candidate identification criteria and threshold guidelines.