Gemini-cli behavioral-evals

Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.

install
source · Clone the upstream repo
git clone https://github.com/google-gemini/gemini-cli
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/google-gemini/gemini-cli "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.gemini/skills/behavioral-evals" ~/.claude/skills/google-gemini-gemini-cli-behavioral-evals && rm -rf "$T"
manifest: .gemini/skills/behavioral-evals/SKILL.md
source content

Behavioral Evals

Overview

Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.


🔄 Workflow Decision Tree

  1. Does a prompt/tool change need validation?
    • No -> Normal integration tests.
    • Yes -> Continue below.
  2. Is it UI/Interaction heavy?
  3. Is it a new test?
    • Yes -> Set policy to
      USUALLY_PASSES
      .
    • No ->
      ALWAYS_PASSES
      (locks in regression).
  4. Are you fixing a failure or promoting a test?

📋 Quick Checklist

1. Setup Workspace

Seed the workspace with necessary files using the

files
object to simulate a realistic scenario (e.g., NodeJS project with
package.json
).

2. Write Assertions

Audit agent decisions using

rig.setBreakpoint()
(AppRig only) or index verification on
rig.readToolLogs()
.

3. Verify

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.


📦 Bundled Resources

Detailed procedural guides:

  • creating.md: Assertion strategies, Rig selection, Mock MCPs.
  • fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
  • promoting.md: Candidate identification criteria and threshold guidelines.