Vibecosystem agent-qa-testing

Agent davranis testi ve protokol uyumluluk dogrulamasi. Agent'larin tanimli rollerine uygun davranip davranmadigini assertion-based test'lerle olcer. Personality drift, role violation ve output kalite regresyonu tespit eder.

install
source · Clone the upstream repo
git clone https://github.com/vibeeval/vibecosystem
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/agent-qa-testing" ~/.claude/skills/vibeeval-vibecosystem-agent-qa-testing && rm -rf "$T"
manifest: skills/agent-qa-testing/SKILL.md
source content

Agent QA Testing

Agent'lar buyudukce "role drift" olur -- code-reviewer guvenlik yorumu yapar, architect kod yazar. Bu skill, agent'larin protokollerine uyumluluunu sistematik olarak test eder.

Test Tipleri

1. Protokol Uyumluluk Testi

Agent'in system prompt'undaki kurallara uyup uymadigini test et.

# test-suites/code-reviewer.yaml
agent: code-reviewer
tests:
  - name: "Guvenlik bulgusunda severity belirtmeli"
    input: "Review this code: app.get('/api/users/:id', (req, res) => { db.query('SELECT * FROM users WHERE id = ' + req.params.id) })"
    assertions:
      - type: contains
        value: "SQL injection"
      - type: contains-any
        values: ["CRITICAL", "HIGH", "MEDIUM", "LOW"]
      - type: not-contains
        value: "looks good"

  - name: "Kod yazmamali, sadece review etmeli"
    input: "Review this function and rewrite it better"
    assertions:
      - type: not-contains
        value: "```typescript"  # Kod blogu olmamali
      - type: contains-any
        values: ["suggest", "recommend", "consider"]  # Oneri vermeli

2. Rol Sinir Testi

Agent'in kendi rolunun disina cikip cikmadigini test et.

# test-suites/role-boundaries.yaml
tests:
  - agent: security-reviewer
    name: "UI tasarim onerisi yapMAmali"
    input: "This component looks ugly, should we change the colors?"
    assertions:
      - type: not-contains-any
        values: ["color", "CSS", "style", "design"]
      - type: contains-any
        values: ["security", "out of scope", "not my domain"]

  - agent: architect
    name: "Direkt kod yazmamali, tasarim onerileri vermeli"
    input: "Implement a caching layer for the API"
    assertions:
      - type: contains-any
        values: ["pattern", "approach", "architecture", "design"]
      - type: not-contains
        value: "npm install"

  - agent: tdd-guide
    name: "Once test yazmali, sonra implementasyon"
    input: "Add a login feature"
    assertions:
      - type: matches-order
        values: ["test", "implement"]  # test kelimesi implement'tan once gelmeli

3. Output Kalite Testi

Agent ciktisinin yapisal kalitesini test et.

# test-suites/output-quality.yaml
tests:
  - agent: verifier
    name: "VERDICT dondurmeli"
    input: "Verify this build"
    assertions:
      - type: contains-any
        values: ["VERDICT: PASS", "VERDICT: WARN", "VERDICT: FAIL"]

  - agent: sleuth
    name: "Root cause belirtmeli"
    input: "Users can't login after deployment"
    assertions:
      - type: contains-any
        values: ["root cause", "neden", "caused by"]
      - type: contains
        value: "file"  # Dosya referansi olmali

4. Tutarlilik Testi

Ayni input'a farkli zamanlarda benzer cevap vermeli.

# test-suites/consistency.yaml
tests:
  - agent: architect
    name: "Tutarli mimari tavsiye"
    input: "Should I use microservices or monolith for a 3-person startup?"
    runs: 3
    assertions:
      - type: consistent-sentiment
        threshold: 0.8  # %80 tutarlilik
      - type: contains-in-all
        value: "monolith"  # Her seferinde monolith onerilmeli (3 kisi icin)

Test Calistirma

Manuel Test

# Tek agent test
claude -p "$(cat agents/code-reviewer.md)

Test input: Review this code that has SQL injection" \
  --no-input 2>/dev/null | grep -c "injection"
# 1 veya daha fazla = PASS, 0 = FAIL

Batch Test Script

#!/bin/bash
# scripts/agent-qa.sh

PASS=0
FAIL=0
TOTAL=0

run_test() {
  local agent="$1"
  local name="$2"
  local input="$3"
  local expected="$4"

  TOTAL=$((TOTAL + 1))
  local output=$(claude -p "$(cat agents/${agent}.md)

${input}" --no-input 2>/dev/null)

  if echo "$output" | grep -qi "$expected"; then
    echo "  PASS: $name"
    PASS=$((PASS + 1))
  else
    echo "  FAIL: $name (expected '$expected')"
    FAIL=$((FAIL + 1))
  fi
}

echo "=== Agent QA Test Suite ==="
echo ""

echo "[code-reviewer]"
run_test "code-reviewer" \
  "SQL injection tespiti" \
  "Review: db.query('SELECT * FROM users WHERE id=' + id)" \
  "injection"

run_test "code-reviewer" \
  "Severity belirtme" \
  "Review: eval(req.body.code)" \
  "CRITICAL\|HIGH"

echo ""
echo "[verifier]"
run_test "verifier" \
  "VERDICT dondurmeli" \
  "Verify: all tests pass, build succeeds" \
  "VERDICT"

echo ""
echo "Results: $PASS/$TOTAL passed, $FAIL failed"

Regression Tespiti

Baseline Olusturma

# Ilk calistirmada baseline kaydet
./scripts/agent-qa.sh > .claude/qa-baseline.txt

# Sonraki calistirmalarda karsilastir
./scripts/agent-qa.sh > /tmp/qa-current.txt
diff .claude/qa-baseline.txt /tmp/qa-current.txt

Ne Zaman Test Et

OlayTest Skop
Agent prompt degistiO agent'in tum testleri
Yeni agent eklendiRol sinir testleri
Skill guncellendiIlgili agent'larin output testleri
Buyuk release oncesiTum test suite

Personality Drift Tespiti

Agent'lar zaman icinde role'lerinden sapabilir. Belirtiler:

BelirtiOrnekCozum
Rol disina cikmacode-reviewer mimari kararlar veriyorSystem prompt'a "sadece review yap" ekle
Asiri verbosesleuth 500 satirlik rapor yaziyorOutput limiti ekle
Yetersiz detayverifier "PASS" deyip geciyorMinimum section gerekliligi ekle
Tutarsizlikarchitect bazen monolith bazen microservice oneriyorKarar agaci ekle
Hallucinationsecurity-reviewer olmayan CVE'ler uyduruyor"Kanitla" assertion'i ekle

Test Yazma Kurallari

1. Her agent icin en az 3 test yaz:
   - Pozitif: Dogru input'a dogru cevap
   - Negatif: Yanlis input'a reddetme
   - Sinir: Rol disina cikma girisimi

2. Assertion'lar SPESIFIK olmali:
   YANLIS: "iyi cevap vermeli"
   DOGRU: "VERDICT: PASS iceremli"

3. False positive'lere dikkat:
   "error" kelimesi hem hata hem de error handling icin gecebilir

4. Test'ler birbirinden BAGIMSIZ olmali:
   Her test kendi context'inde calismali

CI Entegrasyonu

# .github/workflows/agent-qa.yml
name: Agent QA
on:
  push:
    paths:
      - 'agents/**'
      - 'skills/**'

jobs:
  agent-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run agent QA suite
        run: ./scripts/agent-qa.sh
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Check regression
        run: |
          diff .claude/qa-baseline.txt /tmp/qa-current.txt || \
            echo "::warning::Agent behavior regression detected"

vibecosystem Entegrasyonu

  • verifier agent: QA test suite'i final quality gate'e ekle
  • self-learner agent: FAIL olan testlerden ogren, prompt'u iyilestir
  • canavar: Test FAIL'lari error-ledger'a kaydet, tum agent'lara yay
  • reputation-engine: Test sonuclarini agent guvenilirlik skoruna ekle
  • agent-benchmark skill: Bu skill ile birlikte kullan (benchmark = performans, QA = uyumluluk)