Awesome-omni-skill re-evaluate

Re-evaluate attempt repositories after issues are closed, update evaluation reports, and file new issues for any regressions or newly discovered problems.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/cli-automation/re-evaluate" ~/.claude/skills/diegosouzapw-awesome-omni-skill-re-evaluate && rm -rf "$T"
manifest: skills/cli-automation/re-evaluate/SKILL.md
source content

Re-Evaluate Attempt After Changes

Overview

This SOP monitors attempt repositories for closed issues and code changes, re-runs relevant evaluations, updates the main evaluation report, and files new issues for any regressions or newly discovered problems.

Parameters

  • attempt_repo (required): Repository name (e.g.,
    2025-12-13-python-claude-hive
    )
  • evaluation_path (optional, default:
    ./results/{attempt_repo}.md
    ): Path to existing evaluation report
  • full_reeval (optional, default:
    false
    ): If true, run complete evaluation instead of targeted checks

Steps

1. Check Repository for Changes

Determine if the repository has changed since the last evaluation.

Constraints:

  • You MUST check for new commits since last evaluation
  • You MUST identify closed issues that triggered changes
  • You MUST NOT proceed if no changes detected (unless full_reeval=true)
# Navigate to existing clone or re-clone
cd ./reviews/{attempt_repo} || gh repo clone brazil-bench/{attempt_repo} ./reviews/{attempt_repo}

# Fetch latest changes
git fetch origin

# Check for new commits on main/master
git log HEAD..origin/main --oneline 2>/dev/null || git log HEAD..origin/master --oneline

# Pull latest changes
git pull origin main 2>/dev/null || git pull origin master

# Get the latest commit info
git log -1 --format="%H %ai %s"

2. Identify Closed Issues

List issues that have been closed since the last evaluation.

Constraints:

  • You MUST fetch all closed issues from the repo
  • You MUST categorize issues by type ([Missing], [Test Quality], [Docs], etc.)
  • You SHOULD note which issues were addressed vs closed without fix
# List all closed issues
gh issue list -R brazil-bench/{attempt_repo} --state closed --json number,title,closedAt,labels

# Get details of recently closed issues
gh issue list -R brazil-bench/{attempt_repo} --state closed --limit 20 --json number,title,body,closedAt

# Check issue comments for resolution notes
gh issue view {issue_number} -R brazil-bench/{attempt_repo} --json comments

Issue Categories to Track:

Issue TypeRe-evaluation Focus
[Missing]
Re-check spec compliance for that requirement
[Test Quality]
Re-run tests, recalculate skip ratio, check integration tests are self-contained
[Docs]
Re-check README for required elements
[Quality]
Re-review architecture/code quality
[Compliance]
Full spec compliance re-check

ALWAYS Check (regardless of issue type):

  • Integration tests must run without skipping due to missing dependencies
  • Tests should use testcontainers or pytest-docker to manage data stores

3. Targeted Re-Evaluation

Based on closed issues, re-evaluate only the affected areas.

Constraints:

  • You MUST re-evaluate areas corresponding to closed issues
  • You MUST use the same methodology as the original evaluation
  • You SHOULD skip unchanged areas unless full_reeval=true

3a. Re-Evaluate Missing Requirements

For closed

[Missing]
issues:

# Read the original issue to understand what was missing
gh issue view {issue_number} -R brazil-bench/{attempt_repo}

# Check if the requirement is now implemented
# Search for relevant code changes
git diff {old_commit}..HEAD --stat

# Search for specific implementations
grep -r "{feature_keyword}" ./reviews/{attempt_repo}/src/

Verification Checklist:

  • Code implementing the requirement exists
  • Tests covering the requirement pass
  • Integration with existing code is complete

3b. Re-Evaluate Test Quality

For closed

[Test Quality]
issues:

cd ./reviews/{attempt_repo}

# Re-run tests to get current status
pytest --tb=no -v 2>&1 | grep -E "(PASSED|FAILED|SKIPPED|ERROR)" | head -100

# Get summary counts
pytest --tb=no -q 2>&1 | tail -5

# Recalculate skip ratio
pytest --collect-only -q 2>&1 | tail -3

Metrics to Update:

MetricBeforeAfter
Total Tests
Passed
Skipped
Skip Ratio

3c. Re-Evaluate Documentation

For closed

[Docs]
issues:

# Check README for required elements
head -150 ./reviews/{attempt_repo}/README.md

# Look for setup instructions
grep -E "Setup|Install|Prerequisites" ./reviews/{attempt_repo}/README.md

# Look for MCP server setup
grep -E "MCP|claude|server" ./reviews/{attempt_repo}/README.md

# Look for example Q&A
grep -E "Example|Question|Query|Usage" ./reviews/{attempt_repo}/README.md

Documentation Checklist:

  • Setup Instructions present
  • MCP Server Setup documented
  • Example Q&A included

3d. Re-Evaluate Spec Compliance

For closed

[Compliance]
issues or major changes:

# Read the spec requirements
cat ./tasks/beads/spec.md | grep -E "^##|^-"

# Check implementation against each requirement
# (Follow evaluate-attempt SOP Step 6)

3e. Re-Evaluate Integration Test Quality (REQUIRED)

ALWAYS check this - integration tests must be self-contained and actually run.

cd ./reviews/{attempt_repo}

# Check if integration tests still skip due to missing dependencies
pytest --tb=no -v 2>&1 | grep -E "SKIPPED.*neo4j|SKIPPED.*database|SKIPPED.*not running"

# Check for testcontainers usage (good pattern)
grep -r "testcontainers\|Neo4jContainer\|DockerContainer" tests/ --include="*.py"

# Check for pytest-docker usage (good pattern)
grep -r "pytest-docker\|docker_compose_file" tests/ pyproject.toml

# Check for skip patterns that indicate external dependency issues
grep -r "pytest.skip.*neo4j\|pytest.skip.*database\|skipif.*connection" tests/ --include="*.py"

# Run tests and count integration test results
pytest -v tests/ 2>&1 | grep -E "integration|Integration" | head -20

Self-Contained Test Verification:

CheckCommandExpected
Uses testcontainers
grep -r "testcontainers" tests/
Has matches
No external skips
grep "pytest.skip.*not running"
No matches
Integration tests run
pytest -v
output
0 skipped for Neo4j
Docker fixture exists
grep "Neo4jContainer" conftest.py
Has matches

If Integration Tests Still Skip:

File a new issue using the template from file-issues SOP section 3d-integration:

  • Title:
    [Test Quality] Integration tests must be self-contained - use testcontainers
  • Label:
    bug
  • Include testcontainers code examples in issue body

Scoring Impact:

Integration Test StateScore Modifier
Self-contained (testcontainers/docker)No penalty
In-memory mock (not persistent)-10 points quality
Skips due to missing dependency-10 points quality
No integration tests-15 points quality

Skipped Test Policy:

  • ANY skipped test requires an issue to be filed
  • Zero tolerance for skips - all tests must run and pass
  • File separate issue for each category of skip

Document in Report:

## Integration Test Quality

| Aspect | Status |
|--------|--------|
| Self-contained | Yes/No |
| Pattern used | testcontainers / pytest-docker / mock / external |
| Integration tests that run | X passed |
| Integration tests that skip | Y skipped |

{If skips exist}
⚠️ **Issue:** Integration tests skip when dependencies not running.
New issue filed: #{issue_number}

4. Update Evaluation Report

Update the existing evaluation report with new findings.

Constraints:

  • You MUST preserve the original report structure
  • You MUST add a "Re-Evaluation History" section
  • You MUST update metrics that have changed
  • You MUST timestamp each re-evaluation

Report Update Template:

## Re-Evaluation History

### {DATE} - Re-evaluation after issue fixes

**Closed Issues Addressed:**
- #{issue_number}: {title} - {resolution}

**Updated Metrics:**
| Metric | Previous | Current | Change |
|--------|----------|---------|--------|
| Spec Compliance | X/16 | Y/16 | +N |
| Skip Ratio | X% | Y% | -N% |
| Tests Passed | X | Y | +N |

**Changes Summary:**
{description of what changed}

**New Score:** {updated_score}

Update Commands:

# Read existing report
cat ./results/{attempt_repo}.md

# Create backup before modifying
cp ./results/{attempt_repo}.md ./results/{attempt_repo}.md.bak

# Update the report (use Edit tool)
# - Update Summary metrics
# - Update Requirements Checklist
# - Add Re-Evaluation History entry

5. Recalculate Benchmark Score

If metrics changed, recalculate the overall benchmark score.

Constraints:

  • You MUST use the same scoring formula as compare-attempts SOP
  • You MUST update the README leaderboard if rank changes
  • You SHOULD note score changes in re-evaluation history

Scoring Formula (from compare-attempts):

Base Score = (Compliance/16 * 40) + (Effective_Tests * 0.5) + (1/Duration * 10)

Skip Penalty = -5 if Skip_Ratio > 10%

Final Score = Base Score + Skip Penalty
# Check current leaderboard
cat ./README.md | grep -A 20 "Leaderboard"

# Calculate new score
# (Apply formula with updated metrics)

# Update leaderboard if needed
# (Use Edit tool to update README.md)

6. File New Issues (If Needed)

If re-evaluation discovers new problems or regressions, file new issues.

Constraints:

  • You MUST check for regressions (things that worked before but don't now)
  • You MUST file issues for any new problems discovered
  • You MUST NOT re-file issues that are still open
  • You SHOULD reference the re-evaluation in issue body

Regression Detection:

# Compare test results
# If tests that passed before now fail, that's a regression

# Compare spec compliance
# If requirements that were met are now missing, that's a regression

# Compare code metrics
# Significant increases in complexity or decreases in coverage are concerns

New Issue Template:

gh issue create -R brazil-bench/{attempt_repo} \
  --title "[Regression] {description}" \
  --label "bug" \
  --body "$(cat <<'EOF'
## Issue

This regression was discovered during re-evaluation on {DATE}.

## Previous State
{what was working before}

## Current State
{what is broken now}

## Commits Involved
{commits between evaluations}

## Suggested Fix
{recommendation}

---
Discovered during re-evaluation: [results/{attempt_repo}.md](https://github.com/brazil-bench/pourpoise/blob/main/results/{attempt_repo}.md)
EOF
)"

7. Generate Summary

Output a summary of the re-evaluation results.

Constraints:

  • You MUST list all closed issues that were verified
  • You MUST note any metric changes
  • You MUST note any new issues filed
  • You MUST indicate if leaderboard position changed

Output Format:

# Re-Evaluation Summary: {attempt_repo}

## Date
{YYYY-MM-DD}

## Closed Issues Verified
| Issue | Title | Status |
|-------|-------|--------|
| #1 | [Missing] MCP server | Verified Fixed |
| #2 | [Test Quality] Skip ratio | Verified Improved |
| #3 | [Docs] README | Partially Fixed |

## Metric Changes
| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Spec Compliance | 13/16 | 15/16 | +2 |
| Skip Ratio | 25% | 8% | -17% |
| Benchmark Score | 45.2 | 52.8 | +7.6 |

## Leaderboard Impact
- Previous Rank: #4
- New Rank: #2

## New Issues Filed
| Issue | Title | Reason |
|-------|-------|--------|
| #5 | [Regression] Test X fails | Broken by commit abc123 |

## Report Updated
./results/{attempt_repo}.md

## Next Steps
{recommendations for remaining issues}

Automation Tips

Scheduled Re-Evaluation

To check all repos for changes:

# List all attempt repos
gh repo list brazil-bench --limit 50 --json name | jq -r '.[].name' | grep -E "^20[0-9]{2}-"

# For each repo, check for closed issues since last check
for repo in $(gh repo list brazil-bench --limit 50 --json name -q '.[].name' | grep -E "^20[0-9]{2}-"); do
  closed=$(gh issue list -R brazil-bench/$repo --state closed --json closedAt -q 'length')
  if [ "$closed" -gt 0 ]; then
    echo "$repo has $closed closed issues - needs re-evaluation"
  fi
done

Webhook Trigger

Consider setting up a GitHub Action that triggers re-evaluation when:

  • An issue is closed
  • A PR is merged
  • A commit is pushed to main

Troubleshooting

No changes detected but issues were closed

  • Issues may have been closed without fixes (won't fix, duplicate)
  • Check issue comments for resolution type
  • Use
    --full_reeval
    to force complete re-evaluation

Tests fail during re-evaluation

  • Ensure dependencies are available (Neo4j, etc.)
  • Check if new dependencies were added
  • Review docker-compose or setup instructions for changes

Score decreased after fixes

  • Possible regression introduced
  • New skip patterns added
  • Check diff between commits carefully

Can't update leaderboard

  • Verify you have write access to pourpoise repo
  • Check that README.md format hasn't changed
  • Manually verify score calculations

Examples

Example 1: Re-evaluate after MCP server implementation

# User request
re-evaluate 2025-12-13-python-claude-hive

# Expected workflow:
# 1. Pull latest changes
# 2. Verify #1 [Missing] MCP server is now implemented
# 3. Update spec compliance from 13/16 to 14/16
# 4. Update evaluation report
# 5. Recalculate score
# 6. Update leaderboard if rank changed

Example 2: Re-evaluate after test quality improvements

# User request
re-evaluate 2025-12-14-python-claude-beads-2

# Expected workflow:
# 1. Pull latest changes
# 2. Re-run tests to get new skip ratio
# 3. Verify skip ratio dropped from 20% to <10%
# 4. Remove skip penalty from score
# 5. Update evaluation report
# 6. Update leaderboard

Example 3: Full re-evaluation

# User request
re-evaluate 2025-09-30-python-swarm --full-reeval

# Expected workflow:
# 1. Pull latest changes
# 2. Run complete evaluation (all steps from evaluate-attempt SOP)
# 3. Compare all metrics to previous evaluation
# 4. Update report with comprehensive re-evaluation
# 5. File issues for any new problems found