Claude-night-market evaluation-framework
Patterns for building evaluation and scoring systems, quality gates, rubrics, and decision frameworks. Use for any scored assessment.
git clone https://github.com/athola/claude-night-market
T=$(mktemp -d) && git clone --depth=1 https://github.com/athola/claude-night-market "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/leyline/skills/evaluation-framework" ~/.claude/skills/athola-claude-night-market-evaluation-framework && rm -rf "$T"
plugins/leyline/skills/evaluation-framework/SKILL.mdTable of Contents
- Overview
- When to Use
- Core Pattern
- 1. Define Criteria
- 2. Score Each Criterion
- 3. Calculate Weighted Total
- 4. Apply Decision Thresholds
- Quick Start
- Define Your Evaluation
- Example: Code Review Evaluation
- Evaluation Workflow
- Common Use Cases
- Integration Pattern
- Detailed Resources
- Exit Criteria
Evaluation Framework
Overview
A generic framework for weighted scoring and threshold-based decision making. Provides reusable patterns for evaluating any artifact against configurable criteria with consistent scoring methodology.
This framework abstracts the common pattern of: define criteria → assign weights → score against criteria → apply thresholds → make decisions.
When To Use
- Implementing quality gates or evaluation rubrics
- Building scoring systems for artifacts, proposals, or submissions
- Need consistent evaluation methodology across different domains
- Want threshold-based automated decision making
- Creating assessment tools with weighted criteria
When NOT To Use
- Simple pass/fail without scoring needs
Core Pattern
1. Define Criteria
criteria: - name: criterion_name weight: 0.30 # 30% of total score description: What this measures scoring_guide: 90-100: Exceptional 70-89: Strong 50-69: Acceptable 30-49: Weak 0-29: Poor
Verification: Run the command with
--help flag to verify availability.
2. Score Each Criterion
scores = { "criterion_1": 85, # Out of 100 "criterion_2": 92, "criterion_3": 78, }
Verification: Run the command with
--help flag to verify availability.
3. Calculate Weighted Total
total = sum(score * weights[criterion] for criterion, score in scores.items()) # Example: (85 × 0.30) + (92 × 0.40) + (78 × 0.30) = 85.5
Verification: Run the command with
--help flag to verify availability.
4. Apply Decision Thresholds
thresholds: 80-100: Accept with priority 60-79: Accept with conditions 40-59: Review required 20-39: Reject with feedback 0-19: Reject
Verification: Run the command with
--help flag to verify availability.
Quick Start
Define Your Evaluation
- Identify criteria: What aspects matter for your domain?
- Assign weights: Which criteria are most important? (sum to 1.0)
- Create scoring guides: What does each score range mean?
- Set thresholds: What total scores trigger which decisions?
Example: Code Review Evaluation
criteria: correctness: {weight: 0.40, description: Does code work as intended?} maintainability: {weight: 0.25, description: Is it readable?} performance: {weight: 0.20, description: Meets performance needs?} testing: {weight: 0.15, description: Tests detailed?} thresholds: 85-100: Approve immediately 70-84: Approve with minor feedback 50-69: Request changes 0-49: Reject, major issues
Verification: Run
pytest -v to verify tests pass.
Evaluation Workflow
**Verification:** Run the command with `--help` flag to verify availability. 1. Review artifact against each criterion 2. Assign 0-100 score for each criterion 3. Calculate: total = Σ(score × weight) 4. Compare total to thresholds 5. Take action based on threshold range
Verification: Run the command with
--help flag to verify availability.
Common Use Cases
Quality Gates: Code review, PR approval, release readiness Content Evaluation: Document quality, knowledge intake, skill assessment Resource Allocation: Backlog prioritization, investment decisions, triage
Integration Pattern
# In your skill's frontmatter dependencies: [leyline:evaluation-framework]
Verification: Run the command with
--help flag to verify availability.
Then customize the framework for your domain:
- Define domain-specific criteria
- Set appropriate weights for your context
- Establish meaningful thresholds
- Document what each score range means
Detailed Resources
- Scoring Patterns: See
for detailed methodologymodules/scoring-patterns.md - Decision Thresholds: See
for threshold designmodules/decision-thresholds.md
Exit Criteria
- Criteria defined with clear descriptions
- Weights assigned and sum to 1.0
- Scoring guides documented for each criterion
- Thresholds mapped to specific actions
- Evaluation process documented and reproducible