install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/TerminalSkills/skills/prompt-tester" ~/.claude/skills/comeonoliver-skillshub-prompt-tester && rm -rf "$T"
manifest:
skills/TerminalSkills/skills/prompt-tester/SKILL.mdsource content
Prompt Tester
Overview
Build a systematic approach to prompt engineering. Design test cases, define evaluation rubrics, run prompt variants against edge cases, and compare results to find the best-performing prompt for your use case.
Instructions
1. Define the evaluation criteria
Before testing prompts, establish what "good" looks like:
## Evaluation Rubric: Customer Support Classifier | Criterion | Weight | Description | |---------------|--------|------------------------------------------| | Accuracy | 40% | Correct category assigned | | Consistency | 25% | Same input → same output across runs | | Latency | 15% | Response time under threshold | | Format | 10% | Output matches expected JSON schema | | Edge cases | 10% | Handles ambiguous/unusual inputs |
2. Create test cases
Build a test suite covering normal cases, edge cases, and adversarial inputs:
test_cases: - id: TC-001 input: "My order hasn't arrived and it's been 2 weeks" expected_category: "shipping_delay" expected_priority: "high" tags: [normal, shipping] - id: TC-002 input: "I love your product! Also my payment failed" expected_category: "payment_issue" expected_priority: "high" tags: [mixed-intent, edge-case] - id: TC-003 input: "asdf jkl; 12345" expected_category: "unclassifiable" expected_priority: "low" tags: [adversarial, garbage-input] - id: TC-004 input: "" expected_category: "unclassifiable" expected_priority: "low" tags: [adversarial, empty-input]
3. Design prompt variants
Create 2-3 prompt variants to compare:
Variant A (Concise):
Classify this support ticket into one category: billing, shipping_delay, product_defect, account_access, feature_request, unclassifiable. Return JSON: {"category": "...", "priority": "high|medium|low"}
Variant B (Detailed with examples):
You are a support ticket classifier. Analyze the customer message and assign exactly one category and priority level. Categories: billing, shipping_delay, product_defect, account_access, feature_request, unclassifiable Rules: - If the message contains multiple issues, classify by the most urgent - If the message is gibberish or empty, use "unclassifiable" - Priority is "high" for payment/shipping issues, "medium" for product issues, "low" for feature requests Examples: Input: "I was charged twice for my subscription" Output: {"category": "billing", "priority": "high"} Input: "It would be nice to have dark mode" Output: {"category": "feature_request", "priority": "low"} Now classify this message:
4. Run the evaluation
Execute each prompt variant against all test cases and score:
## Evaluation Results | Metric | Variant A | Variant B | |--------------|-----------|-----------| | Accuracy | 72% | 91% | | Consistency | 85% | 94% | | Format match | 100% | 100% | | Edge cases | 40% | 80% | | Avg tokens | 12 | 18 | ### Detailed Results | Test Case | Variant A | Variant B | Expected | |-----------|---------------------|---------------------|---------------------| | TC-001 | ✅ shipping_delay | ✅ shipping_delay | shipping_delay | | TC-002 | ❌ general_inquiry | ✅ payment_issue | payment_issue | | TC-003 | ❌ feature_request | ✅ unclassifiable | unclassifiable | | TC-004 | ❌ (error) | ✅ unclassifiable | unclassifiable | Winner: Variant B (+19% accuracy, +40% edge case handling) Tradeoff: ~50% more tokens per request
5. Iterate and refine
Based on failures, improve the winning variant:
- Analyze which test cases failed and why
- Add specific rules for failure patterns
- Rerun only the failed cases to verify fixes
- Document the final prompt with its test results
Examples
Example 1: Summarization Prompt
Input: "I need a prompt that summarizes meeting transcripts into action items. Help me test it."
Output:
## Test Suite: Meeting Summary Prompt Test cases created: 8 - 3 normal meetings (standup, planning, retrospective) - 2 edge cases (meeting with no action items, all-chat meeting) - 2 long transcripts (>5000 words) - 1 adversarial (transcript in mixed languages) Rubric: - Action item extraction: 40% (did it find all action items?) - Owner assignment: 25% (did it identify who owns each item?) - Brevity: 20% (summary under 200 words?) - No hallucination: 15% (no invented action items?) Running 3 variants... Results ready.
Example 2: Code Review Prompt
Input: "Compare these two system prompts for our AI code reviewer"
Output:
## Comparison: Code Review Prompts Test cases: 6 code snippets (SQL injection, race condition, clean code, style-only issues, empty file, 500-line file) | Metric | Prompt A | Prompt B | |---------------------|----------|----------| | Bug detection | 4/6 | 6/6 | | False positives | 3 | 1 | | Actionable feedback | 60% | 90% | | Handles large files | ❌ | ✅ | Prompt B is better: fewer false positives, catches all bugs, and handles edge cases. Main improvement: explicit severity levels and "only report issues you are confident about" instruction.
Guidelines
- Always define evaluation criteria BEFORE testing — prevents post-hoc rationalization
- Test at least 8-10 cases: 50% normal, 30% edge cases, 20% adversarial
- Run each variant 3 times to check consistency (LLMs are non-deterministic)
- Track token usage alongside quality — cost matters at scale
- Keep a prompt changelog: version, date, changes, test results
- The winning prompt isn't always the longest — sometimes concise prompts outperform
- Document failure modes: knowing when a prompt breaks is as valuable as knowing when it works
- For production prompts, add regression tests and rerun when updating the model version