Claude-skill-registry Ground Truth Management
Comprehensive guide to creating, managing, and maintaining ground truth datasets for AI evaluation including annotation, quality control, and versioning
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/ground-truth-management" ~/.claude/skills/majiayu000-claude-skill-registry-ground-truth-management && rm -rf "$T"
manifest:
skills/data/ground-truth-management/SKILL.mdsource content
Ground Truth Management
What is Ground Truth?
Definition: Correct answers for evaluation - human-verified data that serves as the gold standard for measuring AI performance.
Example
Question: "What is the capital of France?" Ground Truth: "Paris" AI Answer: "Paris" → Correct ✓ AI Answer: "Lyon" → Incorrect ✗
Why Ground Truth Matters
Measure Accuracy Objectively
Without ground truth: "This answer seems good" (subjective) With ground truth: "Accuracy: 85%" (objective)
Train and Validate Models
Training: Learn from ground truth examples Validation: Measure performance on ground truth test set
Regression Testing
Before change: Accuracy 90% After change: Accuracy 85% → Regression detected!
Benchmarking
Model A: 90% accuracy on ground truth Model B: 85% accuracy on ground truth → Model A is better
Types of Ground Truth
Exact Match: Single Correct Answer
{ "question": "What is 2+2?", "answer": "4" }
Multiple Acceptable Answers
{ "question": "What is the capital of France?", "acceptable_answers": ["Paris", "paris", "PARIS", "The capital is Paris"] }
Rubric-Based: Quality Scale
{ "question": "Summarize this article", "rubric": { "1": "Poor summary, missing key points", "3": "Adequate summary, covers main points", "5": "Excellent summary, concise and comprehensive" } }
Human Preference: Comparison Rankings
{ "question": "Which answer is better?", "answer_a": "Paris is the capital of France.", "answer_b": "The capital of France is Paris, a city of 2.1 million people.", "preference": "B", "reasoning": "More informative" }
Creating Ground Truth
Manual Annotation (Humans Label)
Process: 1. Collect examples (questions, documents, images) 2. Human annotators label each 3. Quality control (review annotations) 4. Store in dataset
Expert Review (For Specialized Domains)
Medical: Doctors annotate Legal: Lawyers annotate Technical: Engineers annotate Higher quality but more expensive
Crowdsourcing (Amazon MTurk)
Pros: - Fast (many workers) - Cheap ($0.10-1.00 per annotation) Cons: - Variable quality - Need quality control
Synthetic Generation (For Some Tasks)
LLM-generated questions + answers Careful validation needed Good for scale, risky for quality Use for augmentation, not sole source
Ground Truth Dataset Structure
Input (Question, Document, Image)
{ "input": { "type": "question", "text": "What is the capital of France?" } }
Expected Output (Answer, Label, Summary)
{ "expected_output": { "type": "answer", "text": "Paris", "acceptable_variants": ["paris", "PARIS"] } }
Metadata (Difficulty, Category, Source)
{ "metadata": { "difficulty": "easy", "category": "geography", "source": "wikipedia", "language": "en" } }
Annotation Info (Who, When, Confidence)
{ "annotation": { "annotator_id": "annotator_123", "timestamp": "2024-01-15T10:00:00Z", "confidence": 0.95, "time_spent_seconds": 30 } }
Complete Example:
{ "id": "example_001", "input": { "type": "question", "text": "What is the capital of France?" }, "expected_output": { "type": "answer", "text": "Paris", "acceptable_variants": ["paris", "PARIS", "The capital is Paris"] }, "metadata": { "difficulty": "easy", "category": "geography", "source": "wikipedia", "language": "en" }, "annotation": { "annotator_id": "annotator_123", "timestamp": "2024-01-15T10:00:00Z", "confidence": 0.95 } }
Annotation Guidelines
Clear Instructions
# Annotation Guidelines ## Task Label whether the answer is correct. ## Instructions 1. Read the question carefully 2. Read the answer 3. Determine if answer is factually correct 4. Mark as "Correct" or "Incorrect" ## Examples Question: "What is 2+2?" Answer: "4" Label: Correct Question: "What is 2+2?" Answer: "5" Label: Incorrect
Examples (Good and Bad)
## Good Example Question: "What is the capital of France?" Answer: "Paris" Label: Correct Reasoning: Factually accurate and directly answers question ## Bad Example Question: "What is the capital of France?" Answer: "France is a country in Europe" Label: Incorrect Reasoning: Doesn't answer the question
Edge Case Handling
## Edge Cases ### Partially Correct Question: "What are the capitals of France and Germany?" Answer: "Paris" Label: Partially Correct (missing Germany) ### Ambiguous Question Question: "What is the best programming language?" Label: N/A - Subjective question, no single correct answer ### No Answer in Context Question: "What is the population of Paris?" Context: "Paris is the capital of France." Label: "Cannot be determined from context"
Consistency Checks
## Consistency Rules 1. Same question → Same answer 2. Synonyms are acceptable ("car" = "automobile") 3. Case-insensitive ("Paris" = "paris") 4. Extra details are OK ("Paris" vs "Paris, France")
Quality Control
Multiple Annotators Per Example
Each example labeled by 3 annotators Majority vote determines final label Catches individual annotator errors
Inter-Annotator Agreement (IAA)
Measure: Do annotators agree? Metric: Cohen's Kappa (κ) Target: κ > 0.7 (good agreement)
Gold Standard Subset (Known Answers)
10% of examples have known correct labels Mix into annotation tasks Measure annotator accuracy on gold standard Remove low-quality annotators
Spot Checks by Experts
Expert reviews 10% of annotations Validates quality Identifies systematic errors
Inter-Annotator Agreement
Kappa Score (Cohen's κ)
from sklearn.metrics import cohen_kappa_score annotator1 = [1, 0, 1, 1, 0] # Labels from annotator 1 annotator2 = [1, 0, 1, 0, 0] # Labels from annotator 2 kappa = cohen_kappa_score(annotator1, annotator2) print(f"Kappa: {kappa:.2f}") # Interpretation: # κ < 0.4: Poor agreement # κ 0.4-0.6: Moderate agreement # κ 0.6-0.8: Good agreement # κ > 0.8: Excellent agreement
Fleiss' κ (Multiple Annotators)
from statsmodels.stats.inter_rater import fleiss_kappa # 3 annotators, 5 examples # Each row: [count_label_0, count_label_1] data = [ [0, 3], # Example 1: All 3 annotators chose label 1 [1, 2], # Example 2: 1 chose 0, 2 chose 1 [3, 0], # Example 3: All 3 chose label 0 [2, 1], # Example 4: 2 chose 0, 1 chose 1 [0, 3], # Example 5: All 3 chose label 1 ] kappa = fleiss_kappa(data) print(f"Fleiss' Kappa: {kappa:.2f}")
Percentage Agreement
def percentage_agreement(annotator1, annotator2): agreements = sum(a == b for a, b in zip(annotator1, annotator2)) total = len(annotator1) return agreements / total agreement = percentage_agreement(annotator1, annotator2) print(f"Agreement: {agreement:.1%}")
Target: >0.7 (Good Agreement)
If κ < 0.7: 1. Review annotation guidelines (unclear?) 2. Provide more examples 3. Train annotators 4. Simplify task
Resolving Disagreements
Majority Vote
def majority_vote(labels): from collections import Counter counts = Counter(labels) majority_label = counts.most_common(1)[0][0] return majority_label # 3 annotators labels = [1, 1, 0] # Two say 1, one says 0 final_label = majority_vote(labels) # 1
Expert Adjudication
If no majority (e.g., 1, 0, 2): → Expert reviews and decides
Discussion and Consensus
Annotators discuss disagreement Reach consensus Update guidelines if needed
Update Guidelines
If systematic disagreements: → Guidelines unclear → Update and re-annotate
Ground Truth for Different Tasks
Classification: Category Labels
{ "text": "This product is amazing!", "label": "positive" }
Q&A: Correct Answers + Acceptable Variants
{ "question": "What is the capital of France?", "answer": "Paris", "acceptable_variants": ["paris", "PARIS", "The capital is Paris"] }
Summarization: Reference Summaries
{ "document": "Long article text...", "reference_summary": "Concise summary of key points" }
RAG: Question + Context + Answer
{ "question": "What is the capital of France?", "context": "Paris is the capital and largest city of France.", "answer": "Paris", "relevant_chunks": ["Paris is the capital and largest city of France."] }
Generation: Multiple Acceptable Outputs
{ "prompt": "Write a haiku about spring", "acceptable_outputs": [ "Cherry blossoms bloom\nGentle breeze carries petals\nSpring has arrived now", "Flowers start to bloom\nBirds sing in the morning light\nSpring is here at last" ] }
Dataset Size
Evaluation Set: 100-1000 Examples (Representative)
Purpose: Quick evaluation during development Size: 100-1000 examples Quality: High (manually curated) Coverage: Representative of production
Test Set: 500-5000 Examples (Comprehensive)
Purpose: Final evaluation before deployment Size: 500-5000 examples Quality: High (gold standard) Coverage: Comprehensive (all categories, edge cases)
Quality > Quantity
Better: 100 high-quality examples Worse: 1000 low-quality examples
Cover Edge Cases
Include: - Common cases (80%) - Edge cases (15%) - Adversarial cases (5%)
Dataset Maintenance
Version Control (Like Code)
# Git for dataset versioning git init git add dataset.jsonl git commit -m "Initial dataset v1.0" # Tag versions git tag v1.0 # Update dataset git add dataset.jsonl git commit -m "Added 100 new examples" git tag v1.1
Regular Updates (New Examples)
Monthly: Add 50-100 new examples Quarterly: Major update (500+ examples)
Remove Outdated Examples
Examples that are: - No longer relevant - Incorrect (facts changed) - Duplicates
Track Changes (Changelog)
# Dataset Changelog ## v1.2 (2024-02-01) - Added 100 new examples (geography category) - Removed 20 outdated examples - Fixed 5 incorrect labels ## v1.1 (2024-01-01) - Added 50 new examples (science category) - Updated annotation guidelines ## v1.0 (2023-12-01) - Initial release (500 examples)
Stratified Sampling
Balance by Difficulty
Easy: 40% Medium: 40% Hard: 20%
Balance by Category
Geography: 25% Science: 25% History: 25% Math: 25%
Include Edge Cases
Common cases: 80% Edge cases: 15% Adversarial: 5%
Representative of Production
Sample from actual production queries Ensures dataset matches real usage
Synthetic Ground Truth
LLM-Generated Questions + Answers
def generate_synthetic_qa(document): prompt = f""" Document: {document} Generate 5 question-answer pairs based on this document. Format: Q1: [question] A1: [answer] ... """ response = llm.generate(prompt) qa_pairs = parse_qa_pairs(response) return qa_pairs
Careful Validation Needed
LLM-generated data can have: - Hallucinations - Incorrect facts - Biased questions → Always validate with humans
Good for Scale, Risky for Quality
Pros: Can generate 1000s quickly Cons: Quality varies, needs validation
Use for Augmentation, Not Sole Source
Strategy: - 80% human-annotated (high quality) - 20% synthetic (validated)
Domain-Specific Ground Truth
Medical: Expert Annotations
Annotators: Licensed doctors Cost: $50-100 per hour Quality: Very high Use case: Medical diagnosis, treatment recommendations
Legal: Lawyer Review
Annotators: Licensed lawyers Cost: $100-300 per hour Quality: Very high Use case: Legal document analysis, case law
Technical: Engineer Verification
Annotators: Senior engineers Cost: $50-150 per hour Quality: High Use case: Code review, technical Q&A
Ground Truth Storage
JSON/JSONL Files
{"id": "1", "question": "What is 2+2?", "answer": "4"} {"id": "2", "question": "Capital of France?", "answer": "Paris"}
Database (PostgreSQL, MongoDB)
CREATE TABLE ground_truth ( id UUID PRIMARY KEY, question TEXT NOT NULL, answer TEXT NOT NULL, category VARCHAR(50), difficulty VARCHAR(20), created_at TIMESTAMP DEFAULT NOW() );
Version Control (Git)
git add dataset/ git commit -m "Update ground truth dataset" git push
Cloud Storage (S3 + Versioning)
# Upload to S3 with versioning aws s3 cp dataset.jsonl s3://my-bucket/ground-truth/v1.0/dataset.jsonl aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
Ground Truth for RAG
Structure:
{ "question": "What is the capital of France?", "expected_answer": "Paris", "relevant_document_chunks": [ "Paris is the capital and largest city of France." ], "evaluation_criteria": { "faithfulness": "Answer must be grounded in context", "relevance": "Answer must directly address question", "completeness": "Answer should mention Paris" } }
Evaluation with Ground Truth
Exact Match Accuracy
def exact_match(predicted, ground_truth): return predicted.strip().lower() == ground_truth.strip().lower() accuracy = sum(exact_match(p, gt) for p, gt in zip(predicted, ground_truth)) / len(predicted)
F1 Score (For Overlapping Spans)
def f1_score(predicted, ground_truth): pred_tokens = set(predicted.lower().split()) gt_tokens = set(ground_truth.lower().split()) common = pred_tokens & gt_tokens if len(pred_tokens) == 0 or len(gt_tokens) == 0: return 0 precision = len(common) / len(pred_tokens) recall = len(common) / len(gt_tokens) if precision + recall == 0: return 0 f1 = 2 * (precision * recall) / (precision + recall) return f1
BLEU/ROUGE (For Generation)
from nltk.translate.bleu_score import sentence_bleu reference = [["Paris", "is", "the", "capital"]] candidate = ["Paris", "is", "the", "capital"] bleu = sentence_bleu(reference, candidate)
Semantic Similarity (Embedding Distance)
from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity model = SentenceTransformer('all-MiniLM-L6-v2') emb1 = model.encode("Paris is the capital of France") emb2 = model.encode("The capital of France is Paris") similarity = cosine_similarity([emb1], [emb2])[0][0]
Continuous Ground Truth
Production Feedback (User Thumbs Up/Down)
# Log user feedback feedback = { "question": "What is the capital of France?", "answer": "Paris", "user_feedback": "thumbs_up", "timestamp": "2024-01-15T10:00:00Z" } # Add to ground truth if positive if feedback["user_feedback"] == "thumbs_up": add_to_ground_truth(feedback["question"], feedback["answer"])
Human Review of Flagged Outputs
User flags answer as incorrect → Human reviews → If incorrect, add correct answer to ground truth → If correct, keep as is
Incrementally Add to Dataset
Monthly: Review 100 flagged examples Add 50 to ground truth Update dataset version
Tools
Annotation: Label Studio, Prodigy, CVAT
Label Studio:
pip install label-studio label-studio start # Open http://localhost:8080
Prodigy:
pip install prodigy prodigy textcat.manual dataset_name source.jsonl --label positive,negative
Management: DVC (Data Version Control)
pip install dvc dvc init dvc add dataset.jsonl git add dataset.jsonl.dvc .gitignore git commit -m "Add dataset" dvc push
Storage: S3, GCS, Local Files
See "Ground Truth Storage" section
Summary
Ground Truth: Correct answers for evaluation
Why:
- Measure accuracy objectively
- Train/validate models
- Regression testing
- Benchmarking
Types:
- Exact match
- Multiple acceptable answers
- Rubric-based
- Human preference
Creating:
- Manual annotation
- Expert review
- Crowdsourcing
- Synthetic (with validation)
Quality Control:
- Multiple annotators
- Inter-annotator agreement (κ > 0.7)
- Gold standard subset
- Expert spot checks
Dataset Size:
- Eval: 100-1000 (representative)
- Test: 500-5000 (comprehensive)
- Quality > quantity
Maintenance:
- Version control (Git)
- Regular updates
- Remove outdated
- Changelog
Tools:
- Annotation: Label Studio, Prodigy
- Management: DVC
- Storage: S3, GCS, Git