Claude-skill-registry create-inspect-task
Create custom inspect-ai evaluation tasks through interacted, guided workflow.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/create-inspect-task" ~/.claude/skills/majiayu000-claude-skill-registry-create-inspect-task && rm -rf "$T"
skills/data/create-inspect-task/SKILL.mdCreate Inspect Task
You help users create custom inspect-ai evaluation tasks through an interactive, guided workflow. Create well-documented, reusable evaluation scripts that follow inspect-ai best practices.
Your Task
Guide the user through designing and implementing a custom inspect-ai evaluation task. Create a complete, runnable task file and comprehensive documentation that explains the design decisions and usage.
Operating Modes
This skill supports two modes:
Mode 1: Experiment-Guided (Recommended)
When an
experiment_summary.yaml file exists (created by design-experiment skill), extract configuration to pre-populate:
- Dataset path and format
- Model information
- Evaluation objectives
- System prompts
- Common parameters
Usage: Run skill from experiment directory or provide path to experiment_summary.yaml
Mode 2: Standalone
Create evaluation tasks from scratch without experiment context. User provides all configuration manually.
Usage: Run skill when no experiment exists or when creating general-purpose evaluation tasks
Workflow
Initial Setup (Both Modes)
- Check for experiment context
- Look for
in current directoryexperiment_summary.yaml - If found, ask user: "I found an experiment summary. Would you like me to use it to configure the evaluation task?"
- If user says yes, proceed with Mode 1
- If no or not found, proceed with Mode 2
- Look for
Mode 1: Experiment-Guided Workflow
- Read experiment_summary.yaml - Extract configuration
- Confirm extracted info - Show user what was found (dataset, models, etc.)
- Understand evaluation objective - What specific aspect to evaluate?
- Configure task-specific details - Solver chain, scorers (guided by experiment context)
- Add task parameters - Make the task flexible and reusable
- Generate code - Create the complete task file with experiment integration
- Create documentation - Write design documentation with experiment context
- Create log - Document all decisions in
create-inspect-task.log - Provide usage guidance - Show user how to run the task with their models
Mode 2: Standalone Workflow
- Understand the objective - What does the user want to evaluate?
- Configure dataset - Guide dataset format selection and loading
- Design solver chain - Build the solver pipeline (prompts, generation, etc.)
- Select scorers - Choose appropriate scoring mechanisms
- Add task parameters - Make the task flexible and reusable
- Generate code - Create the complete task file
- Create documentation - Write design documentation with rationale
- Create log - Document all decisions in
create-inspect-task.log - Provide usage guidance - Show user how to run the task
Extracting Information from experiment_summary.yaml (Mode 1)
When operating in experiment-guided mode, extract the following information from the YAML structure:
YAML Structure Overview
experiment: name: string type: string question: string data: training: path: string label: string format: string splits: train: int validation: int test: int models: base: - name: string path: string evaluation: system_prompt: string temperature: float runs: - name: string type: string # "fine-tuned" or "control" model: string
Extraction Algorithm
import yaml from pathlib import Path def extract_from_experiment_summary(path): """Extract configuration from experiment_summary.yaml""" with open(path, 'r') as f: config = yaml.safe_load(f) # Extract dataset configuration dataset_path = config['data']['training']['path'] dataset_format = config['data']['training']['format'] dataset_splits = config['data']['training']['splits'] # Extract system prompt from evaluation section system_prompt = config['evaluation']['system_prompt'] # Extract research question research_question = config['experiment']['question'] experiment_type = config['experiment']['type'] # Extract model information (first base model) base_models = config['models']['base'] model_name = base_models[0]['name'] if base_models else None model_path = base_models[0]['path'] if base_models else None # Extract run names for documentation examples run_names = [run['name'] for run in config['runs']] control_runs = [run['name'] for run in config['runs'] if run['type'] == 'control'] return { 'dataset_path': dataset_path, 'dataset_format': dataset_format, 'dataset_splits': dataset_splits, 'system_prompt': system_prompt, 'research_question': research_question, 'experiment_type': experiment_type, 'model_name': model_name, 'model_path': model_path, 'run_names': run_names, 'control_runs': control_runs }
Key Fields to Extract
From
section:experiment
→ Research question/objective (informs evaluation goal)question
→ Experiment type (helps understand what's being compared)type
From
section:data.training
→ Dataset path for evaluationpath
→ Dataset format (json, parquet)format
→ Sample counts (use test split for evaluation)splits
From
section:models.base[]
→ Model identifiername
→ Full path to base model (for usage examples)path
From
section:evaluation
→ Use same prompt for consistencysystem_prompt
→ Default temperature settingtemperature
From
section:runs[]
→ Run identifiers (for documentation)name
→ Filter for "control" runs that need evaluationtype
Presenting Extracted Information
After extraction, show the user what was found:
## Configuration Extracted from Experiment I found the following configuration in your experiment: **Dataset:** - Path: `/scratch/gpfs/.../data/green/capitalization/words_4L_80P_300.json` - Format: JSON - Splits: train (240), test (60) **Models:** - Llama-3.2-1B-Instruct - Path: `/scratch/gpfs/.../pretrained-llms/Llama-3.2-1B-Instruct` **System Prompt:**
{extracted_prompt or "(none)"}
**Research Question:** {extracted_question} I'll use this information to help configure your evaluation task. You can override any of these settings if needed.
Validation
Check extracted information:
- ✓ Dataset path exists (verify with
)ls - ✓ Dataset format is supported (.json, .parquet, .jsonl)
- ✓ Model path exists (verify with
)ls - ✓ System prompt is properly formatted (string, not list)
If validation fails:
- Warn user but continue
- Ask user to provide correct information
- Log validation failures
Logging
IMPORTANT: Create a detailed log file at
{task_directory}/create-inspect-task.log that records all questions, answers, and decisions made during task creation.
Log Format
[YYYY-MM-DD HH:MM:SS] ACTION: Description Details: {specifics} Result: {outcome}
What to Log
- User's evaluation objective
- Dataset selection and configuration decisions
- Solver chain composition choices
- Scorer selection rationale
- Task parameter decisions
- File creation
- Any validation performed
Example Log Entries
Mode 1: Experiment-Guided
[2025-10-24 14:30:00] MODE_SELECTION: Experiment-guided mode Details: Found experiment_summary.yaml at /scratch/gpfs/MSALGANIK/mjs3/cap_4L_lora_lr_sweep/experiment_summary.yaml Result: User confirmed to use experiment configuration [2025-10-24 14:30:05] EXTRACT_CONFIG: Reading experiment_summary.yaml Details: Parsing YAML structure: experiment, data, models, evaluation sections Result: Successfully extracted configuration [2025-10-24 14:30:10] EXTRACTED_DATASET: Dataset configuration Details: Path: /scratch/gpfs/MSALGANIK/niznik/GitHub/cruijff_kit/data/green/capitalization/words_4L_80P_300.json Format: JSON, Splits: train (240), test (60) Result: Verified dataset exists (43KB) [2025-10-24 14:30:15] EXTRACTED_SYSTEM_PROMPT: System prompt from experiment Details: Prompt: "" (empty - no system message) Result: Will use empty system prompt for consistency with training [2025-10-24 14:30:20] EXTRACTED_RESEARCH_QUESTION: Scientific objective Details: Compare LoRA ranks and learning rates for capitalization task Result: Will design evaluation to measure exact match accuracy [2025-10-24 14:30:25] EVALUATION_OBJECTIVE: User wants to evaluate capitalization accuracy Details: Exact match (case-sensitive), using experiment dataset Result: Will use match(location="exact", ignore_case=False) scorer for strict evaluation [2025-10-24 14:30:30] SOLVER_CONFIG: Designing solver chain Details: system_message(""), prompt_template("{prompt}"), generate(temp=0.0) Result: Matches training configuration for consistency
Mode 2: Standalone
[2025-10-24 14:30:00] MODE_SELECTION: Standalone mode Details: No experiment_summary.yaml found Result: User will provide all configuration manually [2025-10-24 14:30:05] EVALUATION_OBJECTIVE: User wants to evaluate sentiment classification Details: Binary classification (positive/negative), using custom dataset in JSON format Result: Will use match() scorer for exact matching, temperature=0.0 for consistency [2025-10-24 14:30:15] DATASET_CONFIG: Selected JSON dataset format Details: Dataset path: /scratch/gpfs/MSALGANIK/niznik/data/sentiment_test.json Field mapping: input="text", target="sentiment" Result: Will use hf_dataset with json format and custom record_to_sample function
Questions to Ask
1. Evaluation Objective
What do you want to evaluate?
- Classification task? (sentiment, topic, entity type, etc.)
- Generation quality? (summarization, translation, etc.)
- Factual accuracy? (question answering, fact checking)
- Reasoning ability? (math, logic, chain-of-thought)
- Task-specific capability? (code generation, instruction following)
What defines a correct answer?
- Exact match with target?
- Contains specific information?
- Model-graded quality assessment?
- Multiple acceptable answers?
2. Dataset Configuration
What dataset format do you have?
- JSON file (
or.json
).jsonl - Parquet files (
).parquet - HuggingFace dataset (specify dataset name)
- CSV file
- Custom format (will need conversion)
Where is the dataset located?
- Get full path to dataset
- Verify file exists if possible
- Check file size for sanity
What are the field names?
- Input field name (e.g., "question", "text", "prompt")
- Target/answer field name (e.g., "answer", "label", "output")
- Any metadata fields to preserve? (e.g., "category", "difficulty")
Dataset structure specifics:
- For JSON: Is it a single JSON file with nested structure or JSONL?
- For JSON with splits: Which field contains the test split?
- For Parquet: Is it a directory of parquet files?
- For HuggingFace: Dataset name and split to use?
Example questions:
- "Does your JSON file have a structure like
?"{'train': [...], 'test': [...]} - "Is each line a separate JSON object (JSONL format)?"
- "Do you need to load from a specific split like 'test' or 'validation'?"
3. Solver Configuration
System message:
- Do you want to provide instructions to the model via system message?
- What role should the model play? (e.g., "You are a helpful assistant", "You are an expert classifier")
- Default: empty string (no system message)
Prompt template:
- Should we use the input directly or wrap it in a template?
- Do you need chain-of-thought prompting?
- Default:
(direct input)"{prompt}"
Generation parameters:
- Temperature:
- 0.0 for deterministic, consistent answers (recommended for most evals)
- Higher values (0.7-1.0) for creative tasks
- Max tokens: Maximum length of model response (default: model's default)
- Top-p: Nucleus sampling parameter (default: 1.0)
Common solver patterns:
- Simple generation:
[system_message(""), prompt_template("{prompt}"), generate()] - Chain-of-thought:
[chain_of_thought(), generate()] - Multiple-choice:
(don't add separate generate())[multiple_choice()] - Custom template:
[prompt_template("Answer: {prompt}\n"), generate()]
4. Scorer Selection
Based on evaluation objective, suggest scorers:
For exact matching:
- Target appears at beginning/end; ignores case, whitespace, punctuationmatch()- Options:
,location="begin"/"end"/"any"ignore_case=True/False
- Options:
- Precise matching after normalizationexact()
- Target appears anywhere in outputincludes()- Options:
ignore_case=True/False
- Options:
For multiple choice:
- Works withchoice()
solvermultiple_choice()- Returns letter of selected answer (A, B, C, D, etc.)
For pattern extraction:
- Extract answer using regexpattern()- Requires regex pattern parameter
For model-graded evaluation:
- Another model assesses answer qualitymodel_graded_qa()- Options:
, custompartial_credit=True/Falsetemplate
- Options:
- Checks if specific facts appearmodel_graded_fact()- Note: Requires additional model, adds latency and cost
For numeric/F1 scoring:
- F1 score for text overlapf1()
Multiple scorers:
- Can use a list:
to get multiple scores[match(), includes()] - Helpful for comparing scoring methods
5. Task Parameters
Should the task accept parameters for flexibility?
Common parameters to expose:
- Allow different system messagessystem_prompt
- Enable temperature tuningtemperature
- Support different datasetsdataset_path
- For model-graded scoringgrader_model
- (legacy) For runtime config reading; scaffold-inspect uses direct params insteadconfig_dir
Benefits of parameters:
- Run variations without code changes
- Easier experimentation
- Better reusability
How to pass parameters:
inspect eval task.py -T param_name=value
6. Model Specification
How will the model be specified?
Option 1: CLI specification (most flexible)
- User provides model at runtime
inspect eval task.py --model hf/local -M model_path=/path/to/model- Recommended for most cases
Option 2: Integration with fine-tuning config (legacy)
- Like existing
examplecap_task - Reads from
at runtime viasetup_finetune.yaml
parameterconfig_dir - Note: scaffold-inspect now bakes values into SLURM instead of using this pattern
Option 3: Hard-coded in task
- Less flexible but simpler
- Can specify model inside task definition
- Better for benchmarking specific models
Output Files
Create two files:
1. Task Script: {task_name}_task.py
{task_name}_task.pyThe complete, runnable inspect-ai task following best practices.
File naming convention:
- Descriptive name:
sentiment_classification_task.py - Include domain:
math_reasoning_task.py - Follow pattern:
{domain}_{type}_task.py
Required components:
from inspect_ai import Task, task from inspect_ai.dataset import json_dataset, hf_dataset, FieldSpec from inspect_ai.solver import chain, generate, prompt_template, system_message from inspect_ai.scorer import match, includes @task def my_task(param1: str = "default"): """ Brief description of what this task evaluates. Args: param1: Description of parameter Returns: Task: Configured inspect-ai task """ # Dataset loading dataset = ... # Solver chain solver = chain( system_message("..."), prompt_template("{prompt}"), generate({"temperature": 0.0}) ) # Return task return Task( dataset=dataset, solver=solver, scorer=... )
Best practices to follow:
- Use type hints for parameters
- Include docstring explaining purpose
- Add comments explaining non-obvious choices
- Handle errors gracefully (try/except for file operations)
- Validate required parameters
- Use descriptive variable names
2. Design Documentation: {task_name}_design.md
{task_name}_design.mdComprehensive documentation of design decisions.
Required sections:
# {Task Name} Evaluation Task **Created:** {timestamp} **Inspect-AI Version:** {version if known} ## Evaluation Objective {What this task evaluates and why} ## Dataset Configuration **Format:** {JSON/Parquet/HuggingFace/etc.} **Location:** `{full_path_to_dataset}` **Size:** {number of samples if known} **Field Mapping:** - Input field: `{field_name}` - Target field: `{field_name}` - Metadata fields: `{field_names or "none"}` **Loading Method:** {Description of how dataset is loaded} **Data Structure:** {Explanation of JSON structure, splits, etc.} ## Solver Chain **Components:** 1. {Solver 1}: {Purpose} 2. {Solver 2}: {Purpose} 3. ... **System Message:**
{system message text or "none"}
**Prompt Template:**
{template or "direct input"}
**Generation Parameters:** - Temperature: {value} - {rationale} - Max tokens: {value or "default"} - {rationale} - {Other parameters if any} **Rationale:** {Why this solver chain was chosen} ## Scorer Configuration **Primary Scorer:** `{scorer_name}()` **Options:** - {option1}: {value} - {reason} - {option2}: {value} - {reason} **Additional Scorers:** {List if multiple scorers used, or "none"} **Rationale:** {Why this scorer is appropriate for the task} ## Task Parameters | Parameter | Type | Default | Purpose | |-----------|------|---------|---------| | {param1} | {type} | {default} | {description} | **Parameter Usage:** ```bash inspect eval {task_file}.py -T {param}={value}
Model Specification
Recommended usage:
inspect eval {task_file}.py --model hf/local -M model_path=/path/to/model
{Any specific notes about model compatibility}
Example Usage
Basic evaluation:
inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model
With parameters:
inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model -T temperature=0.5
Evaluating fine-tuned model: {if applicable}
cd /path/to/experiment/run/epoch_0 inspect eval {task_name}_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD
Output Files
Inspect-ai will create:
- Evaluation results loglogs/{task_name}_{timestamp}.eval- Console output with accuracy and metrics
Expected Performance
{If known, describe expected baseline performance or what good performance looks like}
Notes
{Any additional considerations, limitations, or future improvements}
References
- Inspect-AI documentation: https://inspect.aisi.org.uk/
- {Any other relevant references}
## Code Generation Guidelines ### Dataset Loading Patterns **JSON with nested splits:** ```python from inspect_ai.dataset import hf_dataset def record_to_sample(record): return Sample( input=record["input"], target=record["output"] ) dataset = hf_dataset( path="json", data_files="/path/to/data.json", field="test", # Access the "test" split split="train", # Don't get confused - this refers to top-level split sample_fields=record_to_sample )
JSONL (one JSON object per line):
from inspect_ai.dataset import json_dataset def record_to_sample(record): return Sample( input=record["question"], target=record["answer"] ) dataset = json_dataset( "/path/to/data.jsonl", record_to_sample )
Parquet directory:
from inspect_ai.dataset import hf_dataset, FieldSpec dataset = hf_dataset( path="parquet", data_dir="/path/to/parquet_dir", split="test", sample_fields=FieldSpec( input="question", target="answer" ) )
HuggingFace dataset:
from inspect_ai.dataset import hf_dataset, FieldSpec dataset = hf_dataset( path="username/dataset-name", split="test", sample_fields=FieldSpec( input="question", target="answer", metadata=["category", "difficulty"] # Preserve metadata ) )
Solver Chain Patterns
Simple generation:
from inspect_ai.solver import chain, generate, prompt_template, system_message solver = chain( system_message(""), # Empty if no system message needed prompt_template("{prompt}"), # Direct input generate({"temperature": 0.0}) )
With system message and custom template:
solver = chain( system_message("You are an expert classifier. Respond with only the category label."), prompt_template("Text: {prompt}\n\nCategory:"), generate({"temperature": 0.0, "max_tokens": 50}) )
Chain-of-thought:
from inspect_ai.solver import chain_of_thought, generate solver = chain( chain_of_thought(), # Adds "Let's think step by step" prompt generate({"temperature": 0.0}) )
Multiple choice:
from inspect_ai.solver import multiple_choice solver = multiple_choice() # Don't add generate() separately # Or with chain-of-thought: solver = multiple_choice(cot=True)
Scorer Patterns
Exact matching (case-insensitive):
from inspect_ai.scorer import match scorer = match() # Default: ignore case, whitespace, punctuation # Or customize: scorer = match(location="exact", ignore_case=False)
Substring matching:
from inspect_ai.scorer import includes scorer = includes() # Default: case-sensitive # Or: scorer = includes(ignore_case=True)
Multiple scorers:
scorer = [ match("exact", ignore_case=False), includes(ignore_case=False) ] # Results will show scores from both
Model-graded:
from inspect_ai.scorer import model_graded_qa scorer = model_graded_qa( partial_credit=True, # Allow 0.5 scores model="openai/gpt-4o" # Specify grading model )
Integration with Fine-Tuning Workflow
Experiment-Guided Task Creation (Recommended)
When creating tasks for an experiment:
-
Run from experiment directory:
cd /scratch/gpfs/MSALGANIK/mjs3/my_experiment/ # Invoke create-inspect-task skill -
Skill automatically extracts from experiment_summary.yaml:
- Dataset path and format
- System prompt (ensures eval matches training)
- Model information
- Research objectives
-
Task parameter modes:
- Direct parameters (preferred):
,data_path
,prompt
passed viasystem_prompt
flags. scaffold-inspect bakes these into SLURM scripts at scaffolding time.-T - config_dir mode (legacy): Reads from
at runtime. Not used by scaffold-inspect but supported for backwards compatibility.setup_finetune.yaml
- Direct parameters (preferred):
Generated Task Pattern
For tasks integrated with experiments:
import yaml from pathlib import Path @task def my_task( config_dir: Optional[str] = None, dataset_path: Optional[str] = None, system_prompt: str = "", temperature: float = 0.0, split: str = "test" ) -> Task: """ Evaluate model using configuration from fine-tuning setup or direct paths. Args: config_dir: Path to epoch directory (contains ../setup_finetune.yaml). If provided, reads dataset path and system prompt from config. dataset_path: Direct path to dataset JSON file. Used if config_dir not provided. system_prompt: System message for the model. Overrides config if both provided. temperature: Generation temperature (default: 0.0 for deterministic output). split: Which data split to use (default: "test"). Returns: Task: Configured inspect-ai task """ # Determine configuration source if config_dir: # Mode 1: Read from fine-tuning configuration config_path = Path(config_dir).parent / "setup_finetune.yaml" with open(config_path, 'r') as f: config = yaml.safe_load(f) # Extract settings from fine-tuning config dataset_path = config['input_dir_base'] + config['dataset_label'] + config['dataset_ext'] # Use system prompt from config unless overridden if not system_prompt: system_prompt = config.get('system_prompt', '') elif dataset_path: # Mode 2: Direct dataset path # system_prompt and other params used as provided pass else: raise ValueError("Must provide either config_dir or dataset_path") # Load dataset dataset = ... # Load using dataset_path return Task( dataset=dataset, solver=chain( system_message(system_prompt), prompt_template("{prompt}"), generate({"temperature": temperature}) ), scorer=... )
Usage Examples
Evaluating fine-tuned model from experiment:
cd /path/to/experiment/run_dir/epoch_0 inspect eval /path/to/my_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD
Evaluating base model (control run):
inspect eval my_task.py \ --model hf/local \ -M model_path=/scratch/gpfs/MSALGANIK/pretrained-llms/Llama-3.2-1B-Instruct \ -T dataset_path=/path/to/dataset.json
Integration with setup_inspect.py (Future)
This task pattern enables integration with the
setup_inspect.py tool (when implemented):
python tools/inspect/setup_inspect.py --finetune_epoch_dir /path/to/experiment/run/epoch_0
Validation Before Completion
Common Validation (Both Modes)
Before finishing, verify:
- ✓ Task file is syntactically correct Python
- ✓ All imports are present
- ✓ Task decorated with
@task - ✓ Dataset loading code matches format
- ✓ Solver chain follows inspect-ai patterns
- ✓ Scorer is appropriate for task
- ✓ Design documentation includes all sections
- ✓ Example usage commands are correct
- ✓ Log file documents all decisions
Mode 1 Specific Validation
Additional checks for experiment-guided mode:
- ✓ experiment_summary.yaml was successfully parsed
- ✓ Extracted dataset path exists and format matches
- ✓ System prompt matches training configuration
- ✓ Task supports both
andconfig_dir
parametersdataset_path - ✓ Documentation includes experiment context (research question, runs)
- ✓ Usage examples show both fine-tuned and base model evaluation
- ✓ Log includes extraction details and validation results
Next Steps After Creation
After creating the task, guide user:
-
Test the task:
# Validate syntax python -m py_compile {task_file}.py # Test with small sample inspect eval {task_file}.py --model {model} --limit 5 -
Run full evaluation:
inspect eval {task_file}.py --model {model} -
View results:
inspect view # Opens web UI to browse evaluation logs -
Iterate if needed:
- Adjust scorer settings
- Modify prompts
- Change generation parameters
- Use
to re-score without re-runninginspect score
Important Notes
General Best Practices
- Follow inspect-ai best practices from https://inspect.aisi.org.uk/
- Always include docstrings and comments
- Make tasks parameterized for flexibility
- Create comprehensive documentation for reproducibility
- Use type hints for parameters
- Handle errors gracefully
- Validate dataset paths when possible
- Keep generation temperature at 0.0 for consistency unless user needs creativity
- Prefer simple scorers (match, includes) over model-graded when possible
- Test with small samples first (
)--limit 5
Experiment Integration
- Prefer Mode 1 (experiment-guided) when working with designed experiments
- Always check for experiment_summary.yaml before starting
- Extract and validate all configuration before proceeding
- System prompt consistency is critical - eval must match training
- Generated tasks should work for both fine-tuned and base models
- Include experiment context in documentation (research question, runs)
- Use
parameter pattern for experiment integrationconfig_dir - Log all extraction and validation steps for reproducibility
Error Handling
If dataset file not found:
- Warn user but proceed with code generation
- Note in documentation that path should be verified
- Include validation suggestion in next steps
If unsure about dataset format:
- Ask for example record
- Offer to help convert to supported format
- Suggest user examine file structure
If scorer choice unclear:
- Recommend starting with simple scorers
- Suggest using multiple scorers for comparison
- Note that scorers can be changed later without re-running generation