Claude-skill-registry create-inspect-task

Create custom inspect-ai evaluation tasks through interacted, guided workflow.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/create-inspect-task" ~/.claude/skills/majiayu000-claude-skill-registry-create-inspect-task && rm -rf "$T"

manifest: skills/data/create-inspect-task/SKILL.md

Create Inspect Task

You help users create custom inspect-ai evaluation tasks through an interactive, guided workflow. Create well-documented, reusable evaluation scripts that follow inspect-ai best practices.

Your Task

Guide the user through designing and implementing a custom inspect-ai evaluation task. Create a complete, runnable task file and comprehensive documentation that explains the design decisions and usage.

Operating Modes

This skill supports two modes:

Mode 1: Experiment-Guided (Recommended)

When an

experiment_summary.yaml

file exists (created by

design-experiment

skill), extract configuration to pre-populate:

Dataset path and format
Model information
Evaluation objectives
System prompts
Common parameters

Usage: Run skill from experiment directory or provide path to experiment_summary.yaml

Mode 2: Standalone

Create evaluation tasks from scratch without experiment context. User provides all configuration manually.

Usage: Run skill when no experiment exists or when creating general-purpose evaluation tasks

Workflow

Initial Setup (Both Modes)

Check for experiment context
- Look for
```
experiment_summary.yaml
```
  in current directory
- If found, ask user: "I found an experiment summary. Would you like me to use it to configure the evaluation task?"
- If user says yes, proceed with Mode 1
- If no or not found, proceed with Mode 2

Mode 1: Experiment-Guided Workflow

Read experiment_summary.yaml - Extract configuration
Confirm extracted info - Show user what was found (dataset, models, etc.)
Understand evaluation objective - What specific aspect to evaluate?
Configure task-specific details - Solver chain, scorers (guided by experiment context)
Add task parameters - Make the task flexible and reusable
Generate code - Create the complete task file with experiment integration
Create documentation - Write design documentation with experiment context
Create log - Document all decisions in
```
create-inspect-task.log
```
Provide usage guidance - Show user how to run the task with their models

Mode 2: Standalone Workflow

Understand the objective - What does the user want to evaluate?
Configure dataset - Guide dataset format selection and loading
Design solver chain - Build the solver pipeline (prompts, generation, etc.)
Select scorers - Choose appropriate scoring mechanisms
Add task parameters - Make the task flexible and reusable
Generate code - Create the complete task file
Create documentation - Write design documentation with rationale
Create log - Document all decisions in
```
create-inspect-task.log
```
Provide usage guidance - Show user how to run the task

Extracting Information from experiment_summary.yaml (Mode 1)

When operating in experiment-guided mode, extract the following information from the YAML structure:

YAML Structure Overview

experiment:
  name: string
  type: string
  question: string

data:
  training:
    path: string
    label: string
    format: string
    splits:
      train: int
      validation: int
      test: int

models:
  base:
    - name: string
      path: string

evaluation:
  system_prompt: string
  temperature: float

runs:
  - name: string
    type: string  # "fine-tuned" or "control"
    model: string

Extraction Algorithm

import yaml
from pathlib import Path

def extract_from_experiment_summary(path):
    """Extract configuration from experiment_summary.yaml"""
    with open(path, 'r') as f:
        config = yaml.safe_load(f)

    # Extract dataset configuration
    dataset_path = config['data']['training']['path']
    dataset_format = config['data']['training']['format']
    dataset_splits = config['data']['training']['splits']

    # Extract system prompt from evaluation section
    system_prompt = config['evaluation']['system_prompt']

    # Extract research question
    research_question = config['experiment']['question']
    experiment_type = config['experiment']['type']

    # Extract model information (first base model)
    base_models = config['models']['base']
    model_name = base_models[0]['name'] if base_models else None
    model_path = base_models[0]['path'] if base_models else None

    # Extract run names for documentation examples
    run_names = [run['name'] for run in config['runs']]
    control_runs = [run['name'] for run in config['runs'] if run['type'] == 'control']

    return {
        'dataset_path': dataset_path,
        'dataset_format': dataset_format,
        'dataset_splits': dataset_splits,
        'system_prompt': system_prompt,
        'research_question': research_question,
        'experiment_type': experiment_type,
        'model_name': model_name,
        'model_path': model_path,
        'run_names': run_names,
        'control_runs': control_runs
    }

Key Fields to Extract

From

experiment

section:

```
question
```
→ Research question/objective (informs evaluation goal)
```
type
```
→ Experiment type (helps understand what's being compared)

From

data.training

section:

```
path
```
→ Dataset path for evaluation
```
format
```
→ Dataset format (json, parquet)
```
splits
```
→ Sample counts (use test split for evaluation)

From

models.base[]

section:

```
name
```
→ Model identifier
```
path
```
→ Full path to base model (for usage examples)

From

evaluation

section:

```
system_prompt
```
→ Use same prompt for consistency
```
temperature
```
→ Default temperature setting

From

runs[]

section:

```
name
```
→ Run identifiers (for documentation)
```
type
```
→ Filter for "control" runs that need evaluation

Presenting Extracted Information

After extraction, show the user what was found:

## Configuration Extracted from Experiment

I found the following configuration in your experiment:

**Dataset:**
- Path: `/scratch/gpfs/.../data/green/capitalization/words_4L_80P_300.json`
- Format: JSON
- Splits: train (240), test (60)

**Models:**
- Llama-3.2-1B-Instruct
- Path: `/scratch/gpfs/.../pretrained-llms/Llama-3.2-1B-Instruct`

**System Prompt:**

{extracted_prompt or "(none)"}


**Research Question:**
{extracted_question}

I'll use this information to help configure your evaluation task. You can override any of these settings if needed.

Validation

Check extracted information:

✓ Dataset path exists (verify with
```
ls
```
)
✓ Dataset format is supported (.json, .parquet, .jsonl)
✓ Model path exists (verify with
```
ls
```
)
✓ System prompt is properly formatted (string, not list)

If validation fails:

Warn user but continue
Ask user to provide correct information
Log validation failures

Logging

IMPORTANT: Create a detailed log file at

{task_directory}/create-inspect-task.log

that records all questions, answers, and decisions made during task creation.

Log Format

[YYYY-MM-DD HH:MM:SS] ACTION: Description
Details: {specifics}
Result: {outcome}

What to Log

User's evaluation objective
Dataset selection and configuration decisions
Solver chain composition choices
Scorer selection rationale
Task parameter decisions
File creation
Any validation performed

Example Log Entries

Mode 1: Experiment-Guided

[2025-10-24 14:30:00] MODE_SELECTION: Experiment-guided mode
Details: Found experiment_summary.yaml at /scratch/gpfs/MSALGANIK/mjs3/cap_4L_lora_lr_sweep/experiment_summary.yaml
Result: User confirmed to use experiment configuration

[2025-10-24 14:30:05] EXTRACT_CONFIG: Reading experiment_summary.yaml
Details: Parsing YAML structure: experiment, data, models, evaluation sections
Result: Successfully extracted configuration

[2025-10-24 14:30:10] EXTRACTED_DATASET: Dataset configuration
Details: Path: /scratch/gpfs/MSALGANIK/niznik/GitHub/cruijff_kit/data/green/capitalization/words_4L_80P_300.json
Format: JSON, Splits: train (240), test (60)
Result: Verified dataset exists (43KB)

[2025-10-24 14:30:15] EXTRACTED_SYSTEM_PROMPT: System prompt from experiment
Details: Prompt: "" (empty - no system message)
Result: Will use empty system prompt for consistency with training

[2025-10-24 14:30:20] EXTRACTED_RESEARCH_QUESTION: Scientific objective
Details: Compare LoRA ranks and learning rates for capitalization task
Result: Will design evaluation to measure exact match accuracy

[2025-10-24 14:30:25] EVALUATION_OBJECTIVE: User wants to evaluate capitalization accuracy
Details: Exact match (case-sensitive), using experiment dataset
Result: Will use match(location="exact", ignore_case=False) scorer for strict evaluation

[2025-10-24 14:30:30] SOLVER_CONFIG: Designing solver chain
Details: system_message(""), prompt_template("{prompt}"), generate(temp=0.0)
Result: Matches training configuration for consistency

Mode 2: Standalone

[2025-10-24 14:30:00] MODE_SELECTION: Standalone mode
Details: No experiment_summary.yaml found
Result: User will provide all configuration manually

[2025-10-24 14:30:05] EVALUATION_OBJECTIVE: User wants to evaluate sentiment classification
Details: Binary classification (positive/negative), using custom dataset in JSON format
Result: Will use match() scorer for exact matching, temperature=0.0 for consistency

[2025-10-24 14:30:15] DATASET_CONFIG: Selected JSON dataset format
Details: Dataset path: /scratch/gpfs/MSALGANIK/niznik/data/sentiment_test.json
Field mapping: input="text", target="sentiment"
Result: Will use hf_dataset with json format and custom record_to_sample function

Questions to Ask

1. Evaluation Objective

What do you want to evaluate?

Classification task? (sentiment, topic, entity type, etc.)
Generation quality? (summarization, translation, etc.)
Factual accuracy? (question answering, fact checking)
Reasoning ability? (math, logic, chain-of-thought)
Task-specific capability? (code generation, instruction following)

What defines a correct answer?

Exact match with target?
Contains specific information?
Model-graded quality assessment?
Multiple acceptable answers?

2. Dataset Configuration

What dataset format do you have?

JSON file (
```
.json
```
or
```
.jsonl
```
)
Parquet files (
```
.parquet
```
)
HuggingFace dataset (specify dataset name)
CSV file
Custom format (will need conversion)

Where is the dataset located?

Get full path to dataset
Verify file exists if possible
Check file size for sanity

What are the field names?

Input field name (e.g., "question", "text", "prompt")
Target/answer field name (e.g., "answer", "label", "output")
Any metadata fields to preserve? (e.g., "category", "difficulty")

Dataset structure specifics:

For JSON: Is it a single JSON file with nested structure or JSONL?
For JSON with splits: Which field contains the test split?
For Parquet: Is it a directory of parquet files?
For HuggingFace: Dataset name and split to use?

Example questions:

"Does your JSON file have a structure like
```
{'train': [...], 'test': [...]}
```
?"
"Is each line a separate JSON object (JSONL format)?"
"Do you need to load from a specific split like 'test' or 'validation'?"

3. Solver Configuration

System message:

Do you want to provide instructions to the model via system message?
What role should the model play? (e.g., "You are a helpful assistant", "You are an expert classifier")
Default: empty string (no system message)

Prompt template:

Should we use the input directly or wrap it in a template?
Do you need chain-of-thought prompting?
Default:
```
"{prompt}"
```
(direct input)

Generation parameters:

Temperature:
- 0.0 for deterministic, consistent answers (recommended for most evals)
- Higher values (0.7-1.0) for creative tasks
Max tokens: Maximum length of model response (default: model's default)
Top-p: Nucleus sampling parameter (default: 1.0)

Common solver patterns:

Simple generation:

[system_message(""), prompt_template("{prompt}"), generate()]

Chain-of-thought:
```
[chain_of_thought(), generate()]
```
Multiple-choice:
```
[multiple_choice()]
```
(don't add separate generate())

Custom template:

[prompt_template("Answer: {prompt}\n"), generate()]

4. Scorer Selection

Based on evaluation objective, suggest scorers:

For exact matching:

```
match()
```
- Target appears at beginning/end; ignores case, whitespace, punctuation
- Options:
```
location="begin"/"end"/"any"
```
  ,
```
ignore_case=True/False
```
```
exact()
```
- Precise matching after normalization
```
includes()
```
- Target appears anywhere in output
- Options:
```
ignore_case=True/False
```

For multiple choice:

```
choice()
```
- Works with
```
multiple_choice()
```
solver
Returns letter of selected answer (A, B, C, D, etc.)

For pattern extraction:

```
pattern()
```
- Extract answer using regex
- Requires regex pattern parameter

For model-graded evaluation:

```
model_graded_qa()
```
- Another model assesses answer quality
- Options:
```
partial_credit=True/False
```
  , custom
```
template
```
```
model_graded_fact()
```
- Checks if specific facts appear
Note: Requires additional model, adds latency and cost

For numeric/F1 scoring:

```
f1()
```
- F1 score for text overlap

Multiple scorers:

Can use a list:
```
[match(), includes()]
```
to get multiple scores
Helpful for comparing scoring methods

5. Task Parameters

Should the task accept parameters for flexibility?

Common parameters to expose:

```
system_prompt
```
- Allow different system messages
```
temperature
```
- Enable temperature tuning
```
dataset_path
```
- Support different datasets
```
grader_model
```
- For model-graded scoring
```
config_dir
```
- (legacy) For runtime config reading; scaffold-inspect uses direct params instead

Benefits of parameters:

Run variations without code changes
Easier experimentation
Better reusability

How to pass parameters:

inspect eval task.py -T param_name=value

6. Model Specification

How will the model be specified?

Option 1: CLI specification (most flexible)

User provides model at runtime

inspect eval task.py --model hf/local -M model_path=/path/to/model

Recommended for most cases

Option 2: Integration with fine-tuning config (legacy)

Like existing
```
cap_task
```
example
Reads from
```
setup_finetune.yaml
```
at runtime via
```
config_dir
```
parameter
Note: scaffold-inspect now bakes values into SLURM instead of using this pattern

Option 3: Hard-coded in task

Less flexible but simpler
Can specify model inside task definition
Better for benchmarking specific models

Output Files

Create two files:

1. Task Script:

{task_name}_task.py

The complete, runnable inspect-ai task following best practices.

File naming convention:

Descriptive name:
```
sentiment_classification_task.py
```
Include domain:
```
math_reasoning_task.py
```
Follow pattern:
```
{domain}_{type}_task.py
```

Required components:

from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset, hf_dataset, FieldSpec
from inspect_ai.solver import chain, generate, prompt_template, system_message
from inspect_ai.scorer import match, includes

@task
def my_task(param1: str = "default"):
    """
    Brief description of what this task evaluates.

    Args:
        param1: Description of parameter

    Returns:
        Task: Configured inspect-ai task
    """

    # Dataset loading
    dataset = ...

    # Solver chain
    solver = chain(
        system_message("..."),
        prompt_template("{prompt}"),
        generate({"temperature": 0.0})
    )

    # Return task
    return Task(
        dataset=dataset,
        solver=solver,
        scorer=...
    )

Best practices to follow:

Use type hints for parameters
Include docstring explaining purpose
Add comments explaining non-obvious choices
Handle errors gracefully (try/except for file operations)
Validate required parameters
Use descriptive variable names

2. Design Documentation:

{task_name}_design.md

Comprehensive documentation of design decisions.

Required sections:

# {Task Name} Evaluation Task

**Created:** {timestamp}
**Inspect-AI Version:** {version if known}

## Evaluation Objective

{What this task evaluates and why}

## Dataset Configuration

**Format:** {JSON/Parquet/HuggingFace/etc.}
**Location:** `{full_path_to_dataset}`
**Size:** {number of samples if known}

**Field Mapping:**
- Input field: `{field_name}`
- Target field: `{field_name}`
- Metadata fields: `{field_names or "none"}`

**Loading Method:**
{Description of how dataset is loaded}

**Data Structure:**
{Explanation of JSON structure, splits, etc.}

## Solver Chain

**Components:**
1. {Solver 1}: {Purpose}
2. {Solver 2}: {Purpose}
3. ...

**System Message:**

{system message text or "none"}


**Prompt Template:**

{template or "direct input"}


**Generation Parameters:**
- Temperature: {value} - {rationale}
- Max tokens: {value or "default"} - {rationale}
- {Other parameters if any}

**Rationale:**
{Why this solver chain was chosen}

## Scorer Configuration

**Primary Scorer:** `{scorer_name}()`

**Options:**
- {option1}: {value} - {reason}
- {option2}: {value} - {reason}

**Additional Scorers:**
{List if multiple scorers used, or "none"}

**Rationale:**
{Why this scorer is appropriate for the task}

## Task Parameters

| Parameter | Type | Default | Purpose |
|-----------|------|---------|---------|
| {param1} | {type} | {default} | {description} |

**Parameter Usage:**
```bash
inspect eval {task_file}.py -T {param}={value}

Model Specification

Recommended usage:

inspect eval {task_file}.py --model hf/local -M model_path=/path/to/model

{Any specific notes about model compatibility}

Example Usage

Basic evaluation:

inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model

With parameters:

inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model -T temperature=0.5

Evaluating fine-tuned model: {if applicable}

cd /path/to/experiment/run/epoch_0
inspect eval {task_name}_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD

Output Files

Inspect-ai will create:

```
logs/{task_name}_{timestamp}.eval
```
- Evaluation results log
Console output with accuracy and metrics

Expected Performance

{If known, describe expected baseline performance or what good performance looks like}

Notes

{Any additional considerations, limitations, or future improvements}

References

Inspect-AI documentation: https://inspect.aisi.org.uk/
{Any other relevant references}


## Code Generation Guidelines

### Dataset Loading Patterns

**JSON with nested splits:**
```python
from inspect_ai.dataset import hf_dataset

def record_to_sample(record):
    return Sample(
        input=record["input"],
        target=record["output"]
    )

dataset = hf_dataset(
    path="json",
    data_files="/path/to/data.json",
    field="test",  # Access the "test" split
    split="train",  # Don't get confused - this refers to top-level split
    sample_fields=record_to_sample
)

JSONL (one JSON object per line):

from inspect_ai.dataset import json_dataset

def record_to_sample(record):
    return Sample(
        input=record["question"],
        target=record["answer"]
    )

dataset = json_dataset(
    "/path/to/data.jsonl",
    record_to_sample
)

Parquet directory:

from inspect_ai.dataset import hf_dataset, FieldSpec

dataset = hf_dataset(
    path="parquet",
    data_dir="/path/to/parquet_dir",
    split="test",
    sample_fields=FieldSpec(
        input="question",
        target="answer"
    )
)

HuggingFace dataset:

from inspect_ai.dataset import hf_dataset, FieldSpec

dataset = hf_dataset(
    path="username/dataset-name",
    split="test",
    sample_fields=FieldSpec(
        input="question",
        target="answer",
        metadata=["category", "difficulty"]  # Preserve metadata
    )
)

Solver Chain Patterns

Simple generation:

from inspect_ai.solver import chain, generate, prompt_template, system_message

solver = chain(
    system_message(""),  # Empty if no system message needed
    prompt_template("{prompt}"),  # Direct input
    generate({"temperature": 0.0})
)

With system message and custom template:

solver = chain(
    system_message("You are an expert classifier. Respond with only the category label."),
    prompt_template("Text: {prompt}\n\nCategory:"),
    generate({"temperature": 0.0, "max_tokens": 50})
)

Chain-of-thought:

from inspect_ai.solver import chain_of_thought, generate

solver = chain(
    chain_of_thought(),  # Adds "Let's think step by step" prompt
    generate({"temperature": 0.0})
)

Multiple choice:

from inspect_ai.solver import multiple_choice

solver = multiple_choice()  # Don't add generate() separately
# Or with chain-of-thought:
solver = multiple_choice(cot=True)

Scorer Patterns

Exact matching (case-insensitive):

from inspect_ai.scorer import match

scorer = match()  # Default: ignore case, whitespace, punctuation
# Or customize:
scorer = match(location="exact", ignore_case=False)

Substring matching:

from inspect_ai.scorer import includes

scorer = includes()  # Default: case-sensitive
# Or:
scorer = includes(ignore_case=True)

Multiple scorers:

scorer = [
    match("exact", ignore_case=False),
    includes(ignore_case=False)
]
# Results will show scores from both

Model-graded:

from inspect_ai.scorer import model_graded_qa

scorer = model_graded_qa(
    partial_credit=True,  # Allow 0.5 scores
    model="openai/gpt-4o"  # Specify grading model
)

Integration with Fine-Tuning Workflow

Experiment-Guided Task Creation (Recommended)

When creating tasks for an experiment:

Run from experiment directory:

cd /scratch/gpfs/MSALGANIK/mjs3/my_experiment/
# Invoke create-inspect-task skill

Skill automatically extracts from experiment_summary.yaml:
- Dataset path and format
- System prompt (ensures eval matches training)
- Model information
- Research objectives
Task parameter modes:
- Direct parameters (preferred):
```
data_path
```
  ,
```
prompt
```
  ,
```
system_prompt
```
  passed via
```
-T
```
  flags. scaffold-inspect bakes these into SLURM scripts at scaffolding time.
- config_dir mode (legacy): Reads from
```
setup_finetune.yaml
```
  at runtime. Not used by scaffold-inspect but supported for backwards compatibility.

Generated Task Pattern

For tasks integrated with experiments:

import yaml
from pathlib import Path

@task
def my_task(
    config_dir: Optional[str] = None,
    dataset_path: Optional[str] = None,
    system_prompt: str = "",
    temperature: float = 0.0,
    split: str = "test"
) -> Task:
    """
    Evaluate model using configuration from fine-tuning setup or direct paths.

    Args:
        config_dir: Path to epoch directory (contains ../setup_finetune.yaml).
                   If provided, reads dataset path and system prompt from config.
        dataset_path: Direct path to dataset JSON file. Used if config_dir not provided.
        system_prompt: System message for the model. Overrides config if both provided.
        temperature: Generation temperature (default: 0.0 for deterministic output).
        split: Which data split to use (default: "test").

    Returns:
        Task: Configured inspect-ai task
    """

    # Determine configuration source
    if config_dir:
        # Mode 1: Read from fine-tuning configuration
        config_path = Path(config_dir).parent / "setup_finetune.yaml"

        with open(config_path, 'r') as f:
            config = yaml.safe_load(f)

        # Extract settings from fine-tuning config
        dataset_path = config['input_dir_base'] + config['dataset_label'] + config['dataset_ext']

        # Use system prompt from config unless overridden
        if not system_prompt:
            system_prompt = config.get('system_prompt', '')

    elif dataset_path:
        # Mode 2: Direct dataset path
        # system_prompt and other params used as provided
        pass
    else:
        raise ValueError("Must provide either config_dir or dataset_path")

    # Load dataset
    dataset = ...  # Load using dataset_path

    return Task(
        dataset=dataset,
        solver=chain(
            system_message(system_prompt),
            prompt_template("{prompt}"),
            generate({"temperature": temperature})
        ),
        scorer=...
    )

Usage Examples

Evaluating fine-tuned model from experiment:

cd /path/to/experiment/run_dir/epoch_0
inspect eval /path/to/my_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD

Evaluating base model (control run):

inspect eval my_task.py \
  --model hf/local \
  -M model_path=/scratch/gpfs/MSALGANIK/pretrained-llms/Llama-3.2-1B-Instruct \
  -T dataset_path=/path/to/dataset.json

Integration with setup_inspect.py (Future)

This task pattern enables integration with the

setup_inspect.py

tool (when implemented):

python tools/inspect/setup_inspect.py --finetune_epoch_dir /path/to/experiment/run/epoch_0

Validation Before Completion

Common Validation (Both Modes)

Before finishing, verify:

✓ Task file is syntactically correct Python
✓ All imports are present
✓ Task decorated with
```
@task
```
✓ Dataset loading code matches format
✓ Solver chain follows inspect-ai patterns
✓ Scorer is appropriate for task
✓ Design documentation includes all sections
✓ Example usage commands are correct
✓ Log file documents all decisions

Mode 1 Specific Validation

Additional checks for experiment-guided mode:

✓ experiment_summary.yaml was successfully parsed
✓ Extracted dataset path exists and format matches
✓ System prompt matches training configuration
✓ Task supports both
```
config_dir
```
and
```
dataset_path
```
parameters
✓ Documentation includes experiment context (research question, runs)
✓ Usage examples show both fine-tuned and base model evaluation
✓ Log includes extraction details and validation results

Next Steps After Creation

After creating the task, guide user:

Test the task:

# Validate syntax
python -m py_compile {task_file}.py

# Test with small sample
inspect eval {task_file}.py --model {model} --limit 5

Run full evaluation:

inspect eval {task_file}.py --model {model}

View results:

inspect view
# Opens web UI to browse evaluation logs

Iterate if needed:
- Adjust scorer settings
- Modify prompts
- Change generation parameters
- Use
```
inspect score
```
  to re-score without re-running

Important Notes

General Best Practices

Follow inspect-ai best practices from https://inspect.aisi.org.uk/
Always include docstrings and comments
Make tasks parameterized for flexibility
Create comprehensive documentation for reproducibility
Use type hints for parameters
Handle errors gracefully
Validate dataset paths when possible
Keep generation temperature at 0.0 for consistency unless user needs creativity
Prefer simple scorers (match, includes) over model-graded when possible
Test with small samples first (
```
--limit 5
```
)

Experiment Integration

Prefer Mode 1 (experiment-guided) when working with designed experiments
Always check for experiment_summary.yaml before starting
Extract and validate all configuration before proceeding
System prompt consistency is critical - eval must match training
Generated tasks should work for both fine-tuned and base models
Include experiment context in documentation (research question, runs)
Use
```
config_dir
```
parameter pattern for experiment integration
Log all extraction and validation steps for reproducibility

Error Handling

If dataset file not found:

Warn user but proceed with code generation
Note in documentation that path should be verified
Include validation suggestion in next steps

If unsure about dataset format:

Ask for example record
Offer to help convert to supported format
Suggest user examine file structure

If scorer choice unclear:

Recommend starting with simple scorers
Suggest using multiple scorers for comparison
Note that scorers can be changed later without re-running generation