Awesome-omni-skills hugging-face-evaluation-v2

Overview workflow skill. Use this skill when the user needs Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format and the operator should preserve the upstream workflow, copied support files, and provenance before merging or handing off.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/hugging-face-evaluation-v2" ~/.claude/skills/diegosouzapw-awesome-omni-skills-hugging-face-evaluation-v2 && rm -rf "$T"

manifest: skills/hugging-face-evaluation-v2/SKILL.md

Overview

This public intake copy packages

plugins/antigravity-awesome-skills/skills/hugging-face-evaluation

from

https://github.com/sickn33/antigravity-awesome-skills

into the native Omni Skills editorial shape without hiding its origin.

Use it when the operator needs the upstream workflow, support files, and repository context to stay intact while the public validator and private enhancer continue their normal downstream flow.

This intake keeps the copied upstream files intact and uses

metadata.json

plus

ORIGIN.md

as the provenance anchor for review.

Overview This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data: - Extracting existing evaluation tables from README content - Importing benchmark scores from Artificial Analysis - Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)

Imported source sections that did not map cleanly to the public headings are still preserved below or in the support files. Notable imported sections: Integration with HF Ecosystem, Core Dependencies, Inference Provider Evaluation, vLLM Custom Model Evaluation (GPU required), ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones, 1. Inspect and Extract Evaluation Tables from README.

When to Use This Skill

Use this section as the trigger filter. It should make the activation boundary explicit before the operator loads files, runs commands, or opens a pull request.

You need to add structured evaluation results to a Hugging Face model card.
You want to import benchmark data or run custom evaluations with vLLM, lighteval, or inspect-ai.
You are preparing leaderboard-compatible model-index metadata for a model release.
Use when the request clearly matches the imported source intent: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with....
Use when the operator should preserve upstream workflow detail instead of rewriting the process from scratch.
Use when provenance needs to stay visible in the answer, PR, or review packet.

Operating Table

Situation	Start here	Why it matters
First-time use	`metadata.json`	Confirms repository, branch, commit, and imported path before touching the copied workflow
Provenance review	`ORIGIN.md`	Gives reviewers a plain-language audit trail for the imported source
Workflow execution	`SKILL.md`	Starts with the smallest copied file that materially changes execution
Supporting context	`SKILL.md`	Adds the next most relevant copied source file without loading the entire package
Handoff decision	`## Related Skills`	Helps the operator switch to a stronger native skill when the task drifts

Workflow

This workflow is intentionally editorial and operational at the same time. It keeps the imported source useful to the operator while still satisfying the public intake standards that feed the downstream enhancer flow.

Confirm the user goal, the scope of the imported workflow, and whether this skill is still the right router for the task.
Read the overview and provenance files before loading any copied upstream support files.
Load only the references, examples, prompts, or scripts that materially change the outcome for the current request.
Execute the upstream workflow while keeping provenance and source boundaries explicit in the working notes.
Validate the result against the upstream expectations and the evidence you can point to in the copied files.
Escalate or hand off to a related skill when the work moves out of this imported workflow's center of gravity.
Before merge or closure, record what was used, what changed, and what the reviewer still needs to verify.

Imported Workflow Notes

Imported: Integration with HF Ecosystem

Model Cards: Updates model-index metadata for leaderboard integration
Artificial Analysis: Direct API integration for benchmark imports
Papers with Code: Compatible with their model-index specification
Jobs: Run evaluations directly on Hugging Face Jobs with
```
uv
```
integration
vLLM: Efficient GPU inference for custom model evaluation
lighteval: HuggingFace's evaluation library with vLLM/accelerate backends
inspect-ai: UK AI Safety Institute's evaluation framework

Version

1.3.0

Dependencies

Examples

Example 1: Ask for the upstream workflow directly

Use @hugging-face-evaluation-v2 to handle <task>. Start from the copied upstream workflow, load only the files that change the outcome, and keep provenance visible in the answer.

Explanation: This is the safest starting point when the operator needs the imported workflow, but not the entire repository.

Example 2: Ask for a provenance-grounded review

Review @hugging-face-evaluation-v2 against metadata.json and ORIGIN.md, then explain which copied upstream files you would load first and why.

Explanation: Use this before review or troubleshooting when you need a precise, auditable explanation of origin and file selection.

Example 3: Narrow the copied support files before execution

Use @hugging-face-evaluation-v2 for <task>. Load only the copied references, examples, or scripts that change the outcome, and name the files explicitly before proceeding.

Explanation: This keeps the skill aligned with progressive disclosure instead of loading the whole copied package by default.

Example 4: Build a reviewer packet

Review @hugging-face-evaluation-v2 using the copied upstream files plus provenance, then summarize any gaps before merge.

Explanation: This is useful when the PR is waiting for human review and you want a repeatable audit packet.

Best Practices

Treat the generated public skill as a reviewable packaging layer around the upstream repository. The goal is to keep provenance explicit and load only the copied source material that materially improves execution.

Keep the imported skill grounded in the upstream repository; do not invent steps that the source material cannot support.
Prefer the smallest useful set of support files so the workflow stays auditable and fast to review.
Keep provenance, source commit, and imported file paths visible in notes and PR descriptions.
Point directly at the copied upstream files that justify the workflow instead of relying on generic review boilerplate.
Treat generated examples as scaffolding; adapt them to the concrete task before execution.
Route to a stronger native skill when architecture, debugging, design, or security concerns become dominant.

Troubleshooting

Problem: The operator skipped the imported context and answered too generically

Symptoms: The result ignores the upstream workflow in

plugins/antigravity-awesome-skills/skills/hugging-face-evaluation

, fails to mention provenance, or does not use any copied source files at all. Solution: Re-open

metadata.json

ORIGIN.md

, and the most relevant copied upstream files. Load only the files that materially change the answer, then restate the provenance before continuing.

Problem: The imported workflow feels incomplete during review

Symptoms: Reviewers can see the generated

SKILL.md

, but they cannot quickly tell which references, examples, or scripts matter for the current task. Solution: Point at the exact copied references, examples, scripts, or assets that justify the path you took. If the gap is still real, record it in the PR instead of hiding it.

Problem: The task drifted into a different specialization

Symptoms: The imported skill starts in the right place, but the work turns into debugging, architecture, design, security, or release orchestration that a native skill handles better. Solution: Use the related skills section to hand off deliberately. Keep the imported provenance visible so the next skill inherits the right context instead of starting blind.

Related Skills

```
@grafana-dashboards-v2
```
- Use when the work is better handled by that native specialization after this imported skill establishes context.
```
@graphql-architect-v2
```
- Use when the work is better handled by that native specialization after this imported skill establishes context.
```
@graphql-v2
```
- Use when the work is better handled by that native specialization after this imported skill establishes context.
```
@growth-engine-v2
```
- Use when the work is better handled by that native specialization after this imported skill establishes context.

Additional Resources

Use this support matrix and the linked files below as the operator packet for this imported skill. They should reflect real copied source material, not generic scaffolding.

Resource family	What it gives the reviewer	Example path
`references`	copied reference notes, guides, or background material from upstream	`references/n/a`
`examples`	worked examples or reusable prompts copied from upstream	`examples/n/a`
`scripts`	upstream helper scripts that change execution or validation	`scripts/n/a`
`agents`	routing or delegation notes that are genuinely part of the imported package	`agents/n/a`
`assets`	supporting assets or schemas copied from the source package	`assets/n/a`

Imported Reference Notes

Imported: 3. Model-Index Management

YAML Generation: Create properly formatted model-index entries
Merge Support: Add evaluations to existing model cards without overwriting
Validation: Ensure compliance with Papers with Code specification
Batch Operations: Process multiple models efficiently

Imported: Core Dependencies

huggingface_hub>=0.26.0
markdown-it-py>=3.0.0
python-dotenv>=1.2.1
pyyaml>=6.0.3
requests>=2.32.5
re (built-in)

Imported: Inference Provider Evaluation

inspect-ai>=0.3.0
inspect-evals
openai

Imported: vLLM Custom Model Evaluation (GPU required)

lighteval[accelerate,vllm]>=0.6.0
vllm>=0.4.0
torch>=2.0.0
transformers>=4.40.0
accelerate>=0.30.0

Note: vLLM dependencies are installed automatically via PEP 723 script headers when using

uv run

IMPORTANT: Using This Skill

Imported: ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

Before creating ANY pull request with

--create-pr

, you MUST check for existing open PRs:

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

If open PRs exist:

DO NOT create a new PR - this creates duplicate work for maintainers
Warn the user that open PRs already exist
Show the user the existing PR URLs so they can review them
Only proceed if the user explicitly confirms they want to create another PR

This prevents spamming model repositories with duplicate evaluation PRs.

All paths are relative to the directory containing this SKILL.md file. Before running any script, first
cd
to that directory or use the full path.

Use

--help

for the latest workflow guidance. Works with plain Python or

uv run

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help

Key workflow (matches CLI help):

```
get-prs
```
→ check for existing open PRs first
```
inspect-tables
```
→ find table numbers/columns
```
extract-readme --table N
```
→ prints YAML by default
add
```
--apply
```
(push) or
```
--create-pr
```
to write changes

Core Capabilities

Imported: 1. Inspect and Extract Evaluation Tables from README

Inspect Tables: Use
```
inspect-tables
```
to see all tables in a README with structure, columns, and sample rows
Parse Markdown Tables: Accurate parsing using markdown-it-py (ignores code blocks and examples)
Table Selection: Use
```
--table N
```
to extract from a specific table (required when multiple tables exist)
Format Detection: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
Column Matching: Automatically identify model columns/rows; prefer
```
--model-column-index
```
(index from inspect output). Use
```
--model-name-override
```
only with exact column header text.
YAML Generation: Convert selected table to model-index YAML format
Task Typing:
```
--task-type
```
sets the
```
task.type
```
field in model-index output (e.g.,
```
text-generation
```
,
```
summarization
```
)

Imported: 2. Import from Artificial Analysis

API Integration: Fetch benchmark scores directly from Artificial Analysis
Automatic Formatting: Convert API responses to model-index format
Metadata Preservation: Maintain source attribution and URLs
PR Creation: Automatically create pull requests with evaluation updates

Imported: 4. Run Evaluations on HF Jobs (Inference Providers)

Inspect-AI Integration: Run standard evaluations using the
```
inspect-ai
```
library
UV Integration: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
Zero-Config: No Dockerfiles or Space management required
Hardware Selection: Configure CPU or GPU hardware for the evaluation job
Secure Execution: Handles API tokens safely via secrets passed through the CLI

Imported: 5. Run Custom Model Evaluations with vLLM (NEW)

⚠️ Important: This approach is only possible on devices with

uv

installed and sufficient GPU memory. Benefits: No need to use

hf_jobs()

MCP tool, can run scripts directly in terminal When to use: User working in local device directly when GPU is available

Before running the script

check the script path
check uv is installed
check gpu is available with
```
nvidia-smi
```

Running the script

uv run scripts/train_sft_example.py

Features

vLLM Backend: High-performance GPU inference (5-10x faster than standard HF methods)
lighteval Framework: HuggingFace's evaluation library with Open LLM Leaderboard tasks
inspect-ai Framework: UK AI Safety Institute's evaluation library
Standalone or Jobs: Run locally or submit to HF Jobs infrastructure

Usage Instructions

The skill includes Python scripts in

scripts/

to perform operations.

Prerequisites

Preferred: use
```
uv run
```
(PEP 723 header auto-installs deps)

Or install manually:

pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests

Set
```
HF_TOKEN
```
environment variable with Write-access token
For Artificial Analysis: Set
```
AA_API_KEY
```
environment variable
```
.env
```
is loaded automatically if
```
python-dotenv
```
is installed

Method 1: Extract from README (CLI workflow)

Recommended flow (matches

--help

# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  [--model-column-index <column index shown by inspect-tables>] \
  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index

# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --apply       # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --create-pr   # open a PR

Validation checklist:

YAML is printed by default; compare against the README table before applying.
Prefer
```
--model-column-index
```
; if using
```
--model-name-override
```
, the column header text must be exact.
For transposed tables (models as rows), ensure only one row is extracted.

Method 2: Import from Artificial Analysis

Fetch benchmark scores from Artificial Analysis API and add them to a model card.

Basic Usage:

AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

With Environment File:

# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env

# Run import
uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"

Create Pull Request:

uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr

Method 3: Run Evaluation Job

Submit an evaluation job on Hugging Face infrastructure using the

hf jobs uv run

CLI.

Direct CLI Usage:

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"

GPU Example (A10G):

HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"

Python Helper (optional):

uv run scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"

Method 4: Run Custom Model Evaluation with vLLM

Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are separate from inference provider scripts and run models locally on the job's hardware.

When to Use vLLM Evaluation (vs Inference Providers)

Feature	vLLM Scripts	Inference Provider Scripts
Model access	Any HF model	Models with API endpoints
Hardware	Your GPU (or HF Jobs GPU)	Provider's infrastructure
Cost	HF Jobs compute cost	API usage fees
Speed	vLLM optimized	Depends on provider
Offline	Yes (after download)	No

Option A: lighteval with vLLM Backend

lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.

Standalone (local GPU):

# Run MMLU 5-shot with vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# Run multiple tasks
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

# Use accelerate backend instead of vLLM
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate

# Chat/instruction-tuned models
uv run scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --tasks "leaderboard|mmlu|5" \
  --use-chat-template

Via HF Jobs:

hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"

lighteval Task Format: Tasks use the format

suite|task|num_fewshot

```
leaderboard|mmlu|5
```
- MMLU with 5-shot
```
leaderboard|gsm8k|5
```
- GSM8K with 5-shot
```
lighteval|hellaswag|0
```
- HellaSwag zero-shot
```
leaderboard|arc_challenge|25
```
- ARC-Challenge with 25-shot

Finding Available Tasks: The complete list of available lighteval tasks can be found at: https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt

This file contains all supported tasks in the format

suite|task|num_fewshot|0

(the trailing

is a version flag and can be ignored). Common suites include:

```
leaderboard
```
- Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
```
lighteval
```
- Additional lighteval tasks
```
bigbench
```
- BigBench tasks
```
original
```
- Original benchmark tasks

To use a task from the list, extract the

suite|task|num_fewshot

portion (without the trailing

) and pass it to the

--tasks

parameter. For example:

From file:
```
leaderboard|mmlu|0
```
→ Use:
```
leaderboard|mmlu|0
```
(or change to
```
5
```
for 5-shot)

From file:

bigbench|abstract_narrative_understanding|0

→ Use:

bigbench|abstract_narrative_understanding|0

From file:

lighteval|wmt14:hi-en|0

→ Use:

lighteval|wmt14:hi-en|0

Multiple tasks can be specified as comma-separated values:

--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

Option B: inspect-ai with vLLM Backend

inspect-ai is the UK AI Safety Institute's evaluation framework.

Standalone (local GPU):

# Run MMLU with vLLM
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu

# Use HuggingFace Transformers backend
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu \
  --backend hf

# Multi-GPU with tensor parallelism
uv run scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --tensor-parallel-size 4

Via HF Jobs:

hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu

Available inspect-ai Tasks:

```
mmlu
```
- Massive Multitask Language Understanding
```
gsm8k
```
- Grade School Math
```
hellaswag
```
- Common sense reasoning
```
arc_challenge
```
- AI2 Reasoning Challenge
```
truthfulqa
```
- TruthfulQA benchmark
```
winogrande
```
- Winograd Schema Challenge
```
humaneval
```
- Code generation

Option C: Python Helper Script

The helper script auto-selects hardware and simplifies job submission:

# Auto-detect hardware based on model size
uv run scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-1B \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

# Explicit hardware selection
uv run scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --framework inspect \
  --hardware a100-large \
  --tensor-parallel-size 4

# Use HF Transformers backend
uv run scripts/run_vllm_eval_job.py \
  --model microsoft/phi-2 \
  --task mmlu \
  --framework inspect \
  --backend hf

Hardware Recommendations:

Model Size	Recommended Hardware
< 3B params	`t4-small`
3B - 13B	`a10g-small`
13B - 34B	`a10g-large`
34B+	`a100-large`

Commands Reference

Top-level help and version:

uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version

Inspect Tables (start here):

uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"

Extract from README:

uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]

Import from Artificial Analysis:

AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]

View / Validate:

uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"

Check Open PRs (ALWAYS run before --create-pr):

uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"

Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.

Run Evaluation Job (Inference Providers):

hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"

or use the Python helper:

uv run scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."

Run vLLM Evaluation (Custom Models):

# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --tasks "leaderboard|mmlu|5"

# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "mmlu"

# Helper script (auto hardware selection)
uv run scripts/run_vllm_eval_job.py \
  --model "model-id" \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

Model-Index Format

The generated model-index follows this structure:

model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com

WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.

Error Handling

Table Not Found: Script will report if no evaluation tables are detected
Invalid Format: Clear error messages for malformed tables
API Errors: Retry logic for transient Artificial Analysis API failures
Token Issues: Validation before attempting updates
Merge Conflicts: Preserves existing model-index entries when adding new ones
Space Creation: Handles naming conflicts and hardware request failures gracefully

Best Practices

Check for existing PRs first: Run
```
get-prs
```
before creating any new PR to avoid duplicates
Always start with
inspect-tables
: See table structure and get the correct extraction command
Use
--help
for guidance: Run
```
inspect-tables --help
```
to see the complete workflow
Preview first: Default behavior prints YAML; review it before using
```
--apply
```
or
```
--create-pr
```
Verify extracted values: Compare YAML output against the README table manually
Use
--table N
for multi-table READMEs: Required when multiple evaluation tables exist
Use
--model-name-override
for comparison tables: Copy the exact column header from
```
inspect-tables
```
output
Create PRs for Others: Use
```
--create-pr
```
when updating models you don't own
One model per repo: Only add the main model's results to model-index
No markdown in YAML names: The model name field in YAML should be plain text

Model Name Matching

When extracting evaluation tables with multiple models (either as columns or rows), the script uses exact normalized token matching:

Removes markdown formatting (bold
```
**
```
, links
```
[]()
```
)
Normalizes names (lowercase, replace
```
-
```
and
```
_
```
with spaces)

Compares token sets:

"OLMo-3-32B"

→

{"olmo", "3", "32b"}

matches

"**Olmo 3 32B**"

"Olmo-3-32B

Only extracts if tokens match exactly (handles different word orders and separators)
Fails if no exact match found (rather than guessing from similar names)

For column-based tables (benchmarks as rows, models as columns):

Finds the column header matching the model name
Extracts scores from that column only

For transposed tables (models as rows, benchmarks as columns):

Finds the row in the first column matching the model name
Extracts all benchmark scores from that row only

This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.

Common Patterns

Update Your Own Model:

# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "your-username/your-model" \
  --task-type "text-generation"

Update Someone Else's Model (Full Workflow):

# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "other-username/their-model"

# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "other-username/their-model" \
  --create-pr

# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms

Import Fresh Benchmarks:

# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "anthropic/claude-sonnet-4"

# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "anthropic/claude-sonnet-4" \
  --create-pr

Troubleshooting

Issue: "No evaluation tables found in README"

Solution: Check if README contains markdown tables with numeric scores

Issue: "Could not find model 'X' in transposed table"

Solution: The script will display available models. Use
```
--model-name-override
```
with the exact name from the list
Example:
```
--model-name-override "**Olmo 3-32B**"
```

Issue: "AA_API_KEY not set"

Solution: Set environment variable or add to .env file

Issue: "Token does not have write access"

Solution: Ensure HF_TOKEN has write permissions for the repository

Issue: "Model not found in Artificial Analysis"

Solution: Verify creator-slug and model-name match API values

Issue: "Payment required for hardware"

Solution: Add a payment method to your Hugging Face account to use non-CPU hardware

Issue: "vLLM out of memory" or CUDA OOM

Solution: Use a larger hardware flavor, reduce
```
--gpu-memory-utilization
```
, or use
```
--tensor-parallel-size
```
for multi-GPU

Issue: "Model architecture not supported by vLLM"

Solution: Use
```
--backend hf
```
(inspect-ai) or
```
--backend accelerate
```
(lighteval) for HuggingFace Transformers

Issue: "Trust remote code required"

Solution: Add
```
--trust-remote-code
```
flag for models with custom code (e.g., Phi-2, Qwen)

Issue: "Chat template not found"

Solution: Only use
```
--use-chat-template
```
for instruction-tuned models that include a chat template

Integration Examples

Python Script Integration:

import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")

Imported: Limitations

Use this skill only when the task clearly matches the scope described above.
Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.