Awesome-copilot arize-experiment
INVOKE THIS SKILL when creating, running, or analyzing Arize experiments. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI.
git clone https://github.com/github/awesome-copilot
T=$(mktemp -d) && git clone --depth=1 https://github.com/github/awesome-copilot "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/arize-ax/skills/arize-experiment" ~/.claude/skills/github-awesome-copilot-arize-experiment && rm -rf "$T"
plugins/arize-ax/skills/arize-experiment/SKILL.mdArize Experiment Skill
Concepts
- Experiment = a named evaluation run against a specific dataset version, containing one run per example
- Experiment Run = the result of processing one dataset example -- includes the model output, optional evaluations, and optional metadata
- Dataset = a versioned collection of examples; every experiment is tied to a dataset and a specific dataset version
- Evaluation = a named metric attached to a run (e.g.,
,correctness
), with optional label, score, and explanationrelevance
The typical flow: export a dataset → process each example → collect outputs and evaluations → create an experiment with the runs.
Prerequisites
Proceed directly with the task — run the
ax command you need. Do NOT check versions, env vars, or profiles upfront.
If an
ax command fails, troubleshoot based on the error:
or version error → see references/ax-setup.mdcommand not found
/ missing API key → run401 Unauthorized
to inspect the current profile. If the profile is missing or the API key is wrong: checkax profiles show
for.env
and use it to create/update the profile via references/ax-profiles.md. IfARIZE_API_KEY
has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys).env- Space ID unknown → check
for.env
, or runARIZE_SPACE_ID
, or ask the userax spaces list -o json - Project unclear → check
for.env
, or ask, or runARIZE_DEFAULT_PROJECT
and present as selectable optionsax projects list -o json --limit 100
List Experiments: ax experiments list
ax experiments listBrowse experiments, optionally filtered by dataset. Output goes to stdout.
ax experiments list ax experiments list --dataset-id DATASET_ID --limit 20 ax experiments list --cursor CURSOR_TOKEN ax experiments list -o json
Flags
| Flag | Type | Default | Description |
|---|---|---|---|
| string | none | Filter by dataset |
| int | 15 | Max results (1-100) |
| string | none | Pagination cursor from previous response |
| string | table | Output format: table, json, csv, parquet, or file path |
| string | default | Configuration profile |
Get Experiment: ax experiments get
ax experiments getQuick metadata lookup -- returns experiment name, linked dataset/version, and timestamps.
ax experiments get EXPERIMENT_ID ax experiments get EXPERIMENT_ID -o json
Flags
| Flag | Type | Default | Description |
|---|---|---|---|
| string | required | Positional argument |
| string | table | Output format |
| string | default | Configuration profile |
Response fields
| Field | Type | Description |
|---|---|---|
| string | Experiment ID |
| string | Experiment name |
| string | Linked dataset ID |
| string | Specific dataset version used |
| string | Project where experiment traces are stored |
| datetime | When the experiment was created |
| datetime | Last modification time |
Export Experiment: ax experiments export
ax experiments exportDownload all runs to a file. By default uses the REST API; pass
--all to use Arrow Flight for bulk transfer.
ax experiments export EXPERIMENT_ID # -> experiment_abc123_20260305_141500/runs.json ax experiments export EXPERIMENT_ID --all ax experiments export EXPERIMENT_ID --output-dir ./results ax experiments export EXPERIMENT_ID --stdout ax experiments export EXPERIMENT_ID --stdout | jq '.[0]'
Flags
| Flag | Type | Default | Description |
|---|---|---|---|
| string | required | Positional argument |
| bool | false | Use Arrow Flight for bulk export (see below) |
| string | | Output directory |
| bool | false | Print JSON to stdout instead of file |
| string | default | Configuration profile |
REST vs Flight (--all
)
--all- REST (default): Lower friction -- no Arrow/Flight dependency, standard HTTPS ports, works through any corporate proxy or firewall. Limited to 500 runs per page.
- Flight (
): Required for experiments with more than 500 runs. Uses gRPC+TLS on a separate host/port (--all
) which some corporate networks may block.flight.arize.com:443
Agent auto-escalation rule: If a REST export returns exactly 500 runs, the result is likely truncated. Re-run with
--all to get the full dataset.
Output is a JSON array of run objects:
[ { "id": "run_001", "example_id": "ex_001", "output": "The answer is 4.", "evaluations": { "correctness": { "label": "correct", "score": 1.0 }, "relevance": { "score": 0.95, "explanation": "Directly answers the question" } }, "metadata": { "model": "gpt-4o", "latency_ms": 1234 } } ]
Create Experiment: ax experiments create
ax experiments createCreate a new experiment with runs from a data file.
ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json ax experiments create --name "claude-test" --dataset-id DATASET_ID --file runs.csv
Flags
| Flag | Type | Required | Description |
|---|---|---|---|
| string | yes | Experiment name |
| string | yes | Dataset to run the experiment against |
| path | yes | Data file with runs: CSV, JSON, JSONL, or Parquet |
| string | no | Output format |
| string | no | Configuration profile |
Passing data via stdin
Use
--file - to pipe data directly — no temp file needed:
echo '[{"example_id": "ex_001", "output": "Paris"}]' | ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file - # Or with a heredoc ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file - << 'EOF' [{"example_id": "ex_001", "output": "Paris"}] EOF
Required columns in the runs file
| Column | Type | Required | Description |
|---|---|---|---|
| string | yes | ID of the dataset example this run corresponds to |
| string | yes | The model/system output for this example |
Additional columns are passed through as
additionalProperties on the run.
Delete Experiment: ax experiments delete
ax experiments deleteax experiments delete EXPERIMENT_ID ax experiments delete EXPERIMENT_ID --force # skip confirmation prompt
Flags
| Flag | Type | Default | Description |
|---|---|---|---|
| string | required | Positional argument |
| bool | false | Skip confirmation prompt |
| string | default | Configuration profile |
Experiment Run Schema
Each run corresponds to one dataset example:
{ "example_id": "required -- links to dataset example", "output": "required -- the model/system output for this example", "evaluations": { "metric_name": { "label": "optional string label (e.g., 'correct', 'incorrect')", "score": "optional numeric score (e.g., 0.95)", "explanation": "optional freeform text" } }, "metadata": { "model": "gpt-4o", "temperature": 0.7, "latency_ms": 1234 } }
Evaluation fields
| Field | Type | Required | Description |
|---|---|---|---|
| string | no | Categorical classification (e.g., , , ) |
| number | no | Numeric quality score (e.g., 0.0 - 1.0) |
| string | no | Freeform reasoning for the evaluation |
At least one of
label, score, or explanation should be present per evaluation.
Workflows
Run an experiment against a dataset
- Find or create a dataset:
ax datasets list ax datasets export DATASET_ID --stdout | jq 'length' - Export the dataset examples:
ax datasets export DATASET_ID - Process each example through your system, collecting outputs and evaluations
- Build a runs file (JSON array) with
,example_id
, and optionaloutput
:evaluations[ {"example_id": "ex_001", "output": "4", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}}, {"example_id": "ex_002", "output": "Paris", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}} ] - Create the experiment:
ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json - Verify:
ax experiments get EXPERIMENT_ID
Compare two experiments
- Export both experiments:
ax experiments export EXPERIMENT_ID_A --stdout > a.json ax experiments export EXPERIMENT_ID_B --stdout > b.json - Compare evaluation scores by
:example_id# Average correctness score for experiment A jq '[.[] | .evaluations.correctness.score] | add / length' a.json # Same for experiment B jq '[.[] | .evaluations.correctness.score] | add / length' b.json - Find examples where results differ:
jq -s '.[0] as $a | .[1][] | . as $run | { example_id: $run.example_id, b_score: $run.evaluations.correctness.score, a_score: ($a[] | select(.example_id == $run.example_id) | .evaluations.correctness.score) }' a.json b.json - Score distribution per evaluator (pass/fail/partial counts):
# Count by label for experiment A jq '[.[] | .evaluations.correctness.label] | group_by(.) | map({label: .[0], count: length})' a.json - Find regressions (examples that passed in A but fail in B):
jq -s ' [.[0][] | select(.evaluations.correctness.label == "correct")] as $passed_a | [.[1][] | select(.evaluations.correctness.label != "correct") | select(.example_id as $id | $passed_a | any(.example_id == $id)) ] ' a.json b.json
Statistical significance note: Score comparisons are most reliable with ≥ 30 examples per evaluator. With fewer examples, treat the delta as directional only — a 5% difference on n=10 may be noise. Report sample size alongside scores:
jq 'length' a.json.
Download experiment results for analysis
-- find experimentsax experiments list --dataset-id DATASET_ID
-- download to fileax experiments export EXPERIMENT_ID- Parse:
jq '.[] | {example_id, score: .evaluations.correctness.score}' experiment_*/runs.json
Pipe export to other tools
# Count runs ax experiments export EXPERIMENT_ID --stdout | jq 'length' # Extract all outputs ax experiments export EXPERIMENT_ID --stdout | jq '.[].output' # Get runs with low scores ax experiments export EXPERIMENT_ID --stdout | jq '[.[] | select(.evaluations.correctness.score < 0.5)]' # Convert to CSV ax experiments export EXPERIMENT_ID --stdout | jq -r '.[] | [.example_id, .output, .evaluations.correctness.score] | @csv'
Related Skills
- arize-dataset: Create or export the dataset this experiment runs against → use
firstarize-dataset - arize-prompt-optimization: Use experiment results to improve prompts → next step is
arize-prompt-optimization - arize-trace: Inspect individual span traces for failing experiment runs → use
arize-trace - arize-link: Generate clickable UI links to traces from experiment runs → use
arize-link
Troubleshooting
| Problem | Solution |
|---|---|
| See references/ax-setup.md |
| API key is wrong, expired, or doesn't have access to this space. Fix the profile using references/ax-profiles.md. |
| No profile is configured. See references/ax-profiles.md to create one. |
| Verify experiment ID with |
| Each run must have and fields |
| Ensure values match IDs from the dataset (export dataset to verify) |
| Export returned empty -- verify experiment has runs via |
| The linked dataset may have been deleted; check with |
Save Credentials for Future Use
See references/ax-profiles.md § Save Credentials for Future Use.