Awesome-copilot arize-dataset

INVOKE THIS SKILL when creating, managing, or querying Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI.

install
source · Clone the upstream repo
git clone https://github.com/github/awesome-copilot
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/github/awesome-copilot "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/arize-ax/skills/arize-dataset" ~/.claude/skills/github-awesome-copilot-arize-dataset && rm -rf "$T"
manifest: plugins/arize-ax/skills/arize-dataset/SKILL.md
source content

Arize Dataset Skill

Concepts

  • Dataset = a versioned collection of examples used for evaluation and experimentation
  • Dataset Version = a snapshot of a dataset at a point in time; updates can be in-place or create a new version
  • Example = a single record in a dataset with arbitrary user-defined fields (e.g.,
    question
    ,
    answer
    ,
    context
    )
  • Space = an organizational container; datasets belong to a space

System-managed fields on examples (

id
,
created_at
,
updated_at
) are auto-generated by the server -- never include them in create or append payloads.

Prerequisites

Proceed directly with the task — run the

ax
command you need. Do NOT check versions, env vars, or profiles upfront.

If an

ax
command fails, troubleshoot based on the error:

  • command not found
    or version error → see references/ax-setup.md
  • 401 Unauthorized
    / missing API key → run
    ax profiles show
    to inspect the current profile. If the profile is missing or the API key is wrong: check
    .env
    for
    ARIZE_API_KEY
    and use it to create/update the profile via references/ax-profiles.md. If
    .env
    has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys)
  • Space ID unknown → check
    .env
    for
    ARIZE_SPACE_ID
    , or run
    ax spaces list -o json
    , or ask the user
  • Project unclear → check
    .env
    for
    ARIZE_DEFAULT_PROJECT
    , or ask, or run
    ax projects list -o json --limit 100
    and present as selectable options

List Datasets:
ax datasets list

Browse datasets in a space. Output goes to stdout.

ax datasets list
ax datasets list --space-id SPACE_ID --limit 20
ax datasets list --cursor CURSOR_TOKEN
ax datasets list -o json

Flags

FlagTypeDefaultDescription
--space-id
stringfrom profileFilter by space
--limit, -l
int15Max results (1-100)
--cursor
stringnonePagination cursor from previous response
-o, --output
stringtableOutput format: table, json, csv, parquet, or file path
-p, --profile
stringdefaultConfiguration profile

Get Dataset:
ax datasets get

Quick metadata lookup -- returns dataset name, space, timestamps, and version list.

ax datasets get DATASET_ID
ax datasets get DATASET_ID -o json

Flags

FlagTypeDefaultDescription
DATASET_ID
stringrequiredPositional argument
-o, --output
stringtableOutput format
-p, --profile
stringdefaultConfiguration profile

Response fields

FieldTypeDescription
id
stringDataset ID
name
stringDataset name
space_id
stringSpace this dataset belongs to
created_at
datetimeWhen the dataset was created
updated_at
datetimeLast modification time
versions
arrayList of dataset versions (id, name, dataset_id, created_at, updated_at)

Export Dataset:
ax datasets export

Download all examples to a file. Use

--all
for datasets larger than 500 examples (unlimited bulk export).

ax datasets export DATASET_ID
# -> dataset_abc123_20260305_141500/examples.json

ax datasets export DATASET_ID --all
ax datasets export DATASET_ID --version-id VERSION_ID
ax datasets export DATASET_ID --output-dir ./data
ax datasets export DATASET_ID --stdout
ax datasets export DATASET_ID --stdout | jq '.[0]'

Flags

FlagTypeDefaultDescription
DATASET_ID
stringrequiredPositional argument
--version-id
stringlatestExport a specific dataset version
--all
boolfalseUnlimited bulk export (use for datasets > 500 examples)
--output-dir
string
.
Output directory
--stdout
boolfalsePrint JSON to stdout instead of file
-p, --profile
stringdefaultConfiguration profile

Agent auto-escalation rule: If an export returns exactly 500 examples, the result is likely truncated — re-run with

--all
to get the full dataset.

Export completeness verification: After exporting, confirm the row count matches what the server reports:

# Get the server-reported count from dataset metadata
ax datasets get DATASET_ID -o json | jq '.versions[-1] | {version: .id, examples: .example_count}'

# Compare to what was exported
jq 'length' dataset_*/examples.json

# If counts differ, re-export with --all

Output is a JSON array of example objects. Each example has system fields (

id
,
created_at
,
updated_at
) plus all user-defined fields:

[
  {
    "id": "ex_001",
    "created_at": "2026-01-15T10:00:00Z",
    "updated_at": "2026-01-15T10:00:00Z",
    "question": "What is 2+2?",
    "answer": "4",
    "topic": "math"
  }
]

Create Dataset:
ax datasets create

Create a new dataset from a data file.

ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.csv
ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.json
ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.jsonl
ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.parquet

Flags

FlagTypeRequiredDescription
--name, -n
stringyesDataset name
--space-id
stringyesSpace to create the dataset in
--file, -f
pathyesData file: CSV, JSON, JSONL, or Parquet
-o, --output
stringnoOutput format for the returned dataset metadata
-p, --profile
stringnoConfiguration profile

Passing data via stdin

Use

--file -
to pipe data directly — no temp file needed:

echo '[{"question": "What is 2+2?", "answer": "4"}]' | ax datasets create --name "my-dataset" --space-id SPACE_ID --file -

# Or with a heredoc
ax datasets create --name "my-dataset" --space-id SPACE_ID --file - << 'EOF'
[{"question": "What is 2+2?", "answer": "4"}]
EOF

To add rows to an existing dataset, use

ax datasets append --json '[...]'
instead — no file needed.

Supported file formats

FormatExtensionNotes
CSV
.csv
Column headers become field names
JSON
.json
Array of objects
JSON Lines
.jsonl
One object per line (NOT a JSON array)
Parquet
.parquet
Column names become field names; preserves types

Format gotchas:

  • CSV: Loses type information — dates become strings,
    null
    becomes empty string. Use JSON/Parquet to preserve types.
  • JSONL: Each line is a separate JSON object. A JSON array (
    [{...}, {...}]
    ) in a
    .jsonl
    file will fail — use
    .json
    extension instead.
  • Parquet: Preserves column types. Requires
    pandas
    /
    pyarrow
    to read locally:
    pd.read_parquet("examples.parquet")
    .

Append Examples:
ax datasets append

Add examples to an existing dataset. Two input modes -- use whichever fits.

Inline JSON (agent-friendly)

Generate the payload directly -- no temp files needed:

ax datasets append DATASET_ID --json '[{"question": "What is 2+2?", "answer": "4"}]'

ax datasets append DATASET_ID --json '[
  {"question": "What is gravity?", "answer": "A fundamental force..."},
  {"question": "What is light?", "answer": "Electromagnetic radiation..."}
]'

From a file

ax datasets append DATASET_ID --file new_examples.csv
ax datasets append DATASET_ID --file additions.json

To a specific version

ax datasets append DATASET_ID --json '[{"q": "..."}]' --version-id VERSION_ID

Flags

FlagTypeRequiredDescription
DATASET_ID
stringyesPositional argument
--json
stringmutexJSON array of example objects
--file, -f
pathmutexData file (CSV, JSON, JSONL, Parquet)
--version-id
stringnoAppend to a specific version (default: latest)
-o, --output
stringnoOutput format for the returned dataset metadata
-p, --profile
stringnoConfiguration profile

Exactly one of

--json
or
--file
is required.

Validation

  • Each example must be a JSON object with at least one user-defined field
  • Maximum 100,000 examples per request

Schema validation before append: If the dataset already has examples, inspect its schema before appending to avoid silent field mismatches:

# Check existing field names in the dataset
ax datasets export DATASET_ID --stdout | jq '.[0] | keys'

# Verify your new data has matching field names
echo '[{"question": "..."}]' | jq '.[0] | keys'

# Both outputs should show the same user-defined fields

Fields are free-form: extra fields in new examples are added, and missing fields become null. However, typos in field names (e.g.,

queston
vs
question
) create new columns silently -- verify spelling before appending.

Delete Dataset:
ax datasets delete

ax datasets delete DATASET_ID
ax datasets delete DATASET_ID --force   # skip confirmation prompt

Flags

FlagTypeDefaultDescription
DATASET_ID
stringrequiredPositional argument
--force, -f
boolfalseSkip confirmation prompt
-p, --profile
stringdefaultConfiguration profile

Workflows

Find a dataset by name

Users often refer to datasets by name rather than ID. Resolve a name to an ID before running other commands:

# Find dataset ID by name
ax datasets list -o json | jq '.[] | select(.name == "eval-set-v1") | .id'

# If the list is paginated, fetch more
ax datasets list -o json --limit 100 | jq '.[] | select(.name | test("eval-set")) | {id, name}'

Create a dataset from file for evaluation

  1. Prepare a CSV/JSON/Parquet file with your evaluation columns (e.g.,
    input
    ,
    expected_output
    )
    • If generating data inline, pipe it via stdin using
      --file -
      (see the Create Dataset section)
  2. ax datasets create --name "eval-set-v1" --space-id SPACE_ID --file eval_data.csv
  3. Verify:
    ax datasets get DATASET_ID
  4. Use the dataset ID to run experiments

Add examples to an existing dataset

# Find the dataset
ax datasets list

# Append inline or from a file (see Append Examples section for full syntax)
ax datasets append DATASET_ID --json '[{"question": "...", "answer": "..."}]'
ax datasets append DATASET_ID --file additional_examples.csv

Download dataset for offline analysis

  1. ax datasets list
    -- find the dataset
  2. ax datasets export DATASET_ID
    -- download to file
  3. Parse the JSON:
    jq '.[] | .question' dataset_*/examples.json

Export a specific version

# List versions
ax datasets get DATASET_ID -o json | jq '.versions'

# Export that version
ax datasets export DATASET_ID --version-id VERSION_ID

Iterate on a dataset

  1. Export current version:
    ax datasets export DATASET_ID
  2. Modify the examples locally
  3. Append new rows:
    ax datasets append DATASET_ID --file new_rows.csv
  4. Or create a fresh version:
    ax datasets create --name "eval-set-v2" --space-id SPACE_ID --file updated_data.json

Pipe export to other tools

# Count examples
ax datasets export DATASET_ID --stdout | jq 'length'

# Extract a single field
ax datasets export DATASET_ID --stdout | jq '.[].question'

# Convert to CSV with jq
ax datasets export DATASET_ID --stdout | jq -r '.[] | [.question, .answer] | @csv'

Dataset Example Schema

Examples are free-form JSON objects. There is no fixed schema -- columns are whatever fields you provide. System-managed fields are added by the server:

FieldTypeManaged byNotes
id
stringserverAuto-generated UUID. Required on update, forbidden on create/append
created_at
datetimeserverImmutable creation timestamp
updated_at
datetimeserverAuto-updated on modification
(any user field)any JSON typeuserString, number, boolean, null, nested object, array

Related Skills

  • arize-trace: Export production spans to understand what data to put in datasets → use
    arize-trace
  • arize-experiment: Run evaluations against this dataset → next step is
    arize-experiment
  • arize-prompt-optimization: Use dataset + experiment results to improve prompts → use
    arize-prompt-optimization

Troubleshooting

ProblemSolution
ax: command not found
See references/ax-setup.md
401 Unauthorized
API key is wrong, expired, or doesn't have access to this space. Fix the profile using references/ax-profiles.md.
No profile found
No profile is configured. See references/ax-profiles.md to create one.
Dataset not found
Verify dataset ID with
ax datasets list
File format error
Supported: CSV, JSON, JSONL, Parquet. Use
--file -
to read from stdin.
platform-managed column
Remove
id
,
created_at
,
updated_at
from create/append payloads
reserved column
Remove
time
,
count
, or any
source_record_*
field
Provide either --json or --file
Append requires exactly one input source
Examples array is empty
Ensure your JSON array or file contains at least one example
not a JSON object
Each element in the
--json
array must be a
{...}
object, not a string or number

Save Credentials for Future Use

See references/ax-profiles.md § Save Credentials for Future Use.