Phoenix phoenix-skills-audit
git clone https://github.com/Arize-ai/phoenix
T=$(mktemp -d) && git clone --depth=1 https://github.com/Arize-ai/phoenix "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/phoenix-skills-audit" ~/.claude/skills/arize-ai-phoenix-phoenix-skills-audit && rm -rf "$T"
.agents/skills/phoenix-skills-audit/SKILL.mdPhoenix Skills Audit
Keep the three external-facing skills —
phoenix-tracing, phoenix-cli, phoenix-evals —
truthful about what the Phoenix Python clients, TypeScript clients, CLI, and APIs actually
do today. The output is patches applied to the skill files, not a report. The skill
reads recent commits, identifies what changed in user-facing surfaces, and updates the
relevant SKILL.md and references/*.md files in place.
This is a sibling skill to
phoenix-docs-gap-audit. The docs-gap-audit produces a report
about gaps in docs/phoenix/; this skill produces edits to .agents/skills/. Skills are
loaded into agent context every time the user asks a question that triggers them, so a
stale skill teaches every future agent the wrong API. That makes drift here strictly
worse than drift in human-facing docs — humans can sanity-check; agents can't.
Targets — the three skills this audit owns
.agents/skills/phoenix-tracing/SKILL.md .agents/skills/phoenix-tracing/references/*.md .agents/skills/phoenix-cli/SKILL.md .agents/skills/phoenix-evals/SKILL.md .agents/skills/phoenix-evals/references/*.md
Do not touch any other skill directory. If a change clearly belongs in a skill outside these three (e.g.
phoenix-server, phoenix-frontend), note it in the run summary and
skip — those skills are internal and have their own owners.
Source mapping — which code feeds which skill
| Source area | Skill |
|---|---|
(Python) | |
(TS) | |
| OpenInference semantic conventions, span attributes, instrumentation patterns | |
(commands, flags, output JSON shape) | |
(Python) | |
(TS) | |
| New evaluators, eval templates, experiment APIs | |
(Python) | depends on what's exposed — see below |
(TS) | depends on what's exposed — see below |
Server REST/GraphQL () | depends on what's exposed — see below |
The generic clients and the server APIs are cross-cutting. Map them by the feature they expose, not the file they live in:
- A client method that creates spans / sets attributes →
phoenix-tracing - A client method that runs evaluations or experiments →
phoenix-evals - A new REST/GraphQL endpoint that the CLI wraps (or should wrap) →
phoenix-cli - A new attribute on a span returned by the API →
(JSON shape doc) and potentiallyphoenix-cli
(if it's an OpenInference attribute)phoenix-tracing
A single feature can — and often does — span multiple skills. That's fine. Make all the edits; cross-reference between skills only when the user genuinely needs to read both.
Workflow
Phase 1: Gather commits
Default window is the last 7 days on
origin/main. The user may override. Translate
their phrasing into a concrete range before running anything.
Always audit
, not the local origin/main
branch. Local main
main is routinely
stale by dozens of commits — auditing the stale tip silently misses everything that
shipped after the last git pull.
git fetch origin main --quiet # Default 7-day window git log --since="7 days ago" origin/main --no-merges --pretty=format:"%h %s" --name-status # Tag range if the user specified one git log <prev-tag>..<current-tag> --no-merges --pretty=format:"%h %s" --name-status # Sanity check git rev-list --count main..origin/main
Save the raw list. Note the audited SHA in the run summary so a reader can reproduce.
Phase 2: Triage to user-facing surfaces
Commit messages lie or under-report. Use them as an index, not a source of truth. Split the list into three buckets:
- Audit candidates — anything that changes what the Python clients, TypeScript clients, CLI, REST/GraphQL APIs, or instrumentation packages do from the user's perspective. New methods, new flags, new commands, new attributes, new env vars, renamed parameters, behavior changes, deprecations, breaking changes.
- Skip — internal refactors with no public surface, dep bumps, test-only changes,
CI/build changes, formatting, frontend-only changes (those belong to other skills),
release-please bookkeeping (
), and feature flags (env vars namedchore(main): release …
,*_DANGEROUSLY_*
, intentionally undocumented escape hatches). Feature flags are deliberately kept out of public skills.*_EXPERIMENTAL_* - Unclear — when you cannot tell from the message and changed paths. Default to
reading the diff. Cheap, and catches features hidden behind
prefixes.refactor:
Group related commits that implement one logical feature across server + SDK + CLI. Audit them as one unit.
Breadth before depth. Enumerate every audit candidate in a flat list before going deep on any one. The goal is a complete sweep, not the single juiciest finding.
For each candidate, in the same one-liner, tag the cross-cutting concept(s) it most likely affects, even if the commit's file path doesn't say so. The four concepts:
- tracing — spans, attributes, instrumentation, OpenInference, otel
- evals — evaluators, experiments, datasets, prompts
- cli — CLI commands, flags, JSON output shape
- none — internal/refactor/dep/frontend-only/feature-flag
Once a candidate is tagged with
tracing / evals / cli, you have committed to a
hypothesis: there is potentially a skill edit here. The only way to legitimately drop
the candidate later is to read the diff and find that the change is internal, or
already documented, or genuinely user-irrelevant. "I don't see the relevant package in
the path" is not a sufficient reason.
Only after this tagged list exists do you expand each entry into edits.
Phase 3: Locate the real code
For every candidate, open the actual changed files. Commit messages routinely omit:
- New optional parameters added to existing functions
- New exported symbols added alongside a refactor
- New CLI flags, JSON output fields, error codes
- Behavior changes in existing code paths
- Schema changes that change the shape returned by the CLI or REST API
Read enough of each file to answer:
- What is the public surface? Exact symbol names, paths, command names, attribute keys. Write them down.
- What is the behavior? What the resulting code does, not what the commit said.
- Why would a user reach for it? If the code doesn't make this clear, flag it — the skill needs a usage rationale, not just a signature.
- What does the example look like? Draft a real, runnable example using the current signature. Verify imports.
Do not trust a commit's file path to dictate the skill it belongs to. An evaluator change often lives in
packages/phoenix-client/, not packages/phoenix-evals/. A span
attribute lands in the server or a client, not in phoenix-otel. For every audit
candidate, scan all client/SDK directories — Python and TypeScript, generic client
and topical packages — and map each new symbol to the skill that documents the feature
it exposes, per the source-mapping table. The easiest way to miss a finding is to stop
looking after the first package you find.
Pair each Python change with its TypeScript mirror (and vice versa). Nearly every user-facing feature ships on both runtimes, usually in the same PR. When you locate a Python change, grep the TS package equivalent for the same symbol family (same commit SHA, or within a few commits). If the mirror is there, edit the
-typescript.md
reference too. If it's genuinely Python-only or TS-only, say so in the run summary —
that itself is a finding worth noting.
Phase 4: Map each finding to skill edits
For each finding, decide:
- Which target skill(s) does it belong to? (Use the source mapping table above.)
- Which file(s) inside that skill?
- New cross-cutting concept → new
file, plus a row inreferences/<slug>.md
's reference table and a link in the navigation section.SKILL.md - Update to an existing concept → edit the existing reference file.
- Top-level navigation, workflow, or invocation change → edit
directly.SKILL.md
- New cross-cutting concept → new
- What is the edit? A new section, a code-snippet replacement, a new row in a table, a renamed parameter, a deleted reference to a removed feature.
Reference file naming follows the existing patterns inside each skill — match what's already there:
: prefix by category (phoenix-tracing/references/
,setup-*
,instrumentation-*
,span-*
,production-*
) and suffix by language (attributes-*
,-python.md
).-typescript.md
: prefix by category (phoenix-evals/references/
,evaluators-*
,experiments-*
,validation-*
,production-*
) and suffix by language.fundamentals-*
: single-file skill — most edits land directly inphoenix-cli
. Only add aSKILL.md
directory if a topic has grown too large for inline (>40 lines).references/
When deleting a removed feature, remove every reference (cross-links in
SKILL.md,
table rows, workflow steps), not just the most obvious one.
Phase 5: Apply the patches
Edit the skill files in place. Use
Edit for surgical changes; use Write for new
reference files. After each edit:
- Verify the example runs against the current code. Every code snippet must use imports, classnames, and parameter names that exist in the tree as of the audited SHA. If you can't verify, don't include the snippet.
- Verify cross-links resolve. If you add a new
, also add it to the navigation tables inreferences/foo.md
. Broken links in a skill make the agent waste tool calls trying to read missing files.SKILL.md - Quote signatures verbatim. A docstring or method header is the source of truth. Don't paraphrase parameters.
For breaking changes (renames, removals), do the destructive edit and the additive edit in the same pass. Removing the old reference without adding the new one leaves agents flailing; adding the new one without removing the old one leaves them confused about which is current.
Phase 6: Write the run summary
After patches are applied, produce a single markdown summary on stdout. The GitHub Actions wrapper reads this to populate the PR body. Do not write it to a file — let it surface as the agent's final message.
# Phoenix Skills Audit — <date range> **Audited against:** `origin/main` at `<sha>` **Commits analyzed:** N (M user-facing, K skipped) ## Edits applied ### phoenix-tracing - `references/instrumentation-auto-python.md` — added section for new auto-instrumentor shipped in <commit>; example uses `<verified import path>`. - `SKILL.md` — added row for the new reference file in the Quick Reference table. ### phoenix-cli - `SKILL.md` — added `px experiment compare` subcommand documentation (commit `<sha>`); JSON shape grounded in `js/packages/phoenix-cli/src/commands/...`. - `SKILL.md` — updated `--format` description; default changed in commit `<sha>`. ### phoenix-evals - `references/evaluators-pre-built.md` — added entry for `<NewEvaluator>` from commit `<sha>`; example uses signature `<verified>`. ## Skipped commits Brief list of commits intentionally excluded (deps, refactors, feature flags, frontend changes that don't affect the three skills) so a reviewer can confirm nothing was missed. ## Out-of-scope findings Optional. If you saw a user-facing change that belongs in a different skill (e.g. `phoenix-server`, `phoenix-frontend`), note it here so the maintainer knows — but do not edit those skills.
Grounding rules — do not skip
Every edit must be verifiable against the tree at the audited SHA. Skills are read by agents that will not double-check; if you teach them an API that doesn't exist, every downstream task that follows the skill is wrong.
- Cite file paths and line numbers in the run summary for the source code that motivated each edit.
- Quote signatures verbatim. If the code says
, the skill must say that — notdef foo(trace_id: str, *, kind: str = "LLM")
.foo(trace_id, kind) - Examples are real. Every snippet must use import paths, class names, and parameter
names that exist in the current tree. Run a mental check: would
succeed against this repo right now?python -c "from X import Y" - No invented APIs. If unsure whether a symbol is exported, check the package's
/index.ts
first.__init__.py - Stale checks are textual. Read the existing skill's snippet, then read the current source. If they diverge, replace the snippet exactly. Don't half-update it.
- Rationale comes from the code or the PR. If the why isn't evident, say so in the run summary as something the author needs to supply, and skip the rationale rather than inventing one.
When in doubt, read more code. The cost of an extra file read is tiny compared to the cost of poisoning every future agent with a hallucinated API.
Decision quick reference
| Question | Answer |
|---|---|
Commit message says — skip? | Read the diff. Refactors often add exports. |
Frontend-only change () — does it affect any of the three skills? | Almost never. Skip and note in "out-of-scope findings". |
| New env var — feature flag? | Skip. Feature flags (, ) stay out of public skills. |
| New env var that controls a public surface? | Update the relevant skill's setup or environment section. |
Server-only change () — relevant to the three skills? | Only if it changes what a client/CLI sees. A new GraphQL field consumed by the CLI is in scope; an internal server refactor is not. |
| Removed/renamed public symbol? | Edit the skill: remove the old reference, add the new one in the same pass. |
| New evaluator class? | Add to (or create a topic-specific reference if the patterns are new). |
| New CLI subcommand? | Add to the relevant section of . Document the JSON shape if the command emits structured output. |
| New OpenInference attribute? | Add to and the relevant . |
| Breaking change to an existing skill example? | High priority. Update the example and search the rest of the skill for adjacent uses of the old signature. |
| Feature has no apparent use case from reading the code? | Add the API entry but flag in the run summary that rationale is missing — don't invent motivation. |
| Python-only change — is there a TS mirror? | Check. Most features ship on both runtimes in the same PR; a Python-only change is the exception, not the rule. If confirmed Python-only, call it out in the run summary so someone can chase the TS side. |
links to a that doesn't exist? | Remove the link. Broken cross-links teach nothing and waste tool calls. Don't add a "TODO" flag — just delete. |
Python change lives in but conceptually belongs to tracing/evals? | Map it to the conceptual skill. File path is an implementation detail; the skill that documents the feature owns it. |
Pre-submit checklist
Before exiting, walk through:
- Every edited code snippet uses symbols that exist in the tree at the audited SHA.
- Every new reference file is linked from its skill's
(no orphan files).SKILL.md - Every removed reference is also unlinked from
(no broken cross-links).SKILL.md - The run summary lists every edit, grouped by target skill, with the source commit SHA.
- Skipped commits are listed so a reviewer can confirm triage.
- Every Phase-2 audit candidate tagged
/tracing
/evals
is either reflected in the edits, or the run summary states a concrete reason it was dropped (e.g., "PR #X already self-patched the skill", with the SHA). A tagged candidate that just disappears is a triage bug — go back and reconcile.cli - No edits made outside
,.agents/skills/phoenix-tracing/
, or.agents/skills/phoenix-cli/
..agents/skills/phoenix-evals/ - No invented APIs, no invented imports, no placeholder values disguised as real code.
Operating modes
This skill runs in two contexts:
-
Manual — a developer asks "audit the skills" or "are the skills up to date". Apply the workflow above end-to-end, edit files, print the run summary, and let the user review the diff before committing.
-
CI (GitHub Actions) —
runs this skill weekly. The action commits the patches to a branch and opens a PR using the run summary as the PR body. The skill itself does not commit or push — that's the workflow's job. Just edit the files and emit the summary..github/workflows/claude-skills-audit.yml