Phoenix phoenix-docs-gap-audit
git clone https://github.com/Arize-ai/phoenix
T=$(mktemp -d) && git clone --depth=1 https://github.com/Arize-ai/phoenix "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/phoenix-docs-gap-audit" ~/.claude/skills/arize-ai-phoenix-phoenix-docs-gap-audit && rm -rf "$T"
.agents/skills/phoenix-docs-gap-audit/SKILL.mdPhoenix Docs Gap Audit
Find everything in the Phoenix repo that shipped recently without proper documentation. The output is a gap report — not release notes, not new docs. The gap report tells the user (or a follow-up skill) exactly what is missing, where it should live, and what the new content should say, grounded in real code.
This skill is deliberately thorough. Documentation gaps are expensive: users hit undocumented features blind, stale examples teach wrong APIs, and in-code docstrings go missing silently because nothing fails CI. A good audit reads the actual code, cross-references every doc surface, and produces a report a maintainer can act on without re-doing the investigation.
Scope — what counts as "documentation"
Phoenix has several doc surfaces. A feature can be fully released and still be undocumented on most of them. Audit all of these:
| Surface | Location | Tool |
|---|---|---|
| User-facing product docs | | Mintlify |
| Server README | | GitHub landing |
| Python package READMEs | | PyPI landing |
| TS package READMEs | | npm landing |
| Python built-in API docs | | Sphinx |
| TS built-in API docs | , TSDoc in | TypeDoc |
| Python docstrings | class/function docstrings in | inline |
| TypeScript TSDoc | on exported symbols in | inline |
| Code comments | non-obvious logic in any | inline |
| llms.txt | | machine-readable docs index |
| Onboarding snippets | and similar | in-product |
Missing from any of these is a gap. Stale (contradicts current code) is also a gap, and is more dangerous than missing because it teaches users the wrong thing.
Workflow
Phase 1: Gather commits
Default window is the last 7 days on
main. The user may override (e.g., "since last
release", "last month", a specific tag range). Translate their phrasing into a concrete
range before running anything.
Always audit
, not the local origin/main
branch. Local main
main is routinely
stale by dozens of commits in active repos — if you audit the stale tip you will silently
miss every feature and every breaking change that shipped after your last git pull. An
early iteration of this skill missed a major breaking change this way.
# Refresh the remote-tracking branch first — this does NOT touch your working tree git fetch origin main --quiet # Then log against origin/main with file stats git log --since="7 days ago" origin/main --no-merges --pretty=format:"%h %s" --name-status # If the user gave you a tag range git log <prev-tag>..<current-tag> --no-merges --pretty=format:"%h %s" --name-status # Sanity check: how far ahead is origin? git rev-list --count main..origin/main
Save the raw list. You will refer back to it repeatedly — don't re-run git for every commit. Note the commit you are auditing against in the report header (e.g. "audited against
origin/main at <sha>") so a reader can reproduce the finding.
Phase 2: Triage
Commit messages lie — or at least under-report. Use them as an index, not a source of truth. Split the list into three buckets:
- Audit candidates — anything that could plausibly affect a user: new APIs, new CLI
flags, new UI, new config, new env vars, new providers, behavior changes, performance
changes visible to users, breaking changes, deprecations. In-product onboarding
snippets, integration registries, and provider configs under
count as user-facing — they are literally the instructions users copy out of the product, so a new snippet without a matchingapp/
page is a real gap, even though the change technically lives in the frontend.docs/phoenix/integrations/<...>.mdx - Skip — dep bumps (
), internal refactors with no public surface, test-only changes, CI/build changes, formatting, skill/workflow edits, release-please bookkeeping (chore(deps):
), feature flags (env vars namedchore(main): release …
,*_DANGEROUSLY_*
,*_EXPERIMENTAL_*
internal toggles, or otherwise intentionally undocumented escape hatches — these are deliberately kept out of public docs and should never be flagged as missing documentation).*_ENABLE_* - Unclear — when you cannot tell from the message and changed paths. Default to
reading the diff rather than guessing. It is cheap and catches features hidden behind
orrefactor:
prefixes.chore:
Group related commits that implement one logical feature across server + SDK + UI. Audit them as one unit.
Breadth before depth. Enumerate every audit candidate in a flat list before you start going deep on any of them. It is tempting to find a huge breaking change and spend the rest of the audit documenting it, but the user is asking "what landed this week that needs docs" — the answer is a complete list, not the single juiciest finding. A one-line entry per audit candidate is fine at this stage:
Audit candidates (N total): 1. 28ecfe023 — ATIF trajectory upload helper (packages/phoenix-client) 2. 15e641510 — evals 3.0 legacy removal (BREAKING, ~24 docs affected) 3. ed559c46e — GraphQL: require explicit first on forward pagination (BREAKING) 4. 81d296bee — 7 new OpenAI-compatible provider onboarding snippets 5. c70eca619 — EvaluatorParams.traceId (TS client) 6. cc644897c — PXI chat tracing env vars ...
Only after this list exists do you expand each entry into a full gap analysis. If one finding is so large that fully documenting all its affected files would blow your budget, it is better to have ten entries at medium depth than one exhaustive entry and nine missing. The reader can always come back for more detail; they cannot ask about a feature the report never mentions.
Phase 3: Locate the real code
For every audit candidate, open the actual changed files. Commit messages routinely omit:
- New optional parameters added to existing functions
- New exported symbols added alongside a refactor
- Config keys, env vars, and CLI flags
- Behavior changes in existing code paths
- New React components or pages
- Schema/migration changes
Read enough of each changed file to answer:
- What is the public surface? Which exported symbols, endpoints, CLI commands, env vars, or UI entry points did this touch? Write down exact names and file paths.
- What is the behavior? Not "what the commit did" — what the resulting code does now.
- Why would a user reach for it? If you cannot answer this from the code, the gap is bigger than just missing docs: the feature itself may lack a clear use case. Flag it.
- When should a user NOT use it? Constraints, alternatives, tradeoffs. These are the rationale bits the user specifically asked for.
Map each public surface to a package of record:
| Surface | Package(s) |
|---|---|
| server REST () |
and | server GraphQL |
| Python client |
| Python evals |
| Python otel |
| TS client |
| TS evals |
| TS MCP |
| TS CLI |
| Phoenix UI |
A single feature often spans multiple packages — keep that mapping, it tells you which doc surfaces to check.
Phase 4: Check every doc surface
For each public surface touched, check all of the applicable doc locations. Do not stop at the first hit. "It's mentioned in the README" is not the same as "it's documented in Mintlify, the README, the docstring, and the TypeDoc output".
Search strategy — grep first, then read targeted regions. Thorough does not mean exhaustive. Reading every doc file top-to-bottom wastes time and tokens. A reliable pattern:
- Grep the symbol, path, or feature name across each doc surface (Mintlify, READMEs, Sphinx sources, TSDoc) to locate candidate files and line numbers.
- Only read the files that matched — and read a window (±30 lines) around each hit, not the whole file.
- For "is it missing?" claims, a zero-result grep across the relevant tree is the evidence — quote the grep command and the empty result in the report.
- Escalate to full-file reads only when the grep result is ambiguous (e.g. nearby code snippet in a different language, alias, partial name).
This turns "audit 20 doc files" into "grep 20 files, read the 3 that hit." It cuts tool calls by roughly an order of magnitude and is more defensible — a grep command is a reproducible claim, a full-file read is not.
Adjacent staleness. When you read an existing doc to check coverage of this week's change, skim the rest of that doc for any other stale references to the same package or symbol family — not just the bit the commit touched. A README that gets no commits for six months is exactly where stale examples accumulate. If you find one, flag it as "adjacent staleness" with a note that it predates the window; do not bury it just because it isn't from this week's commits.
For Python changes touching
packages/<pkg>/src/...:
- Grep
for the symbol name, endpoint path, or config key.docs/phoenix/**/*.mdx - Read
— is the feature listed? Are examples still valid?packages/<pkg>/README.md - Check
— if Sphinx picks up the symbol, confirm the autosummary or explicit reference exists.packages/<pkg>/docs/source/ - Open the source file and check the docstring on the class/function. Does it describe parameters, return values, raises, and at least one example? Is the example using the current signature?
For TypeScript changes touching
js/packages/<pkg>/src/...:
- Grep
for the symbol or feature name.docs/phoenix/**/*.mdx - Read
— listed? examples current?js/packages/<pkg>/README.md - Check TSDoc (
) on exported symbols. TypeDoc renders these verbatim — if they're missing, so is the TypeDoc page./** ... */ - If the package exposes onboarding snippets (e.g.
onboarding), check whether the snippet still uses current signatures.phoenix-client
For server changes touching
src/phoenix/server/...:
- REST endpoints: check
and the OpenAPI spec atdocs/phoenix/sdk-api-reference/
. Check corresponding Python/TS client wrappers.schemas/openapi.json - GraphQL: the schema is self-documenting, but user-facing features built on it belong in
. Check whether a product doc exists.docs/phoenix/ - Env vars / config: grep
,docs/phoenix/environments.mdx
, andself-hosting/
. New env vars with no mention in any of those are gaps — except feature flags (e.g.production-guide.mdx
, experimental toggles, or any env var the code treats as an internal escape hatch). These are intentionally omitted from public docs; do not flag them as gaps, and if you find them already documented, flag that as a gap in the opposite direction (feature flag leaked into public docs).PHOENIX_CLI_DANGEROUSLY_ENABLE_DELETES - Migrations/schema changes: check
anddocs/phoenix/self-hosting/
.MIGRATION.md
For UI changes touching
app/src/:
- Is there a
page for the feature? User-facing UI should have one.docs/phoenix/ - Does the existing page still match current screenshots and flow?
Code comments are a softer target — only flag as gaps when the changed code is non-obvious (complex algorithms, subtle invariants, workarounds) and carries no explanatory comment. A well-named function does not need a comment; a 40-line regex does.
Phase 5: Classify every gap
Every gap gets one of these classifications. The labels matter because they drive priority and the shape of the fix.
- Missing — no doc exists for a shipped feature on a surface where it should exist.
- Stale — doc exists but contradicts current code (wrong signature, removed param, renamed function, outdated example output).
- Incomplete — doc exists but omits a material aspect the user needs (e.g. documents the function but not the new optional parameter, documents the happy path but not error modes, no example).
- Missing rationale — doc describes what the feature does but not why a user would choose it or when they should prefer an alternative. This is a gap in its own right for anything non-trivial.
Also assign a severity:
- High — public API, breaking change, new primary feature, anything a first-time user will hit. Stale examples are always high.
- Medium — optional parameters, secondary features, minor behavior changes.
- Low — internal-facing helpers that happen to be exported, edge-case flags.
Grounding rules — do not skip
Every claim in the gap report must be verifiable against the current tree. This is the single most important thing this skill enforces.
- Cite file paths and line numbers for the code and for any existing doc you are calling
missing, stale, or incomplete.
is useful; "the spans endpoint" is not.src/phoenix/server/api/routers/v1/spans.py:142 - Quote the signature you read. Do not paraphrase. If you say a function takes
, it must be because you readtrace_id: str
in the source.def foo(trace_id: str) - Draft examples from real code. Every code snippet you propose must use import paths, class names, and parameter names that exist in the current tree. If you can't verify it, don't include it.
- No hallucinated APIs. If you're unsure whether a symbol is exported, check the
package's
/index.ts
.__init__.py - Stale checks are textual. Read the existing doc's snippet, then read the current source. If they don't match, it's stale — say exactly what changed.
- Rationale must come from the code or adjacent discussion, not from your priors. If the why isn't evident from reading the PR, the tests, and the surrounding code, say so and flag it as something the author needs to supply. Do not invent motivation.
When in doubt, read more code. The cost of an extra file read is tiny compared to the cost of shipping a gap report full of invented APIs.
Report format
Produce a single markdown report. The user will consume it directly or hand it to another skill to actually write the docs.
# Phoenix Docs Gap Audit — <date range> ## Summary - N commits analyzed, M user-facing, K gaps found - High-severity gaps: <count> - Packages touched: <list> ## Gaps ### <Feature title> **Source:** <commit SHAs and PR links> **Package(s):** <e.g., arize-phoenix-client (Python), @arizeai/phoenix-client (TS)> **Public surface:** - `packages/phoenix-client/src/phoenix/client/foo.py:42` — `def new_thing(trace_id: str, ...)` - `js/packages/phoenix-client/src/foo.ts:87` — `export function newThing(...)` **What it does:** <1-2 sentences grounded in the code, not the commit message> **Why a user would use it:** <rationale — concrete use case, drawn from the code or tests> **When NOT to use it / alternatives:** <constraints, tradeoffs, or "N/A — no alternative"> **Gaps:** - [High | Missing] `docs/phoenix/tracing/` — no page mentions `new_thing`. Should live at `docs/phoenix/tracing/how-to-tracing/<slug>.mdx`. - [High | Stale] `packages/phoenix-client/README.md:120` — example shows old signature `new_thing(span_id)`; current is `new_thing(trace_id)`. - [Medium | Incomplete] `packages/phoenix-client/src/phoenix/client/foo.py:42` — docstring describes function but omits `return` type and has no example. - [Medium | Missing rationale] None of the existing docs explain *why* a user would prefer `new_thing` over `old_thing`. **Proposed content** (rooted in current code): ```python from phoenix.client import Client client = Client() # real, verified example using the current signature client.new_thing(trace_id="...") ``` Use `new_thing` when you need X. Prefer it over `old_thing` when Y, because Z. ### <Next feature> ... ## Commits skipped Brief list of commits intentionally excluded (deps, refactors, etc.) so the reader can confirm nothing was missed.
Keep the report tight. One gap per feature, not one per doc surface — group the surface-level gaps as bullets under a single feature entry. A reader should be able to scan it and know exactly what to fix next.
Decision quick reference
| Question | Answer |
|---|---|
Commit message says — skip? | Read the diff. Refactors often add exports. |
| Feature exists only in an example file? | Not a public API — skip unless is documented as a supported surface. |
| New env var, no doc anywhere? | High-severity Missing in — unless it is a feature flag (e.g. , experimental toggle). Feature flags are deliberately undocumented; skip them. |
| Feature flag appears in public docs? | Flag as a gap in the opposite direction — feature flags should be removed from user-facing docs. |
| Stale doc with old param name? | High-severity Stale. Stale examples mislead users. |
| Python docstring missing on exported function? | Medium Incomplete (package-level docs). |
TS missing on exported function? | Medium Incomplete (TypeDoc will render nothing). |
| Feature has no apparent use case from reading the code? | Flag as "rationale unclear — needs author input". Do not invent one. |
| Unsure if something is user-facing? | Check the package's / exports. If it's exported, assume yes. |
Pre-submit checklist
Before handing the report to the user, walk through this:
- Every cited file path and line number was verified against the current tree
- Every proposed code snippet uses symbols that actually exist in the source
- Every "stale" claim quotes both the doc and the current code
- Every "missing rationale" entry either proposes rationale grounded in the code, or explicitly flags the rationale as needing author input
- The skipped-commits list is present so the reader can audit your triage
- Report is scannable: headings, severity labels, one entry per feature
- No invented APIs, no invented imports, no placeholder values disguised as real code