Awesome-omni-skill analysis

Docent is a platform for analyzing AI agent behavior using large language models. Use this skill anytime you want to use Docent to analyze AI agent behavior.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data-ai/analysis" ~/.claude/skills/diegosouzapw-awesome-omni-skill-analysis && rm -rf "$T"
manifest: skills/data-ai/analysis/SKILL.md
source content

Docent Analysis Guide

You can interact with Docent by writing Python scripts that use the Docent SDK, and by calling Docent MCP tools. If Docent MCP tools are not available, alert the user that the Docent MCP server is not installed correctly.

Data models and key concepts

  • A transcript is a sequence of messages from the system, the agent (aka assistant), the user, and/or tools that the agent calls.
  • An agent run represents an AI agent attempting a task or interacting with a user. An agent run may contain one or more transcripts.
  • A collection contains agent runs from a certain experiment or benchmark. When we query, analyze, or compare agent runs, we do so within one collection at a time.

High-level analysis approach

The Docent SDK facilitates two main technqiues:

  • LLMRequest, to have LLMs read transcripts and perform qualitative analysis
  • DQL, to query and aggregate data already in the database (e.g. metadata that was logged with agent runs and transcripts, previous analysis results)

Sometimes, the user's request will clearly pertain to one of these technqiues. Other times, the user may instead ask high-level questions about the behavior of their AI agents, and you will have to investigate. In the latter case, the following general process is recommended:

  1. Use the
    get_metadata_fields
    MCP tool to understand the structure of agent run metadata for the current collection.
  2. If the metadata fields might shed light on the user's question, you may use DQL to query the metadata. You may do this autonomously without consulting the user.
  3. After querying the metadata, you must decide whether qualitative analysis (reading the transcripts) would provide further insight. If it seems appropriate, create a plan to perform the analysis using LLMRequest. Feel free to ask the user clarifying questions. You must let the user review your plan before you proceed. Then write a script to implement the plan, and run it.

Analysis guidelines and reminders

  • If the user asks you to "summarize the agent runs", "classify the results", or similar, they do not necessarily mean that you (the coding agent) should do so directly. In most cases, it is better to use the Docent SDK to submit an LLM analysis request. Then you may open the results of that analysis to show the user.
  • Agent runs contain metadata. Metadata varies by collection. Do not make assumptions about the structure of run metadata. Use the
    get_metadata_fields
    MCP tool to find out.
  • If you are writing code that will submit LLMRequests, you are encouraged to write it out as a script so you can improve and re-use it later. Unless otherwise instructed, you may place analysis scripts in the current working directory. Quick DQL queries can be done in scripts or inline with the Bash tool at your discretion.
  • Unless informed otherwise, assume uv is used for python package management. Run your scripts with
    uv run
    .
  • If you're not sure what collection the user is talking about, refer to the docent.env file in the working directory. If it does not exist, or if it does not include DOCENT_COLLECTION_ID, ask the user to paste the collection UUID.
  • When writing code to perform analysis, Don't Repeat Yourself. This is particularly important when it comes to prompts for LLMs. The user will likely want to modify prompts, and they should not have to track down multiple copies of a prompt throughout your code. If you need to create different variants of a prompt, build them from reusable pieces and/or use string interpolation, so there is a single source of truth for each part of the prompt.
  • When writing code to perform analysis, keep variable names generic. For example, if you are comparing the performance of two models, you might refer to them as "model_a" and "model_b" in your code, and then declare the identity of these models in one place only. This makes your code more reusable, so we can perform the same analysis on other data.
  • When writing code to perform analysis, be sparing with print statements. In many cases a simple success/failure message at the end is enough.
  • If you are analyzing a limited sample of many items (e.g. because you can only fit so many in the context window), be mindful of how you are sampling them. The most recent N items may be a biased sample. It is safe to assume that UUIDs are random.
  • Metadata alone may provide an incomplete picture. Don't forget to consider qualitative analysis!
  • The user must approve all plans for LLMRequest analysis. If the user tells you how to perform the analysis, that counts as approval. If you use your own judgement in planning the analysis, you must present your plan to the user for approval before implementing it.

Building LLMRequests with the Prompt API

The Docent SDK can be installed via

docent-python
(e.g.,
uv add docent-python
).

Start with these imports when using the Docent SDK:

from docent.sdk.client import Docent
from docent.sdk.llm_request import LLMRequest, ExternalAnalysisResult
from docent.sdk.llm_context import Prompt, AgentRunRef, ResultRef

client = Docent()

Note: the Docent SDK will automatically discover and load a docent.env file if it exists. You do not need to explicitly source docent.env.

The

Prompt
class takes a list of strings and context item refs. Context items are agent runs, transcripts, and analysis results. You can reference items in a prompt without fetching their full content.

run = AgentRunRef(id="<uuid>", collection_id="<uuid>")
transcript = TranscriptRef(id="<uuid>", agent_run_id="<uuid>", collection_id="<uuid>")
result = ResultRef(id="<uuid>", result_set_id="<uuid>", collection_id="<uuid>")

request = LLMRequest(
    prompt=Prompt([run, "Summarize this run."]),
    metadata={"summarized_run_id": run.id}
)

A prompt may include multiple context items of different types, which is useful for comparing behavior across runs and looking for recurring patterns.

When the same ref appears multiple times in a single prompt, the first occurrence renders as full content alongside an alias, and subsequent occurrences render as just the alias.

A useful pattern is to use DQL to get the IDs of relevant runs, then iterate over those run IDs to build the prompts for the language model.

Remember that the context window of the LLM is limited. Avoid passing more than a few full agent runs in a single prompt. However, it is fine to pass many LLMRequests to a single result set, since each request is processed separately.

Writing a good prompt

The quality of LLM output depends on the quality the prompt you write for the LLMRequest. The LLM knows it is analyzing agent run transcripts, and knows how to cite items in its context. You can ask the LLM to cite items in its context and it will just work without further guidance. Otherwise, you are responsible for understanding the purpose of the LLMRequest and writing a clear prompt articulating what you want the LLM to do.

  • Include any information about the runs that is not obvious from the transcripts but important for analyzing them appropriately
  • How detailed or brief should output be? A short paragraph is a good default, but it depends on the nature of the analysis.
  • If you're asking for extensive (multi-paragraph) response, how should it be structured? Note: markdown is supported
  • If you are looking for a particular behavior, how exactly is that behavior defined? If you're proposing a specific definition, make sure the user signs off on it.
  • If you are asking the LLM to analyze other analysis results, remind it to cite those analysis results, NOT the original transcripts which the results may refer to.

Submitting Requests for Backend Processing

Use

submit_llm_requests()
to have the backend process your requests:

result = client.submit_llm_requests(
    collection_id="<collection-uuid>",
    requests=[request1, request2, ...], # if submitting multiple requests, send them in a batch, not one-at-a-time
    model_string="openai/gpt-5-mini", # pass a model explicitly, use this one by default
    result_set_name="cheating/v1",  # hierarchical naming
)

Use hierarchical names with

/
separators for organization. It's often good to have a component like
v1
,
v2
, etc. so you can iterate on your methodology and compare results.

Important: Do not use openai/gpt-4o or openai/gpt-4o-mini. Those models are obsolete and superseded by openai/gpt-5 and openai/gpt-5-mini respectively.

Structured Output

By default, each LLMRequest will produce a text response. If you need more structured output, you can pass a JSON schema when you create a result set. All results in a result set must have the same output schema. Keep things simple and do not request more fields than you need.

Output schemas can have string, number, and boolean properties. They should not have nested objects or arrays.

result = client.submit_llm_requests(
    collection_id="<uuid>",
    requests=[request],
    result_set_name="classification/v1",
    output_schema={
        "type": "object",
        "properties": {
            "category": {"type": "string", "enum": ["helpful", "harmful", "neutral"]},
            "confidence": {"type": "number"},
            "reasoning": {"type": "string"}
        },
        "required": ["category", "confidence", "reasoning"]
    }
)

Presenting Results

Once a batch of LLM requests is submitted, open the result set in the browser. Do not wait for a request to finish to open the results. To open a Docent URL in the browser, use the

navigate_to
tool from the Docent MCP server.

Do not attempt to interact with the Docent web UI using other browsing tools.

By default, let user view the results and draw their own conclusion. If asked to draw a conclusion, you may fetch and read results. Never draw conclusions from LLMRequest analysis without reading the results.

Retrieving Results programmatically

You can retrieve results programmatically if you need to process them further (e.g. make a chart, or pass them to another LLMRequest).

Note the browser is the preferred way to view results. Only retrieve results programmatically if you need to process them further. Do not retrieve results for presentation to the user, unless the user specifically requests.

# List all result sets (optionally filtered by prefix)
sets = client.list_result_sets(collection_id, prefix="analysis/")

# Get result set metadata
result_set = client.get_result_set(collection_id, "analysis/experiment_1")

# Get results as DataFrame
df = client.get_result_set_dataframe(
    collection_id,
    "analysis/experiment_1",
)

DQL (Docent Query Language)

Docent Query Language is a read-only SQL subset that supports ad-hoc exploration in Docent.

Queries can only run over a single collection by design.

Executing DQL via the Python SDK

from docent.sdk.client import Docent

client = Docent()
collection_id = "<collection-uuid>"

# (Optional) inspect available tables/columns
schema = client.get_dql_schema(collection_id)

# Execute a DQL query
result = client.execute_dql(
    collection_id,
    "SELECT agent_runs.id AS agent_run_id FROM agent_runs LIMIT 10",
)

# Convert to dict rows (or use result['columns'] + result['rows'] directly)
rows = client.dql_result_to_dicts(result)

Available Tables and Columns

TableDescription
agent_runs
Information about each agent run in a collection.
transcripts
Individual transcripts tied to an agent run; stores serialized messages and per-transcript metadata.
transcript_groups
Hierarchical groupings of transcripts for runs.
judge_results
Scored rubric outputs keyed by agent run and rubric version.
results
Individual LLM analysis results from result sets.

agent_runs

ColumnDescription
id
Agent run identifier (UUID).
collection_id
Collection that owns the run
name
Optional user-provided display name.
description
Optional description supplied at ingest time.
metadata_json
User supplied metadata, stored as JSON.
created_at
When the run was recorded in Docent.

transcripts

ColumnDescription
id
Transcript identifier (UUID).
collection_id
Collection that owns the transcript.
agent_run_id
Parent run identifier; joins back to
agent_runs.id
.
name
Optional transcript title.
description
Optional description.
transcript_group_id
Optional grouping identifier.
messages
Binary-encoded JSON payload of message turns.
metadata_json
Binary-encoded metadata describing the transcript.
created_at
Timestamp recorded during ingest.

transcript_groups

ColumnDescription
id
Transcript group identifier.
collection_id
Collection that owns the transcript.
agent_run_id
Parent run identifier; joins back to
agent_runs.id
.
name
Optional name for the group.
description
Optional descriptive text.
parent_transcript_group_id
Identifier of the parent group (for hierarchical groupings).
metadata_json
JSONB metadata payload for the group.
created_at
Timestamp recorded during ingest.

judge_results

ColumnDescription
id
Judge result identifier.
agent_run_id
Run scored by the rubric.
rubric_id
Rubric identifier.
rubric_version
Version of the rubric used when scoring.
output
JSON representation of rubric outputs.
result_metadata
Optional JSON metadata attached to the result.
result_type
Enum describing the rubric output type.

results

ColumnDescription
id
Result identifier (UUID).
result_set_id
Parent result set identifier; joins back to
result_sets.id
.
llm_context_spec
JSON specification describing the LLM context used.
prompt_segments
The user prompt sent to the LLM.
user_metadata
Optional JSON metadata supplied by the user.
output
JSON output from the LLM (for string schemas:
{"output": str, "citations": [...]}
).
error_json
JSON error details if the LLM call failed.
input_tokens
Number of input tokens consumed.
output_tokens
Number of output tokens generated.
model
Model identifier used for the request.
created_at
Timestamp when the result was created.

JSON Metadata Access Patterns

Docent stores user-supplied metadata as JSON. Access using Postgres operators:

-- Filter agent runs by a metadata attribute
SELECT id, name
FROM agent_runs
WHERE metadata_json->>'environment' = 'staging';
-- Retrieve nested transcript metadata
SELECT
  id,
  metadata_json->'conversation'->>'speaker' AS speaker,
  metadata_json->'conversation'->>'topic' AS topic
FROM transcripts
WHERE metadata_json->>'status' = 'flagged';
-- Cast numeric metadata for aggregation
SELECT
  AVG(CAST(metadata_json->>'latency_ms' AS DOUBLE PRECISION)) AS avg_latency_ms
FROM agent_runs
WHERE metadata_json ? 'latency_ms';

When querying JSON fields, comparisons default to string semantics. Cast values when you need numeric ordering or aggregation.

Allowed Syntax

Feature
SELECT
,
DISTINCT
,
FROM
,
WHERE
, subqueries
JOIN
,
LEFT JOIN
,
RIGHT JOIN
,
FULL JOIN
,
CROSS JOIN
WITH
(CTEs)
UNION [ALL]
,
INTERSECT
,
EXCEPT
GROUP BY
,
HAVING
Aggregations (
COUNT
,
AVG
,
MIN
,
MAX
,
SUM
,
STDDEV_POP
,
STDDEV_SAMP
,
VAR_POP
,
VAR_SAMP
,
ARRAY_AGG
,
STRING_AGG
,
JSON_AGG
,
JSONB_AGG
,
JSON_OBJECT_AGG
,
PERCENTILE_CONT
,
PERCENTILE_DISC
with
WITHIN GROUP
)
Window functions (
ROW_NUMBER
,
RANK
,
DENSE_RANK
,
NTILE
,
LAG
,
LEAD
,
FIRST_VALUE
,
LAST_VALUE
,
NTH_VALUE
,
PERCENT_RANK
,
CUME_DIST
)
ORDER BY
,
LIMIT
,
OFFSET
Conditional & null helpers (
CASE
,
COALESCE
,
NULLIF
)
Boolean logic (
AND
,
OR
,
NOT
)
Comparison operators (
=
,
!=
,
<
,
<=
,
>
,
>=
,
IS
,
IS NOT
,
IS DISTINCT FROM
,
IN
,
BETWEEN
,
LIKE
,
ILIKE
,
EXISTS
,
SIMILAR TO
,
~
,
~*
,
!~
,
!~*
)
Arithmetic & math (
+
,
-
,
*
,
/
,
%
,
POWER
,
ABS
,
SIGN
,
SQRT
,
LN
,
LOG
,
EXP
,
GREATEST
,
LEAST
,
FLOOR
,
CEIL
,
ROUND
,
RANDOM
)
String helpers (
SUBSTRING
,
LEFT
,
RIGHT
,
LENGTH
,
UPPER
,
LOWER
,
INITCAP
,
TRIM
,
REPLACE
,
SPLIT_PART
,
POSITION
,
CONCAT
,
CONCAT_WS
,
STRING_AGG
)
JSON operators & functions (
->
,
->>
,
#>
,
#>>
,
@>
,
?
, `?
Date/time basics (
CURRENT_DATE
,
CURRENT_TIME
,
CURRENT_TIMESTAMP
,
NOW()
,
EXTRACT
,
DATE_TRUNC
,
AGE
,
AT TIME ZONE
,
timezone()
)
Interval arithmetic (
timestamp +/- INTERVAL
,
INTERVAL
literals,
MAKE_INTERVAL
,
JUSTIFY_DAYS
,
JUSTIFY_HOURS
,
JUSTIFY_INTERVAL
)
Construction & conversion (
MAKE_DATE
,
MAKE_TIME
,
MAKE_TIMESTAMP
,
MAKE_TIMESTAMPTZ
,
TO_CHAR
,
TO_DATE
,
TO_TIMESTAMP
,
DATE_PART
)
Array helpers (
ARRAY[...]
,
array_cat
,
array_length
,
cardinality
,
unnest
,
ARRAY(SELECT ...)
,
= ANY
,
= ALL
,
array_position
,
array_remove
)
Type helpers (
CAST
,
::
)

Unsupported constructs include

*
, user-defined functions, and any DDL or DML commands.

Example Queries

Recent Runs

SELECT
  id,
  name,
  metadata_json->'model'->>'name' AS model_name,
  created_at
FROM agent_runs
WHERE metadata_json->>'status' = 'completed'
ORDER BY created_at DESC
LIMIT 10;

Transcript Counts per Group

SELECT
  tg.id AS group_id,
  tg.name AS group_name,
  COUNT(t.id) AS transcript_count
FROM transcript_groups tg
JOIN transcripts t ON t.transcript_group_id = tg.id
GROUP BY tg.id, tg.name
HAVING COUNT(t.id) > 1
ORDER BY transcript_count DESC;

Flagged Judge Results

SELECT
  jr.agent_run_id,
  jr.rubric_id,
  jr.result_metadata->>'label' AS label,
  jr.output->>'score' AS score
FROM judge_results jr
WHERE jr.result_metadata->>'severity' = 'high'
  AND EXISTS (
    SELECT 1
    FROM agent_runs ar
    WHERE ar.id = jr.agent_run_id
      AND ar.metadata_json->>'environment' = 'prod'
  )
ORDER BY score DESC
LIMIT 25;

Completion Rate by Environment

WITH normalized_runs AS (
  SELECT
    metadata_json->>'environment' AS environment,
    metadata_json->>'status' AS status
  FROM agent_runs
  WHERE metadata_json ? 'environment'
)
SELECT
  environment,
  COUNT(environment) AS total_runs,
  SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed_runs,
  CAST(SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS DOUBLE PRECISION)
    / NULLIF(COUNT(environment), 0) AS completion_rate
FROM normalized_runs
GROUP BY environment
ORDER BY total_runs DESC;

Latest Rubric Scores by Model

WITH latest_scores AS (
  SELECT
    agent_run_id,
    MAX(rubric_version) AS rubric_version
  FROM judge_results
  WHERE rubric_id = 'helpful_response_v1'
  GROUP BY agent_run_id
)
SELECT
  ar.id,
  ar.metadata_json->'model'->>'name' AS model_name,
  jr.output->>'score' AS score,
  jr.result_metadata->>'label' AS label
FROM latest_scores ls
JOIN judge_results jr
  ON jr.agent_run_id = ls.agent_run_id
  AND jr.rubric_version = ls.rubric_version
  AND jr.rubric_id = 'helpful_response_v1'
JOIN agent_runs ar ON ar.id = jr.agent_run_id
WHERE ar.metadata_json->>'environment' = 'prod'
ORDER BY CAST(jr.output->>'score' AS DOUBLE PRECISION) DESC
LIMIT 15;

Restrictions and Best Practices

  • Read-only: Only
    SELECT
    -style queries are permitted.
  • Single statement: Batches or multiple statements are rejected.
  • Explicit projection: Wildcard projections (
    *
    ) are disallowed. List the columns you need.
  • Collection scoping: A single query can only access data within a single collection.
  • Limit enforcement: Every query is capped at 10,000 rows. Use pagination (
    OFFSET
    /
    LIMIT
    ) for larger result sets.
  • JSON performance: Heavy JSON traversal across large collections can be slow. Prefer top-level fields when available.
  • Type awareness: Cast values explicitly when precision matters.

Reminders and tips for using DQL

No Wildcards Allowed

  • SELECT *
    is forbidden
  • COUNT(*)
    is forbidden - use
    COUNT(column_name)
    instead

GROUP BY Alias Workaround

Aliases don't work directly in GROUP BY when selecting from

agent_runs
. Use a subquery:

SELECT task, model_name, COUNT(task) AS run_count
FROM (
    SELECT
        metadata_json->>'task' AS task,
        metadata_json->'agent'->>'model_name' AS model_name
    FROM agent_runs
    WHERE ...
) AS subq
GROUP BY task, model_name

Avoid Dynamic IN Clauses with String Interpolation

Building IN clauses with f-strings is dangerous:

  • Task names containing
    ::
    can be parsed as PostgreSQL type casts
  • Instead: fetch all relevant data and filter in Python with
    .isin()

JSON Access Patterns

  • Nested:
    metadata_json->'parent'->>'child'
  • Flat key with dot:
    metadata_json->>'parent.child'
  • Check key existence:
    metadata_json ? 'key'