Dotfiles querying-mlflow-metrics

Fetches aggregated trace metrics (token usage, latency, trace counts, quality evaluations) from MLflow tracking servers. Triggers on requests to show metrics, analyze token usage, view LLM costs, check usage trends, or query trace statistics.

install
source · Clone the upstream repo
git clone https://github.com/msbaek/dotfiles
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/msbaek/dotfiles "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/querying-mlflow-metrics" ~/.claude/skills/msbaek-dotfiles-querying-mlflow-metrics && rm -rf "$T"
manifest: .claude/skills/querying-mlflow-metrics/SKILL.md
source content

MLflow Metrics

Run

scripts/fetch_metrics.py
to query metrics from an MLflow tracking server.

Examples

Token usage summary:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM,AVG

Output:

AVG: 223.91  SUM: 7613

Hourly token trend (last 24h):

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM \
    -t 3600 --start-time="-24h" --end-time=now

Output: Time-bucketed token sums per hour

Latency percentiles by trace:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m latency -a AVG,P95 -d trace_name

Error rate by status:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m trace_count -a COUNT -d trace_status

Quality scores by evaluator (assessments):

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_value -a AVG,P50 -d assessment_name

Output: Average and median scores for each evaluator (e.g., correctness, relevance)

Assessment count by name:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_count -a COUNT -d assessment_name

JSON output: Add

-o json
to any command.

Arguments

ArgRequiredDescription
-s, --server
YesMLflow server URL
-x, --experiment-ids
YesExperiment IDs (comma-separated)
-m, --metric
Yes
trace_count
,
latency
,
input_tokens
,
output_tokens
,
total_tokens
-a, --aggregations
Yes
COUNT
,
SUM
,
AVG
,
MIN
,
MAX
,
P50
,
P95
,
P99
-d, --dimensions
NoGroup by:
trace_name
,
trace_status
-t, --time-interval
NoBucket size in seconds (3600=hourly, 86400=daily)
--start-time
No
-24h
,
-7d
,
now
, ISO 8601, or epoch ms
--end-time
NoSame formats as start-time
-o, --output
No
table
(default) or
json

For SPANS metrics (

span_count
,
latency
), add
-v SPANS
. For ASSESSMENTS metrics, add
-v ASSESSMENTS
.

See references/api_reference.md for filter syntax and full API details.