Cc-skills telemetry-terminology-similarity
Score telemetry field name similarity across syntactic, taxonomic, and semantic layers. TRIGGERS - field name collision, terminology overlap, schema dedup, naming inconsistency, telemetry naming, log field similarity.
git clone https://github.com/terrylica/cc-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/terrylica/cc-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/quality-tools/skills/telemetry-terminology-similarity" ~/.claude/skills/terrylica-cc-skills-telemetry-terminology-similarity && rm -rf "$T"
plugins/quality-tools/skills/telemetry-terminology-similarity/SKILL.mdTelemetry Terminology Similarity
Score pairwise similarity of telemetry field names across three independent layers. Emits raw scores — no thresholds, no clustering, no opinions. The consuming AI agent applies its own domain judgment.
Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
When to Use This Skill
Use this skill when:
- Auditing a telemetry/logging schema for naming collisions
- Comparing two JSONL log schemas for field overlap
- Detecting
vstrace_id
vstraceId
vsrequest_id
style problemscorrelation_id - Validating field naming consistency before shipping telemetry changes
Architecture
5-layer scoring pipeline — each layer catches what the others miss:
┌─────────────────────────────────────────────────────────┐ │ Layer 1: NORMALIZE │ │ camelCase/snake_case split + abbreviation expansion │ │ wordninja for concatenated words │ │ "traceId" → "trace id", "ts" → "timestamp" │ ├─────────────────────────────────────────────────────────┤ │ Layer 2: SYNTACTIC (RapidFuzz, 0-100) │ │ token_set_ratio on normalized forms │ │ Catches: trace_id ↔ traceId, level ↔ log_level │ ├─────────────────────────────────────────────────────────┤ │ Layer 3: TAXONOMIC (WordNet Wu-Palmer, 0.0-1.0) │ │ Head-noun synonym detection via hypernym tree │ │ Catches: level ↔ severity, error ↔ fault, op ↔ action │ ├─────────────────────────────────────────────────────────┤ │ Layer 4: SEMANTIC (sentence-transformers, 0.0-1.0) │ │ Cosine similarity via all-MiniLM-L6-v2 embeddings │ │ Catches: error ↔ exception, user_id ↔ account_id │ ├─────────────────────────────────────────────────────────┤ │ Layer 5: CANONICAL (--canonical flag, optional) │ │ RapidFuzz vs bundled OTel/OCSF/CloudEvents dictionary │ │ Catches: http_method → http.request.method (OTel) │ ├─────────────────────────────────────────────────────────┤ │ Output: All pairs scored + canonical anchors. │ │ Agent decides what to act on — tool computes, judges. │ │ Use proposer-prompt.md for structured rename proposals.│ └─────────────────────────────────────────────────────────┘
Two-Phase Workflow: Score → Propose
The skill works in two phases:
- Phase 1 — Score (
): Compute raw similarity scores across 5 layers. Tool computes, no opinions emitted.term_similarity.py - Phase 2 — Propose (
): A bundled prompt template that consumes the scoring JSON and asks the LLM to produce structured rename proposals with confidence levels, evidence citations, and explicit escape hatches.references/proposer-prompt.md
The two phases are deliberately separated. Phase 1 is deterministic and reproducible; Phase 2 applies domain judgment that only an LLM with conversation context can provide.
Dependencies
All installed via
uv run (PEP 723 inline metadata — no global install needed):
| Package | Purpose | Size |
|---|---|---|
| Semantic embeddings (MiniLM-L6) | ~80 MB |
| Fast fuzzy string matching (C++) | ~1.3 MB |
| Probabilistic word splitting | ~0.5 MB |
| WordNet Wu-Palmer synonym detection | ~30 MB |
| Fast JSON serialization | ~0.3 MB |
First run downloads the
all-MiniLM-L6-v2 model (~80 MB) and WordNet data (~30 MB).
Script Location
The analysis script lives in this skill's
references/ directory. Resolve the path before use:
# SSoT-OK: marketplace path resolution for cross-repo invocation SCRIPT_DIR="$(dirname "$(find ~/.claude/plugins -path '*/telemetry-terminology-similarity/references/term_similarity.py' -print -quit 2>/dev/null)")" SCRIPT="$SCRIPT_DIR/term_similarity.py"
All examples below assume
$SCRIPT is set. When invoking from the cc-skills repo itself, use the relative path directly.
Usage
Analyze field names directly
# SSoT-OK: uv run handles PEP 723 inline deps uv run --python 3.13 "$SCRIPT" \ trace_id traceId request_id correlation_id \ level severity log_level priority
Extract fields from a Python codebase
Use Python regex extraction (macOS lacks
grep -P):
python3 -c " import re, glob fields = set() for f in glob.glob('**/*.py', recursive=True): text = open(f).read() for m in re.finditer(r'\"([a-z][a-z0-9_]*?)\":', text): fields.add(m.group(1)) for f in sorted(fields): print(f) " | uv run --python 3.13 "$SCRIPT"
Analyze from stdin (pipe from jq, etc.)
head -1 telemetry.jsonl | jq -r 'keys[]' | uv run --python 3.13 "$SCRIPT"
Analyze a JSONL file's fields
uv run --python 3.13 "$SCRIPT" --jsonl /path/to/telemetry.jsonl
Compare two JSON schemas
uv run --python 3.13 "$SCRIPT" --schema-a schema_v1.json --schema-b schema_v2.json
Control output size
uv run --python 3.13 "$SCRIPT" --top 30 field1 field2 field3 # Top 30 pairs uv run --python 3.13 "$SCRIPT" --top 0 field1 field2 field3 # All pairs uv run --python 3.13 "$SCRIPT" --json field1 field2 field3 # JSON output
Lookup against canonical standards (OTel/OCSF/CloudEvents)
# Anchor each field against 1,453 bundled canonical names from OTel + OCSF + CloudEvents uv run --python 3.13 "$SCRIPT" --canonical http_method http_status request_id severity
Output adds a
=== CANONICAL ANCHORS === section showing the closest standard names per field. Useful for "should we rename to match an industry standard" decisions.
Generate structured rename proposals (Phase 2)
After running with
--json --canonical, paste the output into references/proposer-prompt.md — a bundled prompt template that produces atomic, reviewable rename proposals with confidence levels and explicit escape hatches.
# Phase 1: Score uv run --python 3.13 "$SCRIPT" --json --canonical [fields...] > analysis.json # Phase 2: Apply proposer prompt (paste analysis.json into the template) # The LLM produces structured proposals.json — review atomically
Parameters
| Parameter | Default | Description |
|---|---|---|
| 50 | Show top N pairs by combined score (0 = all) |
| false | Lookup each field against bundled OTel/OCSF/CloudEvents dict |
| — | Extract fields from a JSONL file (all unique keys) |
| — | Cross-schema comparison (two JSON schema files) |
| false | Output as structured JSON instead of text |
Output Format
Text output (default)
Fields analyzed: 21 Unique after normalization: 19 === EXACT DUPLICATES (after normalization) === trace_id == traceId timestamp == ts === SCORED PAIRS (sorted by combined score) === syn tax sem comb pair --- --- --- ---- ---- 100.0 0.000 0.560 1.000 level <-> log_level 0.0 1.000 0.472 1.000 error <-> fault 66.7 0.909 0.457 0.909 operation <-> action 46.2 0.833 0.251 0.833 level <-> severity 28.6 0.667 0.700 0.700 error <-> exception
Three independent scores per pair — the agent reads all three to decide:
- syn (syntactic): high = surface-level name variant
- tax (taxonomic): high = WordNet synonym (hypernym tree)
- sem (semantic): high = embedding similarity (distributional)
- comb (combined): max(syn/100, tax, sem) — sorting key
JSON output (--json
)
--jsonStructured JSON with
exact_duplicates and scored_pairs arrays.
How Each Layer Contributes
| Scenario | syn | tax | sem | Which layer wins |
|---|---|---|---|---|
vs | 100.0 | 0.0 | 1.0 | Syntactic |
vs | 46.2 | 0.833 | 0.251 | Taxonomic |
vs | 0.0 | 1.000 | 0.472 | Taxonomic |
vs | 66.7 | 0.909 | 0.457 | Taxonomic |
vs | 28.6 | 0.667 | 0.700 | Semantic |
vs | 47.1 | 0.0 | 0.792 | Semantic |
Abbreviation Dictionary
The normalizer expands common telemetry abbreviations:
| Abbr | Expansion | Abbr | Expansion |
|---|---|---|---|
| timestamp | | user id |
| request | | response |
| error | | message |
| service | | environment |
| operation | | level |
| event | | context |
| account | | configuration |
| duration | | latency |
Add domain-specific abbreviations by editing
ABBREVIATIONS in term_similarity.py.
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Missing deps | Use (PEP 723 resolves automatically) |
| Model download slow | First run | Cached after first download (~110 MB total) |
| Script not found from other repo | Path not resolved | Set per Script Location section |
| macOS lacks PCRE | Use pattern instead |
Post-Execution Reflection
After this skill completes, check before closing:
- Did the command succeed? — If not, fix the instruction or error table that caused the failure.
- Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
- Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.
Only update if the issue is real and reproducible — not speculative.