Babysitter homoglyph-detector
Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code
install
source · Clone the upstream repo
git clone https://github.com/a5c-ai/babysitter
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/a5c-ai/babysitter "$T" && mkdir -p ~/.claude/skills && cp -r "$T/library/specializations/security-compliance/skills/homoglyph-detector" ~/.claude/skills/a5c-ai-babysitter-homoglyph-detector && rm -rf "$T"
manifest:
library/specializations/security-compliance/skills/homoglyph-detector/SKILL.mdsource content
Homoglyph Detector
Byte-level forensic analysis of code changes to detect Unicode homoglyph substitutions — characters that look identical to ASCII in every editor and diff tool but have different codepoints, silently breaking string comparisons, dictionary lookups, and identifier resolution.
Purpose
Homoglyph attacks (related to CVE-2021-42574 "Trojan Source") are the highest-stealth trojan technique. A Cyrillic
р (U+0440) looks identical to a Latin p (U+0070) in every font, editor, and diff viewer. The only way to detect it is byte-level analysis via hexdump.
This skill pipes git diffs through
hexdump -C and scans for multi-byte UTF-8 sequences where single-byte ASCII is expected, particularly in string literals used as dictionary keys, variable names, and identifiers.
Capabilities
Confusable Character Detection
Scans for these high-risk Unicode confusables:
| Latin | Cyrillic | Greek | UTF-8 Bytes |
|---|---|---|---|
| a (61) | а (D0 B0) | α (CE B1) | 1 vs 2 bytes |
| c (63) | с (D1 81) | — | 1 vs 2 bytes |
| e (65) | е (D0 B5) | ε (CE B5) | 1 vs 2 bytes |
| o (6F) | о (D0 BE) | ο (CE BF) | 1 vs 2 bytes |
| p (70) | р (D1 80) | ρ (CF 81) | 1 vs 2 bytes |
| x (78) | х (D1 85) | χ (CF 87) | 1 vs 2 bytes |
| y (79) | у (D1 83) | — | 1 vs 2 bytes |
Zero-Width Character Detection
- U+200B — Zero-width space
- U+200C — Zero-width non-joiner
- U+200D — Zero-width joiner
- U+FEFF — Byte order mark (in non-BOM position)
Bidi Control Character Detection (Trojan Source)
- U+200F — Right-to-left mark
- U+200E — Left-to-right mark
- U+202A — Left-to-right embedding
- U+202B — Right-to-left embedding
- U+202C — Pop directional formatting
- U+2066 — Left-to-right isolate
- U+2067 — Right-to-left isolate
Context-Aware Analysis
- Focuses on string literals (dictionary keys, config values)
- Focuses on identifiers (variable names, function names, class names)
- Ignores legitimate Unicode in comments, docstrings, and i18n strings
- Compares byte patterns between removed (-) and added (+) diff lines
Input Schema
{ "type": "object", "required": ["projectRoot", "changedFiles"], "properties": { "projectRoot": { "type": "string", "description": "Absolute path to the git repository" }, "changedFiles": { "type": "array", "items": { "type": "string" }, "description": "List of changed file paths to scan" }, "scanMode": { "type": "string", "enum": ["uncommitted", "commit-range", "branch-diff"], "default": "uncommitted" }, "baseRef": { "type": "string" }, "headRef": { "type": "string" } } }
Output Schema
{ "type": "object", "required": ["filesScanned", "homoglyphsFound", "verdict"], "properties": { "filesScanned": { "type": "number" }, "homoglyphsFound": { "type": "array", "items": { "type": "object", "properties": { "file": { "type": "string" }, "line": { "type": "number" }, "byteOffset": { "type": "string" }, "context": { "type": "string" }, "expectedAscii": { "type": "string" }, "actualBytes": { "type": "string" }, "unicodeCodepoint": { "type": "string" }, "scriptName": { "type": "string" }, "impact": { "type": "string" } } } }, "bidiControlChars": { "type": "array" }, "verdict": { "type": "string", "enum": ["CLEAN", "HOMOGLYPH_DETECTED"] } } }
Detection Method
# Step 1: Pipe git diff through hexdump git diff <file> | hexdump -C # Step 2: In added (+) lines, look for multi-byte sequences # where the removed (-) line had single-byte ASCII # # Example — Latin 'p' vs Cyrillic 'р': # Removed: 22 70 70 67 22 | "ppg" | ← 70 = Latin 'p' # Added: 22 d1 80 70 67 | "..pg" | ← d1 80 = Cyrillic 'р' # # The d1 80 bytes where 70 should be = HOMOGLYPH DETECTED
Usage Example
skill: { name: 'homoglyph-detector', context: { projectRoot: '/path/to/project', changedFiles: ['backend/app/prediction/temporal.py'], scanMode: 'uncommitted' } }
Real-World Example
From adversarial drill #6:
- Attack: Dictionary key
changed to"ppg"
(Cyrillic р + Latin pg)"рpg" - Camouflage: 4 lines of harmless
wrappers added as decoyround() - Impact: All
lookups return defaultdict.get("ppg")
, disabling trend detection0 - Detection:
revealed byteshexdump -C
whered1 80
was expected70
Process Files
— Phase 2: Homoglyph Detection (parallel with semantic analysis)nation-state-trojan-detection.js