Claude-skill-registry language-detection-expert
Hybrid language detection algorithm for Vigil Guard v2.0.0. Use for language-detector Flask API, entity-based hints, Polish PESEL/NIP detection, 3-branch pipeline integration, accuracy troubleshooting, and langdetect integration.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/language-detection-expert" ~/.claude/skills/majiayu000-claude-skill-registry-language-detection-expert && rm -rf "$T"
skills/data/language-detection-expert/SKILL.mdLanguage Detection Expert (v2.0.0)
Overview
Hybrid language detection algorithm for Vigil Guard v2.0.0 combining entity-based hints (Polish PESEL/NIP detection) with statistical analysis (langdetect library) for accurate dual-language PII processing and 3-branch detection pipeline integration.
When to Use This Skill
- Managing language-detector Flask API (services/language-detector/)
- Implementing hybrid detection logic
- Troubleshooting detection accuracy (<10ms target)
- Working with langdetect library
- Polish entity recognition patterns
- 3-branch pipeline integration (v2.0.0)
Tech Stack
- Python 3.11, Flask 3.0.0
- langdetect 1.0.9 (statistical analysis)
- Custom Polish entity patterns (PESEL, NIP, REGON)
v2.0.0 Architecture Integration
Position in 3-Branch Pipeline
n8n Workflow (24 nodes): 1. Input Validation 2. Language Detection ← This Service 3. 3-Branch Executor (parallel): - Branch A: Heuristics (uses language for keyword matching) - Branch B: Semantic (uses language for embedding model) - Branch C: LLM Guard (language-agnostic) 4. Arbiter v2 Decision 5. PII Redaction (uses language for Presidio model selection)
Integration with Branches
// From n8n 3-Branch Executor const languageResult = await fetch('http://vigil-language-detector:5002/detect', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ text: input, detailed: true }) }); const { language, detection_method } = await languageResult.json(); // Branch A: Heuristics - uses language for keyword patterns const branchA = await fetch('http://vigil-heuristics:5005/analyze', { body: JSON.stringify({ text: input, language, request_id }) }); // Branch B: Semantic - uses language for embedding selection const branchB = await fetch('http://vigil-semantic:5006/analyze', { body: JSON.stringify({ text: input, language, request_id }) }); // PII Redaction - uses language for Presidio model const piiResult = await detectPII(text, language === 'pl' ? ['pl', 'en'] : ['en']);
Hybrid Detection Algorithm (v2.0.0)
Decision Flow
1. Check Polish Entity Hints: - PESEL pattern: \d{11} with checksum - NIP pattern: XXX-XXX-XX-XX or \d{10} - REGON pattern: \d{9} or \d{14} - Polish keywords: ["PESEL", "NIP", "REGON", "dowód", "paszport"] → If found: return "pl" (confidence: "hybrid_entity_hints") 2. If no entity hints, use langdetect: - Statistical analysis of character n-grams - Language profiles for 55+ languages → If confidence >0.9: return detected language → If confidence <0.9: return "en" (default fallback) 3. Edge cases: - Empty text → "en" (default) - Numbers only → "en" (default) - Very short text (<10 chars) → Check entity hints only
API Endpoint
# POST /detect { "text": "Moja karta to 4111111111111111 i PESEL 92032100157", "detailed": true } # Response { "language": "pl", "confidence": 1.0, "detection_method": "hybrid_entity_hints", "details": { "entity_hints_found": ["PESEL"], "langdetect_result": "pl", "langdetect_confidence": 0.95 } }
Common Tasks
Task 1: Add Polish Entity Pattern
# app.py POLISH_ENTITY_PATTERNS = [ (r'\b\d{11}\b', 'PESEL'), # 11 digits (r'\b\d{3}-\d{3}-\d{2}-\d{2}\b', 'NIP'), # NIP with dashes (r'\b\d{10}\b', 'NIP_OR_REGON'), # 10 digits (ambiguous) (r'\b\d{9}\b', 'REGON'), # 9 digits REGON ] POLISH_KEYWORDS = [ 'PESEL', 'pesel', 'NIP', 'nip', 'REGON', 'regon', 'dowód', 'paszport', 'legitymacja', 'tożsamość' ] def has_polish_entities(text: str) -> tuple[bool, list]: """Check for Polish-specific entities""" found_entities = [] # Check patterns for pattern, entity_type in POLISH_ENTITY_PATTERNS: if re.search(pattern, text): found_entities.append(entity_type) # Check keywords for keyword in POLISH_KEYWORDS: if keyword in text: found_entities.append(f'keyword:{keyword}') return len(found_entities) > 0, found_entities
Task 2: Statistical Detection with langdetect
from langdetect import detect, detect_langs, LangDetectException def detect_language_statistical(text: str) -> tuple[str, float]: """ Use langdetect for statistical language detection Returns: (language_code, confidence) """ try: # Get all language probabilities langs = detect_langs(text) # Return most probable language if langs: top_lang = langs[0] return top_lang.lang, top_lang.prob return 'en', 0.0 except LangDetectException: # Text too short or only numbers return 'en', 0.0
Task 3: Hybrid Detection Implementation
@app.route('/detect', methods=['POST']) def detect_language(): data = request.json text = data.get('text', '') detailed = data.get('detailed', False) # 1. Check entity hints has_polish, entities = has_polish_entities(text) if has_polish: # Strong Polish signal from entities result = { 'language': 'pl', 'confidence': 1.0, 'detection_method': 'hybrid_entity_hints' } if detailed: result['details'] = { 'entity_hints_found': entities, 'langdetect_result': None, 'langdetect_confidence': None } return jsonify(result) # 2. No entity hints, use statistical lang, confidence = detect_language_statistical(text) result = { 'language': lang, 'confidence': confidence, 'detection_method': 'langdetect' if confidence > 0.5 else 'default_fallback' } if detailed: result['details'] = { 'entity_hints_found': [], 'langdetect_result': lang, 'langdetect_confidence': confidence } return jsonify(result)
Task 4: Performance Optimization
from functools import lru_cache # Cache for frequent texts (1000 most recent) @lru_cache(maxsize=1000) def cached_detect(text_hash: str) -> tuple: """Cache detection results for performance""" text = unhash(text_hash) has_polish, entities = has_polish_entities(text) if has_polish: return ('pl', 1.0, 'hybrid_entity_hints', entities) lang, confidence = detect_language_statistical(text) return (lang, confidence, 'langdetect', []) # Timeout protection (10ms target) import signal def timeout_handler(signum, frame): raise TimeoutError("Language detection exceeded timeout") def detect_with_timeout(text: str, timeout_ms: int = 10): signal.signal(signal.SIGALRM, timeout_handler) signal.setitimer(signal.ITIMER_REAL, timeout_ms / 1000) try: return detect_language_statistical(text) finally: signal.alarm(0) # Cancel alarm
v2.0.0 Branch Integration Examples
Heuristics Service (Branch A) Integration
# heuristics-service uses language for keyword patterns def analyze_with_language(text: str, language: str): if language == 'pl': keywords = POLISH_KEYWORDS + COMMON_KEYWORDS patterns = POLISH_PATTERNS + COMMON_PATTERNS else: keywords = ENGLISH_KEYWORDS + COMMON_KEYWORDS patterns = ENGLISH_PATTERNS + COMMON_PATTERNS return match_patterns(text, patterns, keywords)
Semantic Service (Branch B) Integration
# semantic-service may use language for embedding model selection def get_embeddings(text: str, language: str): # MiniLM-L6-v2 is multilingual, but language hint helps model = load_model('all-MiniLM-L6-v2') # Language-specific preprocessing if language == 'pl': text = polish_preprocessing(text) return model.encode(text)
PII Redaction Integration
# PII redaction uses language for Presidio model selection async def detect_pii_with_language(text: str, detected_language: str): if detected_language == 'pl': # Polish first for PESEL detection accuracy languages = ['pl', 'en'] else: languages = ['en'] return await dual_language_pii(text, languages)
Test Coverage
Test Categories
Polish Text (15 tests): - With diacritics: "Cześć, jak się masz?" - Without diacritics: "Prosze o pomoc" - Mixed case: "PROSZĘ o pomoc" English Text (10 tests): - Common words: "Please help me" - Technical: "Docker Compose deployment" Mixed Language (8 tests): - Polish + English terms: "Użyj Docker Compose" - English + Polish names: "User Jan Kowalski" Short Text + Entity Hints (10 tests): - PESEL only: "PESEL 92032100157" - NIP only: "NIP 123-456-78-90" - Credit card (no hint): "Card 4111111111111111" → "en" Edge Cases (7 tests): - Numbers only: "12345 67890" → "en" - Special chars: "!@#$%^&*()" → "en" - Empty string: "" → "en"
Integration Points
With presidio-pii-specialist:
when: Language detected action: 1. language="pl" → Call Presidio with pl_core_news_lg 2. language="en" → Call Presidio with en_core_web_lg 3. Dual mode → Call both, deduplicate
With n8n-vigil-workflow (v2.0.0):
when: 3-Branch Executor runs action: 1. Language Detection node runs first 2. Result passed to all 3 branches 3. Branch A uses language for keyword selection 4. Branch B uses language for embedding preprocessing 5. PII_Redactor_v2 uses language for model selection
With heuristics-service (Branch A):
when: Heuristics analysis action: 1. Receive language from detection 2. Select language-specific patterns 3. Apply Polish or English keyword list 4. Return score with language context
Troubleshooting
Incorrect detection for short Polish text:
# Add more Polish keywords POLISH_KEYWORDS += ['proszę', 'dziękuję', 'przepraszam', 'witam'] # Lower confidence threshold if confidence < 0.5: return 'pl' if any(word in text for word in POLISH_KEYWORDS) else 'en'
Detection too slow (>10ms):
# Enable caching @lru_cache(maxsize=10000) def cached_detect(text: str): return detect_language_statistical(text) # Reduce langdetect trials from langdetect import DetectorFactory DetectorFactory.seed = 0 # Deterministic results
Branch A not using language correctly:
# Verify language is passed to heuristics curl -X POST http://localhost:5005/analyze \ -H "Content-Type: application/json" \ -d '{"text":"test PESEL 12345678901","language":"pl","request_id":"debug-1"}' # Check logs docker logs vigil-heuristics-service --tail 50 | grep language
Quick Reference
# Test API curl -X POST http://localhost:5002/detect \ -H "Content-Type: application/json" \ -d '{"text":"PESEL 92032100157","detailed":true}' # Run tests cd services/language-detector && python -m pytest tests/ # Health check curl http://localhost:5002/health # Check service logs docker logs vigil-language-detector --tail 50
ClickHouse Logging (v2.0.0)
-- Language detection results logged with events SELECT original_input, detected_language, detection_method, branch_a_score, branch_b_score, branch_c_score FROM n8n_logs.events_processed WHERE detected_language = 'pl' ORDER BY timestamp DESC LIMIT 10;
Last Updated: 2025-12-09 Performance: <10ms average detection time Accuracy: 100% (50/50 tests passing) Languages Supported: 55+ via langdetect, Polish priority Integration: 3-branch pipeline (v2.0.0)
Version History
- v2.0.0 (Current): 3-branch pipeline integration, branch language passing
- v1.6.11: Hybrid detection algorithm, entity-based hints