Skills hipaa-compliance-auditor
Automatically detect and de-identify PII (Personal Identifiable Information)
install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/aipoch-ai/hipaa-compliance-auditor" ~/.claude/skills/clawdbot-skills-hipaa-compliance-auditor && rm -rf "$T"
manifest:
skills/aipoch-ai/hipaa-compliance-auditor/SKILL.mdsource content
HIPAA Compliance Auditor
A clinical-grade PII/PHI detection and de-identification tool for healthcare text data.
Overview
This skill analyzes text for HIPAA-protected identifiers and automatically redacts or anonymizes them. It uses a combination of regex patterns, NLP entity recognition, and contextual analysis to identify 18 HIPAA identifier categories.
Features
- 18 HIPAA Identifiers Detection: Names, dates, SSN, MRN, phone/fax, email, geographic data, etc.
- Automatic De-identification: Replace PII with semantic tokens (e.g.,
,[PATIENT_NAME]
)[DATE_1] - Context-Aware Detection: Distinguishes between similar patterns (dates vs. lab values)
- Audit Logging: Track all redaction actions for compliance documentation
- Confidence Scoring: Flag uncertain detections for manual review
Usage
Command Line
python scripts/main.py --input "patient_text.txt" --output "deidentified.txt" python scripts/main.py --text "Patient John Doe, SSN 123-45-6789..." --audit-log audit.json
Python API
from scripts.main import HIPAAAuditor auditor = HIPAAAuditor() result = auditor.deidentify("Patient John Doe was admitted on 2024-01-15...") print(result.cleaned_text) # De-identified output print(result.detected_pii) # List of found PII entities
Parameters
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
, | string | - | No | Path to input text file |
| string | - | No | Direct text input (alternative to file) |
, | string | - | No | Path for de-identified output file |
| string | - | No | Path for JSON audit log |
| float | 0.7 | No | Minimum confidence threshold (0.0-1.0) |
| bool | true | No | Maintain document structure |
| string | - | No | Path to custom regex patterns JSON |
HIPAA Identifier Categories Detected
- Names (patient, relatives, employers)
- Geographic subdivisions smaller than state
- Dates (except year) related to individual
- Phone numbers
- Fax numbers
- Email addresses
- SSN
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers
- Device identifiers
- URLs
- IP addresses
- Biometric identifiers
- Full-face photos
- Any other unique identifying numbers
Output Format
De-identified Text
Original identifiers replaced with semantic tags:
,[PATIENT_NAME_1]
...[PATIENT_NAME_2]
,[DATE_1]
...[DATE_2][SSN_1]
,[PHONE_1]
...[PHONE_2][EMAIL_1]
(Medical Record Number)[MRN_1][ADDRESS_1]
Audit Log JSON
{ "timestamp": "2024-01-15T10:30:00Z", "input_hash": "sha256:abc123...", "detections": [ { "type": "PATIENT_NAME", "position": [10, 18], "confidence": 0.95, "replacement": "[PATIENT_NAME_1]", "original_length": 8 } ], "statistics": { "total_pii_found": 5, "categories_detected": ["NAME", "DATE", "PHONE", "SSN"] } }
Technical Architecture
- Preprocessing: Normalize text encoding, handle line breaks
- Regex Engine: Pattern matching for structured identifiers (SSN, phone, email, MRN)
- NLP Pipeline: spaCy NER for names, organizations, locations
- Context Filter: Remove false positives (e.g., "Dr. Smith" vs. "smith fracture")
- Replacement Engine: Sequential replacement with semantic tokens
- Validation: Ensure no original PII remains in output
Dependencies
- Python 3.9+
- spaCy (en_core_web_trf or en_core_web_lg)
- regex (for advanced pattern matching)
- Presidio (optional, for enhanced PII detection)
See
references/requirements.txt for full dependency list.
Limitations & Warnings
⚠️ CRITICAL: This tool is designed as a helper, not a replacement for human review.
- Context-dependent PII (e.g., rare disease names + location) may not be fully detected
- Unstructured narrative text may contain identifying information not caught by patterns
- Always perform manual QA on output before HIPAA-compliant release
- AI Autonomous Acceptance Status: 需人工检查 (Requires Manual Review)
References
- HIPAA Safe Harbor de-identification standardsreferences/hipaa_safe_harbor_guide.pdf
- Complete regex pattern definitionsreferences/pii_patterns.json
- Sample clinical texts with expected outputsreferences/test_cases/
- Python dependenciesreferences/requirements.txt
Technical Difficulty: High
Complex NLP pipelines, contextual disambiguation, regulatory compliance requirements.
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|---|---|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access | No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
Security Checklist
- No hardcoded credentials or API keys
- No unauthorized file system access (../)
- Output does not expose sensitive information
- Prompt injection protections in place
- Input file paths validated (no ../ traversal)
- Output directory restricted to workspace
- Script execution in sandboxed environment
- Error messages sanitized (no stack traces exposed)
- Dependencies audited
Prerequisites
# Python dependencies pip install -r requirements.txt
Evaluation Criteria
Success Metrics
- Successfully executes main functionality
- Output meets quality standards
- Handles edge cases gracefully
- Performance is acceptable
Test Cases
- Basic Functionality: Standard input → Expected output
- Edge Case: Invalid input → Graceful error handling
- Performance: Large dataset → Acceptable processing time
Lifecycle Status
- Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues: None
- Planned Improvements:
- Performance optimization
- Additional feature support