Claude-skill-registry field-extraction-parsing
Extract structured fields from unstructured log data using OPAL parsing functions. Covers extract_regex() for pattern matching with type casting, split() for delimited data, parse_json() for JSON logs, and JSONPath for navigating parsed structures. Use when you need to convert raw log text into queryable fields for analysis, filtering, or aggregation.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/field-extraction-parsing" ~/.claude/skills/majiayu000-claude-skill-registry-field-extraction-parsing && rm -rf "$T"
skills/data/field-extraction-parsing/SKILL.mdField Extraction and Parsing
Overview
This skill covers converting unstructured log data into structured, queryable fields using OPAL's extraction and parsing functions.
Core Functions:
- Extract fields using regex patterns with named capture groupsextract_regex()
/split()
- Split delimited stringssplit_part()
- Parse JSON strings into objectsparse_json()- JSONPath / Array access - Navigate parsed structures
When to Use This Skill
Use field extraction when you need to:
- Parse log formats - Extract timestamp, level, message from structured logs
- Extract identifiers - Pull out request IDs, trace IDs, user IDs for correlation
- Parse metrics from text - Extract numbers, durations, status codes from logs
- Structure unstructured data - Convert free-form text into queryable columns
- Parse embedded JSON - Extract fields from JSON-formatted log messages
Function 1: extract_regex()
The most powerful extraction function. Uses POSIX regex with named capture groups to create new columns.
Syntax
extract_regex source_column, /(?P<field_name>pattern)/ extract_regex source_column, /(?P<field_name::typecast>pattern)/
Key Features:
- Named capture groups create new columns:
(?P<column_name>pattern) - Type casting in capture group:
(?P<name::int64>pattern) - Multiple captures in one regex
- Regex uses forward slashes
not quotes/pattern/
Supported Type Casts
(default)string
,int64float64
(for ISO timestamps)parse_isotime
,duration
,duration_ms
,duration_sec
,duration_minduration_hrparse_json
Pattern 1: Extract Timestamp and Log Level
Use Case: Parse structured application logs
Log Format:
[2025-11-16 01:58:12,204] INFO [Component] Message...
filter container = "kafka" extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) /
Creates Columns:
: "2025-11-16 02:02:45,266"log_time
: "INFO"level
Use For: Java logs, Kafka logs, structured application logs
Pattern 2: Extract with Type Casting
Use Case: Extract numeric values as integers
Log Format:
[SnapshotEmitter id=1] Message...
extract_regex body, /\[(?P<component>\w+) id=(?P<component_id::int64>\d+)\]/
Creates Columns:
: "SnapshotEmitter" (string)component
: 1 (int64)component_id
Key Point:
::int64 casts the extracted value to integer immediately
Pattern 3: Extract HTTP Request Details
Use Case: Parse access log patterns
Log Format:
GET /api/users/123 200 15ms
extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/
Creates Columns:
: "GET"method
: "/api/users/123"path
: 200 (int64)status
: 15 (int64)duration_ms
Use For: Nginx, Apache, application access logs
Pattern 4: Extract Request ID for Correlation
Use Case: Pull out trace/request IDs for distributed tracing
Log Format:
request_id=GHhaU0_7TcVSXpICZ9lh [info] GET /api
extract_regex body, /request_id=(?P<request_id>[\w\-]+)/
Creates Columns:
: "GHhaU0_7TcVSXpICZ9lh"request_id
Then correlate:
extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/ statsby count(), group_by(request_id, method, path)
Use For: Request correlation, distributed tracing, debugging user sessions
Pattern 5: Extract Key=Value Pairs
Use Case: Parse structured key=value log formats
Log Format:
user=john action=login result=success duration=150ms
extract_regex body, /user=(?P<user>\w+) action=(?P<action>\w+) result=(?P<result>\w+) duration=(?P<duration_ms::int64>\d+)ms/
Creates Columns:
: "john"user
: "login"action
: "success"result
: 150 (int64)duration_ms
Use For: Audit logs, security logs, custom application logs
Pattern 6: Extract IP Addresses
Use Case: Parse network information from logs
Log Format:
Connection from IP=192.168.1.100 to Destination="10.0.0.5"
extract_regex body, /IP=(?P<source_ip>[\d\.]+) to Destination="(?P<dest_ip>[\d\.]+)"/
Creates Columns:
: "192.168.1.100"source_ip
: "10.0.0.5"dest_ip
Use For: Network logs, firewall logs, connection tracking
Function 2: split() and split_part()
Split delimited strings into arrays or extract specific parts.
Syntax
split(string, delimiter) # Returns JSON array split_part(string, delimiter, N) # Returns Nth part (1-based)
Key Differences:
returns entire array:split()["a", "b", "c"]
returns single element (1-based indexing)split_part()- Negative indices in
count from end:split_part()
= last-1
Pattern 7: Split IP Address into Octets
Use Case: Parse IP address components
extract_regex body, /IP=(?P<ip>[\d\.]+)/ make_col octets:split(ip, ".") make_col first_octet:split_part(ip, ".", 1) make_col last_octet:split_part(ip, ".", -1)
Results:
: "95.217.183.1"ip
:octets["95", "217", "183", "1"]
: "95"first_octet
: "1"last_octet
Use For: Network analysis, IP classification
Pattern 8: Parse Path Components
Use Case: Extract parts of file paths or URLs
make_col path_parts:split("/var/log/app/error.log", "/") make_col filename:split_part("/var/log/app/error.log", "/", -1) make_col directory:split_part("/var/log/app/error.log", "/", -2)
Results:
:path_parts["", "var", "log", "app", "error.log"]
: "error.log"filename
: "app"directory
Use For: File path analysis, URL parsing
Pattern 9: Parse CSV-Like Data
Use Case: Extract fields from comma-separated values in logs
extract_regex body, /data:(?P<csv_data>[\w,]+)/ make_col fields:split(csv_data, ",") make_col field1:split_part(csv_data, ",", 1) make_col field2:split_part(csv_data, ",", 2) make_col field3:split_part(csv_data, ",", 3)
Use For: Legacy systems, CSV exports in logs
Function 3: parse_json()
Parse JSON strings into queryable objects.
Syntax
parse_json(json_string)
Returns: OPAL object that can be accessed with JSONPath
Pattern 10: Parse JSON from Logs
Use Case: Extract fields from JSON-formatted log messages
Log Format:
MetricsExporter {"kind": "exporter", "data_type": "metrics", "metrics": 23, "data points": 61}
filter body ~ /MetricsExporter/ extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/ make_col parsed:parse_json(json_data)
Result:
:json_data
(string){"kind": "exporter", "data_type": "metrics", ...}
:parsed
(object){"kind": "exporter", "data_type": "metrics", ...}
Next: Access fields using JSONPath (see below)
Function 4: JSONPath and Array Access
Navigate parsed JSON objects and arrays.
Syntax
object.field_name # Simple field object."field with spaces" # Quoted for special chars array[0] # Zero-based array indexing object.nested.field # Nested access
Critical: Field names with spaces or special characters MUST be quoted
Pattern 11: Access JSON Fields
Use Case: Extract specific fields from parsed JSON
extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/ make_col parsed:parse_json(json_data) make_col data_type:string(parsed."data_type") make_col metrics_count:int64(parsed.metrics) make_col data_points:int64(parsed."data points")
Results:
: "metrics" (string)data_type
: 23 (int64)metrics_count
: 61 (int64)data_points
Key Points:
- Use quotes for
and"data_type"
(special chars/spaces)"data points" - Type cast with
,int64()
, etc.string()
Pattern 12: Array Access with Split
Use Case: Access specific array elements
extract_regex body, /IP=(?P<ip>[\d\.]+)/ make_col parts:split(ip, ".") make_col first_octet:int64(parts[0]) make_col second_octet:int64(parts[1]) make_col third_octet:int64(parts[2]) make_col fourth_octet:int64(parts[3])
Key Point: Array indexing is zero-based (
[0] = first element)
Difference from split_part():
: 1-based (first element = 1)split_part()- Array
: 0-based (first element = 0)[N]
Complete Examples
Example 1: Parse Application Errors
Goal: Extract error codes and messages from application logs
filter body ~ /ERROR/ extract_regex body, /\[(?P<log_time>[\d\-: ,]+)\] (?P<level>\w+) \[(?P<component>\w+)\] (?P<error_code>\w+): (?P<message>.*)/ statsby error_count:count(), sample:any(message), group_by(error_code, component) sort desc(error_count)
Use Case: Error analysis, identifying most common errors
Example 2: Parse and Analyze HTTP Status Codes
Goal: Analyze HTTP response codes and response times
filter body ~ /\d{3} \d+ms/ extract_regex body, /(?P<method>\w+) (?P<path>[\w\/\-\.]+) (?P<status::int64>\d{3}) (?P<duration_ms::int64>\d+)ms/ make_col status_class:if(status >= 500, "5xx", if(status >= 400, "4xx", if(status >= 300, "3xx", if(status >= 200, "2xx", "other")))) statsby request_count:count(), avg_duration:avg(duration_ms), p95_duration:percentile(duration_ms, 0.95), group_by(status_class, path) sort desc(request_count)
Use Case: Performance analysis, identifying slow endpoints
Example 3: Correlate Requests Across Services
Goal: Track requests through multiple services using request_id
filter body ~ /request_id=/ extract_regex body, /request_id=(?P<request_id>[\w\-]+) \[info\] (?P<method>\w+) (?P<path>[\w\/]+)/ make_col service:string(resource_attributes."k8s.deployment.name") statsby services:count_distinct(service), total_logs:count(), group_by(request_id) filter services > 1 sort desc(services)
Use Case: Distributed tracing, identifying cross-service requests
Example 4: Parse JSON Metrics and Alert
Goal: Extract metrics from JSON logs and find anomalies
filter body ~ /MetricsExporter/ extract_regex body, /MetricsExporter.(?P<json_data>\{.*\})/ make_col parsed:parse_json(json_data) make_col data_points:int64(parsed."data points") make_col metrics_count:int64(parsed.metrics) filter data_points > 200 or metrics_count > 50 statsby high_count:count(), avg_points:avg(data_points), group_by(metrics_count)
Use Case: Monitoring metric collection, detecting unusual activity
Example 5: Network Traffic Analysis
Goal: Analyze network connections by IP range
filter body ~ /IP=/ extract_regex body, /IP=(?P<ip>[\d\.]+)/ make_col octets:split(ip, ".") make_col network:split_part(ip, ".", 1) make_col is_private:if(network = "10" or network = "172" or network = "192", true, false) statsby connection_count:count(), unique_ips:count_distinct(ip), group_by(is_private, network) sort desc(connection_count)
Use Case: Security analysis, network traffic patterns
Decision Tree: Which Function to Use?
Need to extract data from logs? │ ├─ Data has clear pattern (timestamp, IP, ID) │ └─ Use extract_regex() with named captures │ ├─ Data is delimited (CSV, path, separated values) │ ├─ Need all parts → Use split() │ └─ Need specific part → Use split_part() │ ├─ Data is JSON formatted │ ├─ Extract JSON first → extract_regex() │ ├─ Parse JSON → parse_json() │ └─ Access fields → JSONPath (object.field) │ └─ Data is mixed (pattern + delimited + JSON) └─ Combine: extract_regex() → split() → parse_json()
Common Mistakes and Solutions
Mistake 1: Using Reserved Column Names
ERROR:
extract_regex body, /(?P<timestamp>[\d\-:]+)/ # Error: regex capture group 1 overwrites 'valid from' column "timestamp"
FIX:
extract_regex body, /(?P<log_time>[\d\-:]+)/
Reserved names:
timestamp, valid_from, valid_to, _c_bucket
Mistake 2: Forgetting Timestamp in pick_col
ERROR:
extract_regex body, /(?P<field>\w+)/ pick_col field # Error: need to pick 'valid from' column "timestamp"
FIX:
extract_regex body, /(?P<field>\w+)/ pick_col timestamp, field
Mistake 3: Wrong Regex Delimiters
ERROR:
extract_regex body, "pattern" # Quotes don't work
FIX:
extract_regex body, /pattern/ # Forward slashes required
Mistake 4: Tab Character in Regex
ERROR:
extract_regex body, /field\t(?P<value>.*)/ # Error: Unknown function 't()'
FIX:
extract_regex body, /field.(?P<value>.*)/ # Use . for any char # OR extract_regex body, /field[\t ](?P<value>.*)/ # Character class
Mistake 5: JSONPath Without Quotes
ERROR:
string(parsed.data points) # Syntax error (space in name)
FIX:
string(parsed."data points") # Quote field names with spaces
Mistake 6: Confusing split_part() vs Array Indexing
Remember:
is 1-based (first element = 1)split_part()- Array
is 0-based (first element = 0)[N]
make_col parts:split("a.b.c", ".") make_col using_split_part:split_part("a.b.c", ".", 1) # "a" (1-based) make_col using_array:parts[0] # "a" (0-based)
Error Handling
Regex Non-Matches
When regex doesn't match a log line:
- Extracted columns are null (not an error)
- Original data is preserved
- Filter nulls if needed:
filter extracted_field != null
extract_regex body, /user=(?P<user>\w+)/ # Logs without "user=" will have user=null filter user != null # Keep only matched logs
Invalid JSON
When
parse_json() receives invalid JSON:
- Returns null (not an error)
- Check before accessing:
filter parsed != null
make_col parsed:parse_json(maybe_json) filter parsed != null make_col field:string(parsed.field_name)
Array Out of Bounds
When accessing
array[999] on a smaller array:
- Returns null (not an error)
- No exception thrown
make_col parts:split("a.b.c", ".") # ["a","b","c"] make_col safe:parts[0] # "a" make_col oob:parts[999] # null (no error)
Performance Tips
1. Filter Before Extracting
Extract from relevant logs only:
# GOOD - Filter first filter body ~ /ERROR/ extract_regex body, /ERROR: (?P<error_code>\w+)/ # BAD - Extract from all logs extract_regex body, /ERROR: (?P<error_code>\w+)/ filter error_code != null
2. Single Regex with Multiple Captures
One regex is faster than multiple:
# GOOD - Single regex extract_regex body, /\[(?P<time>[\d: ,]+)\] (?P<level>\w+) (?P<msg>.*)/ # BAD - Multiple regexes extract_regex body, /\[(?P<time>.*)\]/ extract_regex body, /\] (?P<level>.*) / extract_regex body, /(?P<msg>.*)/
3. Anchor Patterns When Possible
Anchored patterns (^, $) perform better:
# GOOD - Anchored extract_regex body, /^(?P<timestamp>[\d\-:]+) / # SLOWER - Unanchored (searches entire string) extract_regex body, /(?P<timestamp>[\d\-:]+) /
4. Only Parse JSON When Needed
# GOOD - Filter, then parse filter body ~ /\{.*"error"/ make_col parsed:parse_json(body) # BAD - Parse everything make_col parsed:parse_json(body) filter parsed != null
Related Skills
- filtering-event-datasets - Text search and filtering before extraction
- aggregating-event-datasets - Aggregating extracted fields with statsby
- investigating-textual-data - Error analysis workflows using extraction
- analyzing-text-patterns - Pattern discovery to identify what to extract
Key Takeaways
- extract_regex() is most powerful - supports type casting and multiple captures
- Forward slashes required for regex:
not/pattern/"pattern" - Type cast in capture group:
for immediate conversion(?P<name::int64>pattern) - split_part() is 1-based, array
is 0-based[N] - Quote JSONPath fields with spaces or special characters
- Reserved names: Avoid
,timestamp
,valid_fromvalid_to - Nulls not errors: Non-matches return null, not exceptions
- Filter before extract: Better performance
When in doubt about regex syntax or parsing functions, use
learn_observe_skill("OPAL extract_regex") for official documentation.