Pydantic-deepagents data-formats

Working with diverse data formats: binary, text, structured, and custom

install

source · Clone the upstream repo

git clone https://github.com/vstorm-co/pydantic-deepagents

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/vstorm-co/pydantic-deepagents "$T" && mkdir -p ~/.claude/skills && cp -r "$T/apps/cli/skills/data-formats" ~/.claude/skills/vstorm-co-pydantic-deepagents-data-formats && rm -rf "$T"

manifest: apps/cli/skills/data-formats/SKILL.md

Data Formats

How to work with diverse and unknown data formats.

Format Detection

Always inspect before parsing:

file <filename>                    # MIME type detection
xxd <filename> | head -5           # hex dump (first bytes)
head -3 <filename>                 # text preview
python3 -c "
with open('<filename>', 'rb') as f:
    h = f.read(16)
    print(h, h.hex())
"

Common Formats

Binary

Magic bytes: Most binary formats start with a signature (ELF:
```
\x7fELF
```
, PNG:
```
\x89PNG
```
)
Endianness: Check if little-endian or big-endian (
```
struct.unpack('<I', ...)
```
vs
```
'>I'
```
)
Alignment: Fields are often aligned to 4 or 8 bytes
Offsets: Binary headers often contain offsets to other sections

Structured text

CSV/TSV: Check delimiter (comma, tab, pipe), quoting, header row

JSON:

python3 -c "import json; json.load(open('f'))"

YAML: Check indentation, anchors/aliases
TOML:
```
python3 -c "import tomllib; ..."
```
XML: Check encoding declaration, namespaces

Checkpoints / Model files

PyTorch:

.pt

.pth

→

torch.load(f, map_location='cpu')

TensorFlow:
```
.ckpt
```
→ index + data files, use
```
tf.train.load_checkpoint()
```
NumPy:
```
.npy
```
,
```
.npz
```
→
```
numpy.load()
```
HuggingFace:
```
config.json
```
+
```
model.safetensors
```
ONNX:
```
onnx.load()
```

Database files

SQLite:
```
file
```
says "SQLite 3.x database" →
```
sqlite3 <file> ".tables"
```
WAL files: SQLite write-ahead log — recover with
```
sqlite3
```
PRAGMA
CSV dumps: Often need schema inference

Parsing Strategies

Unknown binary format

Hex dump first 256 bytes:
```
xxd file | head -16
```
Look for magic bytes, version numbers, string tables
Check file size — does it suggest a pattern? (e.g., N * record_size)
Look for documentation of the format online
Write a minimal parser, test on known values

Large structured files

Never load entirely — sample first:
```
head
```
,
```
tail
```
,
```
shuf -n 10
```
Check consistency: are all lines the same format?
Count fields:
```
head -1 file | awk -F',' '{print NF}'
```
Watch for: mixed types, missing values, encoding issues

Multi-file datasets

List all files and sizes
Look for manifest/index files (often JSON or CSV)
Check naming patterns — timestamps, sequence numbers, shards
Process one file first, then generalize

Common Pitfalls

Assuming UTF-8 when the file is Latin-1 or binary
Assuming CSV when it's TSV (or vice versa)
Ignoring the header row
Not handling quoted fields with embedded delimiters
Reading binary files as text (corrupts data)
Endianness mismatch (x86 is little-endian, network byte order is big-endian)