Pydantic-deepagents data-formats
Working with diverse data formats: binary, text, structured, and custom
install
source · Clone the upstream repo
git clone https://github.com/vstorm-co/pydantic-deepagents
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vstorm-co/pydantic-deepagents "$T" && mkdir -p ~/.claude/skills && cp -r "$T/apps/cli/skills/data-formats" ~/.claude/skills/vstorm-co-pydantic-deepagents-data-formats && rm -rf "$T"
manifest:
apps/cli/skills/data-formats/SKILL.mdsource content
Data Formats
How to work with diverse and unknown data formats.
Format Detection
Always inspect before parsing:
file <filename> # MIME type detection xxd <filename> | head -5 # hex dump (first bytes) head -3 <filename> # text preview python3 -c " with open('<filename>', 'rb') as f: h = f.read(16) print(h, h.hex()) "
Common Formats
Binary
- Magic bytes: Most binary formats start with a signature (ELF:
, PNG:\x7fELF
)\x89PNG - Endianness: Check if little-endian or big-endian (
vsstruct.unpack('<I', ...)
)'>I' - Alignment: Fields are often aligned to 4 or 8 bytes
- Offsets: Binary headers often contain offsets to other sections
Structured text
- CSV/TSV: Check delimiter (comma, tab, pipe), quoting, header row
- JSON:
python3 -c "import json; json.load(open('f'))" - YAML: Check indentation, anchors/aliases
- TOML:
python3 -c "import tomllib; ..." - XML: Check encoding declaration, namespaces
Checkpoints / Model files
- PyTorch:
,.pt
→.pthtorch.load(f, map_location='cpu') - TensorFlow:
→ index + data files, use.ckpttf.train.load_checkpoint() - NumPy:
,.npy
→.npznumpy.load() - HuggingFace:
+config.jsonmodel.safetensors - ONNX:
onnx.load()
Database files
- SQLite:
says "SQLite 3.x database" →filesqlite3 <file> ".tables" - WAL files: SQLite write-ahead log — recover with
PRAGMAsqlite3 - CSV dumps: Often need schema inference
Parsing Strategies
Unknown binary format
- Hex dump first 256 bytes:
xxd file | head -16 - Look for magic bytes, version numbers, string tables
- Check file size — does it suggest a pattern? (e.g., N * record_size)
- Look for documentation of the format online
- Write a minimal parser, test on known values
Large structured files
- Never load entirely — sample first:
,head
,tailshuf -n 10 - Check consistency: are all lines the same format?
- Count fields:
head -1 file | awk -F',' '{print NF}' - Watch for: mixed types, missing values, encoding issues
Multi-file datasets
- List all files and sizes
- Look for manifest/index files (often JSON or CSV)
- Check naming patterns — timestamps, sequence numbers, shards
- Process one file first, then generalize
Common Pitfalls
- Assuming UTF-8 when the file is Latin-1 or binary
- Assuming CSV when it's TSV (or vice versa)
- Ignoring the header row
- Not handling quoted fields with embedded delimiters
- Reading binary files as text (corrupts data)
- Endianness mismatch (x86 is little-endian, network byte order is big-endian)