Claude-skill-registry io-utilities
Guide for using IO utilities in speedy_utils, including fast JSONL reading, multi-format loading, and file serialization.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/io-utilities" ~/.claude/skills/majiayu000-claude-skill-registry-io-utilities && rm -rf "$T"
manifest:
skills/data/io-utilities/SKILL.mdsource content
IO Utilities Guide
This skill provides comprehensive guidance for using the IO utilities in
speedy_utils.
When to Use This Skill
Use this skill when you need to:
- Read and write data in various formats (JSON, JSONL, Pickle, CSV, TXT).
- Efficiently process large JSONL files with streaming and multi-threading.
- Automatically handle file compression (gzip, bz2, xz, zstd).
- Load data based on file extension automatically.
- Serialize Pydantic models and other objects easily.
Prerequisites
installed.speedy_utils- Optional dependencies for specific features:
: For faster JSON parsing.orjson
: Forzstandard
file support..zst
: For CSV/TSV loading.pandas
: For faster CSV reading with pandas.pyarrow
Core Capabilities
Fast JSONL Processing (fast_load_jsonl
)
fast_load_jsonl- Streams data line-by-line for memory efficiency.
- Supports automatic decompression.
- Uses
if available for speed.orjson - Supports multi-threaded processing for large files.
- Shows progress bar with
.tqdm
Universal Loading (load_by_ext
)
load_by_ext- Detects file type by extension.
- Supports glob patterns (e.g.,
) and lists of files.data/*.json - Uses parallel processing for multiple files.
- Supports memoization via
.do_memoize=True
Serialization (dump_json_or_pickle
, load_json_or_pickle
)
dump_json_or_pickleload_json_or_pickle- Unified interface for JSON and Pickle.
- Handles Pydantic models automatically.
- Creates parent directories if they don't exist.
Usage Examples
Example 1: Streaming Large JSONL
Read a large compressed JSONL file line by line.
from speedy_utils import fast_load_jsonl # Iterates lazily, low memory usage for item in fast_load_jsonl('large_data.jsonl.gz', progress=True): process(item)
Example 2: Loading Any File
Load a file without worrying about the format.
from speedy_utils import load_by_ext data = load_by_ext('config.json') df = load_by_ext('data.csv') items = load_by_ext('dataset.pkl')
Example 3: Parallel Loading
Load multiple files in parallel.
from speedy_utils import load_by_ext # Returns a list of results, one for each file all_data = load_by_ext('logs/*.jsonl')
Example 4: Dumping Data
Save data to disk, creating directories as needed.
from speedy_utils import dump_json_or_pickle data = {"key": "value"} dump_json_or_pickle(data, 'output/processed/result.json')
Guidelines
-
Prefer JSONL for Large Datasets:
- Use
for datasets that don't fit in memory.fast_load_jsonl - It handles compression transparently, so keep files compressed (
or.jsonl.gz
) to save space..jsonl.zst
- Use
-
Use
for Scripts:load_by_ext- When writing scripts that might accept different input formats, use
to be flexible.load_by_ext
- When writing scripts that might accept different input formats, use
-
Error Handling:
has anfast_load_jsonl
parameter (on_error
,raise
,warn
) to handle malformed lines gracefully.skip
-
Performance:
- Install
for significantly faster JSON operations.orjson
usesload_by_ext
engine for CSVs if available, which is much faster.pyarrow
- Install
Limitations
- Memory Usage:
loads the entire file into memory. Useload_by_ext
for streaming.fast_load_jsonl - Glob Expansion:
with glob patterns loads all matching files into memory at once (in a list). Be careful with massive datasets.load_by_ext