Claude-skill-registry io-utilities

Guide for using IO utilities in speedy_utils, including fast JSONL reading, multi-format loading, and file serialization.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/io-utilities" ~/.claude/skills/majiayu000-claude-skill-registry-io-utilities && rm -rf "$T"
manifest: skills/data/io-utilities/SKILL.md
source content

IO Utilities Guide

This skill provides comprehensive guidance for using the IO utilities in

speedy_utils
.

When to Use This Skill

Use this skill when you need to:

  • Read and write data in various formats (JSON, JSONL, Pickle, CSV, TXT).
  • Efficiently process large JSONL files with streaming and multi-threading.
  • Automatically handle file compression (gzip, bz2, xz, zstd).
  • Load data based on file extension automatically.
  • Serialize Pydantic models and other objects easily.

Prerequisites

  • speedy_utils
    installed.
  • Optional dependencies for specific features:
    • orjson
      : For faster JSON parsing.
    • zstandard
      : For
      .zst
      file support.
    • pandas
      : For CSV/TSV loading.
    • pyarrow
      : For faster CSV reading with pandas.

Core Capabilities

Fast JSONL Processing (
fast_load_jsonl
)

  • Streams data line-by-line for memory efficiency.
  • Supports automatic decompression.
  • Uses
    orjson
    if available for speed.
  • Supports multi-threaded processing for large files.
  • Shows progress bar with
    tqdm
    .

Universal Loading (
load_by_ext
)

  • Detects file type by extension.
  • Supports glob patterns (e.g.,
    data/*.json
    ) and lists of files.
  • Uses parallel processing for multiple files.
  • Supports memoization via
    do_memoize=True
    .

Serialization (
dump_json_or_pickle
,
load_json_or_pickle
)

  • Unified interface for JSON and Pickle.
  • Handles Pydantic models automatically.
  • Creates parent directories if they don't exist.

Usage Examples

Example 1: Streaming Large JSONL

Read a large compressed JSONL file line by line.

from speedy_utils import fast_load_jsonl

# Iterates lazily, low memory usage
for item in fast_load_jsonl('large_data.jsonl.gz', progress=True):
    process(item)

Example 2: Loading Any File

Load a file without worrying about the format.

from speedy_utils import load_by_ext

data = load_by_ext('config.json')
df = load_by_ext('data.csv')
items = load_by_ext('dataset.pkl')

Example 3: Parallel Loading

Load multiple files in parallel.

from speedy_utils import load_by_ext

# Returns a list of results, one for each file
all_data = load_by_ext('logs/*.jsonl')

Example 4: Dumping Data

Save data to disk, creating directories as needed.

from speedy_utils import dump_json_or_pickle

data = {"key": "value"}
dump_json_or_pickle(data, 'output/processed/result.json')

Guidelines

  1. Prefer JSONL for Large Datasets:

    • Use
      fast_load_jsonl
      for datasets that don't fit in memory.
    • It handles compression transparently, so keep files compressed (
      .jsonl.gz
      or
      .jsonl.zst
      ) to save space.
  2. Use

    load_by_ext
    for Scripts:

    • When writing scripts that might accept different input formats, use
      load_by_ext
      to be flexible.
  3. Error Handling:

    • fast_load_jsonl
      has an
      on_error
      parameter (
      raise
      ,
      warn
      ,
      skip
      ) to handle malformed lines gracefully.
  4. Performance:

    • Install
      orjson
      for significantly faster JSON operations.
    • load_by_ext
      uses
      pyarrow
      engine for CSVs if available, which is much faster.

Limitations

  • Memory Usage:
    load_by_ext
    loads the entire file into memory. Use
    fast_load_jsonl
    for streaming.
  • Glob Expansion:
    load_by_ext
    with glob patterns loads all matching files into memory at once (in a list). Be careful with massive datasets.