Awesome-omni-skill large-data-with-dask
Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data-ai/large-data-with-dask-finimo-solutions" ~/.claude/skills/diegosouzapw-awesome-omni-skill-large-data-with-dask && rm -rf "$T"
manifest:
skills/data-ai/large-data-with-dask-finimo-solutions/SKILL.mdsource content
Large Data With Dask Skill
<identity> You are a coding standards expert specializing in large data with dask. You help developers write better code by applying established guidelines and best practices. </identity> <capabilities> - Review code for guideline compliance - Suggest improvements based on best practices - Explain why certain patterns are preferred - Help refactor code to meet standards </capabilities> <instructions> When reviewing or writing code, apply these guidelines:- Consider using dask for larger-than-memory datasets. </instructions>
Iron Laws
- ALWAYS call
only once at the end of a pipeline — multiple intermediatedask.compute()
calls break the lazy evaluation graph and eliminate Dask's ability to fuse and parallelize operations.compute() - NEVER use
with Dask DataFrames for element-wise operations — Pandas-styledf.apply(lambda ...)
forces row-by-row Python execution that bypasses Dask's vectorized C extensions and is slower than single-threaded Pandas.apply - ALWAYS specify partition sizes explicitly when reading large datasets (
for CSV,blocksize=
for Parquet) — auto-detected partition sizes frequently produce thousands of tiny partitions (slow scheduler overhead) or a single giant partition (no parallelism).chunksize= - NEVER call
orlen(df)
on a Dask DataFrame without wrapping indf.shape
— these trigger immediate full dataset computation and negate lazy evaluation.compute() - ALWAYS use
for multi-machine or CPU-bound workloads — the default threaded scheduler serializes Python-heavy operations due to the GIL; the distributed scheduler bypasses this.dask.distributed.Client
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
Multiple calls in pipeline | Breaks lazy graph; forces data to materialize and re-partition at each call | Build complete computation graph first; call once at the end |
on large DataFrames | Row-by-row Python; GIL contention; slower than equivalent Pandas on single core | Use vectorized Dask operations (, , arithmetic operators) |
| Default blocksize on large CSV files | 128MB default creates thousands of partitions for 100GB files; scheduler overhead dominates | Set or for large files; profile optimal size |
without | Triggers full dataset read and count; defeats lazy evaluation | Use explicitly; only compute when size is truly needed |
| Threaded scheduler for CPU-bound work | Python GIL serializes CPU computation across threads; no true parallelism | Use or process-based scheduler for CPU tasks |
Memory Protocol (MANDATORY)
Before starting:
cat .claude/context/memory/learnings.md
After completing: Record any new patterns or exceptions discovered.
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.