Daft daft-distributed-scaling

Scale Daft workflows to distributed Ray clusters. Invoke when optimizing performance or handling large data.

install

source · Clone the upstream repo

git clone https://github.com/Eventual-Inc/Daft

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Eventual-Inc/Daft "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/daft-distributed-scaling" ~/.claude/skills/eventual-inc-daft-daft-distributed-scaling && rm -rf "$T"

manifest: .claude/skills/daft-distributed-scaling/SKILL.md

source content

Daft Distributed Scaling

Scale single-node workflows to distributed execution.

Core Strategies

Strategy	API	Use Case	Pros/Cons
Shuffle	`repartition(N)`	Light data (e.g. file paths), Joins	Global balance. High memory usage (materializes data).
Streaming	`into_batches(N)`	Heavy data (images, tensors)	Low memory (streaming). High scheduling overhead if batches too small.

Quick Recipes

1. Light Data: Repartitioning

Best for distributing file paths before heavy reads.

# Create enough partitions to saturate workers
df = daft.read_parquet("s3://metadata").repartition(100)
df = df.with_column("data", read_heavy_data(df["path"]))

2. Heavy Data: Streaming Batches

Best for processing large partitions without OOM.

# Stream 1GB partition in 64-row chunks to control memory
df = df.read_parquet("heavy_data").into_batches(64)
df = df.with_column("embed", model.predict(df["img"]))

Advanced Tuning

Target: Keep all actors busy without OOM or scheduling bottlenecks.

Formula 1: Repartitioning (Light Data / Paths)

Calculate the Max Partition Count to ensure each task has enough data to feed local actors.

Min Rows Per Partition =

Batch Size * (Total Concurrency / Nodes)

Max Partitions =
```
Total Rows / Min Rows Per Partition
```

Example:

1M rows, 4 nodes, 16 total concurrency, Batch Size 64.
Min Rows:
```
64 * (16/4) = 256
```
.
Max Partitions:
```
1,000,000 / 256 ≈ 3906
```
.
Recommendation: Use ~1000 partitions to run multiple batches per task.

df = df.repartition(1000) # Balanced fan-out

Formula 2: Streaming (Heavy Data / Images)

Avoid creating tiny partitions. Use

into_batches

to stream data within larger partitions.

Strategy: Keep partitions large (e.g. 1GB+), use

into_batches(Batch Size)

to control memory.

# Stream batches to control memory usage per actor
df = df.into_batches(64).with_column("preds", model(max_concurrency=16).predict(df["img"]))