Skillshub laravel-data-chunking-large-datasets

Data Chunking for Large Datasets

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/noartem/skills/laravel-data-chunking-large-datasets" ~/.claude/skills/comeonoliver-skillshub-laravel-data-chunking-large-datasets && rm -rf "$T"

manifest: skills/noartem/skills/laravel-data-chunking-large-datasets/SKILL.md

source content

Data Chunking for Large Datasets

Process large datasets efficiently by breaking them into manageable chunks to reduce memory consumption and improve performance.

The Problem: Memory Exhaustion

// BAD: Loading all records into memory
$users = User::all(); // Could be millions of records!

foreach ($users as $user) {
    $user->sendNewsletter();
}

// BAD: Even with select, still loads everything
$emails = User::pluck('email'); // Array of millions of emails

foreach ($emails as $email) {
    Mail::to($email)->send(new Newsletter());
}

Solution: Chunking Methods

1. Basic Chunking with

chunk()

// Process 100 records at a time
User::chunk(100, function ($users) {
    foreach ($users as $user) {
        $user->calculateStatistics();
        $user->save();
    }
});

// With conditions
User::where('active', true)
    ->chunk(200, function ($users) {
        foreach ($users as $user) {
            ProcessUserJob::dispatch($user);
        }
    });

2. Chunk By ID for Safer Updates

// Prevents issues when modifying records during iteration
User::where('newsletter_sent', false)
    ->chunkById(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['newsletter_sent' => true]);
            Mail::to($user)->send(new Newsletter());
        }
    });

// With custom column
Payment::where('processed', false)
    ->chunkById(100, function ($payments) {
        foreach ($payments as $payment) {
            $payment->process();
        }
    }, 'payment_id'); // Custom ID column

3. Lazy Collections for Memory Efficiency

// Uses PHP generators, minimal memory footprint
User::where('created_at', '>=', now()->subDays(30))
    ->lazy()
    ->each(function ($user) {
        $user->recalculateScore();
    });

// With chunking size control
User::lazy(100)->each(function ($user) {
    ProcessRecentUser::dispatch($user);
});

// Filter and map with lazy collections
$results = User::lazy()
    ->filter(fn($user) => $user->hasActiveSubscription())
    ->map(fn($user) => [
        'id' => $user->id,
        'revenue' => $user->calculateRevenue(),
    ])
    ->take(1000);

4. Cursor for Forward-Only Iteration

// Most memory-efficient for simple forward iteration
foreach (User::where('active', true)->cursor() as $user) {
    $user->updateLastSeen();
}

// With lazy() for additional collection methods
User::where('verified', true)
    ->cursor()
    ->filter(fn($user) => $user->hasCompletedProfile())
    ->each(fn($user) => SendWelcomeEmail::dispatch($user));

Real-World Examples

Export Large CSV

class ExportUsersCommand extends Command
{
    public function handle()
    {
        $file = fopen('users.csv', 'w');

        // Write headers
        fputcsv($file, ['ID', 'Name', 'Email', 'Created At']);

        // Process in chunks to avoid memory issues
        User::select('id', 'name', 'email', 'created_at')
            ->chunkById(500, function ($users) use ($file) {
                foreach ($users as $user) {
                    fputcsv($file, [
                        $user->id,
                        $user->name,
                        $user->email,
                        $user->created_at->toDateTimeString(),
                    ]);
                }

                // Optional: Show progress
                $this->info("Processed up to ID: {$users->last()->id}");
            });

        fclose($file);
        $this->info('Export completed!');
    }
}

Batch Email Campaign

class SendCampaignJob implements ShouldQueue
{
    public function handle()
    {
        $campaign = Campaign::find($this->campaignId);

        // Process subscribers in chunks
        $campaign->subscribers()
            ->where('unsubscribed', false)
            ->chunkById(50, function ($subscribers) use ($campaign) {
                foreach ($subscribers as $subscriber) {
                    SendCampaignEmail::dispatch($campaign, $subscriber)
                        ->onQueue('emails')
                        ->delay(now()->addSeconds(rand(1, 10)));
                }

                // Prevent rate limiting
                sleep(2);
            });
    }
}

Data Migration/Transformation

class MigrateUserData extends Command
{
    public function handle()
    {
        $bar = $this->output->createProgressBar(User::count());

        User::with(['profile', 'settings'])
            ->chunkById(100, function ($users) use ($bar) {
                DB::transaction(function () use ($users, $bar) {
                    foreach ($users as $user) {
                        // Complex transformation
                        $newData = $this->transformUserData($user);

                        NewUserModel::create($newData);

                        $bar->advance();
                    }
                });
            });

        $bar->finish();
        $this->newLine();
        $this->info('Migration completed!');
    }
}

Cleanup Old Records

class CleanupOldLogs extends Command
{
    public function handle()
    {
        $deletedCount = 0;

        ActivityLog::where('created_at', '<', now()->subMonths(6))
            ->chunkById(1000, function ($logs) use (&$deletedCount) {
                $ids = $logs->pluck('id')->toArray();

                // Batch delete for efficiency
                ActivityLog::whereIn('id', $ids)->delete();

                $deletedCount += count($ids);

                $this->info("Deleted {$deletedCount} records so far...");

                // Give database a breather
                usleep(100000); // 100ms
            });

        $this->info("Total deleted: {$deletedCount}");
    }
}

Choosing the Right Method

Method	Use Case	Memory Usage	Notes
`chunk()`	General processing	Moderate	May skip/duplicate if modifying filter columns
`chunkById()`	Updates during iteration	Moderate	Safer for modifications
`lazy()`	Large result processing	Low	Returns LazyCollection
`cursor()`	Simple forward iteration	Lowest	Returns Generator
`each()`	Simple operations	High (loads all)	Avoid for large datasets

Performance Optimization Tips

1. Select Only Needed Columns

User::select('id', 'email', 'name')
    ->chunkById(100, function ($users) {
        // Process with minimal data
    });

2. Use Indexes

// Ensure indexed columns in where clauses
User::where('status', 'active') // status should be indexed
    ->where('created_at', '>', $date) // created_at should be indexed
    ->chunkById(200, function ($users) {
        // Process efficiently
    });

3. Disable Eloquent Events When Appropriate

User::withoutEvents(function () {
    User::chunkById(500, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]);
        }
    });
});

4. Use Raw Queries for Bulk Updates

// Instead of updating each record
User::chunkById(100, function ($users) {
    $ids = $users->pluck('id')->toArray();

    // Bulk update with raw query
    DB::table('users')
        ->whereIn('id', $ids)
        ->update([
            'last_processed_at' => now(),
            'processing_count' => DB::raw('processing_count + 1'),
        ]);
});

5. Queue Large Operations

class ProcessLargeDataset extends Command
{
    public function handle()
    {
        User::chunkById(100, function ($users) {
            ProcessUserBatch::dispatch($users->pluck('id'))
                ->onQueue('heavy-processing');
        });
    }
}

class ProcessUserBatch implements ShouldQueue
{
    public function __construct(
        public Collection $userIds
    ) {}

    public function handle()
    {
        User::whereIn('id', $this->userIds)
            ->get()
            ->each(fn($user) => $user->process());
    }
}

Testing Chunked Operations

test('processes all active users in chunks', function () {
    // Create test data
    User::factory()->count(150)->create(['active' => true]);
    User::factory()->count(50)->create(['active' => false]);

    $processed = [];

    User::where('active', true)
        ->chunkById(50, function ($users) use (&$processed) {
            foreach ($users as $user) {
                $processed[] = $user->id;
            }
        });

    expect($processed)->toHaveCount(150);
    expect(count(array_unique($processed)))->toBe(150);
});

test('handles empty datasets gracefully', function () {
    $callCount = 0;

    User::where('id', '<', 0) // No results
        ->chunk(100, function ($users) use (&$callCount) {
            $callCount++;
        });

    expect($callCount)->toBe(0);
});

Common Pitfalls

Modifying filter columns during chunk()

// WRONG: May skip records
User::where('processed', false)
    ->chunk(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]); // Changes the WHERE condition!
        }
    });

// CORRECT: Use chunkById()
User::where('processed', false)
    ->chunkById(100, function ($users) {
        foreach ($users as $user) {
            $user->update(['processed' => true]);
        }
    });

Not handling chunk callback returns

// Return false to stop chunking
User::chunk(100, function ($users) {
    foreach ($users as $user) {
        if ($user->hasIssue()) {
            return false; // Stop processing
        }
        $user->process();
    }
});

Ignoring database connection limits

// Consider connection timeouts for long operations
DB::connection()->getPdo()->setAttribute(PDO::ATTR_TIMEOUT, 3600);

User::chunkById(100, function ($users) {
    // Long running process
});

Remember: When dealing with large datasets, always think about memory usage, query efficiency, and processing time. Chunk your data appropriately!