Claude-skill-registry llm-inference-batching-scheduler

Guidance for implementing batching schedulers for LLM inference systems with compilation-based accelerators. This skill applies when optimizing request batching to minimize cost while meeting latency thresholds, particularly when dealing with shape compilation costs, padding overhead, and multi-bucket request distributions. Use this skill for tasks involving batch planning, shape selection, generation-length bucketing, and cost-model-driven optimization for neural network inference.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-inference-batching-scheduler" ~/.claude/skills/majiayu000-claude-skill-registry-llm-inference-batching-scheduler && rm -rf "$T"

manifest: skills/data/llm-inference-batching-scheduler/SKILL.md

LLM Inference Batching Scheduler

This skill provides systematic approaches for designing batching schedulers that optimize LLM inference workloads on compilation-based accelerators (TPUs, custom ASICs). The core challenge involves balancing multiple competing objectives: minimizing compilation cost (fewer shapes), reducing padding waste (tighter batches), and meeting latency thresholds (P95, P99 constraints).

When to Apply This Skill

Apply this skill when:

Designing batch schedulers for LLM inference with shape compilation constraints
Optimizing request packing to minimize padding overhead
Balancing cost metrics against latency thresholds
Working with generation-length bucketing strategies
Implementing plan files that assign requests to batches with specific shapes

Core Concepts

Shape Compilation Cost Model

Compilation-based accelerators require pre-compiled shapes. Each unique (prompt_length, generation_length) shape incurs a one-time compilation cost. The total cost typically follows:

total_cost = per_token_cost × total_tokens + compilation_cost × num_shapes²

The quadratic term on

num_shapes

creates strong pressure to minimize unique shapes while the per-token cost penalizes excessive padding.

Padding Analysis Framework

Before implementing, calculate padding budgets mathematically:

Identify the threshold constraints: Extract max allowed cost, pad_ratio, and latency percentiles
Calculate current baseline: Sum actual tokens across all requests

Derive padding budget:

max_padded_tokens = actual_tokens / (1 - max_pad_ratio)

Compute allowable padding:

padding_budget = max_padded_tokens - actual_tokens

This budget constrains the maximum generation bucket size.

Generation-Length Bucketing

Requests are grouped by generation length into buckets. The bucket size directly affects padding:

Smaller buckets: Less padding waste, but more batches (higher P95 latency risk)
Larger buckets: More padding waste, but fewer batches (better latency)

To derive optimal bucket size:

optimal_bucket_size ≈ padding_budget / num_requests_in_worst_bucket

Systematic Approach

Phase 1: Mathematical Analysis (Before Any Code)

Parse the cost model completely
- Identify all cost components and their weights
- Understand how shape count affects compilation cost
- Map latency calculation formulas
Analyze request distribution
- Compute statistics: prompt length distribution, generation length distribution
- Identify required prompt shapes to cover all requests (max prompt length determines minimum shape)
- Calculate baseline token counts per bucket
Derive parameter bounds from constraints
- From pad_ratio threshold: calculate max padding tokens allowed
- From cost threshold: calculate max compilation overhead
- From latency thresholds: estimate max batch count implications
Document derived constraints
- Write down: "Max padding = X tokens", "Max shapes = Y", "Gen bucket size must be ≤ Z"

Phase 2: Systematic Parameter Search

Instead of ad-hoc trial-and-error, implement structured search:

Define the parameter space
- Generation bucket sizes: range based on Phase 1 analysis
- Shape configurations: enumerate valid combinations covering all prompt lengths
Implement reusable evaluation
- Create a standalone validation function that computes all metrics
- Return structured results: {cost, pad_ratio, p95_latency, p99_latency, valid: bool}
Search systematically
- Grid search over small parameter spaces
- Binary search when optimizing single parameters
- Track all configurations and results

Phase 3: Implementation with Invariant Checking

Critical invariants to validate continuously
- All prompt lengths covered:
```
max(prompt_lengths) ≤ max(shape_prompt_dims)
```
- All requests assigned:
```
sum(batch_sizes) == total_requests
```
- Shape count within limit:
```
len(unique_shapes) ≤ max_shapes
```
Build validation into the workflow
- Check invariants after every modification
- Fail fast when invariants break

Common Pitfalls and Mitigations

Pitfall 1: Removing Required Shapes

Symptom: Assertion errors about uncovered prompt lengths after optimizing shape count.

Cause: When reducing shapes for compilation cost, accidentally removing shapes needed for long prompts.

Prevention:

Before removing any shape, verify:

shape_to_remove.prompt_dim < max(all_prompt_lengths)

Always keep at least one shape covering the maximum prompt length
Implement a
```
required_shapes()
```
function that returns non-removable shapes

Pitfall 2: Empirical Parameter Selection Without Mathematical Justification

Symptom: Cycling through bucket sizes (20, 21, 22...) without clear rationale, losing track of best configurations.

Cause: Not computing optimal parameters from constraints first.

Prevention:

Calculate bounds analytically before testing
If bucket_size=21 works but 20 doesn't, understand why mathematically
Log all tested configurations with results

Pitfall 3: Optimizing One Metric While Breaking Others

Symptom: Reducing pad_ratio but exceeding cost threshold, or vice versa.

Cause: Metrics are interconnected—fewer shapes increases padding, smaller buckets increase latency.

Prevention:

Always evaluate ALL metrics after each change
Create a multi-objective feasibility check
Understand trade-off relationships before optimizing

Pitfall 4: Inconsistent Configuration Tracking

Symptom: Re-testing configurations or forgetting which parameters achieved which results.

Prevention:

Maintain a results log:

{config_hash: {params: {...}, metrics: {...}}}

Before testing, check if configuration was already evaluated
Keep "best so far" state updated

Pitfall 5: Late Validation of Structural Constraints

Symptom: Plan file passes metric thresholds but fails structural validation (duplicate requests, missing requests).

Prevention:

Validate structure FIRST, before computing metrics
Check: no duplicate request IDs, all request IDs present, batch counts match

Verification Strategy

Structural Verification (Run First)

1. Parse generated plan file
2. Extract all request IDs assigned to batches
3. Verify: set(assigned_ids) == set(all_request_ids)
4. Verify: len(assigned_ids) == len(all_request_ids)  # no duplicates
5. Verify: all shapes have prompt_dim >= max prompt in that batch
6. Verify: unique_shape_count <= max_allowed_shapes

Metric Verification (Run After Structural)

1. Compute total padded tokens using shape dimensions
2. Compute actual tokens from request data
3. Calculate: pad_ratio = 1 - (actual / padded)
4. Compute cost using exact cost model formula
5. Simulate latency distribution, extract P95/P99
6. Compare all metrics against thresholds

Iterative Verification Pattern

After any parameter change:

Generate new plan
Run structural verification → fix if broken
Run metric verification → analyze trade-offs
Update configuration tracking log
Decide next parameter adjustment based on which metrics are furthest from threshold

Reference: Optimization Decision Tree

START
  │
  ├─ Is pad_ratio too high?
  │    └─ YES → Reduce generation bucket size OR add more prompt shapes
  │
  ├─ Is cost too high?
  │    └─ YES → Reduce number of unique shapes (but verify coverage!)
  │
  ├─ Is P95/P99 latency too high?
  │    └─ YES → Increase generation bucket size (larger batches, fewer total)
  │         OR redistribute requests across batches more evenly
  │
  └─ All metrics passing?
       └─ YES → DONE (record configuration as solution)

Key Insight Summary

Math first, code second: Derive bounds from constraints before implementing
Track everything: Maintain configuration→results mapping
Validate continuously: Check invariants after every modification
Understand trade-offs: Metrics are interconnected; optimize with awareness
Separate concerns: Structure validation, metric computation, and parameter search should be independent, reusable components