Claude-skill-registry llm-inference-batching-scheduler
Guidance for implementing batching schedulers for LLM inference systems with compilation-based accelerators. This skill applies when optimizing request batching to minimize cost while meeting latency thresholds, particularly when dealing with shape compilation costs, padding overhead, and multi-bucket request distributions. Use this skill for tasks involving batch planning, shape selection, generation-length bucketing, and cost-model-driven optimization for neural network inference.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llm-inference-batching-scheduler" ~/.claude/skills/majiayu000-claude-skill-registry-llm-inference-batching-scheduler && rm -rf "$T"
skills/data/llm-inference-batching-scheduler/SKILL.mdLLM Inference Batching Scheduler
This skill provides systematic approaches for designing batching schedulers that optimize LLM inference workloads on compilation-based accelerators (TPUs, custom ASICs). The core challenge involves balancing multiple competing objectives: minimizing compilation cost (fewer shapes), reducing padding waste (tighter batches), and meeting latency thresholds (P95, P99 constraints).
When to Apply This Skill
Apply this skill when:
- Designing batch schedulers for LLM inference with shape compilation constraints
- Optimizing request packing to minimize padding overhead
- Balancing cost metrics against latency thresholds
- Working with generation-length bucketing strategies
- Implementing plan files that assign requests to batches with specific shapes
Core Concepts
Shape Compilation Cost Model
Compilation-based accelerators require pre-compiled shapes. Each unique (prompt_length, generation_length) shape incurs a one-time compilation cost. The total cost typically follows:
total_cost = per_token_cost × total_tokens + compilation_cost × num_shapes²
The quadratic term on
num_shapes creates strong pressure to minimize unique shapes while the
per-token cost penalizes excessive padding.
Padding Analysis Framework
Before implementing, calculate padding budgets mathematically:
- Identify the threshold constraints: Extract max allowed cost, pad_ratio, and latency percentiles
- Calculate current baseline: Sum actual tokens across all requests
- Derive padding budget:
max_padded_tokens = actual_tokens / (1 - max_pad_ratio) - Compute allowable padding:
padding_budget = max_padded_tokens - actual_tokens
This budget constrains the maximum generation bucket size.
Generation-Length Bucketing
Requests are grouped by generation length into buckets. The bucket size directly affects padding:
- Smaller buckets: Less padding waste, but more batches (higher P95 latency risk)
- Larger buckets: More padding waste, but fewer batches (better latency)
To derive optimal bucket size:
optimal_bucket_size ≈ padding_budget / num_requests_in_worst_bucket
Systematic Approach
Phase 1: Mathematical Analysis (Before Any Code)
-
Parse the cost model completely
- Identify all cost components and their weights
- Understand how shape count affects compilation cost
- Map latency calculation formulas
-
Analyze request distribution
- Compute statistics: prompt length distribution, generation length distribution
- Identify required prompt shapes to cover all requests (max prompt length determines minimum shape)
- Calculate baseline token counts per bucket
-
Derive parameter bounds from constraints
- From pad_ratio threshold: calculate max padding tokens allowed
- From cost threshold: calculate max compilation overhead
- From latency thresholds: estimate max batch count implications
-
Document derived constraints
- Write down: "Max padding = X tokens", "Max shapes = Y", "Gen bucket size must be ≤ Z"
Phase 2: Systematic Parameter Search
Instead of ad-hoc trial-and-error, implement structured search:
-
Define the parameter space
- Generation bucket sizes: range based on Phase 1 analysis
- Shape configurations: enumerate valid combinations covering all prompt lengths
-
Implement reusable evaluation
- Create a standalone validation function that computes all metrics
- Return structured results: {cost, pad_ratio, p95_latency, p99_latency, valid: bool}
-
Search systematically
- Grid search over small parameter spaces
- Binary search when optimizing single parameters
- Track all configurations and results
Phase 3: Implementation with Invariant Checking
-
Critical invariants to validate continuously
- All prompt lengths covered:
max(prompt_lengths) ≤ max(shape_prompt_dims) - All requests assigned:
sum(batch_sizes) == total_requests - Shape count within limit:
len(unique_shapes) ≤ max_shapes
- All prompt lengths covered:
-
Build validation into the workflow
- Check invariants after every modification
- Fail fast when invariants break
Common Pitfalls and Mitigations
Pitfall 1: Removing Required Shapes
Symptom: Assertion errors about uncovered prompt lengths after optimizing shape count.
Cause: When reducing shapes for compilation cost, accidentally removing shapes needed for long prompts.
Prevention:
- Before removing any shape, verify:
shape_to_remove.prompt_dim < max(all_prompt_lengths) - Always keep at least one shape covering the maximum prompt length
- Implement a
function that returns non-removable shapesrequired_shapes()
Pitfall 2: Empirical Parameter Selection Without Mathematical Justification
Symptom: Cycling through bucket sizes (20, 21, 22...) without clear rationale, losing track of best configurations.
Cause: Not computing optimal parameters from constraints first.
Prevention:
- Calculate bounds analytically before testing
- If bucket_size=21 works but 20 doesn't, understand why mathematically
- Log all tested configurations with results
Pitfall 3: Optimizing One Metric While Breaking Others
Symptom: Reducing pad_ratio but exceeding cost threshold, or vice versa.
Cause: Metrics are interconnected—fewer shapes increases padding, smaller buckets increase latency.
Prevention:
- Always evaluate ALL metrics after each change
- Create a multi-objective feasibility check
- Understand trade-off relationships before optimizing
Pitfall 4: Inconsistent Configuration Tracking
Symptom: Re-testing configurations or forgetting which parameters achieved which results.
Prevention:
- Maintain a results log:
{config_hash: {params: {...}, metrics: {...}}} - Before testing, check if configuration was already evaluated
- Keep "best so far" state updated
Pitfall 5: Late Validation of Structural Constraints
Symptom: Plan file passes metric thresholds but fails structural validation (duplicate requests, missing requests).
Prevention:
- Validate structure FIRST, before computing metrics
- Check: no duplicate request IDs, all request IDs present, batch counts match
Verification Strategy
Structural Verification (Run First)
1. Parse generated plan file 2. Extract all request IDs assigned to batches 3. Verify: set(assigned_ids) == set(all_request_ids) 4. Verify: len(assigned_ids) == len(all_request_ids) # no duplicates 5. Verify: all shapes have prompt_dim >= max prompt in that batch 6. Verify: unique_shape_count <= max_allowed_shapes
Metric Verification (Run After Structural)
1. Compute total padded tokens using shape dimensions 2. Compute actual tokens from request data 3. Calculate: pad_ratio = 1 - (actual / padded) 4. Compute cost using exact cost model formula 5. Simulate latency distribution, extract P95/P99 6. Compare all metrics against thresholds
Iterative Verification Pattern
After any parameter change:
- Generate new plan
- Run structural verification → fix if broken
- Run metric verification → analyze trade-offs
- Update configuration tracking log
- Decide next parameter adjustment based on which metrics are furthest from threshold
Reference: Optimization Decision Tree
START │ ├─ Is pad_ratio too high? │ └─ YES → Reduce generation bucket size OR add more prompt shapes │ ├─ Is cost too high? │ └─ YES → Reduce number of unique shapes (but verify coverage!) │ ├─ Is P95/P99 latency too high? │ └─ YES → Increase generation bucket size (larger batches, fewer total) │ OR redistribute requests across batches more evenly │ └─ All metrics passing? └─ YES → DONE (record configuration as solution)
Key Insight Summary
- Math first, code second: Derive bounds from constraints before implementing
- Track everything: Maintain configuration→results mapping
- Validate continuously: Check invariants after every modification
- Understand trade-offs: Metrics are interconnected; optimize with awareness
- Separate concerns: Structure validation, metric computation, and parameter search should be independent, reusable components