Batching and Throughput Optimization for ML Inference
Production ML inference systems face a fundamental trade-off: low latency for individual requests versus high throughput for batch processing. This article covers practical techniques to optimize both, with implications for compute efficiency in data centre deployments.
The Latency-Throughput Trade-off
Unbatched Inference
- Each request processed independently
- Low latency (ms) but poor GPU utilization
- High per-request overhead
Static Batching
- Fixed batch size, wait to fill
- High throughput but variable latency
- Underutilization during low traffic
Dynamic Batching
- Adaptive batch formation
- Balance latency SLA with throughput
- Better operational efficiency
Dynamic Batching Strategies
Time-Based Batching
class DynamicBatcher:
def __init__(self, max_batch_size=32, max_wait_ms=50):
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue = []
self.last_batch_time = time.time()
def add(self, request):
self.queue.append(request)
elapsed = (time.time() - self.last_batch_time) * 1000
if len(self.queue) >= self.max_batch_size or elapsed >= self.max_wait_ms:
return self.flush()
return None
def flush(self):
batch = self.queue[:self.max_batch_size]
self.queue = self.queue[self.max_batch_size:]
self.last_batch_time = time.time()
return batch
Priority Queues
Route latency-sensitive requests to a fast path; batch non-urgent requests for throughput.
GPU Utilization
Profiling
- Use Nsight Systems or PyTorch profiler
- Identify kernel launch overhead
- Measure memory bandwidth saturation
Optimization Levers
- Increase batch size until GPU memory limit
- Use TensorRT or ONNX Runtime for fused kernels
- Pipeline data loading with inference
Queue Sizing
Little's Law: L = λW
- L = queue length
- λ = arrival rate
- W = wait time
Size queues to absorb burstiness without exceeding latency SLA. Monitor p99 latency and drop rate.
Operational Efficiency
Efficient batching reduces:
- Idle GPU time
- Per-request overhead
- Energy per inference (better utilization)
- Cost per thousand inferences
Conclusion
Dynamic batching and queue tuning are essential for production ML inference. Getting the balance right improves both user experience and the operational efficiency of inference infrastructure.