Batching and Throughput Optimization for ML Inference

Production ML inference systems face a fundamental trade-off: low latency for individual requests versus high throughput for batch processing. This article covers practical techniques to optimize both, with implications for compute efficiency in data centre deployments.

The Latency-Throughput Trade-off

Unbatched Inference

Each request processed independently
Low latency (ms) but poor GPU utilization
High per-request overhead

Static Batching

Fixed batch size, wait to fill
High throughput but variable latency
Underutilization during low traffic

Dynamic Batching

Adaptive batch formation
Balance latency SLA with throughput
Better operational efficiency

Dynamic Batching Strategies

Time-Based Batching

class DynamicBatcher:
    def __init__(self, max_batch_size=32, max_wait_ms=50):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = []
        self.last_batch_time = time.time()

    def add(self, request):
        self.queue.append(request)
        elapsed = (time.time() - self.last_batch_time) * 1000
        if len(self.queue) >= self.max_batch_size or elapsed >= self.max_wait_ms:
            return self.flush()
        return None

    def flush(self):
        batch = self.queue[:self.max_batch_size]
        self.queue = self.queue[self.max_batch_size:]
        self.last_batch_time = time.time()
        return batch

Priority Queues

Route latency-sensitive requests to a fast path; batch non-urgent requests for throughput.

GPU Utilization

Profiling

Use Nsight Systems or PyTorch profiler
Identify kernel launch overhead
Measure memory bandwidth saturation

Optimization Levers

Increase batch size until GPU memory limit
Use TensorRT or ONNX Runtime for fused kernels
Pipeline data loading with inference

Queue Sizing

Little's Law: L = λW

L = queue length
λ = arrival rate
W = wait time

Size queues to absorb burstiness without exceeding latency SLA. Monitor p99 latency and drop rate.

Operational Efficiency

Efficient batching reduces:

Idle GPU time
Per-request overhead
Energy per inference (better utilization)
Cost per thousand inferences

Conclusion

Dynamic batching and queue tuning are essential for production ML inference. Getting the balance right improves both user experience and the operational efficiency of inference infrastructure.