AI Infrastructure
Data Centres
Cloud Systems

Batching and Throughput Optimization for ML Inference

Dynamic batching, queue sizing, and GPU utilization strategies to balance latency and throughput in production inference pipelines.

Akshay MulgavkarApril 10, 202411 min read

Batching and Throughput Optimization for ML Inference

Production ML inference systems face a fundamental trade-off: low latency for individual requests versus high throughput for batch processing. This article covers practical techniques to optimize both, with implications for compute efficiency in data centre deployments.

The Latency-Throughput Trade-off

Unbatched Inference

  • Each request processed independently
  • Low latency (ms) but poor GPU utilization
  • High per-request overhead

Static Batching

  • Fixed batch size, wait to fill
  • High throughput but variable latency
  • Underutilization during low traffic

Dynamic Batching

  • Adaptive batch formation
  • Balance latency SLA with throughput
  • Better operational efficiency

Dynamic Batching Strategies

Time-Based Batching

class DynamicBatcher:
    def __init__(self, max_batch_size=32, max_wait_ms=50):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = []
        self.last_batch_time = time.time()

    def add(self, request):
        self.queue.append(request)
        elapsed = (time.time() - self.last_batch_time) * 1000
        if len(self.queue) >= self.max_batch_size or elapsed >= self.max_wait_ms:
            return self.flush()
        return None

    def flush(self):
        batch = self.queue[:self.max_batch_size]
        self.queue = self.queue[self.max_batch_size:]
        self.last_batch_time = time.time()
        return batch

Priority Queues

Route latency-sensitive requests to a fast path; batch non-urgent requests for throughput.

GPU Utilization

Profiling

  • Use Nsight Systems or PyTorch profiler
  • Identify kernel launch overhead
  • Measure memory bandwidth saturation

Optimization Levers

  • Increase batch size until GPU memory limit
  • Use TensorRT or ONNX Runtime for fused kernels
  • Pipeline data loading with inference

Queue Sizing

Little's Law: L = λW

  • L = queue length
  • λ = arrival rate
  • W = wait time

Size queues to absorb burstiness without exceeding latency SLA. Monitor p99 latency and drop rate.

Operational Efficiency

Efficient batching reduces:

  • Idle GPU time
  • Per-request overhead
  • Energy per inference (better utilization)
  • Cost per thousand inferences

Conclusion

Dynamic batching and queue tuning are essential for production ML inference. Getting the balance right improves both user experience and the operational efficiency of inference infrastructure.