You can auto-scale an LLM inference service in Kubernetes by configuring a HorizontalPodAutoscaler based on CPU or custom metrics.
Here is the code snippet below:

In the above code, we are using the following key points:
-
A Deployment manages the LLM inference pods with CPU requests and limits defined.
-
A HorizontalPodAutoscaler (HPA) dynamically scales the number of pods between 2 and 10.
-
CPU utilization is used as the scaling metric, targeting 70% average usage.
Hence, this configuration ensures scalable LLM inference aligned with real-time load.