OpenShift Virtualization and Resource Overcommitment |

As OpenShift Virtualization continues to gain attention and attract new users in the field, certain topics come up over and over again:

How do I overcommit CPU?
How do I overcommit Memory??
How do I make sure my overcommiting doesn’t have adverse reactions to my VMs?

This post (and a few subsequent posts) are going to delve deep into these questions in the hopes that by going deep into the details, you can better understand how overcommit works in OpenShift Virtualization, and what guardrails are in place to help make sure you don’t adversely effect your VMS.

There is a lot to digest here, so sit back, grab your favorite “cold snack” and follow along as we discuss Kubernetes scheduling, and how OpenShift Virtualization uses the features of K8s to manage your virtual resources.

One final note, the discussion below does NOT take into account things like CPU pinning, or setting your VMs to use dedicated resources.

Kubernetes Requests and Limits

To understand how OpenShift Virtualization manages VMs, we first must understand a KEY Kubernetes construct … Requests and Limits. Because CPU and Memory resources are different, in how they can be used we need to look at each individually.

As we go into the details keep the following in mind (from the official Kubernetes documentation):

When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled containers is less than the capacity of the node. Note that although actual memory or CPU resource usage on nodes is very low, the scheduler still refuses to place a Pod on a node if the capacity check fails. This protects against a resource shortage on a node when resource usage later increases, for example, during a daily peak in request rate.

Stated another way, Kubernetes scheduling does not care about the ACTUAL usage on a node, only what resource it has been told to hold in reserve for each container running on a node.

Requests/Limits - CPU

CPU requests can be thought of as a weighted accounting of processor time. The higher the CPU requests number on a given container, the more CPU cycles it will get. CPU requests are measured in “Milicores” and 1000m = 1CPU. So for example, if we have a node with 4 CPUs, that node has a total CPU capacity of 4000m. To explain this better, lets use an example:

Given a node with 4 CPUs (4000m) running 5 containers we will divide up our CPU as follows:

Container Name	CPU Requests
container1	1000m
container2	750m
container3	750m
container4	500m
container5	1000m

In a heavily loaded scenario container2 and container3 each get 18.75% of the CPU time, container1 and container5 each get 25% of the CPU time, and container4 would get 12.5% of the CPU time. Note that this only happens when the CPUs are running at 100% utilization. If the physical CPU resources are not maxed out, then the containers can consume free cycles of the CPU as long as a container with higher requests doesn’t ask for them. So given the above scenario, if containers1,2,3 and 5 are idle … container4 could consume all 4000m until something else needed it. This is known as “Burstable”. This will come into play later when we discuss Virtualization overcommit.

If a Limit is defined for CPU, the Linux kernel will suspend the process until it falls back into its allocated amount of CPU time. This suspend will occur even if there are free CPU cycles available. The use of Limits on CPU resources is not encouraged.

Requests/Limits - Memory

When it comes to Memory Requests and Limits, Kubernetes is much more rigid. If memory is assigned to an application, then it is assumed to be in use and can not be shared by other apps. (Yeah, I know … SWAP, but lets put that aside for a bit. We will come back to swap later.) So when you deploy an application in Kubernetes, you should always assign a “Request” and a “Limit” for memory usage. The Request is “how much memory does my application need to start, and run under normal conditions”. This number is used to help Kubernetes decide where to schedule the pod, It will place your pod on a machine that has enough memory resources free to meet the “request” requirement. There is also a Limit number, and the limit number is much more important. If you set a limit on your pod, and your application starts to use more memory than the Limit set, the Linux kernel will KILL your process with an “Out of Memory” (OOM) signal. This ends your application immediately, with no chance to recover the running process. By having a lower and upper limit, it allows an application to “burst” into free memory if available, but the second you go over the initial request number, you are subject to OOM termination if the node was to get under memory pressure.

Requests/Limits and OpenShift Virtualization

OK, so now with a base understanding of Requests and Limits, how does this come into play with OpenShift Virtualization and Overcommiting resources. As before lets start with CPU.

OpenShift Virtualization and CPU assignment

When it comes to Hypervisors and Virtualization in general, users tend to want to “Over Subscribe” resources, or put in another way, “placing 10lbs of dog food in a 1lb bag”. Out of the box, OpenShift Virtualization is set to oversubscribe CPU resources 10:1. This means that for every one Physical CPU, OpenShift Virtualization will allocate up to 10 virtual CPUs. In other hypervisors this over subscription has different names:

vSphere - This is called CPU-Overcommitment and is controlled at the DRS Level
Nutanix - This is called CPU Oversubscription and is managed by the Acropolis Dynamic Scheduler

In OpenShift this ratio is known as the “vmiCPUAllocationRatio” and can be configured by Setting the CPU allocation ratio in the kubevirt-hyperconverged definition, and this setting applies to the entire cluster.

So how does this work in practice? Remember that Kubernetes treats every Physical CPU core as 1000 milicores. So a 10 Core physical server has 10,000 milicores available to be allocated to workloads. In order to oversubscribe the CPU resources, we will now use our vmiCPUAllocationRatio and Kubernetes CPU Requests. Lets assume we are using the default 10:1 vmiCPUAllocationRatio. This means that for every 1 physical CPU we have in the cluster, we can allocate 10 virtual CPUs. In order to achieve this with Kubernetes controls, this means that we need to take our 1000 milicores, and break that into 10 slices, so for every 1 vCPU you assign to a virtual machine, OpenShift Virtualization will request 100 milicores. If the ratio was 4:1, it would request 250 milicores for every vCPU you assign to your virtual machines.

Kubernetes now has a way to schedule the OpenShift Virtualization VMs, in a context that it understands. Using a 10:1 ratio, and our example server with 10 physical cores, Kubernetes knows that it can pack up to 100 vCPUs onto a physical server.

When it comes to CPU oversubscription ratios 10:1 is high in my opinion. Industry recommendations are normally more in the 4:1 range, however now you know how to set the oversubscription to a number you feel more comfortable with.

It is NOT a requirement to oversubscribe CPUs. When it comes to OpenShift Virtualization you have two options. The first is to set the vmiCPUAllocationRatio to “1”, this will ensure that for every 1 vCPU, 1000 milicores is assigned. Keep in mind, that the 1000 milicores may come from multiple physical cores. If you need dedicated physical cores, he other option is by Enabling dedicated resources for virtual machines. You may want to do this when you need more fine grained control over process switching, and keeping latency for your application minimal.

When using CPU overcommit make sure you monitor things like kubevirt_vmi_vcpu_wait_seconds_total to ensure that your vms are not sitting waiting for additional CPU cycles. I will cover monitoring oversubscription in another post.

OpenShift Virtualization and Memory Assignment

Just like with CPU, OpenShift Virtualization makes use of requests and limits when it comes to Virtual Memory. Out of the box, there is no way to overcommit memory, so if your VM has 6Gi of memory, 6Gi of memory has been “requested” from Kubernetes and is dedicated to your VM. There is no opportunity for another VM to eat into your requested/reserved memory. That being said, there are ways to optimize memory with OpenShift Virtualization and over-commit memory as well. These options need to be enabled/configured on your cluster and are not active by default.

The first option is to enable Kernel Same-page Merging or KSM. KSM deduplicates identical data found in the memory pages of virtual machines (VMs) on a given host node. If you have very similar VMs, KSM can make it possible to schedule more VMs on a single node, than the total amount of memory on the node can support. The caveat that comes with using KSM is that you must only use KSM with trusted workloads. There are significant security concerns with using KSM in non-trusted environments. The details of why are very technical and are out of the scope of this post, but you can see more details in the references link at the end of this post.

The second way to over-commit memory is to leverage SWAP, where inactive memory is copied out to local disk on your physical servers. The usage of SWAP in Kubernetes (and thus OpenShift Virtualization) has been frowned upon in the past, and was not actually supported for many years. However, in more recent version of OpenShift support for SWAP in certain scenarios has been added. By swapping out unused virtual machine memory to disk, OpenShift is able to put more running VMS on a given physical host. You can find the official documentation on enabling swap here: Configuring higer VM workload density.

In an upcoming blog post we will discuss enabling SWAP and KSM to optimize the amount of VMs that can be run in a cluster.

Conclusion

In this post, we covered the basics of Kubernetes and OpenShift CPU and Memory allocation and utilization. In the next post, OpenShift Virtualization and the Kubernetes Descheduler we will focus in on the Kube Descheduler where we will talk about what it is, what it does, and most importantly what it does NOT do. Following that post, we will go into how to use WASP and KSM to over-commit memory in OpenShift Virtualization and Resource Overcommitment - Memory, allowing you to have much more dense utilization of your physical hosts. The final post in this series will go into CPU overcommit and making sure you are not adversely effecting your VM performance by looking at statistics gathered by OpenShift Virtualization which will help you know you have set your CPU ratio too high.

References

KSM Security Issue - CVE-2024-0564

OpenShift Virtualization and Resource Overcommitment