NUMA¶
FEATURE STATE: KubeVirt v0.43
NUMA support in KubeVirt is at this stage limited to a small set of special use-cases and will improve over time together with improvements made to Kubernetes.
In general, the goal is to map the host NUMA topology as efficiently as possible to the Virtual Machine topology to improve the performance.
The following NUMA mapping strategies can be used:
Preconditions¶
In order to use current NUMA support, the following preconditions must be met:
- Dedicated CPU Resources must be configured.
- Hugepages need to be allocatable on target nodes.
- The
NUMA
feature gate must be enabled.
GuestMappingPassthrough¶
GuestMappingPassthrough will pass through the node numa topology to the guest.
The topology is based on the dedicated CPUs which the VMI got assigned from the
kubelet via the CPU Manager. It can be requested by
setting spec.domain.cpu.guestMappingPassthrough
on the VMI.
Since KubeVirt does not know upfront which exclusive CPUs the VMI will get from the kubelet, there are some limitations:
- Guests may see different NUMA topologies when being rescheduled.
- The resulting NUMA topology may be asymmetrical.
- The VMI may fail to start on the node if not enough hugepages are available on the assigned NUMA nodes.
While this NUMA modelling strategy has its limitations, aligning the guest's NUMA architecture with the node's can be critical for high-performance applications.
An example VMI may look like this:
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: numavm
spec:
domain:
cpu:
cores: 4
dedicatedCpuPlacement: true
numa:
guestMappingPassthrough: { }
devices:
disks:
- disk:
bus: virtio
name: containerdisk
- disk:
bus: virtio
name: cloudinitdisk
resources:
requests:
memory: 64Mi
memory:
hugepages:
pageSize: 2Mi
volumes:
- containerDisk:
image: quay.io/kubevirt/cirros-container-disk-demo
name: containerdisk
- cloudInitNoCloud:
userData: |
#!/bin/sh
echo 'printed from cloud-init userdata'
name: cloudinitdisk
Running real-time workloads¶
Overview¶
It is possible to deploy Virtual Machines that run a real-time kernel and make use of libvirtd's guest cpu and memory optimizations that improve the overall latency. These changes leverage mostly on already available settings in KubeVirt, as we will see shortly, but the VMI manifest now exposes two new settings that instruct KubeVirt to configure the generated libvirt XML with the recommended tuning settings for running real-time workloads.
To make use of the optimized settings, two new settings have been added to the VMI schema:
-
spec.domain.cpu.realtime
: When defined, it instructs KubeVirt to configure the linux scheduler for the VCPUS to run processes in FIFO scheduling policy (SCHED_FIFO) with priority 1. This setting guarantees that all processes running in the host will be executed with real-time priority. -
spec.domain.cpu.realtime.mask
: It defines which VCPUs assigned to the VM are used for real-time. If not defined, libvirt will define all VCPUS assigned to run processes in FIFO scheduling and in the highest priority (1).
Preconditions¶
A prerequisite to running real-time workloads include locking resources in the cluster to allow the real-time VM exclusive usage. This translates into nodes, or node, that have been configured with a dedicated set of CPUs and also provides support for NUMA with a free number of hugepages of 2Mi or 1Gi size (depending on the configuration in the VMI). Additionally, the node must be configured to allow the scheduler to run processes with real-time policy.
Nodes capable of running real-time workloads¶
When the KubeVirt pods are deployed in a node, it will check if it is capable of running processes in real-time scheduling policy and label the node as real-time capable (kubevirt.io/realtime). If, on the other hand, the node is not able to deliver such capability, the label is not applied. To check which nodes are able to host real-time VM workloads run this command:
$>kubectl get nodes -l kubevirt.io/realtime
NAME STATUS ROLES AGE VERSION
worker-0-0 Ready worker 12d v1.20.0+df9c838
Internally, the KubeVirt pod running in each node checks if the kernel setting kernel.sched_rt_runtime_us
equals to -1, which grants processes to run in real-time scheduling policy for an unlimited amount of time.
Configuring a VM Manifest¶
Here is an example of a VM manifest that runs a custom fedora container disk configured to run with a real-time kernel. The settings have been configured for optimal efficiency.
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
labels:
kubevirt.io/vm: fedora-realtime
name: fedora-realtime
spec:
running: true
template:
metadata:
labels:
kubevirt.io/vm: fedora-realtime
spec:
domain:
devices:
autoattachSerialConsole: true
autoattachMemBalloon: false
autoattachGraphicsDevice: false
disks:
- disk:
bus: virtio
name: containerdisk
- disk:
bus: virtio
name: cloudinitdisk
machine:
type: ""
resources:
requests:
memory: 1Gi
cpu: 2
limits:
memory: 1Gi
cpu: 2
cpu:
model: host-passthrough
dedicatedCpuPlacement: true
isolateEmulatorThread: true
ioThreadsPolicy: auto
features:
- name: tsc-deadline
policy: require
numa:
guestMappingPassthrough: {}
realtime: {}
memory:
hugepages:
pageSize: 1Gi
terminationGracePeriodSeconds: 0
volumes:
- containerDisk:
image: quay.io/kubevirt/fedora-realtime-container-disk:v20211008-22109a3
name: containerdisk
- cloudInitNoCloud:
userData: |-
#cloud-config
password: fedora
chpasswd: { expire: False }
bootcmd:
- tuned-adm profile realtime
name: cloudinitdisk
Breaking down the tuned sections, we have the following configuration:
Devices: - Disable the guest's memory balloon capability - Avoid attaching a graphics device, to reduce the number of interrupts to the kernel.
spec:
domain:
devices:
autoattachSerialConsole: true
autoattachMemBalloon: false
autoattachGraphicsDevice: false
CPU:
- model: host-passthrough
to allow the guest to see host CPU without masking any capability.
- dedicated CPU Placement: The VM needs to have dedicated CPUs assigned to it. The Kubernetes CPU Manager takes care of this aspect.
- isolatedEmulatorThread: to request an additional CPU to run the emulator on it, thus avoid using CPU cycles from the workload CPUs.
- ioThreadsPolicy: Set to auto to let the dedicated IO thread to run in the same CPU as the emulator thread.
- NUMA: defining guestMappingPassthrough
enables NUMA support for this VM.
- realtime: instructs the virt-handler to configure this VM for real-time workloads, such as configuring the VCPUS to use FIFO scheduler policy and set priority to 1.
cpu:
cpu:
model: host-passthrough
dedicatedCpuPlacement: true
isolateEmulatorThread: true
ioThreadsPolicy: auto
features:
- name: tsc-deadline
policy: require
numa:
guestMappingPassthrough: {}
realtime: {}
Memory - pageSize: allocate the pod's memory in hugepages of the given size, in this case of 1Gi.
How to dedicate VCPUS for real-time only¶
It is possible to pass a regular expression of the VCPUs to isolate to use real-time scheduling policy, by using the realtime.mask
setting.
When applied this configuration, KubeVirt will only set the first VCPU for real-time scheduler policy, leaving the remaining VCPUS to use the default scheduler policy. Other examples of valid masks are:
- 0-3
: Use cores 0 to 3 for real-time scheduling, assuming that the VM has requested at least 3 cores.
- 0-3,^1
: Use cores 0, 2 and 3 for real-time scheduling only, assuming that the VM has requested at least 3 cores.
Additional Reading¶
Kubernetes provides additional NUMA components that may be relevant to your use-case but typically are not enabled by default. Please consult the Kubernetes documentation for details on configuration of these components.
Topology Manager¶
Topology Manager provides optimizations related to CPU isolation, memory and device locality. It is useful, for example, where an SR-IOV network adaptor VF allocation needs to be aligned with a NUMA node.
https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
Memory Manager¶
Memory Manager is analogous to CPU Manager. It is useful, for example, where you want to align hugepage allocations with a NUMA node. It works in conjunction with Topology Manager.
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/