Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA. The Talos published NVIDIA OSS drivers are bound to a specific Talos release. The extensions versions also needs to be updated when upgrading Talos.We will be using the following NVIDIA OSS system extensions:
nvidia-open-gpu-kernel-modulesnvidia-container-toolkit
Make sure the driver version matches for both thenvidia-open-gpu-kernel-modulesandnvidia-container-toolkitextensions. Thenvidia-open-gpu-kernel-modulesextension is versioned as<nvidia-driver-version>-<talos-release-version>and thenvidia-container-toolkitextension is versioned as<nvidia-driver-version>-<nvidia-container-toolkit-version>.
Proprietary vs OSS Nvidia Driver Support
The NVIDIA Linux GPU Driver contains several kernel modules:nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko, nvidia-drm.ko, and nvidia-peermem.ko.
Two “flavors” of these kernel modules are provided, and both are available for use within Talos:
- Proprietary, This is the flavor that NVIDIA has historically shipped.
- Open, i.e. source-published/OSS, kernel modules that are dual licensed MIT/GPLv2. With every driver release, the source code to the open kernel modules is published on https://github.com/NVIDIA/open-gpu-kernel-modules and a tarball is provided on https://download.nvidia.com/XFree86/.
Enabling the NVIDIA OSS modules
Patch Talos machine configuration using the patchgpu-worker-patch.yaml:
Deploying NVIDIA device plugin
First we need to create theRuntimeClass
Apply the following manifest to create a runtime class that uses the extension:
(Optional) Setting the default runtime class as nvidia
Do note that this will set the default runtime class to nvidia for all pods scheduled on the node.
Create a patch yaml nvidia-default-runtimeclass.yaml to update the machine config similar to below:
Testing the runtime class
Note theRun the following command to test the runtime class:spec.runtimeClassNamebeing explicitly set tonvidiain the pod spec.
Collecting NVIDIA GPU debug data
When debugging NVIDIA GPU issues (for example,NVRM: GPU has fallen off the bus messages in the kernel log), NVIDIA support will often ask for the output of nvidia-bug-report.sh.
Talos does not allow direct shell access on the nodes, but you can still generate this report by using kubectl debug. To do this:
- Start a debug pod on the affected node:
- Then attach to it with the sysadmin debug profile:
- Inside the debug container, download the NVIDIA driver bundle and extract
nvidia-bug-report.shby running the following commands:
<nvidia-driver-version> and <node-architecture> with your actual values:
<nvidia-driver-version>: The nvidia driver version running on your talos nodes, which you found in step 3a.<node-architecture>: The architecture of the node.
- Run
nvidia-bug-report.sh:
nvidia-bug-report.log.gz in the current directory.
- To copy the report of the cluster:
nvidia-bug-report.log.gz to NVIDIA support.