Skip to main content
Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA. The Talos published NVIDIA OSS drivers are bound to a specific Talos release. The extensions versions also needs to be updated when upgrading Talos.
We will be using the following NVIDIA OSS system extensions:
  • nvidia-open-gpu-kernel-modules
  • nvidia-container-toolkit
Create the boot assets which includes the system extensions mentioned above (or create a custom installer and perform a machine upgrade if Talos is already installed).
Make sure the driver version matches for both the nvidia-open-gpu-kernel-modules and nvidia-container-toolkit extensions. The nvidia-open-gpu-kernel-modules extension is versioned as <nvidia-driver-version>-<talos-release-version> and the nvidia-container-toolkit extension is versioned as <nvidia-driver-version>-<nvidia-container-toolkit-version>.

Proprietary vs OSS Nvidia Driver Support

The NVIDIA Linux GPU Driver contains several kernel modules: nvidia.ko, nvidia-modeset.ko, nvidia-uvm.ko, nvidia-drm.ko, and nvidia-peermem.ko. Two “flavors” of these kernel modules are provided, and both are available for use within Talos: The choice between Proprietary/OSS may be decided after referencing the Official NVIDIA announcement.

Enabling the NVIDIA OSS modules

Patch Talos machine configuration using the patch gpu-worker-patch.yaml:
machine:
  kernel:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
  sysctls:
    net.core.bpf_jit_harden: 1
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:
talosctl patch mc --patch @gpu-worker-patch.yaml
The NVIDIA modules should be loaded and the system extension should be installed. This can be confirmed by running:
talosctl get modules
which should produce an output similar to below:
NODE       NAMESPACE   TYPE                 ID                     VERSION   STATE
10.5.0.3   runtime     LoadedKernelModule   nvidia_uvm             1         Live
10.5.0.3   runtime     LoadedKernelModule   nvidia_drm             1         Live
10.5.0.3   runtime     LoadedKernelModule   nvidia_modeset         1         Live
10.5.0.3   runtime     LoadedKernelModule   nvidia                 1         Live
talosctl get extensions
which should produce an output similar to below:
NODE           NAMESPACE   TYPE              ID                                                                           VERSION   NAME                             VERSION
172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0            1         nvidia-container-toolkit         515.65.01-v1.10.0
172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0       1         nvidia-open-gpu-kernel-modules   515.65.01-v1.2.0

Deploying NVIDIA device plugin

First we need to create the RuntimeClass Apply the following manifest to create a runtime class that uses the extension:
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
Install the NVIDIA device plugin:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.13.0 --set=runtimeClassName=nvidia

(Optional) Setting the default runtime class as nvidia

Do note that this will set the default runtime class to nvidia for all pods scheduled on the node.
Create a patch yaml nvidia-default-runtimeclass.yaml to update the machine config similar to below:
- op: add
  path: /machine/files
  value:
    - content: |
        [plugins]
          [plugins."io.containerd.cri.v1.runtime"]
            [plugins."io.containerd.cri.v1.runtime".containerd]
              default_runtime_name = "nvidia"
      path: /etc/cri/conf.d/20-customization.part
      op: create
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:
talosctl patch mc --patch @nvidia-default-runtimeclass.yaml

Testing the runtime class

Note the spec.runtimeClassName being explicitly set to nvidia in the pod spec.
Run the following command to test the runtime class:
kubectl run \
  nvidia-test \
  --restart=Never \
  -ti --rm \
  --image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 \
  --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \
  nvidia-smi

Collecting NVIDIA GPU debug data

When debugging NVIDIA GPU issues (for example, NVRM: GPU has fallen off the bus messages in the kernel log), NVIDIA support will often ask for the output of nvidia-bug-report.sh. Talos does not allow direct shell access on the nodes, but you can still generate this report by using kubectl debug. To do this:
  1. Start a debug pod on the affected node:
# Replace ${USER} with any unique suffix if needed
kubectl run "node-debugger-${USER}" \
  --restart=Never \
  --namespace kube-system \
  --image nvcr.io/nvidia/cuda:12.8.0-base-ubuntu22.04
  1. Then attach to it with the sysadmin debug profile:
kubectl debug "node-debugger-${USER}" \
  -it \
  --namespace kube-system \
  --profile sysadmin
This will drop you into a shell inside a container running on the node.
  1. Inside the debug container, download the NVIDIA driver bundle and extract nvidia-bug-report.sh by running the following commands:
a. Confirm the driver version talos is using:
nvidia-smi --query-gpu=driver_version --format=csv,noheader
b. Set the driver version and node architecture in variables:
DRIVER_VERSION=<nvidia-driver-version>
ARCH=<node-architecture>
Replace the placeholders <nvidia-driver-version> and <node-architecture> with your actual values:
  • <nvidia-driver-version>: The nvidia driver version running on your talos nodes, which you found in step 3a.
  • <node-architecture>: The architecture of the node.
c. Download the nvidia driver bundle, and extract the bug report script:
apt-get update && apt-get install -y curl
curl -O "https://us.download.nvidia.com/XFree86/Linux-${ARCH}/${DRIVER_VERSION}/NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}.run"

sh NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}.run --extract-only
cp NVIDIA-Linux-${ARCH}-${DRIVER_VERSION}/nvidia-bug-report.sh /usr/bin/
  1. Run nvidia-bug-report.sh:
nvidia-bug-report.sh
This will generate nvidia-bug-report.log.gz in the current directory.
  1. To copy the report of the cluster:
a. First, find the name of the debug container (if needed):
kubectl get pod "node-debugger-${USER}" \
--namespace kube-system \
-o jsonpath='{.spec.containers[*].name}'
b. Then, from your workstation:
kubectl cp \
"kube-system/node-debugger-${USER}:/nvidia-bug-report.log.gz" \
./nvidia-bug-report.log.gz
You can now upload nvidia-bug-report.log.gz to NVIDIA support.