0

I've follow this official tutorial to allow a bare-metal k8s cluster to have GPU Access. However i received errors while doing so.

Kubernetes 1.21 containerd 1.4.11 and Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-91-generic x86_64).

Nvidia Driver is preinstalled on System OS with version 495 Headless

After pasting the following config inside /etc/containerd/config.toml and perform service restart, containerd would failed to start with exit 1.

Containerd Config.toml

systemd log here.

# persistent data location
root = "/var/lib/containerd"
# runtime state information
state = "/run/containerd"

# Kubernetes doesn't use containerd restart manager.
disabled_plugins = ["restart"]

# NVIDIA CONFIG START HERE

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

# NVIDIA CONFIG ENDS HERE

[debug]
  level = ""

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[plugins.linux]
  shim = "/usr/bin/containerd-shim"
  runtime = "/usr/bin/runc"

I can confirm that Nvidia Driver does detect the GPU (Nvidia GTX 750Ti) by running nvidia-smi and got the following output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   34C    P8     1W /  38W |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

modified config.toml that got it to work.

XPLOT1ON
  • 103
  • 5
  • Two things: you'll likely get better help if you post the logs from the container that exited non-zero, since the details matter. Secondly, don't use 1.4.11, there was a security fix in [1.4.12](https://github.com/containerd/containerd/releases/tag/v1.4.12) – mdaniel Dec 01 '21 at 16:57
  • @mdaniel that you for notifying of such vulnerability, i've updated all nodes. also, I've updated the post above with system log. – XPLOT1ON Dec 02 '21 at 03:22

1 Answers1

2

As best I can tell, it's this:

Dec 02 03:15:36 k8s-node0 containerd[2179737]: containerd: invalid disabled plugin URI "restart" expect io.containerd.x.vx

Dec 02 03:15:36 k8s-node0 systemd[1]: containerd.service: Main process exited, code=exited, status=1/FAILURE

So if you know that the restart-ish plugin is in fact enabled, you'll need to track down its new URI syntax, but I'd actually recommend just commenting out that stanza, or going with disabled_plugins = [], since the containerd ansible role we use doesn't mention anything about "reboot" and does have the = [] flavor


Tangentially, you may want to restrict your journalctl invocation in the future to just look at the containerd.service, since it will throw out a lot of text that is a distraction: journalctl -u containerd.service and you can even restrict it to just the last few lines, which sometimes can help further: journalctl -u containerd.service --lines=250

mdaniel
  • 2,338
  • 1
  • 8
  • 13
  • Thank for the extensive reply, i've tried putting `disabled_plugins` as empty list. It gave me a different error `containerd: invalid plugin key URI "linux" expect io.containerd.x.vx`. I've attached a complete containerd `config.toml` in the original post. If you could have a look that would be great. – XPLOT1ON Dec 02 '21 at 13:15
  • Yes, it seems to be the same problem; `linux` as an unqualified name is evidently the old style, so what you'll likely want is `[plugins."io.containerd.runtime.v1.linux"]` just like you see with the `[plugins]` members at the top of the file and [as shown in the template I linked to](https://github.com/particuleio/symplegma-containerd/blob/v1.4.3-rel.0/templates/config.toml.j2#L132) – mdaniel Dec 02 '21 at 16:51
  • Thanks for the help, i can now boot up containerd with the integrated config based on nvidia docs. For future ref: I've updated my original post for the updated config.toml – XPLOT1ON Dec 03 '21 at 18:22
  • I'm glad to hear it, and I'm always glad when it's something simple, and I wish you good luck on your journey running GPUs in k8s! Please consider putting the config inline in your question, since linking to external sites runs the risk of them being 404 for future generations – mdaniel Dec 04 '21 at 21:07