I have the following on-prem Kubernetes environment:
- OS: Red Hat Enterprise Linux release 8.6 (Ootpa)
- Kubernetes: 1.23.7 (single-node, build with kubeadm)
- NVIDIA driver: 515.65.01
- nvidia-container-toolkit: 1.10.0-1.x86_64 (rpm)
- containerd: v1.6.2
- vcr.io/nvidia/k8s-device-plugin:v0.12.2
And I run the following Pod on my server. Only app2 (initContainer2) uses GPU.
initContainer1: app1
↓
initContainer2: app2 (Uses GPU)
↓
container1: app3
When the app2 uses too much RAM and is OOM killed, the Pod should be in the OOMKilled
status, but it's stuck in the PodInitializing
status on my environment.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default gpu-pod 0/1 PodInitializing 0 83m xxx.xxx.xxx.xxx xxxxx <none> <none>
The results of kubectl describe pod
is as follows:
Init Containers:
app1:
...
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 30 Aug 2022 10:50:38 +0900
Finished: Tue, 30 Aug 2022 10:50:44 +0900
...
app2:
...
State: Terminated
Reason: OOMKilled
Exit Code: 0
Started: Tue, 30 Aug 2022 10:50:45 +0900
Finished: Tue, 30 Aug 2022 10:50:48 +0900
...
app3:
...
State: Waiting
Reason: PodInitializing
...
...
This problem will never happen when I replace app2 with another container that doesn't use GPU, or when I launch app2 as a single container (not an init Container) of the Pod. In both cases, the status will be properly OOMKilled
.
Is this a bug? If so, are there any workarounds?