NVIDIA-SMI can't communicate with NVIDIA driver

Question

Problem description

I am trying to set up a centos-7 GPU (Nvidia Tesla K80) instance on Google Cloud, to execute CUDA work.

Unfortunately, I can't seem to properly install/configure drivers.

Indeed, here is what happens when trying to interact with nvidia-smi (NVIDIA System Management Interface):

# nvidia-smi -pm 1
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Same operation with more recent method nvidia-persistenced:

# nvidia-persistenced
nvidia-persistenced failed to initialize. Check syslog for more details.

+ I get the following error in syslog (using journalctl command):

Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.

Indeed, no nvidia devices are present:

# ll /dev/nvidia*
ls: cannot access /dev/nvidia*: No such file or directory

However, here is a proof that the GPU is correctly connected to the instance:

# lshw -numeric -C display
  *-display UNCLAIMED       
       description: 3D controller
       product: GK210GL [Tesla K80] [10DE:102D]
       vendor: NVIDIA Corporation [10DE]
       physical id: 4
       bus info: pci@0000:00:04.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: msi pm cap_list
       configuration: latency=0
       resources: iomemory:40-3f iomemory:80-7f memory:fc000000-fcffffff memory:400000000-7ffffffff memory:800000000-801ffffff ioport:c000(size=128)

Installation process I followed

Creation of the centos-7 instance, following this section of the Google Cloud docs:

gcloud compute instances create test-gpu-drivers \
    --machine-type n1-standard-2 \
    --boot-disk-size 250GB \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family centos-7 --image-project centos-cloud \
    --maintenance-policy TERMINATE

Then, the installation process I followed for the drivers & CUDA is inspired by Google Cloud documentation, but with latest versions instead:

gcloud compute ssh test-gpu-drivers
sudo su
yum -y update

# Reboot for kernel update to be taken into account
reboot

gcloud compute ssh test-gpu-drivers
sudo su

# Install nvidia drivers repository, found here: https://www.nvidia.com/Download/index.aspx?lang=en-us
curl -J -O http://us.download.nvidia.com/tesla/410.72/nvidia-diag-driver-local-repo-rhel7-410.72-1.0-1.x86_64.rpm
yum -y install ./nvidia-diag-driver-local-repo-rhel7-410.72-1.0-1.x86_64.rpm

# Install CUDA repository, found here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=CentOS&target_version=7&target_type=rpmlocal
curl -J -O https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.0.130-1.x86_64.rpm
yum -y install ./cuda-repo-rhel7-10.0.130-1.x86_64.rpm

# Install CUDA & drivers & dependencies
yum clean all
yum -y install cuda

nvidia-smi -pm 1

reboot

gcloud compute ssh test-gpu-drivers
sudo su
nvidia-smi -pm 1

Full logs here.

(I also tried the exact GCE driver install script, without upgrading versions, but with no luck too)

Environment

Distribution release

[root@test-gpu-drivers myuser]# cat /etc/*-release | head -n 1
CentOS Linux release 7.6.1810 (Core)

Kernel release

[root@test-gpu-drivers myuser]# uname -r
3.10.0-957.1.3.el7.x86_64

I can make it work on Ubuntu!

To analyze the problem, I decided to try doing the same thing on Ubuntu 18.04 (LTS). This time, I had no problem.

Instance creation:

gcloud compute instances create gpu-ubuntu-1804 \
    --machine-type n1-standard-2 \
    --boot-disk-size 250GB \
    --accelerator type=nvidia-tesla-k80,count=1 \
    --image-family ubuntu-1804-lts --image-project ubuntu-os-cloud \
    --maintenance-policy TERMINATE

Install process:

gcloud compute ssh gpu-ubuntu-1804
sudo su
apt update
apt -y upgrade
reboot

gcloud compute ssh gpu-ubuntu-1804
sudo su
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt -y install ./cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
rm cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt-get update
apt-get -y install cuda
nvidia-smi -pm 1

Full installation logs available here.

Test:

# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:00:04.0.
All done.
# ll /dev/nvidia*
crw-rw-rw- 1 root root 241,   0 Dec  4 14:01 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195,   0 Dec  4 14:01 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Dec  4 14:01 /dev/nvidiactl

One thing I noticed is that on Ubuntu installation of package nvidia-dkms triggers some stuff, which I did not see on centos:

Setting up nvidia-dkms-410 (410.79-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf

A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

Loading new nvidia-410.79 DKMS files...
Building for 4.15.0-1025-gcp
Building for architecture x86_64
Building initial module for 4.15.0-1025-gcp
Generating a 2048 bit RSA private key
.............................................................................................................+++
..........+++
writing new private key to '/var/lib/shim-signed/mok/MOK.priv'
-----
EFI variables are not supported on this system
/sys/firmware/efi/efivars not found, aborting.
Done.

nvidia:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-modeset.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-drm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

nvidia-uvm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/4.15.0-1025-gcp/updates/dkms/

depmod...

DKMS: install completed.

Environment

Distribution release

root@gpu-ubuntu-1804:/home/elouan_keryell-even# cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel release

root@gpu-ubuntu-1804:/home/elouan_keryell-even# uname -r
4.15.0-1025-gcp

Question

Does someone understand what goes wrong with my installation of NVIDIA drivers on Centos 7?

score 2 · Answer 1 · edited Mar 18 '19 at 22:39

There is 2 issues :

CentOS uses by default the nouveau open source driver, which is not compatible with nvidia and has to be deactivated
Drivers from nvidia repo seems to not be working, as nvidia dkms module is needed.

To do that :

install some required packages

yum install kernel-devel epel-release dkms

Edit /etc/default/grub and add nouveau.modeset=0 in GRUB_CMDLINE_LINUX

Regen grub config to apply changes :

grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Reboot for changes to take effect.
Then install this driver directly : http://fr.download.nvidia.com/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run

After that, nvidia-smi should be working

score 0 · Answer 2 · answered Dec 05 '18 at 17:26

0

This issue has been reported to google and it is being worked on Here

answered Dec 05 '18 at 17:26

Adebisi

26
2

NVIDIA-SMI can't communicate with NVIDIA driver

Problem description

Installation process I followed

Environment

I can make it work on Ubuntu!

Environment

Question

2 Answers2