1

I am trying to add the GPU Nvidia module in ganglia (/ganglia/gmond_python_modules/gpu/nvidia/).

Do we need to apply the ganglia_web.patch patch?

If I do not apply the patch, I don't see any GPU metrics when I go to http://localhost/ganglia/

If I try to apply the patch, I have the following issue:

ubuntu@server:/usr/share/ganglia-webfrontend$ sudo patch -p0 < /home/ubuntu/gmond_python_modules/gpu/nvidia/ganglia_web.patch
sudo: unable to resolve host server
patching file host_view.php
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 37.
Hunk #3 FAILED at 144.
Hunk #4 FAILED at 153.
Hunk #5 FAILED at 169.
5 out of 5 hunks FAILED -- saving rejects to file host_view.php.rej
patching file templates/default/host_view.tpl
Hunk #1 FAILED at 80.
Hunk #2 FAILED at 89.
2 out of 2 hunks FAILED -- saving rejects to file templates/default/host_view.tpl.rej
ubuntu@server:/usr/share/ganglia-webfrontend$ cd /usr/share/ganglia-webfrontend 

The readme does not mention what to do with the patch file.

The web interface does contain the GPU metric, but all images are 404:

enter image description here

When I go to a Grid > [name] > [gpu node], I don't see any GPU option:

enter image description here

On the Ganglia server (i.e., on the server where gmetad is running), I ran:

git clone https://github.com/ganglia/gmond_python_modules.git
sudo cp gmond_python_modules/gpu/nvidia/graph.d/*   /usr/share/ganglia-webfrontend/graph.d/
sudo /etc/init.d/gmetad restart

On the Ganglia client (i.e., on the server where gmond is running, and where the GPU is located), I ran:

git clone https://github.com/ganglia/gmond_python_modules.git
sudo pip install nvidia-ml-py
sudo cp gmond_python_modules/gpu/nvidia/python_modules/nvidia.py /usr/lib/ganglia/nvidia.py
sudo cp gmond_python_modules/gpu/nvidia/conf.d/nvidia.pyconf /etc/ganglia/conf.d
sudo /etc/init.d/ganglia-monitor restart

I use:

  • Ganglia Web Frontend version 3.6.1
  • Ganglia Web Backend (gmetad) version 3.6.0
  • RRDtool version 1.4.7.
  • Ubuntu 14.04.3 LTS x64 server
Franck Dernoncourt
  • 940
  • 1
  • 12
  • 28

2 Answers2

3

After running into this myself, strangely enough yesterday as well. I asked a developer of the module. He said it should "just work" ... so, after playing a bit I found the following to work:

On web host:

  1. cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
  2. cp graph.d/* /usr/share/ganglia-webfrontend/graph.d/
  3. Restart gmond

On GPU Node (Note, this is RHEL/SL/Cent package names and locations):

  1. Install pip (python-pip)
  2. Run: pip install nvidia-ml-py
  3. From source:

    cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
    cp python_module/nvidia.py /usr/lib64/ganglia/
    
  4. Restart gmond

No need to patch the web tree now. So, on the Web interface go to:

Grid > [name] > [gpu node]

There should be a "gpu metrics" in the listing now. Might want to collapse and look. If there isn't for some reason, you can go to the Grid > [Name] page, and at the bottom in the Metric drop down, select one of the gpu_* metrics. That may kick something once you do that. I had to do that to get one of the nodes to display the 'gpu metrics' section... but another I didn't.

YMMV.

-J

  • Thanks, don't you have to do `sudo cp gmond_python_modules/gpu/nvidia/graph.d/* /usr/share/ganglia-webfrontend/graph.d/` on web host? – Franck Dernoncourt Apr 22 '16 at 20:26
  • Doh! Yes... I did partial yesterday and more today and got stuff crossed. I'll edit the above to add it in. Thanks. – Jason Stover Apr 22 '16 at 20:31
  • Thanks, I did everything you mentioned, the web interface does contain the GPU metric, but all images are 404: http://i.stack.imgur.com/rJaxU.png And idea what could be missing? – Franck Dernoncourt Apr 22 '16 at 20:42
  • @FranckDernoncourt I had 404 images as well when looking at the GPU metrics from the footer area. To get the "real" ones, I needed to go into a node listing. So, from the header link thing: Grid > [Grid Name] > [Node] -- From the node listing was a 'gpu metric' section. – Jason Stover Apr 23 '16 at 22:01
  • @FranckDernoncourt Ahh... also they guy I talked with mentioned nvidia-smi and spawning it. So, maybe for some reason they weren't actually using the nvidia-ml-py module to directly access statistics... Or there's some other interaction I don't know.... I'm not a python programmer, and didn't look through the code to see what it was doing exactly. – Jason Stover Apr 23 '16 at 22:05
  • Thanks. When I go to a `Grid > [name] > [gpu node]`, I don't see any GPU option: http://i.stack.imgur.com/80Tnq.png – Franck Dernoncourt Apr 23 '16 at 22:22
  • Hrmm.... do you get output on a GPU node if you run: `python /usr/lib/python2.7/site-packages/nvidia_smi.py` --- Replacing for location where the nvidia-ml-py package was installed. – Jason Stover Apr 25 '16 at 16:11
  • I get: `username@server:~$ python /usr/local/lib/python2.7/dist-packages nvidia_smi.py /usr/bin/python: can't find '__main__' module in '/usr/local/lib/python2.7/dist-packages'`. Is it supposed to contain a `__main__`? – Franck Dernoncourt Apr 25 '16 at 16:24
  • Hrmm... if I remember right, `__main__` is what's defined when the file is ran directly. That file should have a: `if __name__ == "__main__":` at the end of it. The output of python -v is: 2.7.5 -- on the system I have it running on. – Jason Stover Apr 25 '16 at 20:13
  • There is no `__main__`, but if I open a Python interpreter, and do `from pynvml import *; nvmlInit(); print "Driver Version:", nvmlSystemGetDriverVersion() ` it does manage to display the correct Nvidia driver version ( which in my case is `352.63`) – Franck Dernoncourt Apr 30 '16 at 21:45
2

On Ubuntu xenial I found I also needed to add modpython.conf to tell ganglia's modpython.so to load the nvidia.py module:

  1. Install pip (python-pip)
  2. Run: sudo pip install nvidia-ml-py
  3. From source:

    cp conf.d/nvidia.pyconf /etc/ganglia/conf.d/
    cp python_module/nvidia.py /usr/lib/ganglia/python_modules/
    
  4. If you don't have /etc/ganglia/conf.d/modpython.conf

    cat <<EOF | sudo tee /etc/ganglia/conf.d/modpython.conf
    modules {
      module {
        name = "python_module"
        path = "/usr/lib/ganglia/modpython.so"
        params = "/usr/lib/ganglia/python_modules/"
      }
    }
    include ('/etc/ganglia/conf.d/*.pyconf')
    EOF
    
szym
  • 121
  • 2