2

I've installed Sun Grid Engine on 10 nodes, and one virtual master host.

Now I have to monitor all the resources prior to launching it into production, but I don't know which is the best way. I've tried using xml-qstat, but it seems unstable.

Any tips or suggestions?

Anyone got experience on this?

thanks.

John Gardeniers
  • 27,262
  • 12
  • 53
  • 108
Marc Riera
  • 1,587
  • 4
  • 21
  • 38

4 Answers4

4

You could use Ganglia. We use Ganglia with 1000s of nodes at the Holland Computing Center and for the most part, it seems to work fairly well, especially if you are looking for historical graphs. Nagios is used for active monitoring.

ryanlim
  • 458
  • 3
  • 4
1

If I am understanding you correctly you need to monitor bunch of grid servers. What kind of monitoring do you have in mind? Perhaps something like Nagios with some additional scripting could fit your needs?

There is an example over here.

solefald
  • 2,303
  • 15
  • 14
  • I thought he meant monetize. – Ward - Reinstate Monica Apr 21 '10 at 21:46
  • I don't care much about the money. :) And Nagios, it's too generic I think. I'm looking for something which can fit with sun grid engine, maybe hadoop. It's a grid and this means that I must nomitorize each cpu/process separately, not only the service and the performance. – Marc Riera Apr 21 '10 at 22:04
  • Well, Nagios is only as generic as you want it to be. The real value of Nagios is that it can be made to work with anything. So if you figure out how to extract the data you want `xml-qstat` you can feed it into Nagios. For graphing the data and keeping it for historical purposes you can use Cacti. Again, like with Nagios, you could feed anything into Cacti and get pretty pictures you can show to the management. It's just depends how much effort you want to put into this project. – solefald Apr 21 '10 at 22:15
1

Just for the record, also Munin (http://munin-monitoring.org/) is very nice.

markusN
  • 111
  • 3
0

It sounds like you're more interested in metrics than uptime or availability. Circonus (http://circonus.com/) is a good fit here. You can correlate virtually any metrics, which can be imported over the Resmon XML DTD.

obfuscurity
  • 761
  • 3
  • 7