In solaris, how monitor & auto-respond to critical events

Question

I have a website that randomly fail. Is running in open solaris on joyent.

I have a monitoring service that alert me when the site is down, but, I want a way to put a "insider" tool that tell me why that happened.

Is because the cpu is too high? Not memory? Which process fail? Is possible to have a backtrace of that?

Everything is running on the Solaris Service Management Facility. The webserver is cherokee, the database is mysql and the language is python/django.

I want the most simple setup to monitor that & auto-respond , ie: restart the webserver or the django process in case of failure.

I prefer a low-overhead tool. I don't need the fancy monitoring that some tools have, no ned graphs or sms alert. Only know what fail, restart it if possible (maybe up to n times), and have a log somewhere when I will check it.

If everything is running on SMF, as you wrote, you already have the logging, monitoring and restart facilities or am I missing something ? — jlliagre, Jan 17 '11 at 00:40
Well, any way to see that info? I have not expertise in solaris admin... — mamcx, Jan 17 '11 at 17:00

score 1 · Answer 1 · answered Mar 02 '12 at 06:56

1

All of your needs can be met by the logs in /var/svc/log.

Those are the logs for everything SMF is doing to your system, behind the scenes.

Extracting the 'interesting' data is left as an exercise for the reader.

answered Mar 02 '12 at 06:56

Elijah Wright

172
7

score 1 · Accepted Answer · answered Mar 02 '12 at 06:59

1

You might also choose to implement additional monitoring with Nodefly, NewRelic, Pagerduty, Pingdom, or any of nagios, Munin, or zabbix.

You have a lot of choices available.

answered Mar 02 '12 at 06:59

Elijah Wright

172
7

score 0 · Answer 3 · answered Mar 14 '12 at 18:49

0

Look into collectd. I've gotten it to compile on illumos/smartos. Also:

https://github.com/gflarity/nervous and https://github.com/gflarity/response

answered Mar 14 '12 at 18:49

gflarity

206
2
1

In solaris, how monitor & auto-respond to critical events

3 Answers3