14

Glusterfs, while being a nice distributed filesystem, provides almost no way to monitor it's integrity. Servers can come and go, bricks might get stale or fail and I afraid to know about that when it is probably too late.

Recently we had an strange failure when everything appeared working, but one brick fell out from the volume (found by pure coincidence).

Is there a simple and reliable way (cron script?) that will let me know about health status of my GlusterFS 3.2 volume?

Arie Skliarouk
  • 598
  • 1
  • 5
  • 12
  • For now we use an dirty shell script based monitoring: [check_gluster.sh](http://t11.mine.nu/check_gluster.sh) – Arie Skliarouk Aug 01 '11 at 16:56
  • Have a look at [glfs-health.sh](http://www.sirgroane.net/category/gluster/). – quanta Aug 01 '11 at 17:12
  • 1
    I checked the glfs-health.sh and it looks like it is for old versions of glusterfs, which were configuration-file controlled. I will clarify my question to represent glusterfs 3.2. – Arie Skliarouk Aug 03 '11 at 13:34

5 Answers5

3

This has been a request to the GlusterFS developers for a while now and there is nothing out-of-the-box solution you can use. However, with a few scripts it's not impossible.

Pretty much entire Gluster system is managed by a single gluster command and with a few options, you can write yourself health monitoring scripts. See here for listing info on bricks and volumes -- http://gluster.org/community/documentation/index.php/Gluster_3.2:_Displaying_Volume_Information

To monitor performance, look at this link -- http://gluster.org/community/documentation/index.php/Gluster_3.2:_Monitoring_your_GlusterFS_Workload

UPDATE: Do consider upgrading to http://gluster.org/community/documentation/index.php/About_GlusterFS_3.3

You are always better off with being on the latest release since they seem to have more bug fixes and well supported. Ofcourse, run your own tests before moving to a newer release -- http://vbellur.wordpress.com/2012/05/31/upgrading-to-glusterfs-3-3/ :)

There is an admin guide with specific section for monitoring your GlusterFS 3.3 installation in Chapter 10 -- http://www.gluster.org/wp-content/uploads/2012/05/Gluster_File_System-3.3.0-Administration_Guide-en-US.pdf

See here for another nagios script -- http://code.google.com/p/glusterfs-status/

Chida
  • 2,471
  • 1
  • 16
  • 29
  • Thanks Chida, I guess what's got me hung up is that some folks (https://github.com/semiosis/puppet-gluster) are monitoring gluster via the proc table ('--with-brick', etc) and logfiles (egrep ' E ' for error), and some people are using the CLI and I have no idea which is more likely to accurately report gluster's state. – r_2 Aug 14 '12 at 16:44
  • I'd recommend using the CLI since that's the one GlusterFS recommends and is bound to be up-to-date. – Chida Aug 14 '12 at 16:51
2

Please check the attached script at https://www.gluster.org/pipermail/gluster-users/2012-June/010709.html for gluster 3.3; it's probably easily adaptable to gluster 3.2.

#!/bin/bash

# This Nagios script was written against version 3.3 of Gluster.  Older
# versions will most likely not work at all with this monitoring script.
#
# Gluster currently requires elevated permissions to do anything.  In order to
# accommodate this, you need to allow your Nagios user some additional
# permissions via sudo.  The line you want to add will look something like the
# following in /etc/sudoers (or something equivalent):
#
# Defaults:nagios !requiretty
# nagios ALL=(root) NOPASSWD:/usr/sbin/gluster peer status,/usr/sbin/gluster volume list,/usr/sbin/gluster volume heal [[\:graph\:]]* info
#
# That should give us all the access we need to check the status of any
# currently defined peers and volumes.

# define some variables
ME=$(basename -- $0)
SUDO="/usr/bin/sudo"
PIDOF="/sbin/pidof"
GLUSTER="/usr/sbin/gluster"
PEERSTATUS="peer status"
VOLLIST="volume list"
VOLHEAL1="volume heal"
VOLHEAL2="info"
peererror=
volerror=

# check for commands
for cmd in $SUDO $PIDOF $GLUSTER; do
    if [ ! -x "$cmd" ]; then
        echo "$ME UNKNOWN - $cmd not found"
        exit 3
    fi
done

# check for glusterd (management daemon)
if ! $PIDOF glusterd &>/dev/null; then
    echo "$ME CRITICAL - glusterd management daemon not running"
    exit 2
fi

# check for glusterfsd (brick daemon)
if ! $PIDOF glusterfsd &>/dev/null; then
    echo "$ME CRITICAL - glusterfsd brick daemon not running"
    exit 2
fi

# get peer status
peerstatus="peers: "
for peer in $(sudo $GLUSTER $PEERSTATUS | grep '^Hostname: ' | awk '{print $2}'); do
    state=
    state=$(sudo $GLUSTER $PEERSTATUS | grep -A 2 "^Hostname: $peer$" | grep '^State: ' | sed -nre 's/.* \(([[:graph:]]+)\)$/\1/p')
    if [ "$state" != "Connected" ]; then
        peererror=1
    fi
    peerstatus+="$peer/$state "
done

# get volume status
volstatus="volumes: "
for vol in $(sudo $GLUSTER $VOLLIST); do
    thisvolerror=0
    entries=
    for entries in $(sudo $GLUSTER $VOLHEAL1 $vol $VOLHEAL2 | grep '^Number of entries: ' | awk '{print $4}'); do
        if [ "$entries" -gt 0 ]; then
            volerror=1
            let $((thisvolerror+=entries))
        fi
    done
    volstatus+="$vol/$thisvolerror unsynchronized entries "
done

# drop extra space
peerstatus=${peerstatus:0:${#peerstatus}-1}
volstatus=${volstatus:0:${#volstatus}-1}

# set status according to whether any errors occurred
if [ "$peererror" ] || [ "$volerror" ]; then
    status="CRITICAL"
else
    status="OK"
fi

# actual Nagios output
echo "$ME $status $peerstatus $volstatus"

# exit with appropriate value
if [ "$peererror" ] || [ "$volerror" ]; then
    exit 2
else
    exit 0
fi
S19N
  • 1,693
  • 1
  • 17
  • 28
2

There is a nagios plugin available for monitoring. You may have to edit it for your version though.

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
chandank
  • 847
  • 3
  • 14
  • 31
1

@Arie Skliarouk, your check_gluster.sh has a typo—on the last line, you grep for exitst instead of exist. I went ahead and rewrote it to be a bit more compact, and to remove the requirement for a temporary file.

#!/bin/bash

# Ensure that all peers are connected
gluster peer status | grep -q Disconnected && echo "Peer disconnected." && exit 1

# Ensure that all bricks have a running log file (i.e., are sending/receiving)
for vol in $(gluster volume list); do
  for brick in $(gluster volume info "$vol" | awk '/^Brick[0-9]*:/ {print $2}'); do
    gluster volume log locate "$vol" "$brick";
  done;
done |
 grep -qE "does not (exist|exitst)" &&
 echo "Log file missing - $vol/$brick ." &&
 exit 1
dannyw
  • 333
  • 3
  • 5
BMDan
  • 7,129
  • 2
  • 22
  • 34
  • 1
    The "exitst" typo is what is written in the logs. I don't buy the "compact" advantage - the script is much harder to understand when lines are overloaded. Temporary file is cheap price to pay for the easy-to-understand code. – Arie Skliarouk Feb 20 '13 at 16:13
  • @ArieSkliarouk: Updated to cover both cases, but be forewarned that the relevant message was removed in November 2011; see http://git.gluster.org/?p=glusterfs.git;a=commitdiff;h=a3c49bb260263dce98d44c28e7106da2a2173ed9 . Thus, this is likely not going to work on newer Glusters. If you find the shorter code harder to understand, that's fine, but it is significantly more robust than using a temporary file, so consider refactoring it for readability instead of dismissing it for perceived lack of that attribute. – BMDan Mar 18 '13 at 15:54
  • 1
    An anonymous editor noted that `gluster volume info | awk ...` can be abbreviated to `gluster volume list`. – Lekensteyn Dec 05 '16 at 10:06
1

I was able to configure the nagios monitoring for glusterfs as mentioned below :

http://gopukrish.wordpress.com/2014/11/16/monitor-glusterfs-using-nagios-plugin/

user173141
  • 127
  • 1
  • 2
  • 7
  • 2
    Because links go dead over time, we would prefer it if you could include the essence of the answer here on ServerFault. – Ladadadada Nov 19 '14 at 13:12