8

I've seen this question on the mailing list a few times but haven't had a satisfactory answer.

How best to monitor that the pipeline isn't stuck? Clients -> logstash -> elasticsearch.

Logstash and especially elasticsearch are prone to resource starvation. They are both fantastic at picking up where they left off but how, exactly, are people watching their watchers?

Opinions welcome.

Dan Garthwaite
  • 2,922
  • 18
  • 29
  • May be help this [How to check Logstash's pulse](https://www.elastic.co/blog/how-to-check-logstashs-pulse) – jBee Jul 08 '16 at 08:00

4 Answers4

2

Personally i actually check that redis is still dequeuing on the central logging host, which is upstream of LS+ES.

i.e: redis-cli llen logstash is less than some fixed number.

This may not indicate that logs are appearing in redis at all though, but that could be checked too i guess.

Something like checking that redis-cli info | grep total_commands_processed keeps increasing, maybe ?

Sirex
  • 5,447
  • 2
  • 32
  • 54
  • Wouldn't that continue to increase as more logs roll in? We would need the total number of LPOPs. Or maybe warn when LLEN gets too large? – Dan Garthwaite Aug 07 '14 at 14:00
  • yeah. I worded it badly, i check that llen is less than some number, and alert if it isn't. – Sirex Aug 07 '14 at 20:01
  • Wouldn't `total_commands_processed` always increment, if not from logstash polling it then from the `info` command itself? – Dan Garthwaite Dec 18 '14 at 15:53
2

I use zabbix in my environment, but I suppose this method could work in other setups as well. I have configured the following command that zabbix is allowed to use:

UserParameter=elasticsearch.commits,/usr/bin/curl -s 'localhost:9200/_cat/count?v' | /bin/sed -n '2p' | /bin/awk '{print $3}'

This will return the number of elasticsearch records committed total. So I take this value and divide by the number of seconds since I took the last sample (I check every minute), if this number drops below an arbitrary limit I can alert off it. I also use zabbix to check to see if the logstash PID has died, and alert off that also, and run the following command:

UserParameter=elasticsearch.health,/usr/bin/curl -s 'http://localhost:9200/_cluster/health?pretty=true' | /bin/sed -n '3p' | /bin/awk -F'\"' '{print $4}' | /bin/sed s/yellow/0/ | /bin/sed s/green/0/ | /bin/sed s/red/1/

This will return 1 if cluster health has gone red (yellow and green are okay), which I can also alert off.

Rumbles
  • 915
  • 1
  • 12
  • 27
0

Check to see that the logs per second at your final endpoint (e.g. elasticsearch) are above some baseline.

That is, do an end to end check, if your end result is working correctly, you know that all the steps in the pipeline work correctly.

If you frequently have problems, or need better introspection, start instrumenting each piece of the pipeline like redis as suggested above.

dmourati
  • 24,720
  • 2
  • 40
  • 69
0

We use several approaches:

  1. Monit, to listen for Elastic and Logstash ports and restarting them
  2. For cases when something bad happened, and everything is in place from monit prospective, but logs are not consumed/stored there is a simple script, that checks active index every hour and alerts in case document count haven't changed in last hour.
GregL
  • 9,030
  • 2
  • 24
  • 35