Over a cluster of over 12 centos 5.8 servers, I deployed logstash using the native logstash shipper, which sends /var/log/*/*.log
back to a central logstash server.
We tried using rsyslogd as the shipper, but due to a bug in rsyslogd's ImFile module, if the remote end didn't reply, the logs would pile up in memory.
We're currently using Redis as the transport mechanism, so logstash01 has redis running locally, bound to the IP for the VLAN for these logs.
So logstash-shipper sends to redis on logstash01. logstash01 sends to Elasticsearch running in a separate process.
Here's what we're seeing. Elasticsearch has 141 blocked threads. Stracing the elasticsearch parent shows:
futex(0x7f4ccd1939d0, FUTEX_WAIT, 26374, NULL
Here's the jstack from elasticsearch
Here's the jstack from logstash
So.. Last night, some of the webservers (whose logs are tailed by logstash) went nuts, with load averages over 500.
On logstash01, there's this
Dec 19 00:44:45 logstash01 kernel: [736965.925863] Killed process 23429 (redis-server) total-vm:5493112kB, anon-rss:4248840kB, file-rss:108kB
So OOM-killer killed redis-server, which then meant logs piled up in memory on the servers which were shipping stuff.. Which somehow means that apache gets its knickers in a twist. (Frankly, I'm not sure how, I just assume it's tailing the log)..
This is my theory of how events unfolded:
- We had a traffic spike.
- An immense amount of logs were generated.
- These piled up in Redis, as logstash/elasticsearch only seems to be able to handle 300-400 new events / second.
- Redis had filled up entirely to the point where OOM-killer slaughtered it senselessly.
- Redis stops accepting new items.
- Items now start to pile up on the remote hosts side.
- Everything goes nuts. Apache stops accepting requests. (Why?).
Questions are these:
Why does apache go nuts if there's just something tailing its log. Is it that the thing tailing it blocks apache from writing?
Is there a sane way to make elasticsearch faster/better/resilient?
Is there a sane way to make redis resilient and not die because of being OOM'd
Is there a fundamental flaw in the way I've set it all up, or does everyone have this problem?
-- EDIT --
Some specs for @lusis.
admin@log01:/etc/init$ free -m
total used free shared buffers cached
Mem: 7986 6041 1944 0 743 1157
-/+ buffers/cache: 4140 3845
Swap: 3813 3628 185
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 19G 5.3G 13G 31% /
udev 3.9G 4.0K 3.9G 1% /dev
tmpfs 1.6G 240K 1.6G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 3.9G 0 3.9G 0% /run/shm
/dev/sda1 90M 72M 14M 85% /boot
/dev/mapper/data-disk 471G 1.2G 469G 1% /data
/dev/sda2 on / type ext3 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/data-disk on /data type ext3 (rw)
/data/elasticsearch on /var/lib/elasticsearch type none (rw,bind)
log01:/etc/init$ top
top - 14:12:20 up 18 days, 21:59, 2 users, load average: 0.20, 0.35, 0.40
Tasks: 103 total, 1 running, 102 sleeping, 0 stopped, 0 zombie
Cpu0 : 3.0%us, 1.0%sy, 0.0%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 12.0%us, 1.0%sy, 0.0%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu2 : 4.7%us, 0.3%sy, 0.0%ni, 94.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 5.6%us, 1.3%sy, 0.0%ni, 93.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 5.3%us, 1.3%sy, 0.0%ni, 93.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 6.4%us, 1.0%sy, 0.0%ni, 92.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 8178120k total, 6159036k used, 2019084k free, 761780k buffers