We are running a 4 node/machine elastic search cluster on 12 core, 96gb RAM, 4 spinning disk machines. under normal operation most cpu usage is user and around 5-10%. Every few days, one of the machine's cpu usage gets pegged at 80-100% and is all user and system -- io wait actually decreases. We first thought it was an elasticsearch specific issue, but after extensive debugging it doesn't seem to be so:

  • the high cpu utilization survives an elasticsearch node process restart
  • the elasticsearch threads are all behaving normally, things just take 10x longer.
  • non elasticsearch operations (gc collection) also take 10x longer, but heap activity is normal

If we stop the process for about an hour and then restart the process only (not the machine) the problem goes away and things work fine for a few days.

We have also noticed that during the problem, disk copy tests are very slow. With the process up but idle (not indexing/searching data) or soon after the process has stopped, copying a 1GB file via dd happens at about 18MB/s on the problematic machine but at 490MB/s when healthy. Interestingly, we noticed using dstat that the slow copy took about 25 seconds before doing any i/o and then took an additional 30 seconds to complete. The strace output didn't seem to be significantly different.

Any idea what further tests we could run?

  • 12,493
  • 2
  • 30
  • 49
  • 71
  • 1
  • 5
  • You never explicitly said that it is the elasticsearch processes using the CPU when usage spikes. Is it them? – sciurus Jul 27 '14 at 18:49
  • sorry, yes it's the elasticsearch process taking the cpu, however after a lot of analysis, the operations it's doing are quite normal -- they just seem to take a lot of resources to complete. The hot threads output shows no discernible pattern, which meshes with our thinking that it's not an elasticsearch issue. – slushi Jul 27 '14 at 22:24
  • are you sure it is CPU? not say disk i/O ? do you have enough memory? – Sverre Jul 30 '14 at 07:28
  • our monitoring tools show predominately user cpu usage. i/o actually decreases a bit, probably because the machine becomes so loaded. the machines have plenty of memory, 96GB of physical memory with elasticsearch using 32GB max. we set the mlockall option in elasticsearch so it won't swap. – slushi Jul 30 '14 at 15:24

3 Answers3


There are lot of issues going around with Elastic Search and by quick googling you can find some. But major problem in high cpu usage might be caused due to lack of control on cache usage. Please below for references :

https://github.com/elasticsearch/elasticsearch/issues/4288 http://elasticsearch-users.115913.n3.nabble.com/Very-high-sys-cpu-usage-with-HTTP-KeepAlive-td4049998.html http://blog.sematext.com/2012/05/17/elasticsearch-cache-usage/

  • it's not an elasticsearch related issue. we have even disabled queries so the cache is not a factor. – slushi Jul 24 '14 at 15:33

Processes using a lot of CPU should show up in atop as suggested by Ian Macintosh, but because it's based on sampling the process table on a regular cycle, that visibility can be dependent on how long those processes run for.

The GNU accounting utilities can also be very useful for this sort of thing. (package = 'acct' on debian based systems, or 'psacct' on redhat based ones). I routinely run atop and the have the accounting package on (accton on) for most servers.

After you enable the accounting data collection, stats are kept about the CPU usage of every process that runs, along wqith when it started and finished its run. This can be very useful when a bunch of short lived processes are consuming your cpu, which is hard to see with atop, strace, etc, (though strace may be more helpful with the -f or -ff flag). It's less useful when you have processes with a lifetime much longer than the CPU spike, but in those cases atop should give you what you want. lastcomm is probably the tool you want for accessing the collected statistics.

while very useful, strace only tells you about system calls. If you have something using cpu intensively, but not calling the system, it won't show up. Sometimes ltrace can be useful for this, but again, only if the relevant activity occurs within a library call, and that's not always the case.

Tools like strace and ltrace, and perhaps even a debugger like gdb only come into play once you've identified the process that's consuming the CPU, and it doesn't sound like you've got that yet. At this point, atop and lastcomm are probably the way to go.

  • 5,786
  • 17
  • 31

What further tests could you run?

(Missing some info like what system CPU % is when pegged vs user CPU %) but check what percentage of CPU is IRQ, just in case that leads somewhere.

Assuming the system CPU % is fairly high and it's not IRQ's, you might want to check memory. A handy tool for an overview is atop, it should tell you if something like page scans or page steals is causing it because you're under heavy memory pressure.

I'm going to assume you've excluded swapping as being an issue.

Because atop gives you quite a comprehensive overview of the machine state it's very handy in getting a handle on the overall state. It would help comparing atop on a properly operating system vs one that's misbehaving as well.

If the only abnormality is user CPU % and the system itself is operating correctly otherwise then you're likely dealing with a software bug and will have to revert to the authors for help - or change the way you're using it to avoid triggering whatever bug you've found. Just check that you're not dealing with your own bug - ie, you're calling it badly or in a loop or something of that nature. I've seen that a few times.

In summary, dig into memory, irq, swap etc and see if you can exclude those - I suggest atop for a quick comparison between normal behaviour & aberant behaviour and to highlight critical items.

Ian Macintosh
  • 945
  • 1
  • 6
  • 12