We have a Sphinx install (2.0.3) running on a cluster of 3 EC2 instances (currently m3.large).

Currently we have workers = threads and max_children = 30 in our Sphinx config (same on each box). We are periodically getting the dreaded "temporary searchd error: server maxed out, retry in a second". Our instances are hovering around 5% CPU utilization. Some example top output:

top - 19:51:56 up 22:15,  1 user,  load average: 0.08, 0.04, 0.01
Tasks:  82 total,   2 running,  80 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.0%us,  0.0%sy,  0.0%ni, 98.5%id,  0.3%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   7872040k total,  2911920k used,  4960120k free,   245168k buffers
Swap:        0k total,        0k used,        0k free,  2190992k cached

All the Sphinx doc seems to say about setting max_children is that it is "useful to control server load". While searching I found a forum post indicating that setting it either too high or too low can cause "server maxed out" - I presume the former is because the individual queries are starved - but had no further tips on choosing the right level. (I can't find the link to this post again to save my life. Sorry.)

Two related questions:

  • Am I right in thinking the low CPU suggests max_children could/should be higher than 30?
  • How can I find the optimal number (i.e., the max number of children which [usually] does not lead to query slowdown)? I'm not entirely sure what kind of info Sphinx logs beyond query.log. Is there a tool I can use to determine whether query slowdown is occurring (due to too many parallel queries), and if not, are queries CPU-bound or memory-bound (or should I be looking at some other value entirely)?
