4

Last night at around midnight, our app fell over and I'm trying to determine why. We currently have a nginx front end server and 2 unicorn worker (app) servers on EC2.

Pretty much our unicorn workers kept timing out and hence, were restarted by the master.

From what I can see, we have no crontabs or anything set to run at this time so I'm a little stumped.

I was able to get the app up this morning when I woke (after 6 hours of downtime) by killing the unicorn processes and re-executing the unicorn binary. (unicorn_rails -c unicorn.rb etc)

Any ideas where to look? The fact that both app servers went down makes me think it could be the database (RDS) ?

The log was filled with the following (for 6 hours .. etc);

E, [2013-02-28T00:07:40.367981 #11097] ERROR -- : worker=2 PID:26941 timeout (31s > 30s), killing
E, [2013-02-28T00:07:40.468495 #11097] ERROR -- : reaped #<Process::Status: pid 26941 SIGKILL (signal 9)> worker=2
I, [2013-02-28T00:07:40.756724 #28319]  INFO -- : worker=2 ready
E, [2013-02-28T00:07:44.519818 #11097] ERROR -- : worker=1 PID:11292 timeout (31s > 30s), killing
E, [2013-02-28T00:07:44.626362 #11097] ERROR -- : worker=0 PID:26933 timeout (31s > 30s), killing
E, [2013-02-28T00:07:44.726936 #11097] ERROR -- : reaped #<Process::Status: pid 11292 SIGKILL (signal 9)> worker=1
E, [2013-02-28T00:07:44.727254 #11097] ERROR -- : worker=0 PID:26933 timeout (31s > 30s), killing
E, [2013-02-28T00:07:44.932858 #11097] ERROR -- : reaped #<Process::Status: pid 26933 SIGKILL (signal 9)> worker=0
I, [2013-02-28T00:07:45.661356 #28329]  INFO -- : worker=1 ready
I, [2013-02-28T00:07:45.828289 #28334]  INFO -- : worker=0 ready
E, [2013-02-28T00:08:11.113970 #11097] ERROR -- : worker=2 PID:28319 timeout (31s > 30s), killing
E, [2013-02-28T00:08:11.214770 #11097] ERROR -- : reaped #<Process::Status: pid 28319 SIGKILL (signal 9)> worker=2
I, [2013-02-28T00:08:11.518723 #28368]  INFO -- : worker=2 ready
E, [2013-02-28T00:08:16.270463 #11097] ERROR -- : worker=1 PID:28329 timeout (31s > 30s), killing
E, [2013-02-28T00:08:16.371067 #11097] ERROR -- : worker=0 PID:28334 timeout (31s > 30s), killing
E, [2013-02-28T00:08:16.471684 #11097] ERROR -- : reaped #<Process::Status: pid 28329 SIGKILL (signal 9)> worker=1
E, [2013-02-28T00:08:16.471983 #11097] ERROR -- : reaped #<Process::Status: pid 28334 SIGKILL (signal 9)> worker=0
I, [2013-02-28T00:08:17.038915 #28376]  INFO -- : worker=0 ready
I, [2013-02-28T00:08:17.128931 #28379]  INFO -- : worker=1 ready
E, [2013-02-28T00:08:42.628665 #11097] ERROR -- : worker=2 PID:28368 timeout (31s > 30s), killing
E, [2013-02-28T00:08:42.729290 #11097] ERROR -- : reaped #<Process::Status: pid 28368 SIGKILL (signal 9)> worker=2
I, [2013-02-28T00:08:43.015140 #28390]  INFO -- : worker=2 ready
E, [2013-02-28T00:08:48.778221 #11097] ERROR -- : worker=0 PID:28376 timeout (31s > 30s), killing
E, [2013-02-28T00:08:48.878530 #11097] ERROR -- : worker=1 PID:28379 timeout (31s > 30s), killing
James
  • 151
  • 1
  • 6
  • Can you post the contents of /var/log/syslog at the same time? – Tom O'Connor Feb 28 '13 at 08:05
  • You need to look into your application log to find out why it takes so long to handle requests. Real cause could be anything - overloaded DB backend, overloaded application server (CPU usage or heavy swapping), slow external API, empty cache or just bug in application code which causes it to hang. – AlexD Feb 28 '13 at 15:32
  • You have to understand, the app usually works without a fault. It's been running fine since I restarted the instances and can't see any "long running requests", hence why I can't get my head around why the slaves would timeout for 30s (and be restarted by the master) for 6 hours straight. Thanks for the advice guys, I'll keep looking. – James Mar 03 '13 at 08:02

1 Answers1

1

Managed to solve this guys. Further investigation showed larges amounts of network traffic (and cpu usage!) between the times of 12am - 4am. It turns out, our bing webmaster settings were set to "high" between these times, causing bingbot to go crazy and absolutely hammer unicorn. (poor unicorn).

More info; https://webmasters.stackexchange.com/questions/23084/ms-bing-web-crawler-out-of-control-causing-our-site-to-go-down

James
  • 151
  • 1
  • 6