2

I'm currently using Nginx to proxy requests to a Unicorn server running a Sinatra application. The application only has a couple of routes defined, those of which make fairly simple (non costly) queries to a PostgreSQL database, and finally return data in JSON format, these services are being monitored by God.

I'm currently experiencing extremely slow response times from this application server. I have another two Unicorn servers being proxied via Nginx, and these are responding perfectly fine, so I think I can rule out any wrong doing from Nginx.

Here is my God configuration:

# God configuration

APP_ROOT = File.expand_path '../', File.dirname(__FILE__)

God.watch do |w|
  w.name = "app_name"
  w.interval = 30.seconds # default

  w.start = "cd #{APP_ROOT} && unicorn -c #{APP_ROOT}/config/unicorn.rb -D"

  # -QUIT = graceful shutdown, waits for workers to finish their current request before finishing
  w.stop = "kill -QUIT `cat #{APP_ROOT}/tmp/unicorn.pid`"

  w.restart = "kill -USR2 `cat #{APP_ROOT}/tmp/unicorn.pid`"

  w.start_grace = 10.seconds
  w.restart_grace = 10.seconds
  w.pid_file = "#{APP_ROOT}/tmp/unicorn.pid"

  # User under which to run the process
  w.uid = 'web'
  w.gid = 'web'

  # Cleanup the pid file (this is needed for processes running as a daemon)
  w.behavior(:clean_pid_file)

  # Conditions under which to start the process
  w.start_if do |start|
    start.condition(:process_running) do |c|
      c.interval = 5.seconds
      c.running = false
    end
  end

  # Conditions under which to restart the process
  w.restart_if do |restart|
    restart.condition(:memory_usage) do |c|
      c.above = 150.megabytes
      c.times = [3, 5] # 3 out of 5 intervals
    end

    restart.condition(:cpu_usage) do |c|
      c.above = 50.percent
      c.times = 5
    end
  end

  w.lifecycle do |on|
    on.condition(:flapping) do |c|
      c.to_state = [:start, :restart]
      c.times = 5
      c.within = 5.minute
      c.transition = :unmonitored
      c.retry_in = 10.minutes
      c.retry_times = 5
      c.retry_within = 2.hours
    end
  end
end

Here is my Unicorn configuration:

# Unicorn configuration file

APP_ROOT = File.expand_path '../', File.dirname(__FILE__)

worker_processes 8

preload_app true

pid "#{APP_ROOT}/tmp/unicorn.pid"

listen 8001

stderr_path "#{APP_ROOT}/log/unicorn.stderr.log"
stdout_path "#{APP_ROOT}/log/unicorn.stdout.log"

before_fork do |server, worker|
  old_pid = "#{APP_ROOT}/tmp/unicorn.pid.oldbin"

  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
      # someone else did our job for us
    end
  end
end

I have checked God status logs but it appears CPU and Memory Usage are never out of bounds. I also have something to kill high memory workers, which can be found on the GitHub blog page here.

When running a tail -f on the Unicorn logs I see some requests, but they're far and few between, when I was at around 60-100 a second before this trouble seemed to have arrived. This log also shows workers being reaped and started as expected.

So my question is, how would I go about debugging this? What are the next steps I should be taking? I'm extremely baffled that the server will sometimes respond quickly, but at others time it's very slow, for long periods of time (which may or may not be peak traffic times).

Any advice is much appreciated.

injekt
  • 121
  • 1
  • 2

1 Answers1

1

I would start by looking at general system health with tools like atop. Next, I would look closely at what postgres is doing. If that doesn't reveal the problem, I would use tcpdump (or the nicer tshark) to view communication between your browser, ngnix, and unicorn. In conjunction with that, I would try strace on nginx and unicorn.

sciurus
  • 12,493
  • 2
  • 30
  • 49
  • Thanks for the suggestions, I've done both of these and it doesn't appear there's any issues. Though it hasn't been too unresponsive so I'll need to try these techniques when that happens again. I guess I'll just keep trying, was hoping to find others with the same issue. Thanks for your time – injekt Apr 26 '11 at 08:03