0

Context:

We use a RabbitMQ 2.8.4 HA cluster for messaging across a mid-size (hundreds to thousands of servers) datacenter. The messaging functions are critical to our app.

We use the C client API for AMQP/RabbitMQ operations, and the HTTP API to the RabbitMQ management plugin to periodically monitor queue status and send alerts based on size/throughput/etc.

The only ports/connections allowed from clients to the messaging servers are on the AMQP ports and on HTTP for the management API.

For the near future, management has decided that downtime of the messaging cluster (which translates into downtime for certain requesting parts of the app) is unacceptable. In a few months it will be possible, but until then sales-driven concerns won't allow us to have maintenance of the messaging system. It sucks, but we (the infrastructure/devops people) don't get to make those calls.

Problem:

For awhile, we have been victimized by the problem discussed in this SO question. After a connection storm, or a long period during which new clients are connecting and using the AMQP system very frequently, the RabbitMQ management API shows hundreds of thousands of connections that it lists as "running" that no longer exist. netstat, lsof and other diagnostics prove that the sockets corresponding to these connections are not open on the clients; the management API is just unable to prune the entries for some reason. rabbitmqctl list_connections sometimes shows some of the nonexistent connections to be still open, but it always shows many fewer than the management API, and its "staleness" detection seems better than the API's and it eventually purges the false connections from its listing.

When we attempt to delete one of these "false" connections via the web UI, the page doesn't respond to our pressing "delete", and the connection stays in the listing. When we try to delete it by sending a DELETE in the JSON API, we get a 500 server error indicating that the connection isn't valid, so it can't be deleted (exactly as described in the linked SO question above).

After a long accumilation of these nonexistent connection objects, three bad things happen, in no particular order:

  • The RabbitMQ server has massive memory spikes (that come out of nowhere; they don't correspond to the slow connection creep in the management API, though its memory consumption does grow slowly over time). This often leads to a crash.
  • The management API stops responding to requests (times out on the API clients).
  • The RabbitMQ server itself starts sporadically refusing connections from AMQP clients that are not doing anything with HTTP/the management API.

In cases where this problem begins breaking our app, we have to do one of two things: * Restart the whole messaging cluster, which is bad because it causes momentary downtime, and bad because messages can be dropped during this process. * Restart and purge the management API's database, which is bad because we can't use the management API, and worse because it often breaks the management API in such a way that it will not start properly again until we restart the whole cluster as in the first bullet.

Most people afflicted by similar problems resolve them by upgrading to RabbitMQ 3.*. We have neither the time, permission, nor downtime authorization for such a project. Even in the face of periodic several-minute-long outages when we restart the cluster due to this issue, we were not authorized to upgrade. It's a purely political restriction, but an unfortunate reality.

Question:

Given that we can't upgrade RabbitMQ in the near future, and that we need to continue using the management API, how can we resolve the issue of persistent, stale connections that cause messaging cluster failures?

We have the ability to develop new/different client code, and run jobs on the messaging servers themselves, but we don't have the ability to use rabbitmqctl locally or via ssh on any of the client machines.

What we've tried:

  • We've tried switching AMQP client libraries in hopes that a better client would clean up after itself and not leave stale connections, to no avail. The issue persists.
  • We've tried playing with the tcp_keepalive and exit_on_close TCP settings on the RabbitMQ servers themselves. No combination of these values resolves the issue.
  • We've made a script which iterates over all connections returned from the JSON API's connections list and sends a DELETE for each one that has been open for longer than it would need to be if it were being used. All such requests are met with 500 server errors as described in the SO question linked above.
Zac B
  • 841
  • 1
  • 15
  • 27

1 Answers1

0

You keep restarting everything periodically.

The other option is to backport the fix from 3.x to your current version, but if you aren't allowed to upgrade, you probably won't be allowed to do this either. And it will almost certainly be much more work than upgrading, especially if the RabbitMQ code has been significantly refactored.

Until you can change the politics of the situation, those are your options.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940