0

I am exploring RabbitMQ quorum queues to improve HA for some services in a Kubernetes cluster. As I am reading, they are designed with data safety in mind.

However, the chapter "Managing Replicas" states:

Replicas of a quorum queue are explicitly managed by the operator. When a new node is added to the cluster, it will host no quorum queue replicas unless the operator explicitly adds it to a member (replica) list of a quorum queue or a set of quorum queues.

It seems therefore that, in case of disruptions (especially involuntary), the following situation could arise (for a 3-nodes cluster):

  1. after a disruption a node would go down: the other two nodes still compose the majority and will "keep the queue alive", possibly electing a new leader;
  2. kubernetes will provide a new node (pod) to replace the failed node; the new node will automatically rejoin the RabbitMQ cluster, but
  3. unless the operator manually intervenes, the new node will not contribute to the existing quorum queues;
  4. for a 3-nodes cluster, this means that there is no HA anymore: if, sometime in the future, one of the other nodes fails, the queue is effectively lost;

Is there any way to mitigate this scenario? Is it, for example, possible to have nodes automatically rejoin all existing quorum queue clusters? Maybe by maintaining a list of "startup commands" (which run after RabbitMQ starts) to which we could add the rejoin commands?

matpen
  • 387
  • 2
  • 4
  • 10
  • Which version of Kubernetes did you use and how did you set up the cluster? Did you use bare metal installation or some cloud provider? It is important to reproduce your problem. – Mikołaj Głodziak Mar 14 '22 at 15:09

1 Answers1

1

The RabbitMQ team highly recommends the use of the official Kubernetes operator - https://www.rabbitmq.com/kubernetes/operator/operator-overview.html

Aside from that, here's what the local k8s expert has to say:

Kubernetes will not just randomly delete a persistent volume - if the node went down for some reason, it will start with the same name and the same data

As long as the same name and data is used, the "new" node will join just as if it were the old one.

There are probably scenarios that require manual intervention but they aren't as frequent as you'd think.


NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

Luke Bakken
  • 186
  • 1
  • 5