1

we run a RabbitMQ Cluster with 6 c5.2xlarge nodes on AWS. The Version is 3.6.14 on Erlang 19.1. They run in docker containers and mount a local volume to /var/lib/rabbitmq/mnesia.

The docker run command is as follows:

docker run -d \
    --name rabbitmq \
    --net=host \
    --dns-search=eu-west-1.compute.internal \
    --ulimit nofile=65536:65536 \
    --restart on-failure:5 \
    -p 1883:1883 \
    -p 4369:4369 \
    -p 5672:5672 \
    -p 15672:15672 \
    -p 25672:25672 \
    -e AUTOCLUSTER_TYPE=aws \
    -e AWS_AUTOSCALING=true \
    -e AUTOCLUSTER_CLEANUP=true \
    -e CLEANUP_WARN_ONLY=false \
    -e AWS_DEFAULT_REGION=eu-west-1 \
    -v /mnt/storage:/var/lib/rabbitmq/mnesia \
    dockerregistry/rabbitmq-autocluster:3.6.14

Part of RabbitMQ config:

[
  {
    rabbit,
    [
      {loopback_users, []},
      {collect_statistics_interval, 10000},
      {cluster_partition_handling, autoheal},
      {delegate_count, 64},
      {fhc_read_buffering, false},
      {fhc_write_buffering, false},
      {heartbeat, 60},
      {background_gc_enabled, true},
      {background_gc_target_interval, 60000},
      {queue_index_embed_msgs_below, 0},
      {queue_index_max_journal_entries, 8192},
      {log_levels, [{connection, error},
                    {channel, warning}]},
      {vm_memory_high_watermark, 0.8},
      {auth_backends,
        [
          rabbit_auth_backend_ldap,
          rabbit_auth_backend_internal
        ]
      }
    ]
  },

  {
    rabbitmq_mqtt,
    [
      {default_user,     <<"**********">>},
      {default_pass,     <<"************">>},
      {allow_anonymous,  false},
      {vhost,            <<"/">>},
      {exchange,         <<"amq.topic">>},
      {subscription_ttl, 1800000},
      {prefetch,         10},
      {ssl_listeners,    []},
      {tcp_listeners,    [1883]},
      {tcp_listen_options,
        [
          binary,
            {packet, raw},
            {reuseaddr, true},
            {backlog, 128},
            {nodelay,   true}
        ]
      }
    ]
  },

On last friday evening the queued messages peaked at ~25k messages when out of nothing all nodes started to experience massive iowait issues. Usually the iowait is always < 5 and now it started to spike to > 70. We checked the machines but were yet unable to find a reasonable explanation. After we rotated the entire autoscaling group to new instances, the issue went away. Even on saturday when we reached the same message rate. In iotop we often see the ext4 journaling process on top. However with nvme ssds on the c5 machines, we think that iowait should not be an issue. We also checked the network, and found no source for concern.

Any input or hints you might be able to give is much appreciated? What can we check?

paraa
  • 11
  • 1
  • "the queued messages peaked at ~25k messages" - that suggests your consumers were not keeping up, and RabbitMQ started paging incoming messages to disk to relieve memory pressure. Do you monitor memory use and other RMQ stats? – Luke Bakken Feb 02 '18 at 20:17

0 Answers0