18

I'm running django/celery on EC2, with rabbitmq as the broker. The machine I was using failed, so I fired up another instance. But since switching to the new machine, I haven't been able to get celery to work.

EDIT: I've included a lot of logs below, just in case I'm misdiagnosing the problem. But I'm 85% sure that the problem is that rabbitmq-server fails to start up in the "starting database" phase.

node          : rabbit@ip-10-212-66-181
app descriptor: /usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/../ebin/rabbit.app
home dir      : /var/lib/rabbitmq
cookie hash   : 5+uQ077En5bpvle3HJCQMg==
log           : /var/log/rabbitmq/rabbit.log
sasl log      : /var/log/rabbitmq/rabbit-sasl.log
database dir  : /var/lib/rabbitmq/mnesia/rabbit

starting internal event notification system                           ...done
starting logging server                                               ...done
starting database                                                     ...Erlang has closed

Any ideas on how to further diagnose/solve this problem?

Here's what happens when I try to run celery:

$ python manage.py celeryd -l info
/opt/bitnami/python/lib/python2.6/site-packages/django_celery-2.4.2-py2.6.egg/djcelery/loaders.py:86: UserWarning: Using settings.DEBUG leads to a memory leak, never use this setting in production environments!
  warnings.warn("Using settings.DEBUG leads to a memory leak, never "
[2011-12-05 19:40:13,545: WARNING/MainProcess]  

 -------------- celery@ip-10-212-66-181 v2.4.3
---- **** -----
--- * ***  * -- [Configuration]
-- * - **** ---   . broker:      amqp://guest@localhost:5672//
- ** ----------   . loader:      djcelery.loaders.DjangoLoader
- ** ----------   . logfile:     [stderr]@INFO
- ** ----------   . concurrency: 1
- ** ----------   . events:      OFF
- *** --- * ---   . beat:        OFF
-- ******* ----
--- ***** ----- [Queues]
 --------------   . celery:      exchange:celery (direct) binding:celery


[Tasks]
  . tbAnalytics.models.processAnalysis
  . tbCollections.models.processCollection

[2011-12-05 19:40:13,558: INFO/PoolWorker-1] child process calling self.run()
[2011-12-05 19:40:13,562: WARNING/MainProcess] celery@ip-10-212-66-181 has started.
[2011-12-05 19:40:13,564: ERROR/MainProcess] Consumer: Connection Error: [Errno 111] Connection refused. Trying again in 2 seconds...
[2011-12-05 19:40:15,574: ERROR/MainProcess] Consumer: Connection Error: [Errno 111] Connection refused. Trying again in 4 seconds...

Tracing it back, it looks like the rabbitmq server is the problem, and the database in particular:

$ sudo rabbitmqctl status
Status of node 'rabbit@ip-10-212-66-181' ...
Error: unable to connect to node 'rabbit@ip-10-212-66-181': nodedown
diagnostics:
- nodes and their ports on ip-10-212-66-181: [{rabbitmqctl14448,38289}]
- current node: 'rabbitmqctl14448@ip-10-212-66-181'
- current node home dir: /var/lib/rabbitmq
- current node cookie hash: 5+uQ077En5bpvle3HJCQMg==

But I haven't been able to figure out how to restart the server:

bitnami@ip-10-212-66-181:/var/log/rabbitmq$ sudo rabbitmq-server start_app

+---+   +---+
|   |   |   |
|   |   |   |
|   |   |   |
|   +---+   +-------+
|                   |
| RabbitMQ  +---+   |
|           |   |   |
|   v1.7.2  +---+   |
|                   |
+-------------------+
AMQP 8-0
Copyright (C) 2007-2010 LShift Ltd., Cohesive Financial Technologies LLC., and Rabbit Technologies Ltd.
Licensed under the MPL.  See http://www.rabbitmq.com/

node          : rabbit@ip-10-212-66-181
app descriptor: /usr/lib/rabbitmq/lib/rabbitmq_server-1.7.2/sbin/../ebin/rabbit.app
home dir      : /var/lib/rabbitmq
cookie hash   : 5+uQ077En5bpvle3HJCQMg==
log           : /var/log/rabbitmq/rabbit.log
sasl log      : /var/log/rabbitmq/rabbit-sasl.log
database dir  : /var/lib/rabbitmq/mnesia/rabbit

starting internal event notification system                           ...done
starting logging server                                               ...done
starting database                                                     ...Erlang has closed
{"init terminating in do_boot",{{nocatch,{error,{cannot_start_application,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_vhost,rabbit_config,rabbit_listener,rabbit_durable_route,rabbit_route,rabbit_reverse_route,rabbit_durable_exchange,rabbit_exchange,rabbit_durable_queue,rabbit_queue]}}},[{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},{rabbit,run_boot_step,1},{rabbit,'-start/2-lc$^0/1-0-',1},{rabbit,start,2},{application_master,start_it_old,4}]}}}}}}},[{init,start_it,1},{init,start_em,1}]}}

Crash dump was written to: erl_crash.dump
init terminating in do_boot ()

Also, don't know if it's relevant, but this process is running in the background.

$ ps aux | grep rabbit
rabbitmq   714  0.0  0.0   1980   408 ?        S    Dec04   0:00 /usr/lib/erlang/erts-5.7.4/bin/epmd -daemon

I haven't been able to find any documentation for this kind of failure. Any suggestions?

Abe
  • 543
  • 2
  • 5
  • 11

3 Answers3

18

I got some very good help from the rabbitmq-discuss list:

The database RabbitMQ uses is bound to the machine's hostname, so if you copied the database dir to another machine, it won't work. If this is the case, you have to set up a machine with the same hostname as before and transfer any outstanding messages to the new machine. If there's nothing important in rabbit, you could just clear everything by removing the RabbitMQ files in /var/lib/rabbitmq.

I deleted everything in /var/lib/rabbitmq/mnesia/rabbit/ and it started up without trouble. Hooray!

Abe
  • 543
  • 2
  • 5
  • 11
10

The issue is related to the fact that Mnesia, which stores the queue and meta-data configuration of RabbitMQ, creates a database using the machine's hostname.

Such hostname based database directories will be located under :

<rabbitmq_installdir>/var/lib/rabbitmq/mnesia/rabbit@<yourhostname>
<rabbitmq_installdir>/var/lib/rabbitmq/mnesia/rabbit@<yourhostname>-plugins-expanded

So, the option to delete the above 2 directories and restarting rabbitmq will work. If you had the rabbitmq server migrated from a host to another, you will carry the former hostname mnesia database. Simply renaming the directory to the right hostname will not work, according to my tests.

So, in case you need to preserve the queue structure, user accounts, and any other metadata defined for your RabbitMQ server, you need to keep a copy of such metadata.

There are two ways to extract or import the metadata configuration

  • Management Plugin : activate the management plugin of rabbitmq and go to the url server:15672. The main page has in the bottom two options, one to export and one to import the definition

  • Command Line: rabbitmqadmin export rabbit.config ( or import instead of export )

So, bottom line suggestions:

  • keep a current export of your queue structure/users/etc
  • when migrating servers, or going through recovery, take the action to delete the former directory structure ( if the queued data is irrelevant ) and re-import the original configuration/metadata.
  • If any persistent queued data is relevant, the best option is to rename the hostname of your recovered host to the original one and allow the messages to process/dequeue, then you can adjust the hostname again if needed.
gextra
  • 205
  • 2
  • 6
1

Hi I had a similar situation when I migrated from AWS EC2 Small to Large Instance and needed to keep RabbitMq running and working with old mnesia DB files on new instance as they contained a lot of important delayed tasks and queue info. Below is workaround I used to manage this. Perhaps my workaround that enables one not to delete mnesia folder and preserve data can help someone.

The main problem is that your new machine has new hostname - and directory is named after it (just renaming directory as mentioned before, does not help) so we need to rename your machine hostname and make RabbitMq to work with old files. Let "ip-0-0-0-0" be old machine name (so there should be a mnesia folder /ver/lib/rabbitmq/mnsesia/ip-0-0-0-0), and new machine host name is something like "ip-1-1-1-1", but new name doesnot matter as we will overwrite it. Execute following commands:

sudo -s
echo "127.0.0.1 ip-0-0-0-0" >> /etc/hosts 
echo "ip-0-0-0-0" > /etc/hostname
reboot

After reboot your machine will have a new name and RabbitMq should work with old files.