Apache Mesos starts once, but once I shut it down it won't restart

Question

I have a build of Apache Mesos done on a redhat 6.6 machine. It's installed to a directory I'll call $INST.

I set up $INST/etc/masters and $INST/etc/slaves to have the masters and slaves' hostnames, then set up $INST/etc/mesos-slave-env.sh like so:

export MESOS_work_dir=/path/to/some/directory/$HOSTNAME/work
export MESOS_log_dir=/path/to/some/directory/$HOSTNAME/log
export MESOS_master=masternodename:5050

And $INST/etc/mesos-master-env.sh is exactly the same but without MESOS_master defined.

/path/to/some/directory can be either shared by all nodes, or unique on all nodes for the behavior that I'm experiencing.

Then I run $INST/sbin/mesos-start-cluster the first time. It starts. I can open firefox masternodename:5050 and see the web UI, and it shows all of the slaves attached.

However, if I run $INST/sbin/mesos-stop-cluster.sh to kill the cluster, then restart it with mesos-start-cluster.sh, it hangs forever. curl masternodename:5050 opens a connection to the port and waits, forever, for data and none comes. the log shows this and nothing ever progresses from here:

I1102 16:05:39.334799 27997 logging.cpp:172] INFO level logging started!
I1102 16:05:39.335925 27997 main.cpp:229] Build: 2015-11-02 20:29:24 by sbhaide
I1102 16:05:39.335942 27997 main.cpp:231] Version: 0.25.0
I1102 16:05:39.336308 27997 main.cpp:252] Using 'HierarchicalDRF' allocator
I1102 16:05:39.344976 27997 leveldb.cpp:176] Opened db in 7.787897ms
I1102 16:05:39.346916 27997 leveldb.cpp:183] Compacted db in 1.90886ms
I1102 16:05:39.347038 27997 leveldb.cpp:198] Created db iterator in 94694ns
I1102 16:05:39.347062 27997 leveldb.cpp:204] Seeked to beginning of db in 4003ns
I1102 16:05:39.347074 27997 leveldb.cpp:273] Iterated through 0 keys in the db in 513ns
I1102 16:05:39.347393 27997 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1102 16:05:39.351538 28017 recover.cpp:449] Starting replica recovery
I1102 16:05:39.352499 27997 main.cpp:465] Starting Mesos master
I1102 16:05:39.352665 28017 recover.cpp:475] Replica is in EMPTY status
I1102 16:05:39.356853 28023 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request
I1102 16:05:39.356978 28025 master.cpp:376] Master 6fa2ccac-3527-4522-a72d-8eeba06f55eb (xxxxxx.xxx.xxxxxxxx.xxx.xxx) started on 10.148.0.101:5050
I1102 16:05:39.357002 28025 master.cpp:378] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --initialize_driver_logging="true" --log_auto_initialize="true" --log_dir="/path/to/log/directory/xxxxxxxx/log" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/install/directory/path/share/mesos/webui" --work_dir="..." --zk_session_timeout="10secs"
I1102 16:05:39.357393 28025 master.cpp:425] Master allowing unauthenticated frameworks to register
I1102 16:05:39.357405 28025 master.cpp:430] Master allowing unauthenticated slaves to register
I1102 16:05:39.357467 28025 master.cpp:467] Using default 'crammd5' authenticator
W1102 16:05:39.357502 28025 authenticator.cpp:505] No credentials provided, authentication requests will be refused
I1102 16:05:39.358242 28025 authenticator.cpp:512] Initializing server SASL
I1102 16:05:39.359158 28011 recover.cpp:195] Received a recover response from a replica in EMPTY status
I1102 16:05:39.360354 28029 recover.cpp:566] Updating replica status to STARTING
I1102 16:05:39.361856 28016 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.183548ms
I1102 16:05:39.361889 28016 replica.cpp:323] Persisted replica status to STARTING
I1102 16:05:39.362313 28014 recover.cpp:475] Replica is in STARTING status
I1102 16:05:39.363344 28014 replica.cpp:641] Replica in STARTING status received a broadcasted recover request
I1102 16:05:39.363711 28016 recover.cpp:195] Received a recover response from a replica in STARTING status
I1102 16:05:39.364202 28007 recover.cpp:566] Updating replica status to VOTING
I1102 16:05:39.364570 28029 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 222611ns
I1102 16:05:39.364594 28029 replica.cpp:323] Persisted replica status to VOTING
I1102 16:05:39.364678 28022 recover.cpp:580] Successfully joined the Paxos group
I1102 16:05:39.364972 28022 recover.cpp:464] Recover process terminated

(Data is anonymized somewhat)

It works perfectly the first time I start it, but then hangs forever the second, and I can't figure out why. It must be storing state information somewhere, but lsof doesn't show me any files that it could be affecting on any nodes during its run!

Any ideas where to look or what might be causing this?

score 0 · Accepted Answer · answered Nov 05 '15 at 22:34

The issue was that the system's entropy pool was getting exhausted, and it was blocking on a read to /dev/random; It was solved by compiling a new version of the cyrus-sasl library to use /dev/urandom instead of /dev/random, and link my mesos against that.

Apache Mesos starts once, but once I shut it down it won't restart

1 Answers1