OpsCenter 5.1.3 tells agents wrong host IP

Question

When adding new Cassandra nodes to the cluster we also start up the DataStax agent. After some time the agent is shown as not being connected anymore. Whenever we restart the agent the following error is logged:

ERROR [Initialization] 2015-12-15 10:42:25,309 Can't connect to Cassandra, retrying soon.
 com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /192.168.10.1:9042 (com.datastax.driver.core.TransportException: [/192.168.10.1:9042] Cannot connect))

The IP address 192.168.10.1 is the broadcast_address of the Cassandra node. The rpc_address is different which is why the agent is unable to connect to the node. This wrong IP is being sent from OpsCenter to the agent as shown in the logfile (reformatted for better readability):

INFO [StompConnection receiver] 2015-12-15 10:42:23,492 Got new config from OpsCenter: { :cassandra_port 9042, :rollups300_ttl 7776000, :destinations [], :restore_req_update_period 1, :cassandra_rpc_interface "192.168.10.1", :rollups60_ttl 7776000, :thrift_port 9160, :ec2_metadata_api_host "169.254.169.254", :metrics_enabled 1, :backup_staging_dir "", :rollups7200_ttl 7776000, :ssl_keystore nil, :metrics_ignored_column_families "", :cassandra_log_location "/var/log/cassandra/system.log", :config_md5 "49a3234ff4e1eca80f3b2c2027ae5d9c", :jmx_port 7199, :provisioning 0, :use_ssl 1, :max_pending_repairs 5, :rollups86400_ttl -1, :api_port "61621", :storage_keyspace "OpsCenter", :hosts ["192.168.10.1"], :metrics_ignored_solr_cores "", :metrics_ignored_keyspaces "system, system_traces, system_auth, dse_auth, OpsCenter", :rollup_subscriptions [], :cassandra_install_location ""}

After restarting OpsCenter and then the agent the correct IP is sent to the agent and no errors are logged anymore.

While the node is joining the system.peers table does not contain an entry for the new node. Afterwards it contains the correct addresses.

How can we make new agents use the correct address (the rpc_address) other than restarting OpsCenter every time new nodes are added?

Update: Setting hosts in address.yaml doesn't work

Just tried setting hosts: ["192.168.8.1"] (which is the rpc_address of the node) in /var/lib/datastax-agent/conf/address.yaml. The behavior is exactly the same. It seems this host is overwritten by what the OpsCenter provides:

INFO [main] 2015-12-22 08:55:15,207 Loading conf files: /var/lib/datastax-agent/conf/address.yaml
INFO [main] 2015-12-22 08:55:15,258 Java vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.8.0_45
INFO [main] 2015-12-22 08:55:15,258 DataStax Agent version: 5.1.3
INFO [main] 2015-12-22 08:55:15,282 Default config values: {... :agent_rpc_broadcast_address "192.168.10.1", ... :hosts ["192.168.8.1"]}
...
INFO [StompConnection receiver] 2015-12-22 08:55:21,015 Got new config from OpsCenter: {... :cassandra_rpc_interface "192.168.10.1", ... :hosts ["192.168.10.1"] ...}
...
ERROR [Initialization] 2015-12-22 08:55:22,926 Can't connect to Cassandra, retrying soon.
 com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /192.168.10.1:9042 (com.datastax.driver.core.TransportException: [/192.168.10.1:9042] Cannot connect))
...
WARN [Initialization] 2015-12-22 08:55:32,652 Resetting cluster because {:hosts ["192.168.8.1"]} changed to {:hosts ["192.168.10.1"], :local_interface "192.168.10.1"}

Summarizing to make sure I understand correctly: when nodes are joining the cluster the agents are unable to connect without restarting opscenterd and the agent. Once a node has finished joining the cluster everything works as expected. If this is correct, you can work around this issue by setting the hosts property in address.yaml yourself to point to the proper rpc address (e.g. hosts: ["1.2.3.4"]). There is an internal ticket tracking improved behavior for bootstrapping nodes (OPSC-2543). Feel free to use this for future reference. — mbulman, Dec 16 '15 at 12:26
We restarted only the agent while a node was joining, that didn't help. While a node was joining we never restarted OpsCenter itself. When we did both after a node finished joining everything was working. — Severin Leonhardt, Dec 17 '15 at 12:43

OpsCenter 5.1.3 tells agents wrong host IP

0 Answers0