We have two data centers (192.X.X.X and 10.X.X.X) between which gossip (port 7001) is possible but not thrift or the native protocol. OpsCenter runs on a node in the first data center (192.X.X.X).
After updating from OpsCenter 5.1.3 to OpsCenter 5.2.0 on CentOS 6.6 the dashboard only shows "Cannot Connect to Cluster".
The opscenterd.log
file shows repeated attempts to connect to the Cluster.
It begins with connecting to a seed node:
2015-08-10 11:52:04+0200 [Cluster_01] DEBUG: Connecting to cluster, contact points: ['192.168.0.100', '192.168.0.101']; protocol version: 2 2015-08-10 11:52:04+0200 [] DEBUG: Host 192.168.0.100 is now marked up 2015-08-10 11:52:04+0200 [] DEBUG: Host 192.168.0.101 is now marked up 2015-08-10 11:52:04+0200 [Cluster_01] DEBUG: [control connection] Opening new connection to 192.168.0.100 2015-08-10 11:52:04+0200 [] INFO: Starting factory 2015-08-10 11:52:04+0200 [Cluster_01] DEBUG: [control connection] Established new connection , registering watchers and refreshing schema and topology 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Refreshing node list and token map using preloaded results
The following part is repeated for each node in the other data center and also for each node from the local data center which is not in the list of seed nodes:
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Found new host to connect to: 10.0.0.1 2015-08-10 11:52:05+0200 [Cluster_01] INFO: New Cassandra host 10.0.0.1 discovered 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Handling new host 10.0.0.1 and notifying listeners 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Not adding connection pool for new host 10.0.0.1 because the load balancing policy has marked it as IGNORED 2015-08-10 11:52:05+0200 [] DEBUG: Host 10.0.0.1 is now marked up
The log continues a bit until the control connection is closed:
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Finished fetching ring info 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Rebuilding token map due to topology changes 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Attempting to use preloaded results for schema agreement 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Schemas match 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] user types table not found 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Fetched schema, rebuilding metadata 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Control connection created 2015-08-10 11:52:05+0200 [] DEBUG: Initializing new connection pool for host 192.168.0.100 2015-08-10 11:52:05+0200 [] INFO: Starting factory 2015-08-10 11:52:05+0200 [] INFO: Starting factory 2015-08-10 11:52:05+0200 [] DEBUG: Finished initializing new connection pool for host 192.168.0.100 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Added pool for host 192.168.0.100 to session 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Shutting down Cluster Scheduler 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Not executing scheduled task due to Scheduler shutdown 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Shutting down control connection 2015-08-10 11:52:05+0200 [] DEBUG: Closing connection (46700368) to 192.168.0.100 2015-08-10 11:52:05+0200 [] DEBUG: Closed socket to 192.168.0.100 2015-08-10 11:52:05+0200 [] DEBUG: Closing connection (44407568) to 192.168.0.100 2015-08-10 11:52:05+0200 [] DEBUG: Closed socket to 192.168.0.100 2015-08-10 11:52:05+0200 [] DEBUG: Connect lost: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly. ] 2015-08-10 11:52:05+0200 [] DEBUG: Closing connection (47567568) to 192.168.0.100 2015-08-10 11:52:05+0200 [] INFO: Stopping factory 2015-08-10 11:52:05+0200 [] DEBUG: Closed socket to 192.168.0.100 2015-08-10 11:52:05+0200 [] DEBUG: Connect lost: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly. ] 2015-08-10 11:52:05+0200 [] INFO: Stopping factory 2015-08-10 11:52:05+0200 [] DEBUG: Connect lost: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly. ] 2015-08-10 11:52:05+0200 [] INFO: Stopping factory
Then something strange happens: A connection is established to the first node in the other data center:
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Connecting to cluster, contact points: ['10.0.0.1']; protocol version: 2 2015-08-10 11:52:05+0200 [] DEBUG: Host 10.0.0.1 is now marked up 2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Opening new connection to 10.0.0.1 2015-08-10 11:52:05+0200 [] INFO: Starting factory 2015-08-10 11:52:07+0200 [] TRACE: Sending heartbeat. 2015-08-10 11:52:10+0200 [Cluster_01] WARN: [control connection] Error connecting to 10.0.0.1: errors=Timed out creating connection, last_host=None 2015-08-10 11:52:10+0200 [Cluster_01] ERROR: Control connection failed to connect, shutting down Cluster: ('Unable to connect to any servers', {'10.0.0.1': OperationTimedOut('errors=Timed out creating connection, last_host=None',)}) 2015-08-10 11:52:10+0200 [Cluster_01] DEBUG: Shutting down Cluster Scheduler 2015-08-10 11:52:10+0200 [Cluster_01] DEBUG: Shutting down control connection 2015-08-10 11:52:10+0200 [Cluster_01] DEBUG: Not executing scheduled task due to Scheduler shutdown 2015-08-10 11:52:10+0200 [] WARN: No cassandra connection available for hostlist ['192.168.0.100', '192.168.0.101'] . Retrying.
This fails of course as we don't want clients to communicate across data centers.
Even with this cluster configuration OpsCenter still tries to connect to the other (wrong) data center:
[cassandra] seed_hosts = 192.168.0.100,192.168.0.101 username = opscenter password = XXX local_dc_pref = DC1 used_hosts_per_remote_dc = 0
This setup worked without problems for all versions of OpsCenter until 5.2.0. Is it a new requirement that all nodes must be reachable through native protocol from the OpsCenter? Can't I tell OpsCenter to only connect to its local data center?