1

We have two data centers (192.X.X.X and 10.X.X.X) between which gossip (port 7001) is possible but not thrift or the native protocol. OpsCenter runs on a node in the first data center (192.X.X.X).

After updating from OpsCenter 5.1.3 to OpsCenter 5.2.0 on CentOS 6.6 the dashboard only shows "Cannot Connect to Cluster".

The opscenterd.log file shows repeated attempts to connect to the Cluster.

It begins with connecting to a seed node:

2015-08-10 11:52:04+0200 [Cluster_01] DEBUG: Connecting to cluster, contact points: ['192.168.0.100', '192.168.0.101']; protocol version: 2
2015-08-10 11:52:04+0200 [] DEBUG: Host 192.168.0.100 is now marked up
2015-08-10 11:52:04+0200 [] DEBUG: Host 192.168.0.101 is now marked up
2015-08-10 11:52:04+0200 [Cluster_01] DEBUG: [control connection] Opening new connection to 192.168.0.100
2015-08-10 11:52:04+0200 []  INFO: Starting factory 
2015-08-10 11:52:04+0200 [Cluster_01] DEBUG: [control connection] Established new connection , registering watchers and refreshing schema and topology
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Refreshing node list and token map using preloaded results

The following part is repeated for each node in the other data center and also for each node from the local data center which is not in the list of seed nodes:

2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Found new host to connect to: 10.0.0.1
2015-08-10 11:52:05+0200 [Cluster_01]  INFO: New Cassandra host 10.0.0.1 discovered
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Handling new host 10.0.0.1 and notifying listeners
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Not adding connection pool for new host 10.0.0.1 because the load balancing policy has marked it as IGNORED
2015-08-10 11:52:05+0200 [] DEBUG: Host 10.0.0.1 is now marked up

The log continues a bit until the control connection is closed:

2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Finished fetching ring info
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Rebuilding token map due to topology changes
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Attempting to use preloaded results for schema agreement
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Schemas match
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] user types table not found
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Fetched schema, rebuilding metadata
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Control connection created
2015-08-10 11:52:05+0200 [] DEBUG: Initializing new connection pool for host 192.168.0.100
2015-08-10 11:52:05+0200 []  INFO: Starting factory 
2015-08-10 11:52:05+0200 []  INFO: Starting factory 
2015-08-10 11:52:05+0200 [] DEBUG: Finished initializing new connection pool for host 192.168.0.100
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Added pool for host 192.168.0.100 to session
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Shutting down Cluster Scheduler
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Not executing scheduled task due to Scheduler shutdown
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Shutting down control connection
2015-08-10 11:52:05+0200 [] DEBUG: Closing connection (46700368) to 192.168.0.100
2015-08-10 11:52:05+0200 [] DEBUG: Closed socket to 192.168.0.100
2015-08-10 11:52:05+0200 [] DEBUG: Closing connection (44407568) to 192.168.0.100
2015-08-10 11:52:05+0200 [] DEBUG: Closed socket to 192.168.0.100
2015-08-10 11:52:05+0200 [] DEBUG: Connect lost: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly.
        ]
2015-08-10 11:52:05+0200 [] DEBUG: Closing connection (47567568) to 192.168.0.100
2015-08-10 11:52:05+0200 []  INFO: Stopping factory 
2015-08-10 11:52:05+0200 [] DEBUG: Closed socket to 192.168.0.100
2015-08-10 11:52:05+0200 [] DEBUG: Connect lost: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly.
        ]
2015-08-10 11:52:05+0200 []  INFO: Stopping factory 
2015-08-10 11:52:05+0200 [] DEBUG: Connect lost: [Failure instance: Traceback (failure with no frames): : Connection was closed cleanly.
        ]
2015-08-10 11:52:05+0200 []  INFO: Stopping factory 

Then something strange happens: A connection is established to the first node in the other data center:

2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: Connecting to cluster, contact points: ['10.0.0.1']; protocol version: 2
2015-08-10 11:52:05+0200 [] DEBUG: Host 10.0.0.1 is now marked up
2015-08-10 11:52:05+0200 [Cluster_01] DEBUG: [control connection] Opening new connection to 10.0.0.1
2015-08-10 11:52:05+0200 []  INFO: Starting factory 
2015-08-10 11:52:07+0200 [] TRACE: Sending heartbeat.
2015-08-10 11:52:10+0200 [Cluster_01]  WARN: [control connection] Error connecting to 10.0.0.1: errors=Timed out creating connection, last_host=None
2015-08-10 11:52:10+0200 [Cluster_01] ERROR: Control connection failed to connect, shutting down Cluster: ('Unable to connect to any servers', {'10.0.0.1': OperationTimedOut('errors=Timed out creating connection, last_host=None',)})
2015-08-10 11:52:10+0200 [Cluster_01] DEBUG: Shutting down Cluster Scheduler
2015-08-10 11:52:10+0200 [Cluster_01] DEBUG: Shutting down control connection
2015-08-10 11:52:10+0200 [Cluster_01] DEBUG: Not executing scheduled task due to Scheduler shutdown
2015-08-10 11:52:10+0200 []  WARN: No cassandra connection available for hostlist ['192.168.0.100', '192.168.0.101'] .  Retrying.

This fails of course as we don't want clients to communicate across data centers.

Even with this cluster configuration OpsCenter still tries to connect to the other (wrong) data center:

[cassandra]
seed_hosts = 192.168.0.100,192.168.0.101
username = opscenter
password = XXX
local_dc_pref = DC1
used_hosts_per_remote_dc = 0

This setup worked without problems for all versions of OpsCenter until 5.2.0. Is it a new requirement that all nodes must be reachable through native protocol from the OpsCenter? Can't I tell OpsCenter to only connect to its local data center?

1 Answers1

0

I can confirm your bug and it can be tracked as OPSC-6299 (sorry no public bug tracker, but this can be used for communications with Datastax or future ticket references).

The short of it is that OpsCenter should be respecting that load balancing policy, it is valid, but in this case there's a bug.

Dio
  • 131
  • 4