aws cluster overprovisioning detection framework/tools

Question

Our team is managing many cassandra clusters on aws, one of our problems is when a user ask us to spawn a new cluster, they can't predict how many nodes they need because of lack of production traffic. Most of time it ends up overprovisioned. Since we are managing so many clusters for so many customers, over time, it's hard to examine all of them to decide overprovisioning by hand.

Is there an opensource project or framework tackling this problem in a scientific manner.

score 1 · Answer 1 · answered Sep 21 '16 at 14:29

Cassandra exposes many metrics via JMX that you can use to assess cluster load. One of the common open-source tools to monitor it is Graphite, and I've seen folks use collectd to feed JMX (and other data) to Graphite.

For DSE, OpsCenter collects and displays metrics automatically (disclaimer, I'm an OpsCenter developer so am biased).

I would avoid feeding triggers from these tools to AWS autoscaling groups until you have a very strong understanding of how your cluster load changes over time. You can cause a cluster outage by shrinking sloppily (taking out multiple nodes in a replica-set until you can't meet your consistency-level anymore without waiting for rebalancing to finish), or even by adding nodes (if a cluster is heavily loaded, adding nodes temporarily creates MORE load to replicate data to the new nodes which can cause a cascading failure). Until you feel very confident about the conditions under which you should add/remove nodes, you can go a long way with manually-triggered add/remove advised by automatic monitoring/alerting.

score 0 · Answer 2 · edited May 23 '17 at 11:33

According to information I found with a quick Google search (one, two), Cassandra supports dynamically adding and removing nodes to a running cluster. Netflix has an interesting article here.

Based on this you can probably find a way to use auto-scaling to change the number of nodes as demand changes. You may (or may not) need to create a custom metric based on some kind of Cassandra specific information and have it sent up to CloudWatch, but the process should be reasonably straightforward otherwise. You might for example set a threshold for average CPU utilization across the cluster before an extra node is added or removed. You probably need to be careful not to remove nodes too quickly, in case there's some rebalancing happening - auto-scaling does support this.

However, I have no experience with Cassandra, so I could be completely wrong, and if I am I'm sure someone will correct me. I hope that the thoughts give you some ideas that you can research and develop yourself.

aws cluster overprovisioning detection framework/tools

2 Answers2