1

We're currently running a 24 Node Cassandra Cluster in Production that holds 30Tb of data and handles an average live load of 100k Requests Per Min 24/7. We support multiple partners. One of our partners are leaving our Org, so we have to filter their data and migrate it to a Cluster of their own. We wrote Apache Spark Utilities in Java to migrate their data which is around 6TB.

We Submitted the Spark Job on a Spark cluster with 1 master and 3 workers (R4.4XLarge EC2 instances) but that's affecting our live load as we see significant amount of writes getting timed out, so we had to stop the program (This worked fine in our Staging ENV with 10TB of data and live load of 20K RPM).

How can we implement this job in a way that it does not create a massive load on live running cassandra cluster? What would be ideal number of workers, cores and memory for the Spark Cluster?

Mano
  • 11
  • 1
  • https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#read-tuning-parameters – Alex Ott May 05 '21 at 17:20

0 Answers0