1

Im investigating moving some large DBs from MySQL to Cassandra and Im trying to figure out how to plan the cluster. Historically, one would just buy disks to contain the relevant data but Im not clear on how Cassandra uses disk space vs RAM space.

In planning a cluster, the question of how many machines, how much disk, RAM etc per machine will come up. How do I answer this for 1Tb? 10Tb? More?

ethrbunny
  • 2,327
  • 4
  • 36
  • 72

2 Answers2

1

Capacity planning really is a science (in terms of math/statistics). Since the mathematical models won't get you anywhere you really have to setup a test bed which can be used to answer your questions since nobody here can provide you with a theoretical model that you seem to ask for.

How to answer this:

  1. Get a (scalable) testbed
  2. Populate it with your data
  3. Write appropriate load generation tools
  4. Apply load and measure
  5. Measure and run sanity check on your results
  6. Optionally tune and maybe go to either 3. or 4. again

or hire a professional.

pfo
  • 5,630
  • 23
  • 36
  • Is it possible to answer strictly on a size basis? IE To store 500Gb on 10 servers you would need x total disk space – ethrbunny Aug 10 '11 at 18:04
  • you can't answer this question properly solely in terms of disk capacity. – pfo Aug 10 '11 at 18:11
1

Basically, the formula for disk per node is D x RF / N x O / C with the variables defined below:

  • D is your overall data size.
  • RF is your replication factor. Most clusters use at least 2 (for durability) or 3 (for combined durability and availability at CL=Quorum).
  • N is the number of nodes in your cluster. This has to be at least RF. You'll also want to increase this number until you get to a comfortable "disk per node" result.
  • O is an overhead multiplier for indexes and unmerged sstables on disk. I would use at least an O=2 factor here unless you have almost no indexes and extremely stable data.
  • C is the factor you'll save with Cassandra 1.0+ compression support, assuming you enable it. This will be approximately the savings you get from gzipping a file with representative content. Use C=1 if compression is disabled. If compression tends to cut the size of your data in half, try C=0.6 or so because compression isn't used on everything (for example, indexes).

Once you've gotten some numbers, you should target a "disk per node" that's no more than 30% of the available local storage so that you don't have to immediately grow the cluster and so snapshots are possible.

Memory planning depends a lot more on how your schema looks, but you'll want at least 4GB devoted to Cassandra on each node. The OS will be able to use anything beyond that for highly beneficial disk caching. More memory only becomes completely useless once it substantially exceeds the actual amount of data resident on disk.