Basically, the formula for disk per node is D x RF / N x O / C with the variables defined below:
- D is your overall data size.
- RF is your replication factor. Most clusters use at least 2 (for durability) or 3 (for combined durability and availability at CL=Quorum).
- N is the number of nodes in your cluster. This has to be at least RF. You'll also want to increase this number until you get to a comfortable "disk per node" result.
- O is an overhead multiplier for indexes and unmerged sstables on disk. I would use at least an O=2 factor here unless you have almost no indexes and extremely stable data.
- C is the factor you'll save with Cassandra 1.0+ compression support, assuming you enable it. This will be approximately the savings you get from gzipping a file with representative content. Use C=1 if compression is disabled. If compression tends to cut the size of your data in half, try C=0.6 or so because compression isn't used on everything (for example, indexes).
Once you've gotten some numbers, you should target a "disk per node" that's no more than 30% of the available local storage so that you don't have to immediately grow the cluster and so snapshots are possible.
Memory planning depends a lot more on how your schema looks, but you'll want at least 4GB devoted to Cassandra on each node. The OS will be able to use anything beyond that for highly beneficial disk caching. More memory only becomes completely useless once it substantially exceeds the actual amount of data resident on disk.