1

After much pleading and case building my group got the budget to buy 4 nodes and start a cassandra cluster. Each machine has 3x1Tb drives so I'm wondering whether it's reasonable to skip the 1 way redundancy and mirror the data drive(s) instead.

The data will be backed up so that's not an issue.

Seems like the likelihood of losing a machine in such a small grouping is v low.

Is this reasonable or am I missing some larger issue / factor?

ethrbunny
  • 2,327
  • 4
  • 36
  • 72

1 Answers1

2

It really depends what you are using cassandra for. Are you using it for availability of your data, partitioning of your data, or both? From that sounds of it, you are using it more for the partitioning of your data so you can scale your data out.

Part of the reason you want to replicate your data in cassandra is for availability. If you have a 4 node cluster with a replication factor of 3 for example, you could survive the loss of one node without having to do any maintenance (with quorum consistency level, 2 nodes with 'one' consistency level). On the other hand, each of your nodes would hold 75% of the data on the cluster, which is probably something you were hoping to avoid. This is why I would try to plead for another server or two, although maybe you won't need it right away and you can add more servers as your data needs increase.

While you mention that losing a machine is unlikely, running with a replication factor of 1 is asking for trouble in my opinion. You may never run into problems, but when you do, it will not be fun. If you were using a 1 giant server to service your database, it sounds less likely that it would fail then 1 of 4 individual servers right?

There are also other things that can cause a cassandra node to fail or become unresponsive (OS faults, Garbage Collection, networking issues, etc.)

When you start using a distributed database, fault tolerance should become more of a concern than when using a traditional single database setup and cassandra focuses on and excels in this.

I have had past experiences where it is difficult to justify hardware purchases and environment configuration with management. The best way to get them to understand the implications is to outline a failure scenario and whether or not it is acceptable, for example:

If one server has a hardware failure, data gets corrupted, or cassandra crashes, how long can we tolerate downtime?

If the answer is '0 minutes', you will want to use a replication factor of at least 3. There are more benefits to this as well. With a replication factor of 3, that means more nodes can service an individual read request potentially improving read performance.

Additionally, mirroring/RAID 1 is considered a bit of an anti-pattern with Cassandra for your data (although it's not a bad idea for commitlogs). It would be better to use RAID 0 or multiple data directories, set your replication factor to 3 and let cassandra take care of redundancy for you.

Andy Tolbert
  • 276
  • 1
  • 5