2

Im putting together a 16 node cassandra cluster (replication factor 2) and want to setup a schedule for nodetool repair. gc_grace_seconds is at the default.

Two questions:

  1. My first impulse is to setup a cron job for each machine and attempt to manually randomize the timing around a one week schedule. Is there a better way?
  2. Does nodetool repair have to be run on every system or every # systems/replication factor systems? (IE for my 16 nodes with replication factor 2 - 8 systems - one of each pair)
ethrbunny
  • 2,327
  • 4
  • 36
  • 72

1 Answers1

2

I would not randomize it. Your best bet is to schedule the repairs so they don't stomp on each other.

You should use the -pr option on each node when running repair.

If you're using Cassandra 2.1 you have the option for incremental repair which will speed things up considerably.

RF=2 is also a recipe for disaster.. quorum queries will fail if a node is unavailable. I recommend RF=3.

Jon Haddad
  • 1,332
  • 3
  • 13
  • 20
  • On a small cluster (12-16 nodes) - on new reasonable hardware.. will there truly be failures that often? – ethrbunny Dec 17 '14 at 22:49
  • It's not just about node failures. It's about cluster configuration changes, restarts, network partitions, power failure, rack failure. Additionally, as I mentioned, if you're using QUORUM - your queries will fail if only one node goes down. – Jon Haddad Dec 18 '14 at 04:38
  • @JonHaddad what is the purpose of using -pr option? – Selvam Palanimalai Jan 27 '15 at 09:45
  • From "nodetool help repair": -pr, --partitioner-range. Use -pr to repair only the first range returned by the partitioner. Otherwise you initiate a repair on the entire cluster. – Jon Haddad Jan 27 '15 at 23:00
  • In the last line of your answer, should that say "RF=2 is a recipe for disaster..." as indicated by OP's RF? I've personally been burned by by setting RF=2 and QUORUM CL. – BeepBoop Mar 06 '15 at 21:37