I am running a 6 node cluster using Apache Cassandra 2.1.2 with DataStax OpsCenter 5.0.2 from the AWS EC2 AMI "DataStax Auto-Clustering AMI 2.5.1-hvm" (DataStax Community AMI). When I try to run a repair on the rollups60 column family in the OpsCenter keyspace, I get errors about failed snapshot creation in the Cassandra system log. The repair seems to continue, though it hasn't finished yet.
I am wondering whether this is making the repair ineffectual, or whether I can expect it to finish at all.
I am running the command
nodetool repair OpsCenter rollups60
on one of the nodes (10.63.74.70). From the command, I've gotten this output so far:
[2015-01-23 19:36:06,261] Starting repair command #9, repairing 511 ranges for keyspace OpsCenter (seq=true, full=true)
And here is an example of what I see in the log:
INFO [AntiEntropyStage:1] 2015-01-23 19:38:28,235 RepairSession.java:171 - [repair #138b42e0-a337-11e4-9e78-37e5027a626b] Received merkle tree for rollups60 from /10.63.74.70
INFO [AntiEntropySessions:9] 2015-01-23 19:38:28,236 RepairSession.java:260 - [repair #67772db0-a337-11e4-9e78-37e5027a626b] new session: will sync /10.63.74.70, /10.51.180.16 on range (5848435723460298978,5868916338423419522] for OpsCenter.[rollups60]
INFO [RepairJobTask:3] 2015-01-23 19:38:28,237 Differencer.java:74 - [repair #138b42e0-a337-11e4-9e78-37e5027a626b] Endpoints /10.13.157.190 and /10.63.74.70 have 1 range(s) out of sync for rollups60
INFO [AntiEntropyStage:1] 2015-01-23 19:38:28,237 ColumnFamilyStore.java:840 - Enqueuing flush of rollups60: 465365 (0%) on-heap, 0 (0%) off-heap
INFO [MemtableFlushWriter:25] 2015-01-23 19:38:28,238 Memtable.java:325 - Writing Memtable-rollups60@204861223(51960 serialized bytes, 1395 ops, 0%/0% of on/off-heap limit)
INFO [RepairJobTask:3] 2015-01-23 19:38:28,239 StreamingRepairTask.java:68 - [streaming task #138b42e0-a337-11e4-9e78-37e5027a626b] Performing streaming repair of 1 ranges with /10.13.157.190
INFO [MemtableFlushWriter:25] 2015-01-23 19:38:28,262 Memtable.java:364 - Completed flushing /raid0/cassandra/data/OpsCenter/rollups60-445613507ca411e4bd3f1927a2a71193/OpsCenter-rollups60-ka-331933-Data.db (29998 bytes) for commitlog position ReplayPosition(segmentId=1422038939094, position=31047766)
ERROR [RepairJobTask:2] 2015-01-23 19:38:39,067 RepairJob.java:127 - Error occurred during snapshot phase
java.lang.RuntimeException: Could not create snapshot at /10.63.74.70
at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77) ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:347) ~[apache-cassandra-2.1.2.jar:2.1.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
INFO [AntiEntropySessions:10] 2015-01-23 19:38:39,068 RepairSession.java:260 - [repair #6dec29c0-a337-11e4-9e78-37e5027a626b] new session: will sync /10.63.74.70, /10.51.180.16 on range (-6918744323658665195,-6916171087863528821] for OpsCenter.[rollups60]
ERROR [AntiEntropySessions:9] 2015-01-23 19:38:39,068 RepairSession.java:303 - [repair #67772db0-a337-11e4-9e78-37e5027a626b] session completed with the following error
java.io.IOException: Failed during snapshot creation.
at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) ~[apache-cassandra-2.1.2.jar:2.1.2]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
ERROR [AntiEntropySessions:9] 2015-01-23 19:38:39,070 CassandraDaemon.java:153 - Exception in thread Thread[AntiEntropySessions:9,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Failed during snapshot creation.
at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.jar:na]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) ~[apache-cassandra-2.1.2.jar:2.1.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_51]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
Caused by: java.io.IOException: Failed during snapshot creation.
at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) ~[apache-cassandra-2.1.2.jar:2.1.2]
at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) ~[apache-cassandra-2.1.2.jar:2.1.2]
at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
... 3 common frames omitted
The errors are repeated many times. The IP Address 10.63.74.70 in the log is the node I'm running the repair from. I am able to repair the rest of the OpsCenter column families, and they complete pretty quickly without error.
I have tried creating my own snapshot, and it completes successfully with nothing logged.
nodetool snapshot OpsCenter
The disk has plenty of space left. Are these errors problematic? Should I just let the repair process continue for however long it takes? The cluster is currently not in use by any application, yet it has some load, so it's not sitting idle (it has no load when I'm not repairing).
Thanks for any help.
BTW, there is no datastax-community tag here, so I have to use the datastax-enterprise one.