We are preparing to implement our first Hadoop cluster. As such we are starting out small with a four node setup. (1 master node, and 3 worker nodes) Each node will have 6TB of storage. (6 x 1TB disks) We went with a SuperMicro 4-node chassis so that all four nodes share a single 4U box.
We are now looking at how to backup this solution for disaster recover. (Think rack or site loss, not drive loss) The best solution seems to be a cluster-to-cluster copy. Though I've also read about people copying data a NAS or SMB share. Also, we are going to be backing up the master node via traditional backup means. I'm only concerned about the HDFS data. Here are my questions:
1) For the cluster-to-cluster copy, can i setup a SINGLE node cluster with large amount of storage to act as my off-site replica? I don't care about it's performance, just it's existence and ability to hold the entire dataset. (Restoral times aren't a concern as this cluster isn't missions critical) Can the copy be scheduled so that it only runs once a day, etc?
2) For the SMB or NAS option, how does this work? Does the target disk need to be formatted HDFS? Will i need to backup each of the three worker nodes in their entirety? Or is there some intelligent script out there that can backup the dataset without the parity? I'm not very familiar with this solution and have only seen references to it online. I haven't had much luck locating resources or information.
I'm also open to any other DR options for Hadoop HDFS. Our goal is to obtain a full copy of the HDFS dataset so that we could use it to recover from a rack or site loss.
Thanks!