7

We are preparing to implement our first Hadoop cluster. As such we are starting out small with a four node setup. (1 master node, and 3 worker nodes) Each node will have 6TB of storage. (6 x 1TB disks) We went with a SuperMicro 4-node chassis so that all four nodes share a single 4U box.

We are now looking at how to backup this solution for disaster recover. (Think rack or site loss, not drive loss) The best solution seems to be a cluster-to-cluster copy. Though I've also read about people copying data a NAS or SMB share. Also, we are going to be backing up the master node via traditional backup means. I'm only concerned about the HDFS data. Here are my questions:

1) For the cluster-to-cluster copy, can i setup a SINGLE node cluster with large amount of storage to act as my off-site replica? I don't care about it's performance, just it's existence and ability to hold the entire dataset. (Restoral times aren't a concern as this cluster isn't missions critical) Can the copy be scheduled so that it only runs once a day, etc?

2) For the SMB or NAS option, how does this work? Does the target disk need to be formatted HDFS? Will i need to backup each of the three worker nodes in their entirety? Or is there some intelligent script out there that can backup the dataset without the parity? I'm not very familiar with this solution and have only seen references to it online. I haven't had much luck locating resources or information.

I'm also open to any other DR options for Hadoop HDFS. Our goal is to obtain a full copy of the HDFS dataset so that we could use it to recover from a rack or site loss.

Thanks!

Matt Keller
  • 221
  • 4
  • 7

2 Answers2

1

For option 1, you could use distcp to copy to from one cluster to another. The backup cluster could certainly be a single node server as long as it has a namenode and datanode running on it. Basically, you're looking at running in pseudo distributed mode. To run the distcp periodically,

To do this periodically, I would create a shell script that did something like the following:

  1. check for a lockfile
  2. if the lockfile exists, bail out (and optionally send you an alert if the lockfile has been around too long -- this would signify that a previous distcp either exited badly and didn't unlock or that the previous distcp is taking longer than you expect).
  3. if it doesn't exist, touch the lockfile.
  4. run the distcp.
  5. check the status of the distcp job to verify that it completed correctly.
  6. unlock.

I'm suggesting the use of a lockfile because you don't want multiple distcp's running in this particular setup. You'll end up overpowering your pseudo distributed cluster. I would also set the default replication factor to 1 on the pseudo distributed cluster configuration. No need to double up on blocks if you don't need to (though, I can't remember if a pseudo cluster does this by default; YMMV).

distcp can be made to work like a dumb rsync, only copying those things that change.

For option 2, you could use hadoop fs -copyToLocal. The downside to this is that it's a fully copy every time, so if you copy /, it's copying everything each time it runs.

For the hadoop metadata, you'll want to copy the fsimage and edits file. This blog has a pretty reasonable overview of what to do. It's geared towards using Cloudera, but should be basically the same for any Hadoop 1.0 or 2.0 cluster.

Travis Campbell
  • 1,456
  • 7
  • 15
1

Hdfs is by design replicated, usually on 3 nodes minimum so if you have 3 nodes the data is replicated on all three already.

Of course these nodes should be on different physical servers. Then it is not likely to fail or all 3 should fail at once.

To replicate you current hdfs, you could simply add nodes to the hdfs service on other servers and the data will replicate. To be sure that data is replicated more than the 3 original nodes, increase the fault tolerance setting to 4 or more nodes. Thrn Shut down the other nodes on the single unit and you data will be on all nodes left active.

MrE
  • 408
  • 1
  • 5
  • 14
  • Though it is a common misconception, the **replication is NOT a backup**. It is only designed to increase efficiency and to warrant continuity in case of hardware faillure. -- A simple example of why this is no proper backup: If you accidentally delete files, they will be deleted on all nodes and you cannot recover them normally. – Dennis Jaheruddin Dec 13 '16 at 08:52