DFS for a several small clusters over WAN

2

1

My friends and I all have TBs on our system(s). None of us have any full backups which are geographically distributed however, because at that amount of data, solutions such as Dropbox, S3, et al. are cost-prohibitive for us. However, each of us has local storage in excess. TBs each, in fact, going unused.

We began thinking: If we could network our hosts into some form of Distributed File System, we could each gain geographically distributed backups of our complete data sets while achieving higher utilization of the storage capacity we have. The perfect solution... we think.

  1. There are at least 3 of us. Surely 6 or more if the project yields fruit.
  2. Each of us has 1-2TB of data, and at least that much to spare.
  3. We're all spread out over WAN.
  4. We'd need the ability for any host(s) to enter and leave the cloud service arbitrarily.
  5. Real(ish)-time synchronization. Otherwise we'd just meet up once a week over beers and trade around a pile of external HDDs.
  6. F/OSS is requisite, but we have plenty of elbow grease.
  7. If we can use/learn/leverage a distributed computing platform in the process, so much the better.

We started out thinking about building a Dropbox-esque interface on top of OpenStack or Hadoop, but I'd like to hear if there are other alternatives out there which we're ignoring. Perhaps for our case there is an even simpler solution? Is something like this even feasible, given the low number of nodes per cluster?

NB: Naturally the initial synchronization/balancing/transfer/etc will take days at the least, but that's acceptable.

user16511

Posted 2012-01-10T04:58:40.090

Reputation:

if it didn't need to be FOSS, i suspect crashplan would work perfectly for this. Even if it dosen't, they have some interesting ideas – Journeyman Geek – 2012-01-10T05:26:48.713

1@JourneymanGeek: Post as an answer and I'll accept. Doesn't seem like we're going to be able to F/OSS this thing with the hardware we've got (unless we custom-build the whole system). – None – 2012-04-25T15:14:14.563

Answers

1

Its not FOSS, but crashplan's a pretty good option for this. Dead simple to set up and run, but it'll handle 3 4 and 5 perfectly. Its dead simple to set up as well - install the client, set usable space, and add people who you want to allow to use that space.

Journeyman Geek

Posted 2012-01-10T04:58:40.090

Reputation: 119 122

2

I used sshfs on Ubuntu server and a simple rsync script via cron. Each host retains its own autonomy (even though I have root access in my configure across 3 hosts) and how often to replicate across nodes and to which nodes is also fully controllable. The amount of storage can be controlled via partition or quota, I chose partition simply because I am controlling all 3 hosts. A disadvantage comes with lack of replication frequency (synchronization) control. If a host syncs frequently it could over utilize bandwidth particularly if snapshots are used across the wan. Playing nicely with others and using kbps limits on the rsync commands are necessary.

Kam Salisbury

Posted 2012-01-10T04:58:40.090

Reputation: 21