2

Even though I am fully aware that versions of this question have been asked googol number of times, I'll try not to repeat them.

I have many sets of many files (some files are small, but some are large, like, ~10-20GB). I have multiple servers, each one can host one or more of those sets of files. Of course, one server can host 50% of total number of sets, and other 50% can host another number of sets.

You can think of set as of collection of large media files, really big image libraries, complete applications, whatever, it doesn't really matter, as long as there are large files in the set.

Server can update its copy of set at any point in time (either by replacing files in the set with completely new files, or by applying patches to some of files, which would result in having almost same files with only slight differences).

On the other side, I have many clients, who should be able to obtain any given set (or multiple sets) from servers, and keep their copies of sets up-to-date (synchronized) with sets on server, whenever one wants to use the set.

The tools that I have considered are following:

  • rsync -- It's great for syncing many small-to-medium-sized files, but not so ideal when syncing large files, as it uses algorithm which reads entire file at both sides in order to determine if file should be copied or not. This is okay when file should be copied for the first time, or when file is completely changed, but not-so-okay, when, say, only 1% of 10GB file is changed.
  • SVN -- It's great when it comes to finding differences and transferring only those deltas around, but I'm not so sure how optimal it is when it comes to disk usage (will entire set be twice as big on both client and server, due to once set is stored in repository?).
  • Torrent -- This one could be feasible, distribution-wise. For instance, create a torrent for each set on server, start seeding it there, and clients that receive those sets also continue to seed to other clients, thus distributing the load across every computer that holds copy of set. However, I'm not sure if it would be able to somehow distribute differences, once set on server gets changed... Would it require creation of new torrent for each change? Also, I don't know how torrent would behave in local network, speed-wise (could it be able to transfer files between one server and one client at maximum, network-limited speed. or it adds some serious protocol overhead? How about network congestion?)
  • Custom solution. Well, not much to add here, but that it would most likely be re-inventing the wheel, and that some existing solution would most likely fit my needs, if I was only aware of it.

So, the question is: what method of distribution/synchronization (utilities, approach) would be best suited for my situation?

mr.b
  • 583
  • 10
  • 25
  • Any particular server OS, or are you after an OS-agnostic solution? – Kara Marfia Oct 22 '10 at 11:49
  • If you want to reduce the disk usage (as per the *rsync*) you basically have one solution -- distributed filesystem: http://www.openafs.org/ may be what you're looking for. – Hubert Kario Oct 22 '10 at 16:12
  • Clients are running Windows (sorry for not mentioning that), and servers are OS-agnostic, but let's say Windows or Linux for server platform. – mr.b Oct 24 '10 at 14:14
  • @Hubert Kario: I can't do that, mostly because speed at which clients access those sets is important, in other words, it's highly beneficial to have copies of sets on local hard drives, for each client, and improving network infrastructure to the point where network speed would surpass hdd speeds, is not doable in this case... – mr.b Oct 24 '10 at 14:21
  • That's why I specifically suggested AFS. It has very extensive caching mechanisms and, if I remember correctly, can be configured to use completely asynchronous writes to the server. – Hubert Kario Oct 24 '10 at 19:40
  • Oh, and there's coda: http://www.coda.cs.cmu.edu/, but it supports only Windows XP or linux as a client. – Hubert Kario Oct 24 '10 at 19:58
  • Oh, all right. It might be worth trying. Thanks for the suggestions! You might want to re-post them as an answer, so I can up-vote you properly, until I figure out some final solution. – mr.b Oct 25 '10 at 09:03

4 Answers4

1

If you can safely assume that all the clients will have consistent versions you could use an off-the-shelf binary patching tool and roll your own solution to push out diffs to clients and apply them. If the clients will have inconsistent versions, though, you'll have to read the file on the client in order to determine which diffs need to be sent (basically the rsync problem). If the clients are consistent, though, you could just compute the diffs once and ship them out.

It sounds like you're looking for something like a multicast rsync implementation. I've never used this tool, but it would be worth looking at. It looks like they're only targeting Linux and Unix OS's right now.

Evan Anderson
  • 141,071
  • 19
  • 191
  • 328
1

In the end, I choose BitTorrent. Here's why.

  • It's fast: it completely saturates server's up-link (though, it really slows down network on involved computers due to insane amount of tiny packets, which can be somewhat optimized by disabling usage of UDP packets).
  • It's really good and fast for distributing any set of changes on any set of files (BT protocol's smallest unit of data is a "piece", which varies from 4KB to 4MB in size, and every file is split into pieces, pieces are checksummed, and then only differing pieces are transferred, whether file in question is KB's or GB's in size - it is done very quickly).
  • It's fully distributed: you can host many sets of files from many different source servers, and have clients retrieve files, regardless of where they are stored (kind of a moot point, I know).
  • After server uploads its copy of content to network, server load drops drastically, and time for newly deployed client to receive up-to-date sets is drastically decreased, since sets are then received from entire network of computers, instead of single, centralized server.
  • It can be used in small installations with nothing more than properly configured uTorrent client program, which can be used to both create .torrent's, track seeds/peers, and to receive data on client computers.

About the only two cons I have encountered:

  • Creating torrent for big data sets can take a lot of time (a lot: 5-10 minutes), while .torrent is created (entire set is read, split into pieces, checksummed), which is further slowed down if sets are not available locally but instead fetched from network. Also, same amount of time is needed when one wants to distribute arbitrary amount of changes over a large set - each computer - both server and all clients - needs to do checksum part, which, as I said, can be lengthy. (I must note here that in my case, changes were really small, and it would be impractical to copy GB's of data around only for few MB's of changed data, so this is a very acceptable trade-off.)
  • It can take a while for initial seeder to go up to full speed, so this method is not suitable if one needs to simply copy files between less then, say, 5 computers (but, in reality, benefits can be noticed even with 2-3 computers).

There you go, I hope that I helped someone who faces a same dilemma.

mr.b
  • 583
  • 10
  • 25
0

You can try caching network file systems:

They both cache reads and writes locally and as such are not bond by network performance if you have enough local space for cache.

Hubert Kario
  • 6,351
  • 6
  • 33
  • 65
0

You can use Windows Storage Server 2008, it is sold with a NAS device from different providers, but it is very good and effective, with Single instance storage as well it will save you a few GBs. You can then have a dedicated device serving that big files.

Most of these NAS come with Dual Nic and you can even get Quad Port nics too, so if you have a Gigabit or higher Lan infrastructure then you can bundle / team those ports up to provide more throughput.

Put more RAM into it and you should be good to go, www.broadberry.com http://www.broadberry.com/nasstorage_servers.html

Dell sells Window Storage Server too, get the one which has iscsi so you can utilize storage if you have too via iscsi later.

Hope that helps

Mutahir
  • 2,347
  • 2
  • 32
  • 42