How to store terabytes of large, randomly accessed files?

Question

Lets say I have a couple of thousand large files (1-800mb each), that are all accessed at random, with newly uploaded files accessed very frequently, and as time passes, the access times drops off in an inverse square fashion, but there might be random spikes in usage of the older files.

The total throughput is in the 2-4Gbit range.

Im looking for a self-hosted solution, not the Amazon offersings, since they are way too expensive.

What I roughly had in mind is this:

Expensive "main" server with several 15k rpm SAS drives (or SSDs) which would be hosting new files that are just uploaded to the site. Once the download rate drops (or file reaches a certain age) its moved to one of the cheaper archive nodes.

EDIT: Files are to be served via HTTP to a wide variety of users. Servers are running FC5. Need mostly read access, but write is important also.

Right now I got a simple 2 server setup maxing a gbit, and Im getting crazy IO. The box is formatted with 4K blocks. Would increasing it to say.... 1024K have a huge impact?

I'm curious myself .. I wouldn't think self-hosted cost can beating S3, particularly when bandwidth is considered. — tomjedrz, Feb 18 '10 at 22:23
According to Amazon pricing, serving 1gbit of traffic would cost $30,000/month. You can get a 1gbit line with a server for 1/10th that . It can be on super premium bandwidth for 1/5 that. — , Feb 18 '10 at 22:25
How many users uploading/downloading? If there are many, you could conceivably use bittorrent to decentralize the process. — MattB, Feb 18 '10 at 22:32
What kind of access? (the answer is different if the files are going to be read-write versus read-only) — voretaq7, Feb 18 '10 at 22:42
+1 for mucking in with a DIY approach. 30k/month is rediculous. — hookenz, Feb 18 '10 at 23:15

score 1 · Answer 1 · answered Feb 18 '10 at 22:32

If you only serve this data locally, you could easily assemble a single server with a couple of terabytes of storage using off-the-shelf components. Teaming up a couple of gigabit NICs could provide you the network troughput.

If the content have to be served over larger distances, it might be better to replicate the data across several boxes. If you can afford it, you could fully replicate the data, and if files never get overwritten, crude timestamp-based replication scripts could work.

Otherwise you could look at parallel filesystem implementations; if you want a free one, you could look at Lustre (for linux) or Hadoop (multiplatform).

score 1 · Answer 2 · edited Jul 26 '12 at 06:46

1

What you're proposing is an automated tiered storage solution. This is not a trivial achievement. Some high-end storage vendors like EMC are touting automated tiering solutions, but they're geared towards top-end enterprise LAN solutions and come with a corresponding price tag.

You're going to want to take a look at Sun's ZFS storage system, as it touts the kind of capabilities you're after and may be closer to the price point too.

http://blogs.oracle.com/studler/entry/zfs_and_the_hybrid_storage

edited Jul 26 '12 at 06:46

alanc

1,500
9
12

answered Feb 18 '10 at 22:50

Chris Thorpe

9,903
22
32

1

FreeBSD has support for ZFS filesystems these days. – hookenz Feb 18 '10 at 23:15
FreeBSD **8** has supported ZFS, though I'm still a bit wary of it myself -- I think it needs until 8.1/8.2 to be really solid – voretaq7 Feb 19 '10 at 01:46

score 1 · Answer 3 · answered Feb 19 '10 at 14:55

All of these are significant:

1) lots of RAM

2) multiple network cards and/or frontends to reduce bottlenecks

3) reverse proxy server, such as Squid (see eg. http://www.visolve.com/squid/whitepapers/reverseproxy.php ) or Varnish

4) RAID setup for disks (striped or stripes/mirrors combo possibly)

5) choice of correct filesystem and, yes, block size. XFS used to be good performer for large amounts of data, probably now ZFS is better.

These all should help. How much and what of this needs to be implemented you should be able to calculate based on your target requirements (ie. total net bandwidth you want to utilize, thoroughput of single card, max thoroughput of your disks unraided and raided, etc.)

+1 for lots of ram in conjunction with XFS or ZFS. Highly recommended for large volumes (and large files). — pauska, Feb 19 '10 at 15:01

score 0 · Answer 4 · answered Feb 19 '10 at 02:16

If you don't want a DIY tiered storage option (if I had to I'd probably use the File System Management task in windows 2008 r2) I'd highly reccomend you take a look at a solution from Compellent. You would not need any additional nodes (per se) for lower cost storage as you would simple have some fast disks and some inexpensive slow disks mounted from the san via the OS of your choice. Compellent's OOB featureset is access based HSM. Thsi solution also provides scalability. Right now this approach might be expensive, (and you provided no future outlook) but long term it might be more cost effective than trying to manage and maintain a roll your own solution.

score 0 · Answer 5 · answered Feb 19 '10 at 04:26

Not clear what OS you are operating on? Or if you plan on these files to be moved automatically or to write a script to handle it for you? When you say accessed do you mean via the web (HTTP) or some other method?

I worked on a social networking site that had a "lock box" for files. As the site grew we were burning through about 200GB a day in storage.

We kept track of busy files using web stats which ran every night. If a file was listed in the top files list, then the script would update the database and set the file to "high priority". This told the web app to use the high priority URL and copy make sure the file was on the fast storage system.

This worked reasonably well until they could afford a scalable SAN solution.

If you have large files, do some testing with various block sizes. The formatting can have a very large impact on disk IO. — jeffatrackaid, Feb 19 '10 at 04:27

score 0 · Answer 6 · answered Feb 19 '10 at 04:39

Haven't really heard enough detail, but knowing what I know I'd look into a basic 1U server (or two for HA) with a lot of RAM running your choice of OS/storage-software, connected to a Xiotech Emprise 5000. Assuming you can fit a good working set in memory the IOPS that make it through to the spindles will be pretty broad random i/o, and thats what the box is best at. You could probably do a one-server(64GB)/one-array(2.4TB) combo for a touch under 20K.

score 0 · Answer 7 · answered Feb 19 '10 at 04:48

0

We do this exact thing with our VoD servers, where we use many unclustered servers with lots of memory to act as cache for the local disks which are in turn multiple SAS-connected 25 x 2.5" 15krpm disks, this is then streamed over either multiple 1Gb NICs or dual 10Gb ones. We spent a LONG time getting the PCIe slot/SAS-HBA positions correct as well as RAID cluster and disk block size etc. settings.

answered Feb 19 '10 at 04:48

Chopper3

100,240
9
106
238

Does disk block size have a large impact on IO? I currently have a 4 x 146 GB 15k SAS drives in raid5, maxing a gbit (realistically it would be doing 2gbit if uncapped), and Im getting 10-30 io. 4K blocks – Feb 19 '10 at 08:26
Yes it can make an *enormous* difference but tuning is a big subject requiring a fair amount of data capture, analysis and testing to hit a specific 'sweet-spot' for an application - if you have a reference or test environment then that's the place to do it. – Chopper3 Feb 19 '10 at 09:58

score 0 · Answer 8 · answered Feb 19 '10 at 07:53

Interesting problem. Looks like you're hosting a bunch of pirated movies :P

Jokes aside, I think your solution might work as a good starting point. It's the kind of problem that you want to be familiar with before cooking up a solution that's either too expensive or too limited.

I would do something like that:

(either assume or do a perf test) the bottleneck is most likely users accessing different part of the same file at the same time -- since people will have different download speed and will log in at different times;
therefore, for best throughput you should load the most requested files into RAM or a parallel sort of storage (i.e. replicate them on many, many disks and distribute users' access a-la round robin);
ergo, you might want to have several front-line servers with a ton of RAM each, and a back-line server with a gazillion disk space.
place also a reverse-proxy or something like that to distribute redirect users towards the correct server (i.e. server A holds movie #1-#20, server B holds #21-40, and so on)
finally, put a managing node to move movies from the backend storage to the frontend according to download frequency, time of year, celebrity's birthday and whatnot

(if it works, can I have the servers when you're done with them? I've got a couple of spiking neural networks experiments I'd like to run)

How to store terabytes of large, randomly accessed files?

8 Answers8