Sanity check on 40TB server configuration

Question

I've got 40 years in computing, but I've never had to build a server quite like this one, so this might be a n00b question.

I have a client that is going to offer ultra-high def music files for download. In this case that means FLAC-compressed 24/192Khz =~ 10GB/album. (No, I don't want to discus the desirability of the product, just the server configuration.) The catalog will be about 3,000 albums, with both ultra-high and low def versions (for their iPods, I guess), giving about 35-40TB or so of primary data.

Since this is a very specialized product, the market size is relative small (think: people who spend $20,000+ on their audio systems), which means most of the time the server is going to be 100% idle (or close to it). I have what looks like a good colocation offer from ColocationAmerica with a 1Gbps connection and bandwidth at about $20/TB, so now I just have to build a box to deliver the goods.

The data-access use case is write-once / read-many, so I'm thinking of just using software RAID 1 for pairs of drives. This would allow me (I think) to reconfigure spare drives for failed ones on-the-fly, thereby being able to start the rebuild of the second drive before some sysadmin notices the red light on the system (they do free swap out). It would be great if I could get most of the drives to sleep/spin-down if they aren't needed, which will be most of the time for most of the drives.

I don't need much in the way of compute power—this thing is just shoving fat-objects down the pipe—and so the CPU/motherboard can be pretty modest so long as it can support this number of drives.

I'm currently considering the following configuration:

Chasis: Supermicro CSE-847E26-RJBOD1
Drives: 30 4TB SAS drives (Seagate ST4000NM0023 ?)
MB: SUPERMICRO MBD-X10SAE-O w/ 8GB
CPU: Xeon E3-1220V3 3.1GHz LGA 1150 80W Quad-Core Server

So, am I going in the right direction, or is this a completely n00b / dinosaur way of approaching the problem?

Update to clarify a couple of points:

I have no experience with ZFS, since the last Sun product I owned was back in the late 80's. I will do a little RTFMing to see if it feels right.
I don't really need the filesystem to do anything spectacular since the file names are going to be simple UUIDs, and the objects are going to be balanced across the drives (sort of like a large caching system). So I really was thinking of these as 40 separate filesystems, and that made RAID 1 sound about right (but I admit ignorance here).
Because our current expectations are that we will be unlikely to be downloading more than a couple dozen files at any one time, and in most cases there will be exactly one person downloading any given file, I don't know if we need tons of memory for buffers. Maybe 8GB is a bit light, but I don't think 128GB will do anything more than consume energy.
There are 2 separate machines not mentioned here: their current web store, and an almost completely decoupled Download Master that handles all authentication, new product ingest management, policy enforcement (after all, this is the RIAA's playground), ephemeral URL creation (and possibly handing downloads off to more than one of these beasts if the traffic exceeds our expectations), usage tracking, and report generation. That means this machine could almost be built using gerbils on Quaaludes.

ZFS? Where's the benefit?

OK, I'm slogging my way through multiple ZFS guides, FAQs, etc. Forgive me for sounding stupid, but I'm really trying to understand the benefit of using ZFS over my antediluvian notion of N RAID1 pairs. On this Best Practices page (from 2006), they even suggest not doing a 48 device ZFS, but 24 2-device-mirrors--sounds kind of like what I was talking about doing. Other pages mention the number of devices that have to be accessed in order to deliver 1 (one) ZFS block. Also, please remember, at 10GB per object, and at 80% disk utilization, I'm storing a grand total of 320 files per 4TB drive. My rebuild time with N RAID 1s, for any given drive failure, is a 4TB write from one device to another. How does ZFS make this better?

I'll admit to being a dinosaur, but disk is cheap, RAID 1 I understand, my file management needs are trivial, and ZFS on Linux (my preferred OS) is still kind of young. Maybe I'm too conservative, but when I'm looking at a production system, that's how I roll.

I do thank all of you for your comments that made me think about this. I'm still not completely decided and I may have to come back and ask some more n00b questions.

For this amount of storage, I wouldn't even consider using less than 128 gb of ram. Also, strongly consider using the zfs filesystem. — EEAA, Feb 12 '14 at 22:17
Is the server just going to serve files, nor specific performance demands? — Molotch, Feb 12 '14 at 22:18
Pairs of disks in RAID1 sounds... awful. Personally, I'd spec out a storage server/shelf, cram it full of near-line SAS drives, put the whole thing in RAID 10 or 6, add a hot spare or two and call it a day. — HopelessN00b, Feb 12 '14 at 22:38
@EEAA, Personally, I can't see anything to warrant 128GB of ram. Whatever's cheapest. I do heartily concur with the ZFS suggestion. RAID1 seems a colossal waste given that he sacrifices half the gross storage for I/O throughput he doesn't need. RAID6 or raidz2 with hotspares makes more sense - especially if the machine isn't readily accessible. — etherfish, Feb 12 '14 at 22:41
@etherfish - the RAM is not needed for computational purposes, but it is definitely needed for filesystem cache. Performance with only 8GB would be horrendous. Even more so if using ZFS, which is really the only fs I'd give serious consideration to at this size. ZFS needs a *lot* of RAM to function well. Fortunately RAM is relatively cheap. — EEAA, Feb 12 '14 at 22:46
Performance would be overly sufficient to saturate 1Gbps. Performance would only be impaired in the filesystem had to reread blocks from disk that had been expunged from buffer-cache and given little or no expectation of temporal locality, the point of diminishing returns for extra RAM is reached well, well before 128GB. Given extent based filesystem and large files, even filesystem metadata will occupy an insignificant amount of RAM. He even expects usage to be sparse enough that drives will be able to spindown. '73s. — etherfish, Feb 12 '14 at 22:54
Just a note on spinning down the disks -- [***DON'T DO IT!*** (Click me to find out why)](http://serverfault.com/questions/135967/hard-drives-always-on-or-spin-up-down-as-needed) Spin-Up/Spin-Down is a lot of wear on the moving parts of a traditional hard drive, and will cause premature failure. The money you save on power will be lost replacing failed disks. — voretaq7, Feb 12 '14 at 23:11
@voretaq7 - Got it! I was thinking that might be the case, but then I went all Green, so ... you know? (I'll be reading you're link in a moment.) — Peter Rowell, Feb 12 '14 at 23:34
@HopelessN00b: Could you expand on "storage shelf"? Is that just another name for a box full of disks? — Peter Rowell, Feb 12 '14 at 23:37
@etherfish, I was looking at RAID 1 for mirroring/rebuild rather than for any performance gains. This box is going to be hosted 500 miles away from us, so I was just hoping to ship them a few spare drives every once in a while. I will be reading up on ZFS and it's ability to recover from multiple disk failures. — Peter Rowell, Feb 12 '14 at 23:44
@PeterRowell RE: "shelves" - I have a preference for Dell hardware, so I was thinking of the [NX storage servers](http://www.dell.com/us/business/p/powervault-nx/fs) (because I own one at home for a 24 TB media library), but they also have [direct-attached-storage options, like these](http://www.dell.com/us/business/p/direct-attached-storage) - essentially shelves of disks that you have to attach to a computer. Other vendors have similar offerings, but I'm not familiar with them. Look for network-attached-storage (NAS) and/or direct-attached-storage (DAS) offerings from your vendor of choice. — HopelessN00b, Feb 12 '14 at 23:48
@HopelessN00b: Got a link to something specific? I was wandering around the Dell site and they kept on trying to sell me something different, even when I *thought* I was clicking on a link for a NX3300. — Peter Rowell, Feb 12 '14 at 23:59
@PeterRowell [I would think the NX3200 would be what you're looking for](http://configure.us.dell.com/dellstore/config.aspx?oc=breca2&model_id=powervault-nx3200&c=us&l=en&s=bsd&cs=04), but honestly, I got mine off ebay... and would always check the outlet store for deal on a refurb/return before paying full price. In any event, it's the customize and buy button you're looking for... but again, if I were to pay full price, I'd make my Dell rep spec and quote it out. :) — HopelessN00b, Feb 13 '14 at 00:06
@PeterRowell - The last time I checked the Dell NX-series are just PowerEdge servers with MD-series DASD cabinets attached, running Windows Storage Server. If you're not planning to run Windows I don't know that these units make sense. You really want to present JBOD to ZFS, if you're going to go that route. The Dell MD1200 is a nice SAS JBOD enclosure that you can populate w/ 4TB disks. They're getting old enough that you can find them off-lease as well as brand new. — Evan Anderson, Feb 13 '14 at 05:33
@PeterRowell, regarding your question: *ZFS? Where's the benefit?*, if you run RAID-Z3, you can experience up to 3 drive failures without losing data integrity. If you were unlucky enough to have 2 drives fail in your RAID1 setup, you could have immediate trouble *if the drives were in the same pair.* In the case of an environmental issue, that isn't *that* unlikely. With ZFS, you could use *all* your 2nd RAID1 drives to provide coverage so you could lose up to (number of drives/2 - 1) before losing parity coverage. And that would be *any* of the drives... — MrWonderful, Feb 13 '14 at 16:26
This sounds like a great use-case for [building](http://blog.backblaze.com/2013/02/20/180tb-of-good-vibrations-storage-pod-3-0/) (or [buying](http://www.45drives.com/)) a Backblaze Storage Pod. — Moshe Katz, Feb 13 '14 at 21:06
@MosheKatz: Interesting thought, although I need to figure out if I can even deliver my downloads from Backblaze to arbitrary (mostly U.S.) IP addresses. I'm trying to find a number to give them a call. Thanks! — Peter Rowell, Feb 14 '14 at 18:47
@PeterRowell I think you misunderstood what I meant. Backblaze designed a server specifically for the use-case of lots of disks with occasional reads/writes. You would still have to host it somewhere, but it is an alternative to the SuperMicro *hardware* you are considering. — Moshe Katz, Feb 14 '14 at 20:23
@MosheKatz: Indeed. Once I started reading I realized that was the only direction to go. Currently looking at 45drives.com and OpenStoragePod.org, amongst others. — Peter Rowell, Feb 14 '14 at 22:13

score 12 · Accepted Answer · edited Apr 13 '17 at 12:13

12

Based on your problem description your issue isn't so much the server as the storage.
You want a reliable, robust filesystem like ZFS that's designed to handle large storage capacity well, and has built-in management capabilities to make that end of the system easier to manage.

As was mentioned in the comments, I'd go with ZFS for the storage pool (probably on FreeBSD because I'm most familiar with that operating system and because it's got a long, proven track record of solid performance with ZFS - My second choice OS would be Illumos, again because of the well-tested ZFS support).

As far as serving up the files I agree - you don't need much in terms of hardware to just push data out the network port. Your primary driver for CPU/RAM is going to be the needs of the filesystem (ZFS).
The general rule of thumb is ZFS needs 1GB of RAM, plus 1GB for every 10TB of disk space it manages (so for 40TB you would need 5GB of RAM for ZFS) -- the relationship isn't quite linear though (there are plenty of good books/tutorials/docs on ZFS that can help you come up with an estimate for your environment).
Note that adding in ZFS bells and whistles like deduplication will require more RAM.

Obviously round RAM requirements up rather than down and don't be stingy: If your math says you need 5GB of RAM don't load the server with 8GB -- step up to 16GB.

You can then either run your server directly on the storage box (which means you're going to need even more RAM on that box to support the server processes), or you can remote-mount the storage to "front-end" servers to actually serve client requests.
(The former is cheaper initially, the latter scales better long-term.)

Beyond this advice the best suggestions I can give you are already well covered in our Capacity Planning series of questions -- basically "Load Test, Load Test, Load Test".

edited Apr 13 '17 at 12:13

Community

1

answered Feb 12 '14 at 22:51

voretaq7

79,345
17
128
213

Methinks your math is off. By your formula, he'd need 41G. – EEAA Feb 12 '14 at 22:54
@EEAA Indeed, I dropped a zero :-) And note that that's a bare minimum amount of RAM. ZFS would be quite happy to use 41G and soak it all up with cache :-) – voretaq7 Feb 12 '14 at 23:07
@voretaq7: Thanks for the link to capacity planning; it's next on my list after reading about ZFS. – Peter Rowell Feb 12 '14 at 23:47
If you do go with ZFS, consider hardware from http://www.ixsystems.com/ – sciurus Feb 13 '14 at 00:04
I filled out a quote form there, but I have to admit I *don't like* sites that give me no idea of their pricing. Makes me think of Oracle ... not a good company to be compared to. – Peter Rowell Feb 13 '14 at 01:44
@voretaq7: Please see my update at the end of the main question. Thanks for your time in looking at my problem. – Peter Rowell Feb 13 '14 at 03:51
1

@PeterRowell The major advantages of ZFS are that it's ***designed*** to handle multi-terabyte scale filesystems - It was forged in the crucible of Sun Microsystems and built as a 21st century filesystem for 21st century data sizes (of the kind you're talking about). A question about the benefits/drawbacks of ZFS versus would be a good subject for another separate question, but I'll drop this nugget: There is no such thing as waiting for `fsck` if you're using ZFS and the machine crashes. I've `fsck`'d terabyte filesystems. It's pretty terrible. – voretaq7 Feb 13 '14 at 04:21
@voretaq7: You have a good point: fsck sucks! So maybe that right there is the tipping point. Sigh, back to the FM! :-) What about the Best Practices (from 2006) that recommend *against* having a 48 disk (25 or so in my case) ZFS? Is that still valid or is it last decade's news? – Peter Rowell Feb 13 '14 at 04:51
I gave you the Check because a) it's the only actual answer :-), but b) it's making me dig deeper into what I thought was a relatively straightforward solution. ZFS may not end up being what I do, but it clearly represents a greater step forward than I had realized. Thanks. – Peter Rowell Feb 13 '14 at 06:00
You can have 25 disks in your zfs volume but you will to consider you put them together. – tegbains May 14 '14 at 08:49

score 2 · Answer 2 · answered Jan 03 '15 at 21:27

I use ZFS for a multi-TB server and it has been rock solid. I used OpenIndiana to start with and have now moved to FreeNAS as it does what I need it to do.

I would recommend using an LSI HBA card (9211-8i is a good base card) with SAS expanders (SuperMicro cases can be ordered with integral SAS expanders that are based on LSI chipsets). The LSI firmware is supported in FreeNAS and FreeBSD. Check for appropriate versions (V16 is good on FreeBSD V9.x).

Given the write once read many nature of your system, I would use a ZFS Z2 topology (avoid RAID-5 and Z1 with drives this size). Given that you are using 4TB disks, the rebuild (resilver) time for a large single vDev array would be a long time if the pool is full. To avoid long rebuild times, arrange the vDevs in groups of 6 or 10 to create the pool (recommendations from FreeNAS documentation). A pool made of three 6 drive vDevs (4TB drives assumed) would have a useable capacity of ~ 48TB and offers a good level of fault tolerance (remember you still need to back stuff up as RAID does not replace backups :) ).

To speed things up for commonly accessed files, you can throw in a couple SSDs for L2ARC (likely not needed for your application but they are pretty cheap for 120GB SSDs).

And as stated, use lots of RAM. 64GB is not overly expensive given the other hardware in the system. Unfortunately, the smaller XEON cannot use more than 32GB. You could try it but more RAM would be better according to ZFS literature (I use the XEON you mention with 32GB ram and a 24TB capacity Z2 array and it works fine).

Another advantage of ZFS is that you can setup periodic snapshots. This way, you can restore previous versions easily and the snapshots are very space efficient. Furthermore, you can replicate any snapshot to another dataset (local or remote) and this can be done over SSH for security.

I really like the reliability of the ZFS system. I also like the fact that it is hardware INDEPENDENT!! Any system that can see the drives can import the pool. No firmware dependancies etc. that can happen with hardware raid (not an issue with better cards but they are more expensive than HBA cards and need drivers etc. - been bit by that in the past).

Given this post is older, you likely have a solution. If so, mind telling us what you built?

Cheers,

Sanity check on 40TB server configuration

2 Answers2