HW/SW Design: 2 Petabyte of storage

Question

Disclaimer Yes I'm asking you to design a system for me :)

I'm tasked to design a system to store about 10 TB / day with a retention time of 180 days.

My first approach would be to go with GlusterFS and use a HW setup like this:

Single Node in the System:

1 HP ProLiant DL180 G6 with HP Smart Array P812 Controller
8 HP D2600 w/12 2 TB 6G SAS 7.2K LFF Dual Port MDL HDD 24 TB Bundle
106 Disks for storage (2 OS disks, 10 Data disks in the server, 96 distributed over 8 shelves)

I'd need 9 nodes to get a net storage (without replication or raid on local disks) that can hold the data.

Pros:

I can start with a single server without shelves
Grow by adding shelves to a single server (or add servers, just put some thought wether to scale by first adding nodes or first adding shelves or some mix of both)
scales "infinitely" (for certain definitions of "infinite")

Cons:

In general: I actually have no idea how to verify wether this will be a viable setup once I reach the final stage of expansion (1.8 PB estimated)

I don't have any actual preferred direction, just some experience with GlusterFS where I have a 4 TB System (distributed, replicated, 4 nodes) already using GlusterFS.

I'm pretty sure that there isn't much of a difference wether this setup runs Hadoop/Gluster/Netapp/EMC/Hitachi/EveryoneElse but the use case is (drumroll):

ls -ltr | grep 'something' | xargs grep somethingelse

Yes that is scary. I tried to convince people to actually run real analytical jobs over that data but as it seems that won't happen. (OK it's not that bad, but those people will use a simple ssh session on some "analysis" system to manually go to some directory, recursivly look thru some files and then determine wether the data is OK or not, which sounds even worse now that I wrote it)

I'm open to any ideas, I do have people that run "big storage" within our company (one backup system has 2PB for example) and I'd love to go with whatever they have that already works. But I also have to proof that they are doing the right thing (please don't ask about this it's a political thing, I'd trust my data to the storage team, I have no idea why I have to duplicate the work)

Thinking about the problem how to actually run analysis on the data is explicitely out of scope.

There were countless meetings and I brought up everything from Splunk to analysis jobs developed in house (with and/or without a Map/Reduce System). There's no interest in that. All the people care about is:

10TB / day
Keep the data 180 days
Make it highly available (not yet fully defined but something along 99.9, 99.99...)

If I understand this right, you're basically going to be parsing (log?) files for "error"? I know this doesn't help you, but there's a reason that Splunk is a bajillion dollars for a dataset of this size. — MDMarra, Feb 23 '12 at 22:30
I have no idea what they'll be parsing for. But I suggested that they should rather develop them with proper requirements... also clarified the use case... — Martin M., Feb 23 '12 at 22:33
I'm curious as to the industry... Do you have any insight into the data? If it's text/flat files, is it compressible? In finance, I've dealt with tick-data that's usually in a binary format, so there were no optimizations at that level. Since you're talking about grep, how compressible is the data? — ewwhite, Feb 23 '12 at 22:35
Also: Proper log analysis unfortunately doesn't seem to be interesting to the people defining the requirements. It's "Store 10 TB of data, keep the data 180 days, end of story" — Martin M., Feb 23 '12 at 22:36
@ewwhite: already gzipped. We have 4TB "spooling" system sitting in front of that that does the compression. As of now 10TB/day is indeed the gzipped data that will be stored. It's about 1:20 (compressed:raw). — Martin M., Feb 23 '12 at 22:38
What type of data is it? I think to ZFS and compressed filesystems as a partial option... — ewwhite, Feb 23 '12 at 22:42
I know this doesn't help, but I literally laughed out loud, all by myself in my office, when I read your last bolded edit. So, now everyone here thinks I'm losing it. — MDMarra, Feb 23 '12 at 22:45
I would probably look at something like a [HP EVA8400](http://h18006.www1.hp.com/products/storageworks/eva8400/index.html); but I think your maths might be a bit off... 2PB is 2,000TB, yet 108x2Tb = 216Tb..? Also the only D2400 I can find is a 24-disk enclosure, which is 8*24*2=384TB? Or am I missing something obvious here? As a point of reference you should be able to get an 8400 (without disks) for about $US60,000 and you would need 3 fully loaded plus one partially loaded. — Mark Henderson, Feb 24 '12 at 00:44
Uh, never mind. I missed the sentence where you mentioned that's a single node and that you need 9 nodes. — Mark Henderson, Feb 24 '12 at 01:05
I'll go ahead and offer http://www.isilon.com/x-series as a possible solution--scales to 10.4PB and optimized for sequential reads. (I am not an Isilon shill.) — Mark Wagner, Feb 24 '12 at 01:11

ewwhite · Answer 1 · 2012-02-24T03:23:32.957

5

Well, you didn't mention budget... So buy this now. Data at that scale should probably be left in the hands of a team with experience in that realm. It's nice having support and someone to yell at :)

http://www.racktopsystems.com/products/brickstor-superscalar/

http://www.racktopsystems.com/products/brickstor-superscalar/tech-specs/

4 x Storage Heads BrickStor Foundation Units
10 x BrickStor Bricks (36 x 3.5″ Bay JBOD)
2 x 16-port SAS switch
1 x pullout rackmount KVM
1 x 48U Rack
1 x 10Gb Network Switch (24 x 10Gb non-Blocking)
NexentaStor Plug-ins:VMDC, WORM, HA-cluster or Simple-HA
Onsite installation 5-days
24/7/365 day email and phone support
Onsite Support

Since the application you describe really doesn't seem to be in the realm of clustered storage (given the use-case), use ZFS. You'll get the infinite scalability. You'll get a chance to offload some of the compression to the storage system and you can tell all of your friends about it :)

More than that, the L2ARC caching (using SSDs) will keep the hot data available for analysis at SSD speed.

Edit: Another ZFS-based solution - http://www.aberdeeninc.com/abcatg/petarack.htm

Also, Red Hat is now in the scale-out storage industry.

See: http://www.redhat.com/products/storage/storage-software/

edited Feb 24 '12 at 03:23

answered Feb 23 '12 at 22:52

ewwhite

194,921
91
434
799

That looks like a really neat system. And the price seems about right as well... – Mark Henderson Feb 24 '12 at 01:09
I like the idea of anything marketed as "$1.4 million for the first one and $1.1 million for each additional". >smile – Evan Anderson Feb 24 '12 at 02:46
They seem fairly confident. Is this a bad recommendation? :) – ewwhite Feb 24 '12 at 02:49
Part of the RedHat stuff you are referring to is GlusterFS. And yes budget is not the blocker in this case. Regarding distributed System. Also with 360 disks of 2TB each I'm not going to store the 1.8 Petabytes so I need 2 stores, that is distributed in my book. – Martin M. Feb 24 '12 at 08:25
And what did you decide to do? – ewwhite Feb 27 '12 at 23:28

score 2 · Answer 2 · answered Feb 23 '12 at 23:42

2

As MDMarra mentions you need Splunk for this, I'm a big user and fan, for very similar volumes as you discuss and right away it'll save you having to buy anywhere near that much storage and reduce all the complexity. One decent sized server (maybe 150-200TB max) will do the job if used with Splunk, it's on-the-fly indexing is perfect for this kind of thing and it's search capabilities far outstrip anything you'll manage yourself. It's not free of course but I'd not consider anything else.

answered Feb 23 '12 at 23:42

Chopper3

100,240
9
106
238

I agree the "real" solution would be something like splunk, but anything that isn't somehow mountable is worse than the ability to meaningful search your data. I made the suggestion already and the direction of indexing the data to get structured access unfortunately is not an option. I hate to say that but thinking about how to get information out of the data mountain is way beyond the scope of this -- read: Thinking about the real problem isn't something the people with the requirements want me to do. – Martin M. Feb 24 '12 at 08:33
Wow - that sucks, they're literally robbing you of the best option - Splunk's indexing alone makes it worth the effort. To be honest if they just want you to have a 2PB file system then you're going about it generally the right way - though it's a shame you're doing this now as there's a forthcoming HP SL model that'd REALLY help you with this but it's not out until the summer. Consider the 70-slot HP MDS 600 drive shelves over the D2612's, I love the 'D's but you owe it to yourself to consider the 600's - they may be more your thing. Oh and use 10GigE for your gluster nodes links ok. – Chopper3 Feb 24 '12 at 09:04

HW/SW Design: 2 Petabyte of storage

2 Answers2