Capacity Optimization / Deduplication Options for Primary Storage

Question

I'm exploring options for making more efficient use of our primary storage.

Our current NAS is an HP ProLiant DL380 G5 with an HP Storageworks MSA20, and one other disk shelf which I'm not sure what it is.

The vast majority of our files are PDF files (hundreds of millions of them), with a high degree of similarity.

In an expert opinion from George Crump (referenced from Data Domain's Dedupe Central), in the section on granularity, he says: "To be effective data de-duplication needs to be done at a sub file level using variable length segments."

This is hard to find, yet exactly what I need. Most dedupe options seems to be block based, which works really well for minimizing how much space backups take up, since only the changed blocks get stored, but the block-based techniques do not find identical segments located at different offsets within the blocks of our PDFs.

I came across Ocarina Networks the other day, which looks like exactly what we need.

Storage Switzerland's Lab Report Overview - The Deduplication of Primary Storage compares Ocarina Networks and NetApp as being "two of the leaders in primary storage deduplication".

Ideally we'd like to continue using our current NAS, but much more efficiently.

The other solution I've come upon is Storwize, who seem to perform inline compression of single files, integrating with deduping solutions.

What other solutions and informational resources are there?

Can you also clarify with what type of SAN you're using? – Brent Ozar May 05 '09 at 21:03 — Brent Ozar, May 05 '09 at 21:03

score 4 · Answer 1 · answered May 13 '09 at 17:39

I have found that most black-box solutions for de-duplication are not as effective or as efficient as the ones built directly into the storage.

For example, a black-box de-dupe appliance will require all of your data pass through it in both directions before hitting whatever generic storage you are using, processing it all for de-dupe, whereas storage arrays such as NetApp, Data Domain, and many others, allow you to control de-dupe on a per volume basis, and all processing is done on the controller itself.

If you are set on using existing non-intelligent storage but employing a solution in front of it, I would recommend data domain, but honestly I would encourage you to upgrade to a different storage system which can de-dupe internally.

I would look into the NetApp V-Series of storage controllers. These allow you to attach an intelligent disk controller to existing disk shelf hardware you already have.

NetApp indeed seems to be one of the best options. – Stephen Denne May 15 '09 at 01:44 — Stephen Denne, May 15 '09 at 01:44

score 2 · Answer 2 · answered May 05 '09 at 12:34

2

The technology you're looking for is called deduplication, and there's a ton of vendors offering dedupe.

If you're using a SAN, call your SAN vendor and they'll fall all over themselves trying to sell you their dedupe options.

Here's a good resource on how to get started with dedupe:

http://www.datadomain.com/dedupe/

answered May 05 '09 at 12:34

Brent Ozar

4,425
17
21

Thanks for the resource. I'll update my question a little. – Stephen Denne May 05 '09 at 20:53
I've just rearranged my question again, and corrected the details of our NAS. – Stephen Denne May 15 '09 at 01:56

Chopper3 · Answer 3 · 2009-05-15T07:37:29.230

2

I know the MSA range well and I think you'll struggle to dedupe with what you have, for a start deduping is a reasonably slow and IO-intensive job that's best done by the actual SAN/NAS controllers. It's slightly different in a backup scenario as the backup media server can dedupe as it goes but with live data it's important to maintain data integrity and overall performance and I'm not sure there's anything available as an 'after market add-on' that'll really give you what you need.

edited May 15 '09 at 07:37

answered May 05 '09 at 22:17

Chopper3

100,240
9
106
238

Thanks for sharing your experience. In my massive rearrangement of the question, I added in that we have a DL380 as our NAS. Your answer seems to be arguing against inline deduping, which I'd agree with. Slow and IO intensive jobs can be scheduled for quieter times using Ocarina. – Stephen Denne May 15 '09 at 01:53

James · Answer 4 · 2009-06-25T15:55:08.113

1

Its worth noting that the Ocarina system trawls an original file system and sees if a file matches a policy. If it does the Ocarina box expands the file out and applies their proprietary compression algorithms. It then writes this new file to a new different file system, optionally deleting the original file.

Apparently the reading side can be set up with a fuse file system such that reads to the original file system can be intercepted by fuse to use the "optimised" version so that sounds much more transparent then the original sales person described.

edited Jun 25 '09 at 15:55

answered Jun 24 '09 at 20:51

James

2,212
1
13
19

I wasn't aware their reader was implemented in fuse... makes sense though. – Stephen Denne Jun 26 '09 at 03:47

score 1 · Answer 5 · edited Sep 25 '13 at 11:47

1

Backup Central's List of Disk Targets for Archives

(Not to be confused with their more extensive list of Disk Targets for Backup)

edited Sep 25 '13 at 11:47

ricmarques

1,112
1
13
23

answered May 05 '09 at 21:04

Stephen Denne

231
3
7

score 0 · Answer 6 · answered Jul 12 '10 at 09:50

0

FILEminimizer Server by balesio is a software-only solution which optimizes your stack of Office and image files preserving the native file format. You can free up as much as 70 percent of your storage capacity currently taken up by these files. www.balesio.com/fileminimizerserver

answered Jul 12 '10 at 09:50

Thanks for adding your answer, they do not appear to support PDF files. – Stephen Denne Jul 12 '10 at 21:07

Capacity Optimization / Deduplication Options for Primary Storage

6 Answers6