I'm exploring options for making more efficient use of our primary storage.
Our current NAS is an HP ProLiant DL380 G5 with an HP Storageworks MSA20, and one other disk shelf which I'm not sure what it is.
The vast majority of our files are PDF files (hundreds of millions of them), with a high degree of similarity.
In an expert opinion from George Crump (referenced from Data Domain's Dedupe Central), in the section on granularity, he says: "To be effective data de-duplication needs to be done at a sub file level using variable length segments."
This is hard to find, yet exactly what I need. Most dedupe options seems to be block based, which works really well for minimizing how much space backups take up, since only the changed blocks get stored, but the block-based techniques do not find identical segments located at different offsets within the blocks of our PDFs.
I came across Ocarina Networks the other day, which looks like exactly what we need.
Storage Switzerland's Lab Report Overview - The Deduplication of Primary Storage compares Ocarina Networks and NetApp as being "two of the leaders in primary storage deduplication".
Ideally we'd like to continue using our current NAS, but much more efficiently.
The other solution I've come upon is Storwize, who seem to perform inline compression of single files, integrating with deduping solutions.
What other solutions and informational resources are there?