3

Do any of the commercial IaaS object stores (S3, Azure Blobs etc.) avoid charging multiple times for storing duplicate data (identical files, or parts of files)? For instance, we have a 15 TB dataset of tweets and one of our team wants to make a copy and then make a few modifications to the data. Will we be charged for 30 TB of storage?

Is there a good way to find duplicate chunks on these large object stores, or to compress large datasets in-place? Can we replace duplicate files with some kind of symlinks?

Jedi
  • 408
  • 1
  • 5
  • 19

2 Answers2

4

Storage providers (at least AWS, Google and Microsoft) don't do deduplication and/or compression on blob objects. That leads to unpredictable delays, increased jitter and increased RAM consumption. Not to say it's impossible to implement good billing strategy in such scenario, and deduplicating objects across several servers/availability zones is a huge technological challenge.

You can implement compression on your end. Deduplication is harder cause you will need to maintain middleware with hash tables and so on.

Another approach could be using ZFS on your EC2 instances instead of S3. You can attach EBS volumes and mount them as ZFS volumes, and ZFS had built-in capabilities for compression and deduplication. If you need those files/objects on several EC2 instances, you can always export and import ZFS as NFS share. Once again, deduplication will require additional RAM.

Sergey Kovalev
  • 343
  • 1
  • 6
  • 1
    Do you mean that they don't expose deduplication to the end-users (which is possible) or are you suggesting that they don't use dedup even under the hood (which is probably untrue, even when you consider cross-AZ/region storage). Why would a lack of exposed dedup in object stores increase RAM usage on an EC2 instance? – Jedi Dec 23 '16 at 06:36
  • 2
    I'm pretty sure they don't use deduplication under the hood. In distributed environments that increases latency, which is exactly opposite to AWS goals. When I suggested "increased RAM consumption", I was talking about AWS servers, not EC2 instances. EC2 instance will face RAM consumption only when using it's own deduplication solution, like ZFS or StorReduce. – Sergey Kovalev Dec 23 '16 at 07:19
4

You can use on-site deduplication, which can be performed by some backup solutions, Veeam for example https://www.veeam.com/hyper-v-vmware-backup-deduplication-compression.html, and push deduped data to cloud, thus saving the network bandwidth. It can be useful especially if the in-time recovery is critical.

We have have quite large amount of VM's running in our production atm, and using Veeam and Starwind, so I think it is a similar case. Also tested other solutions, e.g. MS DPM and Backup Exec, but Veeam showed better results.

Strepsils
  • 4,817
  • 9
  • 14
  • 1
    Totally agree, Veeam is a perfect and mature backup solution. It does the job exceptionally. – Stuka Dec 30 '16 at 15:19