9

How do companies who handle large amounts of data, for example Google or Facebook, backup everything?

According to this Google platform article on Wikipedia, Google has an estimated 450,000+ servers each with a 80+ GB hard disk. That's a lot of data. Do they really keep 1+ GB of backup for every 1 GB of data?

Olivier Lalonde
  • 713
  • 3
  • 13
  • 20
  • I doubt Boogle backup the servers software as they seem to be able to build a server from bare metal very quickly. They do seem to have backup copies of user data. – BillThor Nov 18 '10 at 02:53
  • Well, Google has more than 1 million servers (from 2007): http://www.pandia.com/sew/481-gartner.html – Kedare Jan 08 '11 at 00:13
  • I think you make ONE fundamental mistake: GOogle has a LOT of servers all being SIMILAR. Nodes of X servers serving the index. YOu do not back up the same index a million times. – TomTom Jan 09 '13 at 22:57

3 Answers3

8

It depends on what your purpose is.

If you're looking for backups for disaster recovery (server exploded, datacentre burnt down, etc) then the short answer is they may not do backups at all. We have a client who deals in sensitive government data, and part of their mandate is that we are not permitted to do backups or backups onto removable media. We are permitted live replication to a DR site and that's it. Both sites are covered in the same level of physical and logical security. The catch here is that if I screw something up on Site A, then it's replicated to Site B almost instantly.

If you're talking about backups from a data integrity point of view (e.g. you accidentally dropped the Customers table and it's already replicated to the DR site), then LTO-5 tapes in a big tape library are often the go. With up to 3TB per tape, and multiple tapes in a tape library you can quickly back up vast amounts of data (quick here refers to Mbps, it may still take many, many hours to backup 25TB of data).

Any decent backup suite will do high compression and de-duping, which vastly reduces the amount of storage space required. I saw an estimate for a compressed and de-duped Exchange backup tool once that claimed a 15:1 ratio (15gb of data stored in 1gb of backups).

I very much doubt Google bother with backups for a lot of their search engine data, because most of it is replacable, and it's distributed so far and wide that if they lose even a significant portion, or perhaps even an entire, datacentre the system stays online thanks to failover BGP routes.


Actually, it looks like Google do back up a metric crap-ton of data onto tape, which isn't quite what I was expecting:

Part of the Google tape library

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
2

Most of their data is stored on their own GFS filesystem, and GFS requires that there are at least three copies of every 64 MB block that makes a file (GFS uses 64 MB blocks). Having that said, I don't think they bother with backups, as they have at least three copies of every file, and blocks on failing node can be quickly replaced by simply replicating data from any of remaining two good copies to a new node.

For more information, take a look at http://labs.google.com/papers/gfs.html

ipozgaj
  • 1,061
  • 10
  • 10
  • 1
    Redundancy increases availability, but it isn't exactly a backup (and you didn't call it that) because it is easy to overwrite. – Tobu Nov 24 '10 at 23:17
  • Yes, that's a good point. My point was merely that they probably *don't need* backups for most of their data. – ipozgaj Nov 25 '10 at 07:03
0

farseeker's answer is good but I think could be clarified by thinking about it from this perspective: What are you trying to restore? Is it for DR? Whats the recovery time required? As an example suppose your company relies on a 25 TB sql server database. In case of data failure or error (dropped table, corrupted db etc) the CTO wants to be able to recover the database in under an hour. In case of site failure 2 hours is required.

On the face of it this sounds difficult but it's not impossible. Since you know your backup strategy has to recover in an hour, you know that you are not going to be restoring full backups, you are going to have to work with the dba teams to ensure that the DB is partitioned into manageable chunks. You are also going to be doing frequent trans-log backups. For DR should be looking at a replication strategy (maybe a time delayed version with log data replicated realtime but not applied). As farseeker said it depends on the purpose, and that purpose should be to do some form of recovery.

Jim B
  • 23,938
  • 4
  • 35
  • 58