How are sites like Pinterest able to hold all those pictures?

4

2

Does anyone know how sites hold massive numbers of pictures in general? I tried researching this, but it seems like they would need massive storage spaces to hold all these, unless there was a trick to it? I'm sure they compress them, but thats still a huge amount of data to hold for one site.

user105651

Posted 2013-10-22T04:26:10.390

Reputation:

http://www.howstuffworks.com/pinterest.htm havent u seen this? – BlueBerry - Vignesh4303 – 2013-10-22T04:39:46.713

4Yes, they just store them. Storage is cheap relative to cpu and memory. – Paul – 2013-10-22T04:39:49.083

4One thing that a site like Pinterest can obviously do is deduplication: store each image only once, no matter on how many people's pages it appears. – Michael Borgwardt – 2013-10-22T05:46:17.083

My guess is, more than 1 hard drive! :) – Dave – 2013-10-22T07:52:45.457

Answers

5

When it comes to storing large amounts of data, content providers use Storage Area Networks, also known as SANs and SAN storage hardware.

From Wikipedia:

A storage area network (SAN) is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices to the operating system. A SAN typically has its own network of storage devices that are generally not accessible through the local area network by other devices. The cost and complexity of SANs dropped in the early 2000s to levels allowing wider adoption across both enterprise and small to medium sized business environments.

So what does a SAN storage device look like? Some look like the following:

this.

Every one of those slots (the red square is the drive eject button) is a hard disk drive. The one in the picture is a relatively small SAN storage device, others can look much different and can be much larger in size.

Where I used to work, we had SAN storage which were boxes, basically the size of 3 refrigerators side-by-side filled with hard drives. We then took those drives as needed and created RAID arrays for redundancy. As we needed more space, we could order more SAN storage devices and attach them to our storage area network. This allowed us to have petabytes of redundant storage.

Sites like Flickr, Picassa, Facebook, etc, have very large SANS filling massive datacenters.

Keltari

Posted 2013-10-22T04:26:10.390

Reputation: 57 019

2

From your question I gather that you do not have a Computer Science background, so I'd avoid throwing geeky sounding terms.

Popular websites, handling extremely large amounts of data (or traffic), is nothing new or very unique. Usually there is no trickery in terms of massive compression (since most pics uploaded as JPEG are highly compressed already, and further compression can often result in loss of details). What does go in is some clever architecture, lots and lots of computers, fast & reliable network, and of course, several terabytes (or even petabytes) of storage. Actually, storage is often the least of the issues. Storage and compute power is pretty inexpensive these days.

What happens is data is often distributed (several copies) across multiple computers, for redundancy and faster-retrieval, and seek/search of data happens in parallel. Keeping frequently used data closer to the edge of the network or users and keeping such data updated based on usage, are some of the techniques.

Some geeky keywords that are often used, and might be seen as wizardry are:

  • Multi-level caching
  • Distributed storage
  • Data Warehousing
  • NoSQL
  • Map-Reduce
  • Data sharding (mostly in the SQL world)
  • Parallel processing
  • CDN (Content Delivery Networks)

How to do this well, more efficiently, effectively is a field of study and also research, in the area of computer science and computer architecture. Different technique are used based on nature of data, nature / frequency of access (more writes, versus more reads), kind of reliability required etc.

Edited: This picture of google's server rack (from 1999) is epic:! Note the exposed harddrives (3-4 of them) in the middle of each server "tray" (especially the one labelled "g61").

enter image description here

and the full journey is captured in this post here:

jay

Posted 2013-10-22T04:26:10.390

Reputation: 226

Don't have enough reps to leave comment on @Keltari's otherwise nice answer, so will comment here. SAN's are a popular, more common and slightly more traditional approach, that many big enterprises and their IT depts prefer. There is however an alternative approach which was popularized by the likes of Google and Yahoo. Roughly speaking, this approach is based on the principle of having huge cluster of servers (pizzaboxes), each with large attached storage, where each server not only contributes storage-space, but also computation power. That distributed computation is used to... – jay – 2013-10-22T06:07:56.690

...breakdown complex search, lookup operations into smaller operations that are spread across the cluster and run in parallel. The search results are then combined together to form the answer to the more complex question. This is typical of search networks, and "read-heavy" operations. Note that these days, Google, Yahoo and the likes do use SAN too. Those organizations are far too complex and have grown rapidly to stick to one single technology for storage or computing. In the end, it boils down to using the right tool for the right job. – jay – 2013-10-22T06:12:05.223

1

They can't compress the photos, because photos are almost certainly already compressed, either with JPEG or PNG compression, and it's not possible to compress already-compressed data. (That's oversimplifying it a bit, but unless you want to get deep into information theory, just accept that as a given.)

There's really no shortcut. A site that holds massive amounts of data has massive amounts of computers to hold it on.

Let's say an image weighs in at 1 MB. There are plenty that are bigger, and plenty that are smaller, but just for simplicity's sake let's say the average image is 1 MB. It's not hard to find affordable 2 TB drives these days, which means that each drive could theoretically hold around 2 million images. (Obviously there will be some space lost to overhead, but you get the idea.)

A server can have a RAID configuration set up with multiple hard drives. Some of the data is lost to redundancy, but even so you can have several TB worth of drives per computer. And a server farm can hold dozens, hundreds, or even thousands of servers. That's how sites like Pinterest and Facebook manage so much content.

They tend to have massive server farms, with computers in front of them that route requests from Web browsers, looking up the content in the appropriate place in the server farm and serving it back to the user. It's a really big topic to try to cover here, but that's the basic idea.

Mason Wheeler

Posted 2013-10-22T04:26:10.390

Reputation: 942

1the number of computers has nothing to do with storage capacity or capability. – Keltari – 2013-10-22T05:14:12.520

1@Keltari sure it does, you can only hook up so many drives to a controller, and only so many controllers to a server. There is a finite amount of space a single server can handle, so it needs to be distributed over several servers. – Richie Frame – 2013-10-22T05:53:52.497

@Richie: That's one reason you use SANs - you no longer need a controller for "so many drives", you only need an interface card to talk to the SAN, and the server need not care how many drives the SAN contains. Of course, there may still be a maximum partition size in the OS, but that is typically very high. – sleske – 2013-10-22T07:29:56.490

@RichieFrame, you are correct that a server can physically hold so many drives. However, this is the reason for SANs and NASs - storage is abstracted, which allows it to be virtually limitless. Servers can cantain 0 drives and simply be connected to a SAN or NAS. – Keltari – 2013-10-22T07:38:08.030

In all fairness, a SAN really is just a bunch of computers, each of which can hold a huge amount of disks. They're built especially for that purpose. But even before SANs became popular, some servers could hold well over a 100 disks. – MSalters – 2013-10-22T07:46:06.250