Best practice for storing really large amounts of user uploaded images

We currently have a Django-powered website that allows users to upload a lot of images. All of which are stored on our server on a single hard drive. The problem is: we slowly reach the maximum capacity of available hard drives, thus vertical scaling is not an option any longer.

As far as I know, Amazon S3/CloudFront have no such limit, however, for high traffic sites these services are way more expensive than our own server rack. Is there a best practice for splitting the uploads on several disks in our own environment?

Simon Steinberger

Posted 2013-12-06T22:16:44.013

Reputation: 280

Answers

This is bad - in a server environment where the contents of the data are important you should at least use RAID to mitigate the significant risk of disk failure - and RAID is also an answer to your storage problem. You can use a RAID array to increase the capacity of your storage. (RAID is used to take multiple disks to provide a single virtual disk, with varying performance characteristics and redundancy)

There are also other technologies you really need to know about and use - You have not specified your OS, but hopefully its a Linux variant - in which case you should be looking at LVM which handles disk management, and, among other things, has the ability to merge multiple disks into a single virtual disk - beneath the OS level.

Of-course, you can also look at things like SANS, which typically take a number of disks and can merge them into a single large external hard disk.

davidgo

Posted 2013-12-06T22:16:44.013

Reputation: 49 152

RAID is not a backup solution. If they are doing proper backups already to an off-site or at least out-of-band offline storage mechanism, such as tape or at least another server, RAID is not going to help them all that much. I'm not saying RAID is bad -- redundancy can certainly improve uptimes -- but it is not a good strategy for maintaining data integrity and data retention. Backups are the only way to get any confidence that the data will be retained. After all, recursively shredding your files with the shred command works equally well on RAID and non-RAID configurations. – allquixotic – 2013-12-06T22:49:48.743

RAID is not a backup solution, and I never said otherwise. That said, the difference between RAID and no RAID in a server environment is huge - with RAID if a disk fails you just switch it out - with hot swappable drives there is no downtime - and also no lost data while the disk is failing - as opposed to risk of data corruption and many hours of downtime for a rebuild. – davidgo – 2013-12-06T23:02:40.123

So yes, it does maintain data integrity, but its not a panacea or an alternative to a robust backup system. In any server environment worth having, the extra cost of a $300 or less hard disk is more then covered in the first hard drive failure - and remember that drive failure rate (for new drives) is between 1.5 and 13% per annum (realistically I'd say 5% PER YEAR] depending on disk [ see http://www.pcworld.com/article/129558/article.html ], so on a multiple disk setup failure at some point is almost guaranteed. – davidgo – 2013-12-06T23:03:05.643

Thanks! I've accepted this answer, because it appears to have been posted first. Second answer is great, too! – Simon Steinberger – 2013-12-06T23:50:27.250

So the answer that was posted first should get accepted? That's odd... I thought they should be judged on their own merits, not when they were posted. Anyway, I was done writing mine and posted it less than a minute after his, and I hadn't even read his answer when I posted mine. Regardless, you can choose whichever answer you like, it makes no difference to me. I just found it odd that you would reason that "the first answer should be the one that's accepted". If you think his answer is better, well then that's perfectly valid. – allquixotic – 2013-12-07T05:11:42.673

Assuming that you want to avoid cloud-based services, the traditional large enterprise approach is to procure hardware or software that can merge many separate disks into a single logical filesystem. There are many possible ways to do this. I will enumerate a few:

Using distributed filesystems such as glusterfs will allow you to have multiple servers, each with their own CPU, RAM, and storage, and have a single logical filesystem shared among all of them.
You can also take this distributed concept a step further and cluster the entire system, soup to nuts, so that it appears that you're running one logical computer, when in fact it's a series of networked computers closely joined at the hip (preferably via some very high-speed networking).
You can save on buying motherboards, chassii, CPUs, RAM, etc. by procuring a "storage server", which is a moderately powerful enterprise-grade server that is attached to many hard disks -- either directly installed into the chassis, or connected via fibre channel or SAS to an external storage rack, sometimes containing hard drives numbering from 60 to even more. In these configurations, the hard disks are usually joined into one logical device using a hardware RAID controller or backplane. Of course, this method will eventually reach a max. capacity if you have all the disks you can possibly fit in a single rack at the maximum disk density, in which case you could scale up by having a filesystem-layer or system-layer cluster of these storage servers.

Depending on the exact size of storage you expect to need within the next N years (where N is the number of years you're willing to plan ahead for), some of these solutions will be more expensive or harder to administer than others.

In the extreme example of needing many thousands of terabytes of redundant storage, on the scale of what Amazon S3 provides to its downstream customers, you pretty much have to have some sort of cluster system, usually with centralized infrastructure to manage it. In these cases, very fast inter-node networking is critical to maintaining good performance. Definitely look into 10G ethernet at a minimum.

Judging from the fact that you said you're currently running on a single hard drive, though, the most economical way to scale up from here without blowing your scale way out of proportion would be to buy a 2U or 3U server that can hold 4 to 8 hard drives, and stick a bunch of disks in there in RAID. RAID10, RAID5 and RAID6 are all fairly common configurations for this number of disks, but if you go with RAID5/RAID6, make sure you use a hardware RAID controller to avoid undue CPU load.

You can probably scale up to about 16 TB of usable storage (with redundancy) using this method and currently available disks, but be aware that larger-capacity disks also tend to be slower, with lower throughput and higher response times, which is why very high-traffic sites tend to use disks with smaller capacity.... which of course means you'll need more of them to achieve the same usable capacity. :/

allquixotic

Posted 2013-12-06T22:16:44.013

Reputation: 32 256

Thanks allquixotic! Sorry I couldn't mark both answers as accepterd! – Simon Steinberger – 2013-12-06T23:51:06.580