Assuming that you want to avoid cloud-based services, the traditional large enterprise approach is to procure hardware or software that can merge many separate disks into a single logical filesystem. There are many possible ways to do this. I will enumerate a few:
Using distributed filesystems such as glusterfs will allow you to have multiple servers, each with their own CPU, RAM, and storage, and have a single logical filesystem shared among all of them.
You can also take this distributed concept a step further and cluster the entire system, soup to nuts, so that it appears that you're running one logical computer, when in fact it's a series of networked computers closely joined at the hip (preferably via some very high-speed networking).
You can save on buying motherboards, chassii, CPUs, RAM, etc. by procuring a "storage server", which is a moderately powerful enterprise-grade server that is attached to many hard disks -- either directly installed into the chassis, or connected via fibre channel or SAS to an external storage rack, sometimes containing hard drives numbering from 60 to even more. In these configurations, the hard disks are usually joined into one logical device using a hardware RAID controller or backplane. Of course, this method will eventually reach a max. capacity if you have all the disks you can possibly fit in a single rack at the maximum disk density, in which case you could scale up by having a filesystem-layer or system-layer cluster of these storage servers.
Depending on the exact size of storage you expect to need within the next N
years (where N is the number of years you're willing to plan ahead for), some of these solutions will be more expensive or harder to administer than others.
In the extreme example of needing many thousands of terabytes of redundant storage, on the scale of what Amazon S3 provides to its downstream customers, you pretty much have to have some sort of cluster system, usually with centralized infrastructure to manage it. In these cases, very fast inter-node networking is critical to maintaining good performance. Definitely look into 10G ethernet at a minimum.
Judging from the fact that you said you're currently running on a single hard drive, though, the most economical way to scale up from here without blowing your scale way out of proportion would be to buy a 2U or 3U server that can hold 4 to 8 hard drives, and stick a bunch of disks in there in RAID. RAID10, RAID5 and RAID6 are all fairly common configurations for this number of disks, but if you go with RAID5/RAID6, make sure you use a hardware RAID controller to avoid undue CPU load.
You can probably scale up to about 16 TB of usable storage (with redundancy) using this method and currently available disks, but be aware that larger-capacity disks also tend to be slower, with lower throughput and higher response times, which is why very high-traffic sites tend to use disks with smaller capacity.... which of course means you'll need more of them to achieve the same usable capacity. :/
RAID is not a backup solution. If they are doing proper backups already to an off-site or at least out-of-band offline storage mechanism, such as tape or at least another server, RAID is not going to help them all that much. I'm not saying RAID is bad -- redundancy can certainly improve uptimes -- but it is not a good strategy for maintaining data integrity and data retention. Backups are the only way to get any confidence that the data will be retained. After all, recursively shredding your files with the
shred
command works equally well on RAID and non-RAID configurations. – allquixotic – 2013-12-06T22:49:48.743RAID is not a backup solution, and I never said otherwise. That said, the difference between RAID and no RAID in a server environment is huge - with RAID if a disk fails you just switch it out - with hot swappable drives there is no downtime - and also no lost data while the disk is failing - as opposed to risk of data corruption and many hours of downtime for a rebuild. – davidgo – 2013-12-06T23:02:40.123
So yes, it does maintain data integrity, but its not a panacea or an alternative to a robust backup system. In any server environment worth having, the extra cost of a $300 or less hard disk is more then covered in the first hard drive failure - and remember that drive failure rate (for new drives) is between 1.5 and 13% per annum (realistically I'd say 5% PER YEAR] depending on disk [ see http://www.pcworld.com/article/129558/article.html ], so on a multiple disk setup failure at some point is almost guaranteed. – davidgo – 2013-12-06T23:03:05.643
Thanks! I've accepted this answer, because it appears to have been posted first. Second answer is great, too! – Simon Steinberger – 2013-12-06T23:50:27.250
So the answer that was posted first should get accepted? That's odd... I thought they should be judged on their own merits, not when they were posted. Anyway, I was done writing mine and posted it less than a minute after his, and I hadn't even read his answer when I posted mine. Regardless, you can choose whichever answer you like, it makes no difference to me. I just found it odd that you would reason that "the first answer should be the one that's accepted". If you think his answer is better, well then that's perfectly valid. – allquixotic – 2013-12-07T05:11:42.673