3

The web application I'm working on, will be used to upload/download large number/amounts of smaller size files - I'm looking at close to 1B files with total size of > 10Pb. I'm currently struggling with deciding the scalable architecture that would support such amounts. And here's my question - is there a way of building some sort of storage that would be seen by a windows server as one huge (10Pb and up) network storage drive, so I can write all the files to subfolders of that virtual drive? And how would it perform?

Right now I'm trying to understand if that's even possible, or if I have to implement software level sharding - writing files to different drives based on some key.

I'm a developer, not a sys admin, so I apologize if it's a naive question, and thanks in advance for patience in explaining me possibly trivial things.

Andrey

Andrey
  • 354
  • 5
  • 17
  • I would research if SQL Server filestreams + NTFS can handle this. – alex Sep 02 '10 at 20:11
  • You mean, store files "in" SQL Server? – Andrey Sep 02 '10 at 20:40
  • the FileStream data type is a special kind of field. It does not store the data in the database (this would be **bad**) but it's kinda like holding the file on the disk and the metadata in the SQL server, but integrated. – Mark Henderson Sep 03 '10 at 05:36

4 Answers4

2

as a 'normal but huge' fileserver:

with a file-like application level library:

  • amazon S3
  • rackspace cloudfiles
  • mogilefs

generic key-value:

Javier
  • 9,078
  • 2
  • 23
  • 24
1

Check out how Backblaze is storing its data. Very good read and they have a blog about the new 3TB drives. This probably will not answer the question about file system. I am not sure how Backblaze does there file structure. But good information nevertheless.

xeon
  • 3,796
  • 17
  • 18
  • They have a nice design, but the key concern is performance - a single server with just one processor and 4Gb of memory can't be speedy serving 45 drives. But it's an interesting read, so +1 – Andrey Sep 03 '10 at 00:09
  • Ah - it can. Sorry. One processor is 6 cores these days. 4gb memory for a simple FS + Linux is enough. discs is 25% of the capacity of a RAID card like from Adaptec's 5xxx series. You should be more concerned with bandwidth (go go infiniband, 10gb here) which is going to be the botttlenech. – TomTom Sep 03 '10 at 03:31
1

Before you continue looking, you need to decide a bit more exactly what kind of semantics you need. For instance, you say they're files - do you need POSIX file semantics (mostly concerned with consistency and locking) on them on the storage? or is 'eventual consistency' of various distributed datastores enough? What are your I/O requirements: how much concurrent access? What are your redundancy requirements? Also: what kind of hardware are you going to use? 10Pb arrays don't grow on trees and just managing them is a full time job - that much hardware means failure is a normal event, so constant repair and replacement is needed.

From what you've said "web application... storing files..." I think an OpenStack or S3 kind of solution should do you. Since you're mostly a developer, I'd suggest you probably want to actually use amazon or Rackspace or whoever as your provider unless you really want to get into the hardware management biz.

pjz
  • 10,497
  • 1
  • 31
  • 40
0

These days you might consider HDFS and the general Hadoop ecosystem.

Xorlev
  • 1,845
  • 14
  • 12