0

I have a service that serves images to end users at a very high rate using plain HTTP. The images vary between 4 and 64kbytes, and there are 1.300.000.000 of them in total. The dataset is about 30TiB in size and changes (new objects, updates, deletes) make out less than 1% of the requests. The number of requests pr. second vary from 240 to 9000 and is dispersed pretty much all over, with few objects being especially "hot".

As of now, these images are files on a ext3 filesystem distributed read only across a large amount of mid range servers. This poses several problems:

  • Using a fileysystem is very inefficient since the metadata size is large, the inode/dentry cache is volatile on linux and some daemons tend to stat()/readdir() it's way through the directory structure, which in my case becomes very expensive.
  • Updating the dataset is very time consuming and requires remounting between set A and B.
  • The only reasonable handling is operating on the block device for backup, copying, etc.

What I would like is a deamon that:

  • speaks HTTP (get, put, delete and perhaps update)
  • stores data it in an efficient structure.
  • The index should remain in memory, and considering the amount of objects, the overhead must be small.
  • The software should be able to handle massive connections with slow (if any) time needed to ramp up.
  • Index should be read in memory at startup.
  • Statistics would be nice, but not mandatory.

I have experimented a bit with riak, redis, mongodb, kyoto and varnish with persistent storage, but I haven't had the chance to dig in really deep yet.

Tommy
  • 331
  • 1
  • 7

1 Answers1

0

There is no magic solution for your needs. A noSQL database is not really going to help; you need to make some basic decisions about your application architecture.

and some daemons tend to stat()/readdir() it's way through the directory structure

Moving the data into any sort of database is not going to help unless these daemons shouldn't be reading the data in the first place. Wouldn't it just be simpler to reconfigure these or switch them off?

Without knowing anything about your application (no, that's not an invitation for a detailled specification of requirements) then a hybrid approach is probably the way to go - with meta-data held in a database while the content itself is maintained on the filesystem (and there are some very specific reasons why a relational database may be a lot more appropriate than a NoSQL db). If it were me, I'd also be looking at distributing the storage rather than just replicating it.

The index should remain in memory

If you've got 1.3 billion records, each with, say 300 bytes of metadata, you'll need about 6Gb of memory. Most of which will never be acessed but which will prevent the memory from being available for content caching.

the inode/dentry cache is volatile on linux

Have you tried tuning it?

symcbean
  • 19,931
  • 1
  • 29
  • 49
  • Thanks for your input. I know that a nosql database won't solve all my problems (probably), but if it can store objects more efficiently with less metadata/overhead (with for instance a 64bit key) it would solve some of my seek issues. Looking up a file in a large directory structure generates a lot of IOPS and thus a delay. I was hoping an in memory index could solve the seek issue and result in a single IOP for each object. – Tommy Sep 06 '12 at 15:40
  • if this is a native ext3 (rather than ext2 + journal) then it will be using HTREEs - a database index isn't that much different (indeed, a btree index is probably less efficient for this kind of access). Also you don't need need to have especially deep trees. – symcbean Sep 06 '12 at 21:42
  • It's a native ext3. And a relational database might not be that much more efficient, but a key/value store should be since it doesn't have to store metadata such as permissions, last updated etc. A deamon that carries a simple 64bit key with the image in the value field, should be small enough that all keys could be stored in memory. As such all lookups should be one IOP, given no fragmentation. Also note that having large amounts of files in one directory is limited by the readdir() call that only buffers 32K, and listing is thus very slow. – Tommy Sep 07 '12 at 07:30
  • But presumably you have some basis for selecting specific items from this huge pool - even if it's just a filename – symcbean Sep 07 '12 at 08:30