I have a service that serves images to end users at a very high rate using plain HTTP. The images vary between 4 and 64kbytes, and there are 1.300.000.000 of them in total. The dataset is about 30TiB in size and changes (new objects, updates, deletes) make out less than 1% of the requests. The number of requests pr. second vary from 240 to 9000 and is dispersed pretty much all over, with few objects being especially "hot".
As of now, these images are files on a ext3 filesystem distributed read only across a large amount of mid range servers. This poses several problems:
- Using a fileysystem is very inefficient since the metadata size is large, the inode/dentry cache is volatile on linux and some daemons tend to stat()/readdir() it's way through the directory structure, which in my case becomes very expensive.
- Updating the dataset is very time consuming and requires remounting between set A and B.
- The only reasonable handling is operating on the block device for backup, copying, etc.
What I would like is a deamon that:
- speaks HTTP (get, put, delete and perhaps update)
- stores data it in an efficient structure.
- The index should remain in memory, and considering the amount of objects, the overhead must be small.
- The software should be able to handle massive connections with slow (if any) time needed to ramp up.
- Index should be read in memory at startup.
- Statistics would be nice, but not mandatory.
I have experimented a bit with riak, redis, mongodb, kyoto and varnish with persistent storage, but I haven't had the chance to dig in really deep yet.