13

This is very system dependent, but chances are near certain we'll scale past some arbitrary cliff and get into Real Trouble. I'm curious what kind of rules-of-thumb exist for a good RAM to Disk-space ratio. We're planning our next round of systems, and need to make some choices regarding RAM, SSDs, and how much of each the new nodes will get.

But now for some performance details!

During normal workflow of a single project-run, MongoDB is hit with a very high percentage of writes (70-80%). Once the second stage of the processing pipeline hits, it's extremely high read as it needs to deduplicate records identified in the first half of processing. This is the workflow for which "keep your working set in RAM" is made for, and we're designing around that assumption.

The entire dataset is continually hit with random queries from end-user derived sources; though the frequency is irregular, the size is usually pretty small (groups of 10 documents). Since this is user-facing, the replies need to be under the "bored-now" threshold of 3 seconds. This access pattern is much less likely to be in cache, so will be very likely to incur disk hits.

A secondary processing workflow is high read of previous processing runs that may be days, weeks, or even months old, and is run infrequently but still needs to be zippy. Up to 100% of the documents in the previous processing run will be accessed. No amount of cache-warming can help with this, I suspect.

Finished document sizes vary widely, but the median size is about 8K.

The high-read portion of the normal project processing strongly suggests the use of Replicas to help distribute the Read traffic. I have read elsewhere that a 1:10 RAM-GB to HD-GB is a good rule-of-thumb for slow disks, As we are seriously considering using much faster SSDs, I'd like to know if there is a similar rule of thumb for fast disks.

I know we're using Mongo in a way where cache-everything really isn't going to fly, which is why I'm looking at ways to engineer a system that can survive such usage. The entire dataset will likely be most of a TB within half a year and keep growing.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
  • A difficult question well-asked. – gWaldo Jul 16 '12 at 12:18
  • It sounds like you're probably going to hit write lock problems before you can tune for IO much, honestly. If you hammer the DB with writes, you'll likely hold write locks long enough that queries are going to stall regardless of how fast the underlying IO is. Something like Fusion IO can cut down on write lock a little, but it just buys some time, it's not a real fix. – MrKurt Jul 16 '12 at 21:36
  • @MrKurt Part of what I'm trying to figure out is when I need to shard, in addition to how beefy I can make the individual replica nodes. My provisional spec does have a PCIe-based SSD card involved. – sysadmin1138 Jul 16 '12 at 21:41
  • Ah, got it. You might consider sharding from the beginning, we do single server sharding a lot. It lets you get around the write lock and effectively scale writes to your total cores. Plus, it's easy to move shards around between servers at a later time. – MrKurt Jul 17 '12 at 04:52

3 Answers3

13

This is intended as an addendum to the other answers posted here, which discus many of the relevant elements to be considered here. However, there is another, often overlooked, factor when it comes to efficient RAM utilization in a random access type system - readahead.

You can check the current settings for readahead (on Linux) by running blockdev --report (usually requires sudo/root privileges). This will print out a table with one row for each disk device. The RA column contains the value for readahead. That value is the number of 512 byte sectors (unless sector size is not the default - note that as of the time of writing this post, even disks that have larger sizes are treated as 512 byte sectors by the kernel) that are read on every disk access.

You can set the readahead setting for a given disk device by running:

blockdev --setra <value> <device name>

When using a software based RAID system make sure to set the readahead on each disk device as well as on the device that corresponds to the RAID controller.

Why is this important? Well, readahead uses the same resource MongoDB is trying to use in order to optimize your reads for sequential access - RAM. When you are doing sequential reads on spinning disks (or devices that behave something like spinning disks anyway - EBS I am looking at you), fetching the nearby data into RAM can boost performance massively, save you on seeks, and a high readahead setting in the right environment can get you some impressive results.

For a system like MongoDB where your access is generally going to be random access across a data set this is just wasting memory that is better used elsewhere. The system, which as mentioned elsewhere manages memory for MongoDB also, is going to allocate a chunk of memory to readahead when it is requested and hence leave less RAM for MongoDB to effectively use.

Picking the correct readahead size is tricky and depends on your hardware, the configuration, block size, stripe size and the data itself. If you do move to SSDs for example, you will want a low setting, but how low will depend on the data.

To explain: you want to make sure that readahead is high enough to pull in a full single document and not have to go back to disk. Let's take your mentioned median size of 8k - since sectors on disk are generally 512 bytes it would take 16 disk accesses to read in whole document with no readahead. If you had a readahead of 16 sectors or more, you would read in the whole document with only one trip to disk.

Actually, since MongoDB index buckets are 8k, you will never want to set readahead below 16 anyway, or it will take 2 disk accesses to read in one index bucket. A general good practice is to start with your current setting, halve it, then re-assess your RAM utilization and IO and move on from there.

Adam C
  • 5,132
  • 2
  • 28
  • 49
5

This is going to be a bunch of small points. There is sadly no single answer to your question, however.

MongoDB allows the OS kernel to handle memory-management. Aside from throwing as much RAM as possible at the problem, there are only a few things that can be done to 'actively manage' your Working Set.

The one thing that you can do to optimize writes is to first query for that record (do a read), so that it's in working memory. This will avoid the performance problems associated with the process-wide Global Lock (which is supposed to become per-db in v2.2)

There is no hard-and-fast rule for RAM vs SSD ratio, but I think that the raw IOPS of SSDs should allow you to go with a much lower ratio. Off the top of my head, 1:3 is probably the lowest you want to go with. But given the higher costs and lower capacities, you are likely going to need to keep that ratio down anyway.

Regarding 'Write vs Reading phases', am I reading correctly that once a record is written, it is seldom updated ("upserted")? If that is the case, it may be worthwhile to host two clusters; the normal write cluster, and read-optimized cluster for "aged" data that hasn't been modified in [X time period]. I would definitely enable slave-read on this cluster. (Personally, I'd manage that by including a date-modified value in your db's object documents.)

If you have the ability to load-test before going into Prod, perf monitor the hell out of it. MongoDB was written with the assumption that it would be often be deployed in VMs (their reference systems are in EC2), so don't be afraid to shard out to VMs.

gWaldo
  • 11,887
  • 8
  • 41
  • 68
  • During processing an initial document stub is created and is then continually updated by various sub-stages in the first part of processing. We've been weighing the possibility of doing some hand-padding on the initial create to reduce the amount of extending we're doing, but our current write-lock percentage is happily low. – sysadmin1138 Jul 16 '12 at 13:45
  • The advice to read a record before writing to it to get it into RAM is not good advice. Since 2.0 (mid-2011) MongoDB has had yielding if data to be accessed isn't in RAM so you're just causing an extra read and an extra round trip to the server for no good reason if you do that since the lock wouldn't be held for that duration anyway. – Asya Kamsky Mar 28 '14 at 12:44
3

You should consider using replicas for end-user queries and having your workflow done on other machines.

Using your 1:10 rule-of-thumb, you're looking at about 128GB of RAM for 1TB of disk storage; While some affordable SSDs today claim to reach >60K IOPS, real world numbers may differ quite a bit, as well as whether you're using RAID with your SSDs or not, and if you are, then the RAID card is extremely important as well.

At the time of this post, going from 128GB of DDR3 ECC ram to 256GB seems to be around 2000$ extra on a 1U Intel server, and this will give you a 1:5 ratio with 1TB of data, which I feel would be an even better ratio. If you need your workload finished as fast as possible, more RAM will definitely help, but is it really that urgent?

You'll need to do some file system tuning as well, something like "noatime,data=writeback,nobarrier" on ext4, and you may need to do some kernel settings tweaks as well to squeeze out the most performance you can out of your system.

If you're going with RAID, RAID-10 will be a pretty good choice, and with the proper RAID controller will offer quite a performance boost, but with halve your available space. You can also look into RAID50 if you want a decent performance boost without halving your available space. The risk of running a RAID is that you no longer have access to TRIM on your drives, which means every now and again you need to move your data out, break up the RAID, TRIM the drives and recreate the RAID.

Ultimately, you need to decide how much complexity you want, how much money you want to spend and how quickly you want your workload processed. I would also evaluate whether MongoDB is the ideal database to use, as you could still use Mongo for end-user queries that need quick responses, but use something else to process your data, which doesn't need to be ready in a few seconds, and it may also allow you to spread your workload across multiple machines with more ease.

gekkz
  • 4,219
  • 2
  • 20
  • 19