14

We've been running a server on a VM at a hosting company, and have just signed up for a dedicated host (AMD Opteron 3250, 4 cores, 8GB RAM, 2 x 1TB in software RAID, ext3).

While running performance tests, we noticed that some SQLite transations (combination of inserts, deletes and/or updates) were taking 10x to 15x longer than on my 2010 MacBook Pro.

After lots of googling and reading, we got to look at the mount options, which were:

    data=ordered,barrier=1

We've done some experimenting, and got best performance with

    data=writeback,barrier=0

I've read up on these, and understand the basics of what they're doing, but I don't have a good sense / feel for whether it's a good idea for us to run like this?

Questions

Is the above config even sensible to consider for a hosted service?

If we had a power outage, or hard crash, then we might end up with data being lost, or files corrupted. If we were taking snapshots of the DB every 15 minutes, that might mitigate the situation, but the DB might not be sync'd when the snapshot is taken. How should (can?) we ensure the integrity of such a snapshot?

Are there other options we should be considering?

Thanks

Huygens
  • 1,678
  • 2
  • 19
  • 36
NeilB
  • 243
  • 1
  • 2
  • 9
  • Many factors are involved. Do you expect many hard crashes? Do you have a UPS (or something equivalent) connected to your hosted machine? Did you do some benchmarking with other file systems (e.g. ext4, XFS, etc.)? Can you control ((de)activate) the HDD cache? How did you configure your software RAID? Are you HDD properly aligned (if having 4K blocks)? – Huygens Mar 11 '13 at 15:40
  • We don't expect many hard crashes. We don't have a UPS. The machine spec was a standard "off the shelf" from the hosting company, so: we didn't benchmark other fs, ext3 is what we got. Don't know on the HDD cache, will look into that, and similarly for RAID and HDD alignment. Thanks. – NeilB Mar 11 '13 at 17:38
  • Another question I forgot is how much history worth can you afford to lose? Or you can't afford any lost? Note: SQLite supports snapshot, or in other words backing up a running database. https://www.sqlite.org/backup.html – Huygens Mar 11 '13 at 18:29
  • What is your kernel version? Barriers are honored by md since 2.6.33, not in previous kernel release. – Huygens Mar 12 '13 at 10:45
  • uname -r reports "2.6.32-220.2.1.el6.x86_64". What's "md"? If barriers aren't honoured in this version of the kernel, why did I see a performance improvement when turning off barriers? – NeilB Mar 12 '13 at 11:37
  • Red Hat does a lot of backporting from stuff in newer kernel release into their current supported release. So it is kind of more difficult to know what their kernel version actually supports. `md` stands for multiple devices, it is the layer you configure with `mdadm` when you do software RAID-1. – Huygens Mar 12 '13 at 16:02

2 Answers2

16

First advice
If you cannot afford to lose any data (I mean once a user entered new data, if that cannot be lost in the coming seconds) and because you do not have something like a UPS, then I would not remove the write barrier, neither would I switch to writeback.

Removing write barriers
If you remove write barrier, then in case of crash or power loss, the file system will need to do a fsck to repair the disk structure (note that even with barrier ON, most journaling file system would still do a fsck even though the replay of the journal should have been sufficient). When removing write barrier, it is advisable to remove any disk caching (at the hardware) if possible, this helps minimizing the risk. You should benchmark the impact of such a change though. You can try this command (if your hardware supports it) hdparm -W0 /dev/<your HDD>.
Note that ext3 uses 2 barriers for on metadata change, whereas ext4 uses only one when using the mount option journal_async_commit.

Although Ted T'so explained why some few data corruption happened in the early days of ext3 (barriers were OFF by default until Kernel 3.1), the journal is placed in a way that unless a journal log wrap happens (journal are a cyclic log) data gets written to disk in a safe order - journal first, data second - even with hard disk supports reordering of writes.
Basically, it would be unlucky that a system crash or power loss happens when the journal log wrap. However, you need to keep data=ordered. Try to benchmark with data=ordered,barrier=0 in addition.

If you can afford to lose a few seconds of data, you could activate both options data=writeback,barrier=0 but then try to experiment with the commit=<nrsec> parameter as well. Check the manual for this parameter here. Basically you give a number of seconds which is a period the ext3 file system will sync its data and metadata.
You could try also try to fiddle and benchmark with some kernel tunables regarding dirty pages (those that need writing to disk), there is a good article here that explains everything about these tunables and how to play with them.

Summary regarding barriers
You should benchmark a few more combinations of tunables:

  1. Use data=writeback,barrier=0 in conjunction with hdparm -W0 /dev/<your HDD>
  2. Use data=ordered,barrier=0
  3. Use data=writeback,barrier=0 in conjunction with the other mount option commit=<nrsec> and try different values for nrsec
  4. Use option 3. and try further tunable at the kernel level regarding the dirty pages.
  5. Use the safe data=ordered,barrier=1, but try other tunables: especially the filesystem elevator (CFQ, Deadline or Noop) and their respecitve tunables.

Considering moving to ext4 and benchmarking it
As said ext4 requires less barrier than ext3 for a write. Furthermore, ext4 supports extents which for large files might bring better performance. So it is a solution worth exploring, especially since it is easy to migrate from an ext3 to ext4 without reinstalling: official documentation; I did that on one system but using this Debian guide. Ext4 is really stable since kernel 2.6.32 so it is safe to use in production.

Last consideration
This answer is far from complete, but it gives you enough materials to start investigating. This is so much dependent of requirements (at user or system level) that it is hard to have a straightforward answer, sorry about that.

Huygens
  • 1,678
  • 2
  • 19
  • 36
  • Thanks - lots of useful stuff there. I'd already read the ext3 doc at kernel.org, and tried changing commit, but didn't have a sense for what was a big value. Set to 15 rather than 5 seconds I saw no change. I'll do some more benchmarking, to cover the permutations you suggested. Thanks again. – NeilB Mar 14 '13 at 10:15
  • That was a good idea to try increasing the commit time while keeping the safe defaults! It is possible that SQLite is the one flushing/syncing which could be an explanation why you did not measure any performance change using the commit option. – Huygens Mar 14 '13 at 10:40
  • @NeilB just stumble on these articles: 1. http://www.sqlite.org/draft/lockingv3.html look for `ext3` in it. It gives a perhaps easier to understand (or simplified) explanation of what I tried to address in my answer. 2. http://sqlite.1065341.n5.nabble.com/fsync-and-ext3-ordered-mode-data-integrity-td25882.html you could try keeping the safe ext3 defaults (ordered + barrier) but remove the sync in SQLite. I will update soon my answer regarding this second aspect. – Huygens Mar 14 '13 at 10:45
  • Thanks for those. I'm about to work out all the permutations and run performance tests with them in turn. Early on I tried with sync off in SQLite and got good performance figures. I need to write some code to gather a range of data for different combinations of write operations first. I'll post a summary here, but if you want more details I'm neil at bowers dot com. – NeilB Mar 20 '13 at 11:53
10

Caveat: there may be inaccuracies below. I've been learning about a lot of this stuff as I go along, so take it with a pinch of salt. This is quite long, but you could just read the parameters we were playing with, then skip to the Conclusion at the end.

There are a number of layers where you can worry about SQLite write performance:

different levels for thinking about performance

We looked at the ones highlighted in bold. The particular parameters were

  • Disk write cache. Modern disks have RAM cache which is used to optimise disk writes with respect to the spinning disk. With this enabled, data can be written in out-of-order blocks, so if a crash happens, you can end up with a partially written file. Check the setting with hdparm -W /dev/... and set it with hdparm -W1 /dev/... (to turn it on, and -W0 to turn it off).
  • barrier=(0|1). Lots of comments online saying "if you run with barrier=0, then don't have disk write caching enabled". You can find a discussion of barriers at http://lwn.net/Articles/283161/
  • data=(journal | ordered | writeback). Look at http://www.linuxtopia.org/HowToGuides/ext3JournalingFilesystem.html for a description of these options.
  • commit=N. Tells ext3 to sync all data and metadata every N seconds (default 5).
  • SQLite pragma synchronous=ON | OFF. When ON, SQLite will ensure that a transaction is "written to disk" before continuing. Turning this off essentially makes the other settings largely irrelevant.
  • SQLite pragma cache_size. Controls how much memory SQLite will use for it's in-memory cache. I tried two sizes: one where the whole DB would fit in cache, and one where the cache was half of max DB size.

Read more about the ext3 options in the ext3 documentation.

I ran performance tests on a number of combinations of these parameters. The ID is a scenario number, referred to below.

scenarios I tried

I started off by running with the default configuration on my machine as scenario 1. Scenario 2 is what I assume to be the "safest", and then tried various combinations, where appropriate / prompted. This is probably easiest to understand with the map I ended up using:

map relating scenarios to parameters

I wrote a test script which ran a lot of transactions, with inserts, updates, and deletes, all on tables with either INTEGER only, TEXT only (with id column), or mixed. I ran this a number of times on each of the configurations above:

plot showing timings for scenarios

The bottom two scenarios are #6 and #17, which have "pragma synchronous=off", so unsurprising that they were the fastest. The next cluster of three are #7, #11, and #19. These three are highlighted in blue on the "configuration map" above. Basically the configuration is disk write cache on, barrier=0, and data set to something other than 'journal'. Changing commit between 5 seconds (#7) and 60 seconds (#11) seems to make little difference. On these tests there didn't seem to be much if any difference between data=ordered and data=writeback, which surprised me.

The mixed update test is the middle peak. There is a cluster of scenarios that are more clearly slower on this test. These are all ones with data=journal. Otherwise there's not much between the other scenarios.

I had another timing test, which did a more heterogenous mix of inserts, updates and deletes on the different type combinations. These took a lot longer, which is why I didn't include it on the above plot:

mixed types and insert/update/delete

Here you can see that the writeback configuration (#19) is a bit slower than the ordered ones (#7 and #11). I expected writeback to be slightly faster, but perhaps it depends on your write patterns, or maybe I just haven't read enough on ext3 yet :-)

The various scenarios were somewhat representative of the operations done by our application. After picking a shortlist of scenarios we ran timing tests with some of our automated test suites. They were in line with the results above.

Conclusion

  • The commit parameter seemed to make little difference, so we're leaving that at 5s.
  • We're going with disk write cache on, barrier=0, and data=ordered. I read some things online that thought this is a bad setup, and others which seemed to think this should be the default in a lot of situations. I guess most important is that you make an informed decision, knowing what trade-offs you're making.
  • We're not going to using the synchronous pragma in SQLite.
  • Setting the SQLite cache_size pragma so the DB would fit in memory improved performance on some operations, as we expected.
  • The above configuration means we're taking slightly more risk. We'll be using the SQLite backup API to minimise the danger of disk failure on a partial write: taking a snapshot every N minutes, and keeping the last M around. I tested this API while running performance tests, and it's given us confidence to go this way.
  • If we still wanted more, we could look at mucking about with the kernel, but we improved things enough without going there.

Thanks to @Huygens for various tips and pointers.

NeilB
  • 243
  • 1
  • 2
  • 9