7

I have an 8 drive RAID 10 setup connected to an Adaptec 5805Z, running Centos 5.5 and deadline scheduler.

A basic dd read test shows 400mb/sec, and a basic dd write test shows about the same.

When I run the two simultaneously, I see the read speed drop to ~5mb/sec while the write speed stays at more or less the same 400mb/sec. The output of iostat -x as you would expect, shows that very few read transactions are being executed while the disk is bombarded with writes.

If i turn the controller's writeback cache off, I dont see a 50:50 split but I do see a marked improvement, somewhere around 100mb/s reads and 300mb/s writes. I've also found if I lower the nr_requests setting on the drive's queue (somewhere around 8 seems optimal) I can end up with 150mb/sec reads and 150mb/sec writes; ie. a reduction in total throughput but certainly more suitable for my workload.

Is this a real phenomenon? Or is my synthetic test too simplistic?

The reason this could happen seems clear enough, when the scheduler switches from reads to writes, it can run heaps of write requests because they all just land in the controllers cache but must be carried out at some point. I would guess the actual disk writes are occurring when the scheduler starts trying to perform reads again, resulting in very few read requests being executed.

This seems a reasonable explanation, but it also seems like a massive drawback to using writeback cache on an system with non-trivial write loads. I've been searching for discussions around this all afternoon and found nothing. What am I missing?

Khaled
  • 35,688
  • 8
  • 69
  • 98
  • Would you mind putting together your data in some sort of structured table? This is some very interesting observations you're making, I'd like to see if there's a pattern. – Marcin Feb 20 '11 at 04:25

2 Answers2

3

Well, a basic dd is probably not the best way to measure drive throughput. It's not a realistic load. However, if you do run dd, please pass the oflag=direct flag in the command line to eliminate the effect of filesystem cache. Also see: How to measure disk throughput? for suggestions on how to measure workloads.

I think your scheduler choice is having a larger effect on your results than anything else. For RAID controllers with battery or flash-backed cache (write cache), I used to run with the deadline scheduler, but now use the noop scheduler if the cache is 512MB or 1GB. You can swap the scheduler on the fly, so try the tests with the noop algorithm and the oflag=direct and see how the results look.

Have you run bonnie++ or iozone?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • +1 for oflag=direct. I also recommend iozone. – sysadmin1138 Feb 19 '11 at 16:19
  • I will try noop, though thus far I've seen the scheduler have very little effect while write-back is enabled. Will check out iozone too, but I'm not too interested in the overall performance; I'm more interested in what I'm interpreting as read starvation because of the "instant" nature of writes when write-back is enabled. – Nathan O'Sullivan Feb 19 '11 at 21:36
  • 1
    oflag=direct tends to give exactly the opposite results. With writeback enabled, I see a 50/50 split , which is what I want. With writeback disabled, I see write starvation – Nathan O'Sullivan Feb 21 '11 at 07:13
  • And the change in kernel elevator? How did `noop` do? – ewwhite Feb 21 '11 at 07:22
1

If you do plan on using iozone here are some ways to check your performance. These are better than dd as they allow the kind of test you're looking for.

iozone -s 4G -a -i 0 -i 1 -i 2

That will run tests with a 4GB dataset (-s 4G), using a variable record size and run the write-test (-i 0), the read-test (-i 1), and the random read/write test (-i 2). Selecting the file-size is critical. If you pick one that fits in RAM your results will be based more in file-cache than actual storage performance. So if you have server with 4GB of RAM, test with a file size larger then that.

However, if you have obscene amounts of RAM (I have one server with 12GB) and want your tests to finish in under many hours, you can supply the -I option, which tells iozone to set O_DIRECT and bypass the filesystem cache. You'll get your true storage sub-system performance there.

You can also do tests that check for concurrent access.

iozone -s 128M -r 4k -t 32 -i 0 -i 1 -i 2

That will run 32 concurrent 128MB threads running the same tests as the previous command but with a 4K record size (-r 4k). The working set is 4GB, but some of the files will fit in file-cache. Depending on what you're doing with this storage, this may be a more accurate test of your probable performance. As with before, the -I parameter will set O_DIRECT.

iozone -s 128M -r 4k -l 16 -u 32 -i 0 -i 1 -i 2

This does the same as the above command, but runs a series of tests starting with 16 threads and increasing to 32 threads.

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296