9

I have a Sun M4000 connected to an EMC CX4-120 array with a write-heavy database. Writes peak at around 1200 IO/s and 12MB/s.

According to EMC, I am saturating the write cache on the EMC array.

I think the simplest solution is to move the redo logs to a DRAM based SSD. This will reduce the load on the EMC array by half and apps won't be seeing log buffer waits. Yes, the DBWR may become a bottleneck, but the apps won't be waiting for it (like they do on redo commits!)

I currently cycle through about 4 4GB redo logs, so even 20GB or so of SSD would make a big difference. Since this is short-term storage and is constantly being overwritten, Flash based SSDs probably aren't a great idea.

The M4000 doesn't have any extra drive lots, so a PCI-E card would be perfect, I could go external or move boot volumes to the EMC and free up the local drives.

Sun sells a Flash Accelerator F20 PCIe card, but that seems to be a cache for some SATA disks, not a DRAM SSD solution. Details are sketchy, it doesn't list the M4000 as supported, and I'm tired of fighting Sun's phone tree looking for human help. :(

Do others agree that a DRAM SSD is the way to go? Any hardware recommendations?

UPDATE In addition to the info in a comment below, I tried various settings for "commit_write" and it didn't make a difference.

rmeden
  • 192
  • 2
  • 8
  • Are you archiving the logs somewhere ? If they ultimately need to be copied off the SSD to disk then you may just move the bottleneck to the archiving. – Gary Jul 13 '10 at 00:44
  • Yes... redo logs are being archived and IO actually increases to about 80MB/s during the redo log copy because it's a sequential write. I always thought redo logs were sequential, but guess not. – rmeden Jul 13 '10 at 14:09

5 Answers5

9

First - I guess you have very few disks in the array. 1200IOPS can be easily supported be 12 spinning disks (100 IOPS per disk is very reasonable). If the cache can't handle it, it means that your sustained write rate of 1200 IOPS is way more than your disks can support.

Anyway, SSD for redo logs isn't likely to help. First, are your session wait mostly on the COMMIT statement? Check the top wait events in statspack / AWR to verify. I would guess ~95% of your I/O is not to the redo logs at all. For example, a single row insert to a table with 5 indexes can do 1 I/O to read a table block (that has space for the row), read 5 index blocks (to update them), write 1 data block, 1 undo block and 5 index blocks (or more, if non-leaf blocks are updated) and 1 redo block. So, check statspack and see your wait events, you are likely waiting a lot of both READs and WRITEs for data / indexes. Waiting for reads slows down the INSERT, and the WRITE activity makes READs even slower - it is the same disks (BTW - do you really need all the indexes? dropping those who aren't must have will accelerate the inserts).

Another thing to check is RAID definition - is it RAID1 (mirroring - each write is two writes) or RAID 5 (each write is 2 reads and two writes for checksum calculation). RAID 5 is way slower in write-intensive load.

BTW - if the disks can't hanlde the write load, DBWR will be a bottleneck. Your SGA will be full with dirty blocks, and you will not have room left to read new blocks (like index blocks that needs to be processed / updated) until DBWR can write some dirty blocks to disks. Again, check statspack / awr report /addm to diagnose what's the bottleneck, typically based on the top 5 wait events.

Ofir Manor
  • 731
  • 3
  • 6
  • 1
    +1 - and I'd give it +10 if I could. – Helvick Jul 12 '10 at 21:17
  • 2
    +1 for advice to actually see where the bottleneck is. – DCookie Jul 12 '10 at 22:22
  • The waits are "log file sync" and "log buffer space". I can get about 150MB/s to the volume using DD. I suspect the LGWR is waiting for an IO to complete before submitting the next one. IO service time is about 1ms. The EMC has a whopping 500MB of cache, that according to EMC cannot be increased w/o upgrading the whole box. We have 22 TB in the array, why they would offer something with so little cache is beyond me. The redo logs are currently in a 5-wide RAID 5, but there was no difference with RAID 10 (another reason to suspect cache) – rmeden Jul 12 '10 at 22:30
  • BTW, if there was more cache the disk still may not keep up. By moving the REDO off the EMC array, that frees up capacity for the data disks and cuts I/O in half. A small DRAM SSD may be the cheapest, high performance disk since it can be small. – rmeden Jul 12 '10 at 22:38
  • meden - how much redo does Oracle writes per second? you said total I/O is 12 MB/s and 1200 IOPS, it means a lot of small IOs (average 10KB). If you move the redo logs to SSD, you'll just see different wait events as the DBWR will become the bottleneck and INSERT will wait for free buffer in the SGA. Please check - what type of RAID you have, what is the stripe size and what is Oracle block size (also, are your datafiles striped over all the disks?). Also, check in statspack the source for most of the I/O - is it redo or some other thing - check I/O per tablespace – Ofir Manor Jul 13 '10 at 01:41
2

dd is nothing compared to block i/o.

For some other views, check around, anandtech.com did an exaustive test (granted with MS SQL server) with SAS rotating vs SSD, in various combinations, and the Solaris world has ZFS with SSD making up various parts (logs, cache, etc).

But yes, if RAID 5 vs RAID 10 is the same (for writes), you are doing something wrong. With linear writes, heck RAID 5 could be faster (i.e. it can do the parity in memory, then write the stripes and parity all at once), but with random small block (4-8k), you get killed by updating stripes (as noted by others), the raid 10 should be more than 2x faster, if not, something is wrong.

You need to dig deeper, before you spend money on hardware.

Ronald Pottol
  • 1,683
  • 1
  • 11
  • 19
2

I saw a post about mounting UFS partitions using the "forcedirectio" option and setting the Oracle parameter "filesystemio_options" to "setall".

I tried it and see a 4-5x improvement in Oracle writes! Yea!

The key symptoms were low throughput but good response times on the disk. This seems to help some people but not others. It certainly did the job for me.

I may consider SSDs for new servers, but this server is running fine now.

Robert

rmeden
  • 192
  • 2
  • 8
  • Most likely the speed-up you experienced was not caused by enabling direct I/O, but by enabling asynchronous I/O. In Oracle, setall means direct+async. – kubanczyk Sep 29 '13 at 22:17
1

If this box had only been an x86/64 box running linux I'd have happily recommended one of the FusionIO PCIe drive cards - they're astonishingly fast and don't 'die' with heavy writes like SSDs do. Unfortunately they're not supported with either Sparc or Solaris, you might want to contact them to discuss this though.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
1

The F20e PCIe card is similar to the Fusion I/O in function. It's basically just a PCIe attached Flash SSD. With write heavy workload, you'll need to worry both about maintaining enough free blocks (via drive based garbage collection of some kind) so you don't wind up with the Erase/Program cycle on the SSD becoming the bottleneck, as well as the limited write cycles available on a Flash based SSD. It's definitely fast, but might not be the best kit for this job.

John
  • 93
  • 1
  • 7
  • tks John. I didn't think it would work for me. Sun doesn't even support it on a M4000 anyway. :( – rmeden Sep 01 '10 at 04:49