3

I just installed an LSI 9260-i8, using two virtual drives, the first composed of 4 SSDs, the second of 4 HDDs. Obviously the idea is to get better performance while maintaining some security and plenty of storage capacity.

The SSDs are great and that array is ridiculously fast when dealing with small to relatively large files. The HDDs host mostly huge files (500MB-30GB). It's intended a the main long term storage facility, while the SSD array is for operating files and short term storage only. This means very often files will be moved from the SSD array to the HDD array.

Problem is that performance very quickly declines after the first gig or so of a large operation is written. It starts at around 250MB/s, which isn't half bad write performance for an RAID 5 array of only 5 HDDS, but the copy I just did, consisting of 4 files totalling 12 GB, gradually declined to a 35MB/s low.

Now I guess one's advice would depend on a lot of metainfo, so here goes:

  • The LSI card does not have a BBU (yet) so write-back is disabled.
  • The HDDs are WD15EARS 2TB drives. Obviously these aren't the fastest HDDs out there, but a consistent 200MB/s isn't too much to ask I think.
  • The SSDs are OCZ Vertex 2 60GB drives.
  • Don't think it's relevant, but the HDDs have the spin down time upped to 5 minutes instead of the normal 8 seconds
  • Drives show healthy in Storage Manager, no errors of note in logs
  • Like I said, the SDDs are really fast, sporting up to 1100MB/s read speed, so that doesn't seem to be the bottleneck.
  • Copy seems to pause, it'll run fast, stop, run fast again for about 500MB, etc etc, resulting in a lower speed overall.
  • When creating the HDD array, I used a strip size of 512Kb. That's huge, but I'm expecting only large to huge files on that array. I'd rather not change that now either, as it would destroy existing data and I don't have a backup (yet)
  • Operating system is Ubuntu 10.04 (64bit)
  • Motherboard Asus WS Revolution (it's a workstation), 24GB of ECC RAM, Xeon W3570 at stock 3.2GHz
  • The LSI card is inserted in the first PCIe slot (to avoid latency introduced by NF200)
  • System is otherwise perfectly stable
  • HDD array was formatted using "mkfs.ext4 -b 4096 -E stride=128,stripe-width=384 -L "DATA" /dev/sdb"
  • fstab does not include data=writeback, nor noaccess, though I don't think that should be an issue influencing large files

Any and all advice is appreciated.

Keiran Holloway
  • 1,146
  • 6
  • 13
Jake
  • 33
  • 1
  • 5
  • You write "4 HDDs" and then "RAID 5 array ... 5 HDDs". 4 or 5 HDDs, which one is it? What RAID level are you using for the SSDs and HDDs? Where are you copying from, and where *to*, when you observe the 35MB/s transfer rate? –  Dec 05 '10 at 20:05
  • It's 4. And from SSD to HDD, as I explained. – Jake Dec 05 '10 at 20:36

3 Answers3

4

TomTom already has essentially answered it, but a little more context to the answer might be useful.

You're using RAID 5. RAID 5 has a well known performance issues when writing data.

For each RAID 5 stripe there is a parity data block, and the parity data blocks are spread out over all disks in a round-robin fashion. For each write to a RAID 5 array, the controller needs to recompute the parity information, and then write the new parity block to disk. A quote from here illustrates this (regarding a partial stripe update, but the same principle applies):

If you [...] modify the data block it recalculates the parity by subtracting the old block, and adding in the new version. Then in two separate operations it writes the data block followed by the new parity block. To do this it must first read the parity block from whichever drive contains the parity for that stripe block and reread the unmodified data for the updated block from the original drive. This read-read-write-write is known as the RAID5 write penalty since these two writes are sequential and synchronous the write system call cannot return until the reread and both writes complete, [...]

Around 35 MB/s sounds about right for a single SATA HDD doing a good bit of more-or-less random I/O due to the RAID 5 striping, and real-world RAID 5 write speeds are generally around ~1 disk performance for smaller arrays. So it's more-or-less expected performance; that it copies faster at the beginning is probably OS caching at play.

Getting a Battery Backup Unit and enabling write caching is not a cure-all solution. You write that you often copy large files (>1 GB). BBU + write caching helps tremendously with random small file writes, but less so with large sequential writes (because the on-controller buffer eventually fills up).

If you want to have good write performance, the answer is generally RAID 10.

And lastly, when you create your partitions, you should take care to ensure that the partition boundaries align with the array stripe boundaries.

Jeremy Visser
  • 1,405
  • 8
  • 16
  • > And lastly, when you format your partitions, you should take care to ensure that the partition boundaries align with the array stripe boundaries. ---- Could you explain that a bit? What mkefs options correspond to what you're saying? What more context could you possibly need? – Jake Dec 05 '10 at 21:29
  • @jake: Say your RAID array uses a 64 kb stripe, starting from block 0. Your file system uses a 4 kb cluster unit, but exists on a partition that starts from block 5. Now your file system is offset by 5 blocks, and the last filesystem cluster in the stripe has a few bytes 'sticking out' (extending over into the next stripe on the RAID). Thus when this filesystem cluster is written to, the RAID controller has to update **two** stripes on the array, which induces a performance penalty. You can set the start offset when you create the partition. –  Dec 05 '10 at 21:39
  • I sort of get what you're saying, but can't say I FULLY understand. I mean to say I understand the issue, but I'm not sure what I do should do differently. I just used mkfs.ext4 to format the whole device (after creating a gpt partition table). You can see the mkfs.ext4 options I used above, and that the stripe size is 512KB. – Jake Dec 05 '10 at 21:44
  • @Jake: I don't use Linux, so no copy'n'paste solution for you. Try Googling for "Linux partion alignment", or searching this site. Note that alignment is done when you create the partition, not when you create the filesystem. Lastly, if you switch to RAID 10, alignment matters less. –  Dec 05 '10 at 22:17
  • Ok, I'll figure it out. I have been aware of this issue but it didn't seem to apply as cluster and stripe size are dividable and the partition starts at sector null, but I could be wrong. I'm not confident that will solve it though, as the performance seems to degrade over time, it's not just x% less than it should be. – Jake Dec 05 '10 at 22:33
1

I think that "The LSI card does not have a BBU (yet) so write-back is disabled" is the bottleneck.

If you have UPS - enable the Write-Back.

If not - try to get the BBU.

If you can't - you can either risk the data consistency on the virtual drive by loosing the cached data in case of power surge if you enable Write-Back or stick to these speeds using write through cache.

Even if you align the partition to the logical volume (which is normally automatically done by most modern OSes) and format the volume with optimized cluster/block size big enough (i think it should be 2mb in your case) to get all the drives to process the single IO request i don't think you will achieve very big write performance difference.

Because the write performance of the RAID5 is very over-headed process. And since it is write through the XOR processor don't have the whole data in cache to perform the parity calculations in real time i think

With Write-Back enabled cache on 4x320gb hdds 515kb stip sized RAID 5 i get average 250-350 MB/s write speed writing big sequential files or average 150 MB/s copying big files inside the virtual volume. (i still don't have BBU but i have and old apc 700VA smart ups so i think its enough to minimize the power surges and eventual cache loss to a lot)

Are we discussing 100% random, 100% sequential or some mixed pattern? I am mostly experiencing high speeds when i fully read, write or copy big files on/from/to my array. On the other hand as already said random writes (reads) are much lower variating from less than 1 mb/s up to 190 mb/s average speeds depending on the file sizes and/or request sizes. Mostly under the 20mb/s range in everyday small size/file uses. So it depends a lot from the applications in the real life random transfers. As i am using windows OS my volumes are pretty mush as de-fragmented and for big files big operations like copying from/to are pretty fast

And one suggestion as a solution to the slow read/write random speeds of normal hdds - if you get to the point of reconfiguring your whole controller configuration why don't you consider CacheCade using 1 or 2 of the SSDs for no-power-dependent raid cache (something like the adaptec hybrid raids) and the rest for your OS/app drive as you are using them now? This way you should be able to boost the speed of your raid 5 volume even with write through i think because the actual write to the physical hdds should take place in the background and as you are using write through cache (no on board controller cache) and the ssds as cache instead i think you should be worry free of system resets. But for actual and concrete information on how cachecade works please read lsi's documentation and even ask LSI's technical support as i haven't got the chance to use it yet.

Angel
  • 104
  • 3
  • RAID 5 performance should be very good on the LSI, comparible with the Adaptec 5805. I doubt that is the bottleneck. I do have a (1200VA) UPS, but that doesn't help when the system freezes up and I have to a hard reset (or does it?). – Jake Dec 05 '10 at 21:23
  • Well if the OS freezes totally the cache must be flushed after 4sec (be default) of inactivity to the disks after which the hard reset or even unplugging the power cord shouldn't be a problem. I accidentally did the unplugging power cord thing and consistency check after that did not show any problems although i am not very sure if the cache was already flushed or not. But i am yet to experience OS/system freeze although its quiet abused. – Angel Dec 05 '10 at 21:57
  • 1
    I agree it's rare, but not unheard of. I guess though the journal would prevent any serious damage? Generally I have most important data in code repositories on other servers, so /sda is not really critical. What I'm worried about is irreparable damage to the data on the HDD array. That is my mass storage and gone is gone. – Jake Dec 05 '10 at 22:11
  • 1
    Journal does not help at all with write-back enabled hardware raids. To the filesystem data is as already written to the disk when it enters the controller's cache. Filesystem does not know when it is written to the actual physical drives. That's why i am not saying you must at all cost activate the write-back if u don't have BBU. UPS is an overall system protection against outer threats, but not against internal threats/faults like BBU protecting only the controllers cache. Which also doesn't mean BBU faults are not misheard etc. – Angel Dec 05 '10 at 22:31
  • You're right though, writeback makes a shitload of difference. I just copied 20 <1GB files at almost 200MB/s and a single 20GB file at 350MB/s. I didn't expect such a massive difference. – Jake Dec 05 '10 at 22:51
  • It still goes down awful quick though, especially when copying multiple (really small numbers, like four, but even with two) files. Is there any way to help that? – Jake Dec 05 '10 at 22:59
  • It's also very inconsistent, the same two files that copied at 170MB/s a minute ago copy at 300MB/s now. Both times the system was pretty much idle otherwise. – Jake Dec 05 '10 at 23:04
  • raid 5 is really good for big files and for more loaded (more concurrent transfers) storage machine rather than for small random writes but reread my post as i edited a while ago i am using windows box instead of linux so there should be some differences in the filesystems as it is said ext3/4 alsmost doesn't get fragmented and i am not even sure if there's a tool to try to defragment even that 1 percent if there's such but as you said its better to use the raid5 volume for smth like backup storage of big (sequentially read/written) files – Angel Dec 05 '10 at 23:37
  • Turns out though that when I disable write back on the SSD array, Ubuntu wont boot, claiming a missing /tmp directory. Obviously I had to reboot yesterday, but I guess that never really cuts the power. Doing a cold boot, it chokes. Guess I'll just hurry with that BBU. – Jake Dec 06 '10 at 08:18
-1

Hmpf. Some basics.

It starts at around 250MB/s, which isn't half bad write performance for an RAID 5 array of only 5 HDDS

Reality check: The write speed of any RAID 5 is slow, limited to the wirte speed of ONE DISC. 3, 5, 15 sics makes no dsifference on writes.

The HDDs are WD15EARS 2TB drives. Obviously these aren't the fastest HDDs out there, but a consistent 200MB/s isn't too much to ask I think.

Reality check: it is too high for an end user disc in a random IO situation, and that is what you demonstrate. Even 200mb RAW is too high. But if you add all the operations you need for a Raid 45 (check Wikipedia) it is comical to ask for it. You want speed? Get FAST discs, and move to a RAID 10.

Don't think it's relevant, but the HDDs have the spin down time upped to 5 minutes instead of the normal 8 seconds

8 second spin down time for a HDD? Where the heck did you get that number from? That is WAY too slow, WAY too bad.. Spin up should be avoided when more operations happen in a short timeframe - we talk minutes. 5 minutes is too low. 8 seconds is suicidal.

At the end of yothe day you expect a lot from little - you got cheap on the RAID 5, and never did any reality check. The numbers are well ok.

Things to checK:

  • Do the drives support NCQ? Is it used?
  • How expensive is a BBU? Pretty much moving to write back instead of write through can make a HUGH improovement. Right now you are not optimizable on the IO patterns for writes because nothing in between can cache.

Besides that,getting FASTER discs and moving away from RAID 5 are your only options. Doing some starter math checks first will also help you - pretty much your assumption what speed you should have is terribly off.

TomTom
  • 50,857
  • 7
  • 52
  • 134
  • 2
    Uhm, that tone isn't needed, I'll just assume that wasn't intended. I'll take your point that 200MB/s may be too high an expectation. 8 seconds is the idle spin down time of WD's power saving feature. There's a vendor specific tool to adjust it. – Jake Dec 05 '10 at 21:03
  • As you say, 200MB/s might not be realistic for a 4 disc RAID 5 array, at least not for real-life applications. A quick benckmark claims over 300MB/s, but I'd be happy with just half of that, consistently. When just attached as separate disks on the mb's controller, real life performance is around 80MB/s. You understand why I think a speed running down to 35MB/s is not normal. Also the fact that file copies seem to "pause" leads me to believe there's some other bottleneck. – Jake Dec 05 '10 at 21:09
  • Angel just above this claims 250-350MB/s on a RAID 5 array. That kinda doesn't support your claim that a RAID 5 array performance doesn't exceed the performance of single disk (which seems illogical too). – Jake Dec 05 '10 at 21:27
  • 3
    FYI aside from being unnecessarily rude, you're also wrong. I enabled write back and a 20GB file copies from the SSD array to the SSDs at ~350MB/s. – Jake Dec 05 '10 at 22:53
  • 1
    Obviously you're mixing up RAID-5 sequential write performance and RAID-4 (Ciprico, NetApp, DDN) random write performance... Your informations are wrong. – wazoox Dec 13 '10 at 16:02