4

I am testing an LSI 9207-8i controller with 8x Samsung 850 Pro 256GB SSDs attached. SSDs are running latest firmware EXM02B6Q, controller is running P17 and has exhibited same issues with P19. Server RAM is ECC and have been testing in mirrored mode.

I have tested with ZFS-On-Linux and FreeBSD, and have tried LSI's driver on both operating systems.

Disks behave as expected, but during heavy IO they appear to be writing bad blocks. When running a scrub on the disks, checksum errors appear. In order to simulate heavy IO, I am using a recordsize of 16k with primarycache=metadata and secondarycache=none. I generate a 4gb random file and dd this to another file in 4 threads. Looping this a few times is enough for a scrub to show checksum errors.

Yet to confirm if this is an issue with the controller, SSDs or cables. I am suspecting the SSDs, but will be testing with a 9211-8i at the next opportunity.

Has anyone experienced a similar issue, or does anyone have any suggestions on what to do next - beyond replacing controller/SSDs?

Update: Have tested another Samsung 850 Pro 256GB with EXM01B6Q firmware on an entirely different server, using the onboard SATA controller. Same issue occurs with checksums.

  • Further tests are suggesting that the problem is related to the garbage collection on the drives, and the sector size as used by ZFS. Creating the pool with ashift=9 (default) results in checksum errors, but ashift=12 is working without errors so far. I have also noticed that simply writing data to a pool with ashift=9, waiting for a while and scrubbing the disk results in checksum errors. Waiting again and scrubbing again results in further checksum errors, hence my belief that it is related to the garbage collection. – Christopher King Mar 23 '15 at 19:24
  • Further to the original poster, we had the exact same problem on OmniOS with 6x Samsung 850 EVO drives. We followed his comment and changed the ashift to 12 (using http://lists.omniti.com/pipermail/omnios-discuss/2013-August/001261.html) and that's worked a treat for us too! Thanks a lot OP! –  Jun 11 '15 at 14:06
  • @ChristopherKing If you have been able to come up with an answer (which your comment seems like), you should post that as a self-answer, accept that answer, and upvote any other answers that were helpful (such as it seems ewwhite's was). Comments are ephemeral and subject to deletion at any time, while question and answer posts remain on the site. – user Jun 12 '15 at 19:04

3 Answers3

4

I've had this problem in the past with Samsung 850 Evo's as well. The drives present themselves as 512K aligned in OmniOS/OpenSol, which because it lacks the ashift param, you get this issue. It appears to be some kind of garbage collection issue on the disks themselves, I'd write a ton of data, scrub-- and see errors.

We ended up forcing the disks to present as 4K aligned in sd.conf, and ZFS then started behaving properly.

I thought I'd bring this up incase someone else hits the same problem.

Shane
  • 41
  • 1
3

I have managed to resolve the problem by setting ashift=12 (4k alignment) when creating the pool.

0

I'd suspect the consumer-oriented Samsung 850 SSDs or the drive backplane, assuming there is one in this configuration.

This is mainly because you're experiencing errors across two different operating systems. Can you provide any other details about the hardware configuration?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 1
    Yes, this particular server is a Supermicro X9DRL-7F with 4x 32GB ECC RAM. It is in a 1U chassis with Supermicro backplane (not sure what model). I have just tested a Samsung 850 Pro 256GB in completely different machine with the onboard SATA controller. This was a drive from a different batch with the earlier EXM01B6Q firmware. Same checksum issues after 1 run of my multithreaded dd test. I think this confirms that Samsung 850 Pro drives are the culprit! – Christopher King Mar 17 '15 at 18:51
  • 1
    This is very interesting. Do you have any other SSDs you could try? Or could you share your testing protocol? – ewwhite Mar 17 '15 at 19:15
  • 1
    I'll be trying some alternative SSDs next week. I currently have several servers full of Crucial M4 512GB SSDs which have been operating perfectly as DB servers for several years, and I have performed the same tests on these just to make sure I can only reproduce the problem on Samsung 850 Pro drives. My test is to set recordsize=16k, primarycache=metadata, secondarycache=none. Create a 4GB random file using /dev/urandom, then a loop: fork 4 instances of `dd if=randomfile of=newfile.1 bs=16384 &`, run `zpool scrub` and wait for it to finish. Checksum errors happen after a few runs. – Christopher King Mar 19 '15 at 15:17