3

I am in the process of setting up a mirrored storage system for our Business.

We don't have the budget for prebuilds so I am trying to do what I can to get the best bang for our buck. Here is our hardware breakdown:

San1 and San2 Windows Server 2019

SUPERMICRO MBD-H11SSL-I Amd Epyc 7251 8 core CPU

64GB RAM 8GB x8

SSD for OS 500GB

LSI 9380-8i8e

Intel 10G nic, 4 port - Iscsi network

Intel 25G nic, 2 port - Sync between Servers - Jumbo Frames-9014.

1 internal nic 1G (data), 1 IPMI In use on MB

IW-RJ224-03 24bay SSD Enclosure, Populated with 24 2TB Samsung 860 Pros, Raid10 configuration. Connected via 2 sas cables to the 9380 card.

We will be using Starwind to sync the 2 servers.

While in the process of setting up Starwind, I have been trying to see our sync performance Using varying image sizes from 500G to 5TB

When a sync starts, the system writing the sync data is barely usable. The system stutters, performance monitor hangs, and everything runs horribly unless I turn off all caching options. If I enable writeback, or Enable disk cache, I notice Core0 on numa 0 peg 100% and everything goes south... other cores show very little, or no usage, minus a couple.

I have tried every kind of combination of drive setup to get through this, but I am getting nowhere at this point. I must be missing something. I have configured the Array in 2x8, 6x4, and 4x6 (standard 64k strip) settings thinking it was some drive limitation holding me back, but I have had 1 instance, where nothing went wrong, and the drive wrote a 5TB sync with no issues, and in an hour with perfect system response. It was going over 1.6GB/s at that time with both Caches Enabled on a 4x6 array. I did notice that core0, numa0 was near idle that time, and core 2,numa 0 was doing the heavy lifting. Took everything down to replicate and rebuild, been stuck since. Now every transfer maxes out at about 600MB writes with cache off, and when on hits about 1GB/s before it is noticeably struggling.

Any Ideas to help point me in the right direction are appreciated! Firmware up to date on the 9380, Drivers for Raid cards, Nics, and MB components are all up to date.

mazelon
  • 33
  • 4

2 Answers2

5

Here some thoughts, which may help to solve the issue:

  1. If you are using some kind of NIC-Teaming, it may affect performance of iSCSI and replication in unpredictable way. Most SAN’s/VSAN’s vendor don’t support Teaming and recommend MPIO instead. Disable NIC-Teaming.
  2. You mentioned Intel 25G NIC. XXV710 model may have issues with enabled Jumbo Frames. Disable Jumbo Frames and run additional tests.
  3. Jumbo Frame value 9126 is not typical to Windows OS and used mostly on switches. Windows default value is 9014.
  4. LSI 9380 doesn’t have Samsung 980 Pro in the list of supported drives. Moreover, 980 Pro is an NVMe drive (not SATA). Are you sure, that you have 980 Pro?

I’d also recommend to contact Starwind’s support, as BaronSamedi1958 mentioned.

batistuta09
  • 8,723
  • 9
  • 21
  • 2
    Yikes I was all over the place on there huh? Yeah they are 860 SSDs.., and yeah it was 9014... was in a rush after 10 hours of pulling my hair out :). I did get it pinned to the 710 25GB nic not having numa scaling enabled. that cleared up the issues I was having instantly. – mazelon Dec 13 '21 at 22:59
3

You need to fine tune the synchronization priority for the whole thing to function properly.

https://www.starwindsoftware.com/help/ChangingSynchronizationPriority.html

As you deal with a paid solution I’d suggest to apply for support.

BaronSamedi1958
  • 12,510
  • 1
  • 20
  • 46
  • 2
    Priority should not effect server performance. It's on a 2x25Gb server to server sync. plenty of bandwidth. sync is choking up the server when it's only using about 5Gb per connection. – mazelon Dec 11 '21 at 10:28
  • 1
    This isn’t about network, it’s about synchronization traffic saturating DISK bandwidth. – BaronSamedi1958 Dec 12 '21 at 19:00
  • 2
    Thanks for the help. It was actually the NUMA scaling was not on on the 25G nic... so it was pegging 1 core and holding everything up, bringing the system to an unresponsive state. Thank you. – mazelon Dec 13 '21 at 22:57
  • 1
    Great to hear the issue is gone! :) – BaronSamedi1958 Dec 14 '21 at 05:09