0

I have a site who run a very IO-sensitive application (Accredo Saturn); it's an accounting/CRM package written in Delphi with a local flat-file database.

For various historical reasons the site was running it on a Windows Server 2008 R2 terminal server running under Server 2012 R2 Hyper-V on a Proliant DL380 G9, with their DC being an old DL380 G7 with SBS 2011 (Exchange has long been on Office 365).

I've now upgraded them to a new DL380 G10 running Server 2019. The host and domain controller are running on 6x 600 GB 10k SAS in RAID10 (host on its own partition, one big partition for the rest) on a P408i-p, with the remote desktop server on 4x 480 GB mixed use SATA SSD in RAID10 on a P408i-a. The server has 2x Xeon 4210 and 64 GB memory. The data for this software is on a VHDX on the SSD array, mounted directly on the remote desktop server.

They have 18 users, all of whom use the remote desktop server for this program, with 8 callcentre users also using a Unify phone system agent. One or two use Edge. I intended to go a bit overboard on specs because this client is fussy about speed, and as I mentioned the software is fussy!

The client has complained about slow speeds within the software. I've tested, and found that an operation that was taking 5 seconds is now taking up to 15. The old 2008 R2 VM on the same hardware is performing as it always did, so it almost seems to be something to do with the guest.

I have run diskspd with no users logged in (-c100b -b4K -o32 -F8 -T1b -s8b -W60 -d60 -Sh) and am seeing similar read IOPS and throughput on both VMs, but with some wide variations in threads on the new 2019 VM. I'm seeing about 531.41 Mbps and 136k IOPS on the guests, but two threads down at 1.9 Mbps on the 2019 VM. The old VM is sitting at 520.44 Mbps but consistently around 72-76 per thread, except for one thread down at 3.75. Total was 133k IOPS. That's on the SSD array.

In comparison the bare metal SSD array with the same parameters is giving me 999 Mbps, consistently 124-125 Mbps per thread and a total of 255k IOPS.

I have been days looking into this. I've tried the registry entry to disable the IO load balancer, but to no effect - I'm not sure if it even applies to 2019. I've tried both fixed and dynamic VHDXs - even swapped the data volume between servers (it is its own VHDX). Have tried dynamic and static memory. Have tried NUMA enabled and disabled.

I'm at my wits end and have a frustrated client who is starting their callcentre for the year on the old VM tomorrow!

The 2008 R2 is a generation 1, version 5 VM while the 2019 is generation 2, version 9.

Any hints on getting those mission IOPS back would be appreciated!

This is my first post, so apologies if I haven't included enough relevant or specific information.

arjoll
  • 11
  • 5

3 Answers3

1

on 6x 600 GB 10k SAS in RAID10

Instaed of a Raid 1 of 2x high performance SSD which give you waht - 50 times the IO?

Generally: GET SSD.

On top use statically sized SSD.

Not a lot more you CAN do - your numbers sound insane, though. 200k+ IOPS speaks for amazingly crappy programming.

TomTom
  • 50,857
  • 7
  • 52
  • 134
  • Sorry, my initial question wasn't clear about where the data was stored, and where I was testing. I've edited now. The data for that program is a VHDX stored on the RAID10 SSD array (4x 480 GB mixed use SATA). The guest tests were run on that VHDX, and the bare metal test on the SSD array; so guest is roughly half the speed of host. The SAS RAID10 is only used for the Hyper-V host, and a VM that does domain control and file and print. I'm not concerned about its performance. – arjoll Jan 29 '20 at 09:45
  • 480 GB mixed use SATA SSD -> hard to go even more low end. SATA can not realy handle this load, those should have been SAS or U.2 if you want high performance. – TomTom Jan 29 '20 at 09:48
1

That is not evidence that the performance problem is due to the storage.

Analyze the slow application workflow in detail.

  • What code paths does it take? Profile how much time each function takes.

  • What database queries does it do?

  • How many data records are involved, at what size?

  • How does it handle concurrency, including file or database based locks?
  • Does it use any external resources over the network? What is the latency to those?
  • What does communication to the client look like on the wire? The client could be the terminal server in this case.

Likely you will need the assistance of the software vendor to look this deep. Insist on detailed profiling and visibility of the type you would get from an application performance monitoring package.

Resources limits like CPU, memory, IOPS, network bandwidth, can be why things are slow. And those are metrics to measure. However, it is also possible that the stack of that application on that OS won't go faster even if you throw hardware at it. Only way to tell is to isolate what actually is slow.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • Thanks - I have been back to the developer and they are looking into things in more detail to see what is going on. Turns out they use very similar HPE hardware in-house and plan to spin up 2008 R2 and 2019 VMs on 2019 bare metal to see if they can reproduce whatever is happening. – arjoll Jan 30 '20 at 03:34
0

I've just noticed this when coming here to research another issue. The problem was resolved, and was due to TSFairShare Disk. Disabling that resolved the issue - turns out it is a problem for many applications using a file-level database.

We found the solution buried in a Microsoft Dynamics GP forum. The details of the actual fix are summarised here - https://www.ryslander.com/disable-fair-sharing-in-windows-server/ - for the likes of GP and the application we were using (Accredo) only FSSDisk needs to be disabled - we left the others alone.

I note with Server 2022 the default has gone back to disabled.

arjoll
  • 11
  • 5