Server nearly unusable when doing disk writes

Question

My question closely relates to my last question here on serverfault.

I was copying about 5GB from a 10 year old desktop computer to the server. The copy was done in Windows Explorer. In this situation I would assume the server to be bored by the dataflow.

But as usual with this server, it really slowed down. At least I could work with the remote session, even there was some serious latency. The copy took its time (20min?). In this time I went to a colleague and he tried to log in in the same server via remote desktop (for some other reason). It took about a minute to get to the login screen, a minute to open the control panel, a minute to open the performance monitor, ... Icons were loading maybe one per second. We saw the following (from memory):

CPU: 2%
Avg. Queue Length: 50
Pages/sec: 115 (?)

There was no other considerable activity on the server. The server seldom serves some ASP.NET pages, which became also very slow in this time.

The relevant configuration is as follows:

Windows 2003
SEAGATE ST3500631NS (7200 rpm, 500 GB)
LSI MegaRAID based RAID 5
4 disks, 1 hot spare
Write Through
No read-ahead
Direct Cache Mode
Harddisk-Cache-Mode: off

Is this normal behaviour for such a configuration? What measurements could give further clues?

Is it reasonable to reduce the priority of such copy I/O and favour other processes like the remote desktop? How would you do that?

Many thanks!

possible duplicate of http://serverfault.com/questions/118046/very-slow-harddisk-performance — Oskar Duveborn, Apr 02 '10 at 09:15
How much RAM is there in the server, and what are the disk queue stats when it's not having files copied to it\? — xenny, Apr 02 '10 at 11:36
Ugh. Another server brought to its knees by the horrible myth that disk cache should be turned off. Caching exists for a reason and is absolutely critical to performance of mechanical drives. Stick the server on a $50 UPS, turn caching back on, and be happy. — Nicholas Knight, Apr 02 '10 at 13:44
The server got 6GB RAM and is sticked to an UPS. I heared I can't enable write cache without batteries, but seems like I've got to look it up. — Wikser, Apr 02 '10 at 19:50
The UPS _is_ the battery. There is no technical limitation requiring an on-controller battery for cache -- you think every desktop and laptop in the world runs without cache? An on-controller battery just adds a little extra insurance, and so long as you're on a reliable power feed (in this case provided by the UPS), the gain is very nearly zero. — Nicholas Knight, Apr 02 '10 at 21:32

score 2 · Accepted Answer · answered Apr 02 '10 at 09:16

Disc overload. That simple. Avg. queue length 50 - check "seconds per IO / Read / write" - that will be too high, too.

It looks a lot like you basically totally overload the discs, and having hard disc cache mode off does not help either (bad setting - at least put it onto read cache there... better write + ups - withut caching SATA NCQ can not work, killing your performance).

The main problem is your RAID 5 - it basically has all on it. FIle area AND operating system, so an overload overloads the whole system.

For real servers I use WD Scorpio Black in a Raid 10 (4 discs) for the operating system and (I only do virtual) virtualization root- the Raid 10 gives me better performance. For a high performance file server I would / do add a SECOND raid (can be raid 5) for the files. The trick here is that the file area and the operating system area are not never ever allowed to overlap (same discs). In your case - get a small hard disc (80gb or so) - two of them - and put a mirror on them and move the operating system onto that. Then the server is still usable when IO is piling up.

Pages/second says nothing - it means there is some virtual memory playing around. If that hits the discs during your file copy (likely, but this is another performance counter that marks physical activity as result of page faults) Then naturally it gets into the queue.

And please get caching on. Can LSI sell you some bbu (Battery backup unit?). I use Adaptec myself as RAID controller, and ever since I Have a BBU On them, I put the cache on write back (NOT write through) - the performance gain from optimizatons is significant.

Thanks for your answer. For the next server (which I don't order myself) I'll make those things (RAID10, separate OS disk, BBU) as a requirement for sure. :) For the existing server I'll try to enable the read cache. What do you mean with "better write + ups"? The server is on a UPS, but can I enable the write-cache just with an UPS? Is this impossible or just to risky? Will the "gain from optimizatons" be significant in such "mainly copy" situations too? — Wikser, Apr 02 '10 at 21:04
At least turn on disc caching for read and write - the UPS should handle problems thre, unless the power supply burns out. WIthout full disc caching on the discs, you loose NCQ which makes discs a LOT slower and less respnosive, as they have to work on the commands in the EXACT ORDER they arrive - instead of reordering them. — TomTom, Apr 03 '10 at 03:53
Ok thanks. I'll try to enable the write cache (in a quite moment after a backup) and test the performance gain. If it's fine, a BBU seems to be a good choice. — Wikser, Apr 03 '10 at 05:39

score 1 · Answer 2 · answered Apr 02 '10 at 15:47

The problem has been characterized well by the other answers but in short:

Your RAID array with 3 (active) 7200 RPM disks in RAID 5 has a write performance that is about 3/4 the speed of a single 7200RPM drive for extended copies. Given that you have disabled caching\read ahead etc the performance will be even worse than that. For the most part the performance of your server from a write perspective is pretty going to be poor with this config.

If your 5GB is a single large file (or a couple of fairly large files) and if your network based copy is being sent at faster than about 30Meg/sec (easy enough with a Gigabit connection) then your server's disks wont be able to keep up, the network copy buffering on the server will grow until it consumes all available memory on the server and then that will force the OS to start paging excessively further worsening your performance problem. Depending on what other things are actually happening on the server the copy speed that is needed to kill your system may be even lower than this, if there is any other sustained read\write activity, even at very low rates, then an inbound copy over a 100Meg connection might be enough to trigger this sort of problem.

Thanks for your answer. In this situation, the memory usage of the system was not rising, so I don't think the system was swapping. But this may explain a problem we've had, when about once a week the server literally crashed at night. The memory usage was rising endlessly while backupping the disks to an (regularly exchanged) USB drive. We've bypassed the problem by stopping all regular tasks at backup-time. — Wikser, Apr 02 '10 at 20:01

score 0 · Answer 3 · answered Apr 02 '10 at 13:29

Are you sure the RAID array wasn't being rebuilt? I've seen a rebuild/verify bring a box to its knees. You might even have a drive that is marginal and can't keep up with the others, but isn't throwing error codes (yet).

A 'RAID' drive should immediately tell the controller that it has a problem, 'consumer' (they're the same, but with different firmware) drives will keep retrying a failed request instead of defaulting to a fast fail. I've had a few that eventually got dropped from an array due to timeouts under load. They'd check out alright and rebuild (usually) without incident, only to start timing out again as soon as the box was under load. The constant rebuilds and stalling drives would bring the box to a standstill after a few rebuild cycles.

Thanks for your answer. The RAID controller management tool showed no problems. Wouldn't a rebuild be shown in the logs? How can I diagnose this kind of problem other than with the RAID tools? — Wikser, Apr 02 '10 at 20:10

Server nearly unusable when doing disk writes

3 Answers3