8

I have a Windows 10 workstation used within my business for things like image processing (Photoshop) and software development (Eclipse). It's an i7-2600K based computer, Gigabyte GA-B75M-D3H B75 motherboard, 16 GB RAM. OS is on Samsung 850 pro SSD, there's another 850 pro for data, WD Black for data, plus two 4GB HGST drives each on SATA 3 ports, formatted ReFS, in a storage spaces mirror. The array has 1.63GB used, 1.99GB free.

Recently the ReFS drives in the storage spaces mirror have started dropping - so far three times in a month. This usually occurs under moderate to heavy load, after an extended period. None of the other disks drop under load as far as I can tell, so I assume it's ReFS, Storage Spaces, or a problem with an underlying disk. A reboot brings the disk online.

I can see errors in the event viewer such as those below. These are not all in one place, and while there are NTFS and Storage Spaces log areas under "application and services log -> microsoft -> windows" there doesn't seem to be one for ReFS.

I'd appreciate help tracking down what's causing these problems, and resolving them, so my system stays up.

16:27.05 (under event viewer -> application and services log -> microsoft -> windows -> storagespaces-driver-operationsl
Virtual disk {26bf58b3-1cb9-4b93-a945-1b89331bb565} requires a data integrity scan.                                    
Data on the disk is out-of-sync and a data integrity scan is required.                  To start the scan, run the following command:                  

Get-ScheduledTask -TaskName "Data Integrity Scan for Crash Recovery" | Start-ScheduledTask                  

Once you have resolved the condition listed above, you can online the disk by using the following commands in PowerShell:                  

Get-VirtualDisk | ?{ $_.ObjectId -Match "{26bf58b3-1cb9-4b93-a945-1b89331bb565}" } | Get-Disk | Set-Disk -IsReadOnly $false                  
Get-VirtualDisk | ?{ $_.ObjectId -Match "{26bf58b3-1cb9-4b93-a945-1b89331bb565}" } | Get-Disk | Set-Disk -IsOffline  $false

16:27.05 (windows system event log): The file system was unable to write metadata to the media backing volume R:. A write failed with status "A device which does not exist was specified." ReFS will take the volume offline. It may be mounted again automatically.
16:27.06 (windows system event log): The file system detected a checksum error and was not able to correct it. The name of the file or folder is "<unable to determine file name>".
18:35.50 (windows system event log): Failed to connect to the driver: (-2147024894) The system cannot find the file specified. 
18:35.50 (Kernel PNP) The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\_??_USBSTOR#Disk&Ven_Generic&Prod_STORAGE_DEVICE&Rev_9451#7&2a9fd895&0#{53f56307-b6bf-11d0-94f2-00a0c91efb8b}.

18:35.58: Virtual disk {26bf58b3-1cb9-4b93-a945-1b89331bb565} could not be repaired because there is not enough free space in the storage pool.                  
Replace any failed or disconnected physical disks. The virtual disk will then be repaired automatically or you can repair it by running this command in PowerShell:                  
Get-VirtualDisk | ?{ $_.ObjectId -Match "{26bf58b3-1cb9-4b93-a945-1b89331bb565}" } | Repair-VirtualDisk

UPDATE as yagmoth points out this error includes something about USB. The scenarios where I recall this error happening are a) When backing up to an external USB disk b) When running CrashPlan backups to another internal SATA disk

Tim
  • 30,383
  • 6
  • 47
  • 77
  • 1
    antivirus removed in case its a windows 10 compatibility issue? as like mcafee got trouble with the aniversary upgrade – yagmoth555 Dec 31 '16 at 04:43
  • @yagmoth555 I could give that a go I guess, but it doesn't seem super likely to me. I'm quite interested in how to troubleshoot the issue itself, rather than guess and check. – Tim Jan 01 '17 at 01:15
  • Sadly without more log I suspect a underneath hardware problem. As the AV idea was the easy test, as if your mirror cant handle IO, the AV driver on access/on write dont help (as it slow your IO by installing a driver in between) If you want to investiguate without test/error you could try a processmonitor first to log everything. If all fail, I would try to remove the mirroring to test again after (but that test still is a test/error test sadly). Keep us updated with the monitoring log please – yagmoth555 Jan 01 '17 at 04:46
  • Can I ask if your mirror is done with usb disk ? as the kernel error state a usb driver – yagmoth555 Jan 01 '17 at 04:49
  • Both disks are on SATA 3Gbps. Good spotting about the USB error. Question updated. I'm not at the machine for a week or so, but I'll try to trigger it again and see if the logs are different. ProcessMonitor looks like it would generate masses of logs, but I'll give it a shot sometime thanks. – Tim Jan 01 '17 at 22:39

2 Answers2

2

Storage spaces seems very sensitive to write latency: if it too much spikes, the volume can be dropped.

This seems a know problem when using consumer SSDs, as you can find here

Tim
  • 30,383
  • 6
  • 47
  • 77
shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Thanks for that information. My ReFS / Storage Space disks are 4TB HGST hard drives, not SSDs. My SSDs run NTFS. The write latency thing is interesting though. – Tim Apr 22 '17 at 09:02
1

First, you really should check the HCL. I'd bet a pretty nice dinner, nothing you mentioned there is on Storage Spaces HCL. Like vSAN, Windows and Storage Spaces have completely different HCLs. I can tell without even looking up your drives, that none of them are going to be on the HCL, because none of them are enterprise grade drives.

If you want a stable solution, get an LSI SAS card (non-RAID), get enterprise SATA HDDs and get an Intel dc series SSD. Is it expensive? Yep. Will it be reliable? As much as Windows can be when it comes to storage (which isn't great).

Me, I dumped storage spaces and went back to an LSI RAID card. Went from data corruption every week to rock solid storage for over two years on the same hardware. And I had ALL enterprise grade kit that was on the HCL.

You can find the Hardware Compatibility List (HCL) here https://www.windowsservercatalog.com/results.aspx?&chtext=&cstext=&csttext=&chbtext=&bCatID=1642&cpID=0&avc=10&ava=0&avq=0&OR=1&PGS=25&ready=0

Eric C. Singer
  • 2,319
  • 15
  • 17
  • 1
    Thanks for your thoughts Eric. Perhaps you could edit your post to expand "HCL" as I don't know what it means, and neither will others reading this post. I thought the whole point of software RAID such as storage spaces was to avoid the need for expensive enterprise grade solutions. I'm not sure your answer "throw it away and buy something else" really addresses my question. – Tim Oct 08 '17 at 20:16
  • See my comments, HCL = hardware that is certified to work in layman's terms. SW RAID doesn't equal or mean you can skimp out on enterprise grade HW, it means you have the flexibility of a SW based solution. – Eric C. Singer Oct 09 '17 at 00:57
  • ...and stop using ReFS, switch back to NTFS. its like infinitely more mature than ReFS. ReFS has a lot of promise, but it also has a ton of maturing to do. – Eric C. Singer Oct 09 '17 at 01:11
  • I wrote a whole blog post on Storage Spaces BTW, started a series but gave up on it because IMO, it's a POS. Like most MS solutions, its half baked. http://www.ericcsinger.com/backup-storage-part-5-realization-of-a-failure/ We switched back to RAID cards and haven't had an issue since. Since you're using mirroring, you'd almost be better setting up some robocopy script to copy data from one disk to another. Easier to troubleshoot and probably more reliable, and it would allow you to keep your consumer grade disks. – Eric C. Singer Oct 09 '17 at 01:17
  • I would also suggest checking out “stalebit” drive pool if you want a more consumer friendly storage spaces like solution. – Eric C. Singer Oct 09 '17 at 01:42
  • BTW, another reason you might want to move away from ReFS on windows 10 https://arstechnica.com/gadgets/2017/08/microsoft-to-remove-full-refs-support-from-windows-10-pro-push-workstation-sku/ – Eric C. Singer Oct 09 '17 at 15:15
  • Thanks for all your thoughts Eric, it's appreciated. I'm not going to mark it as the answer, as it doesn't answer the question I asked, but it'll definitely have me reconsidering my storage and backup approach. Right now my internal disks are ReFS Storage spaces mirror, my offsite backup disk is ReFS. Fortunately I also have archives in AWS Glacier and daily backups in AWS S3 so even if ReFS failed the data would be ok. – Tim Oct 09 '17 at 18:17
  • Curious, have you seen this article and checked for any disk errors? https://blogs.msdn.microsoft.com/san/2012/09/26/determining-disk-health-using-windows-powershell-on-windows-server-2012-and-windows-8/ – Eric C. Singer Oct 11 '17 at 12:03
  • I don't know how to find DiskID or disk FriendlyName. I ran it by ID from 1 to 20 and it came back with zero errors for each. I haven't seen the error I asked about in months, but I haven't done anything particularly intensive in months either. Given ReFS is being removed from the OS I probably will go back to NTFS at least for the backup disk, and maybe for the two disks that the storage spaces mirror runs over. The problem is not having the spare disk space to do that. – Tim Oct 11 '17 at 16:29
  • You would use the "get-disk | Get-StorageReliabilityCounter" and that will output the results for each disk automatically. – Eric C. Singer Oct 11 '17 at 20:07
  • 1
    That said, since you are mirroring your disks, you *could* evict one disk, format it with NTFS and copy your data over to it. I would *strongly* advise against storage spaces, as the issue you're seeing isn't as much an ReFS one, as it's a storage spaces one. Anything to do with disks getting evicted or showing up as failed is related to storage spaces.. If you're just trying to keep your disks in sync so you have a local replica, robocopy is the most easiest way to do it, it's just not real time is all. – Eric C. Singer Oct 11 '17 at 20:09
  • so for example, if you had two disks you wanted to mirror, E: drive to F: drive. You would run "robocopy E: f: /Mir /copyall /w:0 /r:0" Setup a scheduled task to run every hour or something like that. Always save your data to E and it will get replicated to F. You now have two independent copies of your data *and* you know how to recover it – Eric C. Singer Oct 11 '17 at 20:11
  • Thanks for your thoughts again Eric. I've used Robocopy in the past. I really only have a RAID mirror because I had bad RAM for a while and wanted to increase protection. I have good backups so I might just use a standard disk and keep regular backups. I might look into stablebit as well, if I want to keep RAID. Or I might do nothing at all, since it seems ok recently. – Tim Oct 11 '17 at 21:34