5

I'm in the process of decommissioning an old 2003 server, which acts as a file server, and am just attempting a dry-run of migrating the file repository over to a new Windows Storage Server 2012 box. I'm using robocopy to copy over the files, and currently just doing some test runs to see how long it takes, before we make the final change over.

The first time I ran robocopy I supplied the following switches: Options : . /S /E /COPYALL /PURGE /MIR /MT:128 /R:100000 /W:30 It ran fine (although I wouldn't recommend the /r and /w switches as it'll take forever to complete!) The second time I ran it with the following switches (the destination directory already contained a copy of the source destination from the first time I ran it, /MIR will ensure it's updated): Options : . /S /E /COPYALL /PURGE /MIR /MT:128 /R:0 /W:0

This caused the server to hang about 5 minutes after the job started. It completely hung and I had to manually power cycle it to restart it. The logs aren't giving me a huge indication of what went wrong - thoughts were that /mt:128 had caused issues, but I supplied that switch the first time and that was fine.

The second time I change a couple of switches to /r:0 and /w:0 although I wouldn't imagine that they would cause it to hang.

Finally is the fact that I've chosen /MIR problematic as the destination has already been copied over from the source once before - I wouldn't have thought so though as I thought the only potential downside of mirroring was that it would delete files in the destination which are no longer in the source. If anyone could shed any light on what went wrong it'll ensure that it doesn't go wrong next time I try it out.

EDIT: the switches I mentioned above are taken from the robocopy log file, and in a sense they are an interpretation of the switches I specified, which were: /MIR /COPY:DATSOU /MT:128 /R /W

2nd Edit: The server in question has a dual NIC, teamed using Windows Server in-built NIC teaming. I feel this is important information, which I did not share when I originally posted the question. Would like to investigate this. The NIC in question is a Intel(R) 82574L Gigabit Network Connection. The NIC Team is 'Microsoft Network Adapter Multiplexor Driver'.

kafka
  • 547
  • 1
  • 15
  • 27
  • Running robocopy should not hang a server no matter what options specified. But stuff does happen, I'd call the storage server OEM and harass them. – tony roth May 09 '13 at 16:44
  • I use either /S, /E or /MIR but not more than one. /MIR should work fine and you don't need /PURGE either. I would also leave the threads (/MT) at the default of 8. 128 seems excessive. – Peter Hahndorf May 09 '13 at 16:49
  • Yeah there's no way it should have completely hung the server. I'm happy for the network to be saturated but it shouldn't overwhelm the server. the option I specified was just /MIR (equivalent to /e and /purge) the /s must be from /copy:datsou – kafka May 09 '13 at 17:00
  • this is an oem version of storage server correct? If so you'll need to contact the oem cause I'm sure its a nic driver issue. – tony roth May 09 '13 at 17:25
  • it could well be a driver issue, as we've got windows server 2012 on there which is pretty recent, and im not 100% all the hardware has totally up to date drivers. the box was supplied without and os, which i put, then put the drivers on off the cd. ill see if i can update them – kafka May 09 '13 at 19:50
  • I've had mixed results when adding in the /MT option. I've had UNC shares become unavailable for no apperent reason, and a bit of lag/slowness as well. While not a lockup, it can make the machine look like it is basically frozen. Maybe open up Resource Monitor before starting Robocopy and see if it points to the cause. My guess is that the /MT is kicking your server in the pants. Peter's suggestion of /MT:8 is what I do. But if I need to be sure it copies, I leave off /MT completely. – MikeAWood May 18 '13 at 01:12
  • I'm starting to think it may be related to the NIC. The server has dual NIC, with latest drivers, so I don't think it's a driver issue. However, we have teamed the NICs, using the inbuilt Windows NIC teaming feature. I've not uncovered much but have come across (anecdotal) warnings about using Windows NIC teaming and would like further information on this. – kafka May 23 '13 at 12:37
  • Do you have access to logs on the switch? Perhaps that would have more info if it is a NIC issue. And I presume you don't see anything on the Win2012 logs? – CC. May 23 '13 at 16:01
  • Unmanaged switch so no access to logs I'm afraid. The Win2012 logs told me very little, nothing along the lines of what I'm suspecting anyhow. – kafka May 23 '13 at 16:07

3 Answers3

2

It appears to me that Robocopy is A) buggy, and B) hooks into the kernel in some way that can make the entire system incredibly unstable when it bugs out. We've seen this happen quite often (especially with the MT option) when syncing over reasonably high-speed WAN links (20Mbps - 100Mbps). So I'm pretty sure it's not a NIC driver having traffic volume issues - we do things in production that abuse them far more badly than this, and we see this even with 10Gbps LAN connections on Cisco UCS / VMWare 5.5, with everything patched current and Robocopy v6.3.9600.17415 dated 10/28/2014.

I'd love it if somebody can definitively prove we're all doing something stupid, but it looks like Microsoft is just putting out some unbelievably dangerous code.

  • BTW, robocopy is an UNSUPPORTED M$ release. That should tell you a little somethine. – mdpc Dec 14 '14 at 19:34
1

It sounds like its a network card driver issue for sure. To see if this is a bug with your dual-nic setup, adjust the IPG parameter to about 20 milliseconds and remove your /MT:128 parameter (since /IPG and /MT are not compatible). Using your "switches I specified" line in your original post it would look like this.

/MIR /COPYALL /R /W /IPG:20 /Z

The /IPG:20 (inter-packet gap) will slow down the transmission considerably, but provides stability.

The /Z (restartable mode) is important for copies over the network, in case of network disruptions (caused by bad cards, drivers, or by actual network issues) because it will allow the copy to pick up where it left off.

If this completes successfully, you've got an issue with your network driver. The issue would be that whatever driver your using can't handle the throughput of /IPG:0.

The final nail in the coffin for the NIC driver being the root cause of your server hanging would be to replace the card and rerun the command that caused it to hang. Apart from that you could probably also unplug one of the connections so the multiplexing doesn't occur, and run the command that produced the error.

Suggestion came from cb42 on technet.

http://social.technet.microsoft.com/Forums/en-US/itprovistaapps/thread/9555a996-1301-4f68-b9d3-82a87fc6ba46/

...and ss64 rocks (just sayin!) http://ss64.com/nt/robocopy.html

Lucretius
  • 459
  • 1
  • 4
  • 14
  • Thanks for this Lucretius. Interestingly robocopy has hung on this server again (fortunately it's not crashed the whole server) and this was copying from an internal hard disk to an external, encrypted hard disk! (I need to figure out a way to encrypt our external hard drives and get them working with wbadmin - it's not liking my truecrypt encrypted partitions though). So although the original error is almost certainly a driver issue on the nic_team (it's the latest driver too) I'm also having problems with robocopy without crossing the network. Shame cos when it works it's great. – kafka May 28 '13 at 19:48
  • You could try copying in raw mode with robocopy by appending /EFSRAW at the end. `/EFSRAW : Copy any encrypted files using EFS RAW mode.` – Lucretius Jun 18 '13 at 18:35
0

Why do you use /S with /E? It seem to be opposite. And /E + /Purge is equal to /Mirror. And I think /MT:128 is too high, you should reduce it. Try:

/S /MIRROR /COPYALL /MT:64 /R:10 /W:60 
cuonglm
  • 2,346
  • 2
  • 15
  • 20
  • The flags are taken from the log file, the params I specified are: /MIR /copy:DATSOU /mt:128 /R /W – kafka May 09 '13 at 17:01
  • Was Robocopy logging to the console? If not, it's possible that it wasn't hung but because you didn't see any output you assumed it was hung. – joeqwerty May 09 '13 at 18:27
  • it was logging to a text file. the system was completely unresponsive, and when I rebooted it today the server logs stopped last night at 530pm, nothing after that. – kafka May 09 '13 at 19:48