Windows DFSR - Changed replicated directory permissions and now have a 350,000 backlog for more than a week

Question

Question: Is there a way to make this 350,000 file backlog complete faster? For nearly every file the only change was a change to the ACL for each affected file. Some files have changed content, but that is not the common case in this situation.

This might be fixed. I'll edit this text to confirm success/fail after a period of time and verification. Toward the end of this question text I have detailed the changes made recently that might have fixed it.

We have a DFSR replication group with about 450,000 files and takes up 1.5TB of space. In this situation, there are two Windows Server 2008 R2 servers that are about 500 miles apart. There are other servers, but they aren't involved in this replication group. Server ALPHA is the main server and is the one used by most of the staff. Server BETA is the server in the remote office and is less busy.

Here is a graph of backlog for this replication group (PNG hosted on Google Drive) showing the slow sync progress.

I needed to remove a permission entry that was in the root directory of that replication group, which of course was inherited across most of the subfolders. I made this change on server ALPHA. Right away after that, DFSR had a 350,000 file backlog. It has been more than a week and now it is at 267,000. The only thing that changed (initially) was the single permission change.

This is what happened (this is not the solution, just another explanation of what happened to cause this issue): http://blogs.technet.com/b/askds/archive/2012/04/14/saturday-mail-sack-because-it-turns-out-friday-night-was-alright-for-fighting.aspx#dfsr

Any changes that occur on server BETA are replicated to server ALPHA very quickly since there is no backlog in that direction. Any files changed on BETA do make it to ALPHA without trouble.

It's replicating 24/7 at full speed across a 50Mbps connection one end to a fiber 100Mbps on the other end. The staging area is 100GB on each server. There is nothing interesting in the event logs at all. There is an unrelated high watermark event that shows up for an unrelated replication group that is neither for this particular replication nor for this ALPHA/BETA server pair. In particular there are no event log entries for high watermark nor for connection errors.

ALPHA's view of the replication group:

Bandwidth Savings: 99.83% reduction (30.85 MB replicated instead of 18.1 GB)

I believe that the 30.85MB/18.1GB happened since I last restarted the DFSR service on ALPHA and BETA. If so, this shows that even though it is taking a very long time (longer than I believe it should take) it isn't actually transferring the file contents across the wire.

Replicated folder: 1.46TB (actual size), 439,387 (files), 52,886 (folders)

Conflict and Deleted folder: 100.00GB (configured size), 34.01GB (actual size), 19,620 (files), 2,393 (folders)

Staging folder: 200.00GB (configured size), 92.54GB (actual size)

I got one high watermark error in the logs (May 14, 7pm) and so have upped the staging quota to 200GB from 100GB. I know that the Microsoft-approved route is to increase by 20%, but I'm not playing around on this. We have plenty of disk space to spare on the staging disk arrays.

Disabling anti-virus on all servers did not help, though I thought it would have helped a little bit. For now I have re-enabled anti-virus but set the replication group's path to be excluded from scanning in order to remove that variable from the equation.

Is there a way to get this to go faster? I would just make this change on server BETA as well, but there are files that have changed on ALPHA but haven't replicated to BETA and by making the inherited permission change on BETA would push OLD files from BETA to ALPHA (because DFSR seems to ignore file timestamps when comparing which file is the winner in a collision). And having that happen would be rather bad.

The backlog is reducing slowly. Very, very slowly. It is going forward, though. But at this rate, it will be weeks before it finishes. I'm contemplating just shoving a copy of the data set onto a 3TB drive and shipping it to the remote office. Is there a better way?

May 16, 4am US PT: What might have fixed the problem (assuming it's honestly fixed, anyway):

I made multiple changes to the DCs that should have been made a long time ago. The problem is that this network was inherited from someone else who probably inherited it from someone else, etc. I can't promise which change fixed the problem. Here they are in no particular order:

All DCs were not in the "Domain Controllers" OU. I've never seen a Windows Domain that had their DCs elsewhere. I moved them back to where they belonged. They were previously in OUs that were segregated by the name of the city each office is in. (I have a feeling I've got some plumbing work to deal with now that I moved those, but all seems okay at present...)
AVG Anti-Virus is running on all DCs and DFSR-participating servers. I excluded the replicated folders and the staging folders from active/on-access scanning. I don't think this fixed the problem and I'm likely to test this issue later on to see if undoing that change will interfere with the replication speed of DFSR. That's a challenge for another day.
dcdiag.exe complained of a DNS issue with regard to RODCs. I remedied that problem even though we have no RODCs on the domain at all. I doubt this fixed anything.
One of the _ldap._tcp.domain.GUID._msdcs.DOMAIN.NET SRV records was missing for one of the DCs (not one of the DFSR servers) and I remedied that. I don't think this helped either.
One of the times I rebooted server BETA it complained of a bad shutdown of the DFSR database (event 2212) and it then proceeded to take hours to rebuild the database. When finished it reported event 2214 to let me know it finished. After that, replication was still running extremely slowly, but it might have helped unstick whatever was stuck.
One of the DCs didn't have 127.0.0.1 as a secondary DNS server in its interface configuration. I added it. This wasn't one of the DFSR servers, so that probably had nothing to do with it.
I followed the TechNet Blog: Tuning replication performance in DFSR recommended Registry settings for DFSR servers. I used all of the "tested high performance value" values except for AsyncIoMaxBufferSizeBytes was set to 4194304, which is one notch lower than the high value. This could have helped with the problem... or maybe not. It's difficult to tell when one changes too many variables.
dcdiag.exe complained about a problem with communicating with the RPC service on BETA, but only after already making the above changes. This seemed to be the most likely issue going on, but there was nothing I did to correct it. The VPN was running properly and the firewall wasn't blocking it. It's possible that one of the above items is what caused and then remedied the RPC issue or it could have been simple coincidence. I am not getting that error now and replication is running smoothly at present.

The moral of the story is: change one thing at a time or you'll never really know what fixed it. But I was desperate and was running out of time to fix it, so I just fired a bunch of bullets at the problem. If I ever pinpoint the fix, I'll report that here. Don't bank on me narrowing it down, though.

EDIT 5/21/2012: I solved this by driving for about seven hours with a spare server (GAMMA) to the remote office yesterday. GAMMA is now acting as their primary local server while their usual server (BETA) catches up on the replication. Since I put it into place, the servers have been going about double the replication speed. While this tells me it could be a VPN-related issue, I'm less inclined to believe that it is since all new updates seem to replicate to GAMMA from ALPHA have been very quick and going well.

EDIT 5/22/2012: It's at 12000 right now and should be finished in a few hours. I'll post a nice graph of the progress from slow start to fast finish. The problem is that the only thing that really actually "fixed" it is the local server connection. I'm presently thinking that maybe the VPN is part of the problem. And if that's the case, I feel that this question isn't quite answered yet. After I've had some more time to check out how things are replicating via the VPN and seeing any failures, I'll debug and report the progress.

If something changes I'll update here.

How much data needs to be replicated and how much bandwidth is available between your site and the remote site? Also, are you throttling DFS replication? — MDMarra, May 12 '12 at 03:57
My answer to add is the same as MDMarra (check your replication schedule and staging size), so I'll just leave a comment. If it was a permission change, then it's not the actual data that is being replicated, rather the security attributes on each file. In these cases, the backlog typically isn't dependent on bandwidth. You haven't mentioned anything that is shown in the Event Log, but it's worth taking a look. Also run a DFSR Diagnostic report for the replication group. — Jeff Miles, May 12 '12 at 04:33
Also, Windows Server 2012 has a feature that should take away this problem forever: http://blogs.technet.com/b/askds/archive/2012/04/14/saturday-mail-sack-because-it-turns-out-friday-night-was-alright-for-fighting.aspx#dfsr — Jeff Miles, May 12 '12 at 04:33
`dfsrdiag replicationstate /a` shows that it is only sending two files, but both have the same filename. It says that it has two outbound connections to BETA from ALPHA, anyway. The file that it is sending is 850MB. As described before, I'm not convinced that it is actually _sending_ the entire file's contents, though I'm not sure _what_ it would be doing if not since it takes a very long time just to deal with a single file. The file was last updated in 2008 (on both servers) so there is no reason it needs to do anything except update the ACL info on the file on BETA. — Emmaly, May 15 '12 at 07:36
On BETA, it says the following: **Total number of inbound updates scheduled**: 252, **Active inbound connections**: 2, **Updates received**: 496 — Emmaly, May 15 '12 at 07:38
From 8pm until 4am the backlog count went from 266840 to 245640, which is amazingly good progress. I changed quite a few variables, so it'll be hard to tell which thing fixed it, but I will update the question above to detail the changes I made in case it helps someone in the future. Of course this is assuming that I don't hit another snag, but even if I did, I think it'd be file-specific in that case. — Emmaly, May 16 '12 at 10:33

score 6 · Answer 1 · answered May 12 '12 at 03:49

6

You can tweak the replication schedule to allow DFS-R to replicate at full-speed during off hours (or even on hours if appropriate).

You can also try to increase the staging size on the back logged server. It should increase performance in this situation.

You don't mention whether or not it's capped, but I assume it is since you have replication across a WAN.

answered May 12 '12 at 03:49

MDMarra

100,183
32
195
326

I updated the question to respond to your response. In particular it details the 24/7 full-speed replication schedule and the 100GB staging area. What you said would be helpful if these items were not already in place. I appreciate your interaction on this. – Emmaly May 13 '12 at 06:18

Jeff Miles · Accepted Answer · 2012-05-15T20:10:59.597

3

Very strange problem, especially after reviewing the edit.

I would inspect the DFSR debug log, which is located here: %systemroot%\debug By default there should be 9 previous log files that have been GZ archived, and one that is currently being written to.

Open that up in a text file, and do a search for the text "warning" or "error". You can check out this blog series for more detailed information on the debug logs: http://blogs.technet.com/b/askds/archive/2009/03/23/understanding-dfsr-debug-logging-part-1-logging-levels-log-format-guid-s.aspx

Other questions/suggestions:

Is there anything out of place when looking at the Resource Monitor? Excess hard drive or CPU activity that is outside a baseline?

If possible I'd restart both Alpha and Beta servers. If it resolves your issue you may never know what the real problem was, but if its critical that this is resolved soon it is worth a try.

Edit based on Question Update

You mentioned two entries related to an 850 MB file, as well as an error within the DFSR debug log.

Can you try changing the Staging Location to a different folder or drive on each server? In case the files that are currently being staged are corrupt or blocking the replication in some way.

edited May 15 '12 at 20:10

answered May 14 '12 at 02:58

Jeff Miles

2,020
2
19
26

The newest log file has nothing matching "warning" but it does have errors. The errors are _all_ just like this one: "20120513 23:38:59.198 6592 ASYN 755 [WARN] AsyncUnbufferedFileWriter::SetFileSizeEstimate [Error:87(0x57) FileUtil::SetFileValidDataLength fileutil.cpp:1657 6592 W The parameter is incorrect.]" I have disabled anti-virus as well to see if that is causing this horrible slowdown. I forgot av was even on those servers and it may very well be the cause of the trouble. :-| – Emmaly May 14 '12 at 22:31
Anti-virus notes were added to the question. It doesn't appear to affect anything, as noted. – Emmaly May 15 '12 at 02:07
I have rebooted both ALPHA and BETA many times over the course of debugging this issue. It hasn't seemed to have an effect on anything aside from the related errors in the event logs on the opposite server. CPU activity on both servers is very low. It hardly averages 20% even with high mid-day load. Same with RAM. Disk writes are very frequent but it never shows as pegged at 100%. It doesn't seem to be disk IO bound. Right now I just have to assume that something somewhere is waiting on some sort of lookup and timing out? I don't see any other reason for this behavior. I'm still digging... – Emmaly May 15 '12 at 02:14
I had to reboot BETA again because of applied Windows Updates and it came back up with a 2212 but hasn't come back with a 2214, so now I wait and wait. Maybe it's a sign of good things to come. Or it means that there is just more screwed up stuff on BETA. Servers: pfft. – Emmaly May 15 '12 at 05:20
... no dice. Same slowness, same problems. I'll keep pushin' on. – Emmaly May 15 '12 at 07:01
Great progress has been made. I will detail changes made on the servers in the question text, though too many variables were changed to promise a specific cause/fix. Your responses were the most helpful. Even though the changes I made were only vaguely related, at least some of them were prompted by digging through the debug logs, etc. I'm going to wait just a little bit before doing so, but I'm most likely to flag your response as the answer. Thanks for the support on this. – Emmaly May 16 '12 at 10:37
I didn't actually mean to unset this as the answer. – Emmaly May 21 '12 at 09:19

score 3 · Answer 3 · answered Nov 15 '15 at 13:28

My experience is that this is Just How It Works.

I stumbled across this after updating security on a fairly small collection of 4 DFS replication groups (550 GB data, 58k files, 3.4k folders total). Data actually transmitted on the wire is low so it appears not to be moving entire files for just security changes, but disk activity feels like the entire hierarchy is being recopied -- sustained disk transfer rates between 60-100 MB/sec, and disk queues of 30, peaking as high as 500 on SSD tiered storage space.

My sense is that DFS has a lot of churn in its staging and destaging process which results in extreme disk I/O. An initial replication process between two gigabit LAN connected boxes takes multiples of time longer than the same data simply file copied between boxes, which would seem to indicate every byte replicated requires multiple bytes of disk read and write.

Security updates don't seem to have any special replication logic barring the use of the 2012 claims-based security (which isn't widely used AFAICT), resulting in the same stage/destage churn you would get for data changes.

Hello from the future. Ever since this issue happened, we set up security groups in AD that are specific to the data being shared/stored, and then do our very best to never touch that ACL ever again. Then we can just adjust group memberships with no DFSR penalty. I'm sure this is a best practice already, but if it isn't, it certainly should be! — Emmaly, Dec 27 '21 at 02:56

Windows DFSR - Changed replicated directory permissions and now have a 350,000 backlog for more than a week

3 Answers3

Linked