blog.brian nuszkowski.com - A blip in IT.
Faster than Distributed File System Replication (DFSR)

DFSR is an extremely important piece of my daily environment. I primarily use it as it was intended - to efficiently replicate files from one location to another over a slow to average link. But, my situation is slightly unique in the fact that files always originate on the far end of the link (remote site) and replicate to a datacenter. Once DFSR replicates this data to the datacenter, a “data monster” moves the data from the near end DFSR server into some giant information warehouse in the sky. So, let me recap this process from start to finish and with a little more detail. 1.) Data is generated on the far end. 2.) DFSR replicates this data to our datacenter DFSR servers. 3.) The “data monster” moves the data from the datacenter DFSR server to it’s final resting place. 4.) The datacenter DFSR server tells its far end replication partner, “Hey there, this file is gone, so please tombstone (delete) this file on your end. 5.) The far end DFSR server deletes the data as instructed and life goes on. Let’s call this the “transfer lifecycle.”

Once this data is generated at the far end and moves to it’s final resting place in some data warehouse, we no longer have a need for it to exist on the far end DFSR replication partner. That’s why it’s not a big deal that when that data is moved off of the near end DFSR partner, it’s also removed from the far end replication partner.

Initially, my team developed this relatively efficient PowerShell script in which its only job was moving data from the near end DFSR partner to the data warehouse. This script never really gave us much grief. I must admit that it seemed a bit sluggish at times, but it wasn’t so insanely fast that we would go around giving high fives about it either. Up until this point, both the script and the DFSR process itself was running smooth and we didn’t experience any bottlenecks. One day my team decided to write an application that would run as a service on the near end DFSR replication partner, replacing the PowerShell script.

When this application was implemented, we stared to notice two things. 1.) We were randomly receiving TONS of duplicate files and 2.) it would take 5 times as long for a large group of files to fully complete it’s “transfer lifecycle”. Again, this only happened for a large group of files being transferred.

After lots of trial and error, here is what I discovered. Our near end file moving application was too efficient! Before DFSR could complete the entire transaction of a large group of files, our application would have already moved it off of the server! DFSR would then say, “Hey! The file was here and now it’s not - let me try again.” This explained the large number of duplicate files. Even though it would retransmit over and over again, DFSR was intelligent enough to slowly catch up. This explained the increase in time to complete the “transfer lifecycle.” I attribute these results to the increased efficiency of using the Windows API for file movement and WAN link performance (or lack there of).

In the end, we determined that increased gaps of time in which the process would look for new file arrivals would take care of this issue. Lesson learned. Don’t try to beat something at it’s own game unless you are fully informed and fully prepared!