Generally, this bug manifests when there is asymmetric latency (for example, where half of the DAG members have latency of 1 ms, while the other half of the DAG members have 30 ms latency) for two cluster nodes that discover a broken connection between the pair. If the first node detects a connection loss well before the second node, a race condition can occur:
- The first node will initiate a reconnect of the stream between the two nodes. This will cause the second node to add the new stream to its data.
- Adding the new stream tears down the old stream and sets its failure handler to ignore. In the failure case, the old stream is the failed stream that has not been detected yet.
- When the connection break is detected on the second node, the second node will initiate a reconnect sequence of its own. If the connection break is detected in the proper race window, the failed stream’s failure handler will be set to ignore, and the reconnect process will not initiate a reconnect. It will, however, issue a pause for the send queue, which stops messages from being sent between the nodes. When the messages are stopped, this prevents GUM from operating correctly and forces a cluster restart.
In addition to fixing the issue described above, KB2550886 also includes other important Windows Server 2008 R2 hotfixes that are also recommended for DAGs:
- http://support.microsoft.com/kb/2549472 – Cluster node cannot rejoin the cluster after the node is restarted or removed from the cluster in Windows Server 2008 R2
- http://support.microsoft.com/kb/2549448 – Cluster service still uses the default time-out value after you configure the regroup time-out setting in Windows Server 2008 R2
- http://support.microsoft.com/kb/2552040 – A Windows Server 2008 R2 failover cluster loses quorum when an asymmetric communication fail
I’ve seen the issue that this hotfix looks to fix, a network issue causes the entire DAG to fail at at least one client. This hotfix came out in August and I hadn’t heard of it until now.
Last week I installed this hotfix on my two Exchange servers and at two client sites and had no issues.
The Exchange Team on 11/21 posted this as “strongly recommended: hotfix to install on any Exchange 2010 DAG member running on Windows 2008 R2. EHLO Blog Post
From this post: