Exchange DAG member dies during snapshot creation
Yesterday, a customer called me and told me about a scary observation on one of his Exchange 2016 DAG (Database Availability Groups) nodes.
In preparation of a security check, my customer created a snapshot of a Exchange 2016 DAG node. This node is part of a two node Windows Server 2012 R2/ Exchange 2016 CU7 cluster.
That something went wrong was instantly clear, after the first alarm messages were received. My customer opened a console windows and saw, that the VM was booting.
What went wrong?
Nothing. Something worked as designed, except the fact, that the observed behaviour was not intended.
That a snapshot was created was clearly visible in the logs. Interesting was the amount of time, that the snapshot creation took. It took 5 minutes from the start of the snapshot creation until the task finished. During this time, pretty much data was written to the disks.
The server eventlog contained an entry, that pointed me to to the right direction.
Event Type: Information Event ID: 1001 Source: BugCheck Description: The computer has rebooted from a bugcheck. The bugcheck was 0x0000009E (0xffffe0001eccf900, 0x000000000000003c, 0x000000000000000a, 0x0000000000000000).
The Ask the Core Team wrote a nice blog post about this STOP error. In short: The failvoer clustering service incorporates a detection mechanism that may detect unresponsive user-mode processes. If an unresponsive user-mode process is detected, a HangRecoveryAction is called. Since Windows Server 2008, a STOP error (Bugcheck) is caused on the cluster node.
Most likely hypothesis
My explanation of the observed behaviour is, that my customer accidentally created a snapshot that has contained the VM memory. Because the Exchange server has 32 GB memory, the snapshot creation took some time and the VM became unresponsive. As the VM was responding again, the HangRecoveryAction did its dirty job.
Check if the checkbox for the VM memory is disabled, before you create a snapshot. Otherwise the bugcheck will do its job. Please note, that you might see this behaviour in all Microsoft Windows Failover Clusters, not only with Microsoft Exchange.