Wednesday, 30 December 2015

Exchange 2016 Database Availability Group Troubleshooting (Part 1)

Introduction


I’ve recently posted instructions on how to set up an Exchange 2016 DAG and set up of the mailbox database copies so this post will be dedicated to testing our DAG in various failure scenarios while also demonstrating some of the DAG troubleshooting methods.

For more information on how to set up an Exchange 2016 DAG, see here. For more information on how to perform maintenance on your DAG, see here.

To go to other parts in this series:



Database Availability Group Testing


In order to make sure that our DAG is working correctly and to take a bit more of a look into what failures a DAG can tolerate without administrator input, we will test the following failures:

  • DAG member failure
  • REPLICATION network failure
  • MAPI network failure
  • Failure of single Exchange server storage
  • Failure of single Exchange server database storage
  • Information store crash on a single Exchange server
  • OWA Virtual Directory failure
  • DAG member and file share witness failure



Database Availability Group Troubleshooting Tools


The key tools we will look at are below:
  • System Event Log
  • Application Event Log
  • Crimson Channel Event Log
  • Cluster Logs
  • CollectOverMetrics.ps1 script

Lab Environment


The lab follows on from what we set up earlier in part 1 and 2. Anyways, here’s a reminder - we have two Exchange 2016 servers running on Server 2012 R2 configured in an IP-less DAG: LITEX01 and LITEX02:

image_thumb7

As for the mailbox databases, we have four mailbox databases and each has a copy on LITEX01 and LITEX02:

  • MDB01
  • MDB02
  • MDB03
  • MDB04

The preferred mailbox database copy for MDB01 and MDB02 is on LITEX01 and for MDB03 and MDB04, this is LITEX02. Here we can see the databases mounted on the preferred mailbox servers:

Get-MailboxDatabaseCopyStatus -Server LITEX01 | Sort Name

image

Get-MailboxDatabaseCopyStatus -Server LITEX02 | Sort Name

image

Our file share witness is on LITFS01, (file server) and our quorum model is node and file share majority, as is expected for a two node DAG:

Get-ClusterQuorum | fl *

image

In these tests, all client access services have been made highly available using DNS round robin. Instructions on how to set this up can be found here.

Ok, now we have confirmed that our DAG is in a healthy state before we go ahead and test it (break it?), we can begin with our first test.


DAG Member Failure


To test this, I’ll force power off server LITEX01 which will be the same effect as having a power cut or other sudden total failure.


Outlook 2016


Outlook in online mode may become unresponsive and you may see the warning below.

image

After a few seconds, Outlook recovers and you can open your emails and continue working without issue. Whether you see this warning or not depends on how long it takes for your DAG to fail over and for the databases to be mounted again.


Outlook Web Access


If the user was using OWA, they may be prompted with this error during failover but OWA then recovers without needing to log out and back in:

image


DAG and Cluster Status


Let’s take a look at what happened behind the scenes running these below commands from our surviving server, LITEX02. Our cluster failed over as we can see that LITEX01 is marked as down:

Get-ClusterNode

image

....and all databases failed over to LITEX02:

Get-MailboxDatabaseCopyStatus

image

Windows Failover Cluster Log


We can also check the cluster log which we can generate using the command to get the last 20mins of logs and output it to C:\temp\ClusterLog:

Get-ClusterLog -TimeSpan 20 -Destination C:\temp\ClusterLog

image

See below lines from the cluster log which show that node 1, 172.16.0.21, 10.2.0.22 has been marked as unreachable/dead.

00000060.00000854::2015/12/28-21:58:04.358 INFO  [CHM] Received notification for two consecutive missed HBs to the remote endpoint 10.2.0.21:~3343~ from 10.2.0.22:~3343~
00000060.00000a14::2015/12/28-21:58:07.483 DBG   [NETFTAPI] Signaled NetftRemoteUnreachable event, local address 172.16.0.22:3343 remote address 172.16.0.21:3343
00000060.00000854::2015/12/28-21:58:07.483 INFO  [IM] got event: Remote endpoint 172.16.0.21:~3343~ unreachable from 172.16.0.22:~3343~
00000060.00000854::2015/12/28-21:58:07.483 INFO  [IM] Marking Route from 172.16.0.22:~3343~ to 172.16.0.21:~3343~ as down
00000060.00000854::2015/12/28-21:58:07.483 INFO  [NDP] Checking to see if all routes for route (virtual) local fe80::f17a:75b4:6c65:f533:~0~ to remote fe80::783b:e33e:84d9:8878:~0~ are down
00000060.00000854::2015/12/28-21:58:07.483 INFO  [NDP] All routes for route (virtual) local fe80::f17a:75b4:6c65:f533:~0~ to remote fe80::783b:e33e:84d9:8878:~0~ are down
00000060.00003738::2015/12/28-21:58:07.952 ERR   [NODE] Node 2: Connection to Node 1 is broken. Reason GracefulClose(1226)' because of 'channel to remote endpoint fe80::783b:e33e:84d9:8878%17:~6057~ is closed'
00000060.000008d4::2015/12/28-21:58:07.952 INFO  [CORE] Node 2: executing node 1 failed handlers on a dedicated thread
00000060.000008d4::2015/12/28-21:58:07.952 INFO  [NODE] Node 2: Cleaning up connections for n1.


System Event Log


We can also take a look at the system event log on LITEX02 and we see event 1135 which states that LITEX01 was removed from the cluster:

“Cluster node 'litex01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges”

image

Application Event Log


Let’s move on an check the application event logs and just a few seconds later, we see event 2091 logged for each database that was mounted on LITEX02 stating that the attempts to copy the last logs failed and that the database will be mounted on LITEX02:

Database: MDB02\LITEX02
Mailbox Server: litex02.litwareinc.com
Database MDB02\LITEX02 will be mounted with the following loss information:
* The last log generated (known to the server) before the switchover or failover was: 750
* The last log successfully replicated to the passive copy was: 750
* AutoDatabaseMountDial is set to: GoodAvailability

Attempts to copy the last logs from the active database copy weren't successful. Error: The log copier was unable to communicate with server 'litex01.litwareinc.com'. The copy of database 'MDB02\LITEX02' is in a disconnected state. The communication error was: A timeout occurred while communicating with server 'litex01.litwareinc.com'. Error: "A connection could not be completed within 5 seconds." The copier will automatically retry after a short delay.

image

We also see event 2113 for each database where Exhcange attempts re-delivery of the emails from the transport dumpster:

Re-delivery of messages from the transport dumpster will be attempted for database MDB02. Messages originally delivered between 28/12/2015 21:52:07 (UTC) and 28/12/2015 22:03:23 (UTC) will be re-delivered.

image

Eventually event 3169 is logged for each mailbox database to inform the administrator that the database has now been successfully moved to LITEX02:

(Active Manager) Database MDB01 was successfully moved from litex01.litwareinc.com to litex02.litwareinc.com. Move comment: None specified.

image

Event 40018 is also logged by the information store to let us know that each database was mounted successfully:

Active database MDB01 (fa0c03ad-3b51-45fe-b779-b689a3936aed) was successfully mounted.      
Process id: 9576      
Mount time: 20.3938645 seconds      
Log generation: 0x3      
Previous state: 3    


image

Crimson Channel Event Logs


The next log we should check is the Crimson channel event logs. This provides the same events as we see in the application log but also more information. You can access the Crimson Channel Event Logs by opening up Event Viewer, clicking on Application and Services Logs > Exchange > HighAvailability.

For example, here is one of the earlier events where attempts to LITEX01 time out after 5 seconds:

A client-side attempt to connect to litex01.litwareinc.com using '10.2.0.21:64327' from '<Null>' failed: Microsoft.Exchange.Cluster.Replay.NetworkTimeoutException: A timeout occurred while communicating with server 'litex01.litwareinc.com'. Error: "A connection could not be completed within 5 seconds."
   at Microsoft.Exchange.Cluster.Replay.TcpClientChannel.TryOpenChannel(NetworkPath netPath, Int32 timeoutInMs, TcpClientChannel& channel, NetworkTransportException& networkEx)


image

CollectOverMetrics.ps1


The CollectOverMetrics.ps1 script which you can find in the Exchange Scripts directory is also quite useful. It allows us to output an HTML file with the failovers during a specified time. See below for how to generate all events since 28th December 2015 21:55, just a few minutes before the failover:

.\CollectOverMetrics.ps1 -StartTime "12/28/2015 21:55" -GenerateHtmlReport -ShowHtmlReport -DatabaseAvailabilityGroup DAG01

image

The reported errors are because LITEX01 is still offline and the script is attempting to collect data from it. Part of the HTML report is below where we can see that there was a failover due to a node being down.

image

I’ll now power LITEX01 back on, check the services are running and then redistribute the databases using the .\RedistributeActiveDatabases.ps1 script. You can find out more information about this script here.


Conclusion


That concludes the DAG member failover test and hopefully you can see a number logs and scripts that you can use to find out what is happening in your IP-less DAG setup.
In the next part, we’ll continue our DAG testing with a simulated REPLICATION network failure. See here for part 2.

1 comment: