Thursday, 31 December 2015

Content Index State Failed in Exchange 2016

After a failover or similar event, you may find that a passive mailbox database copy has a ContentIndexState of Failed or FailedAndSuspended in your Exchange 2016, 2013 or 2010 Database Availability Group.

In our case, we have four mailbox database copies and one of them, MDB03, has a ContentIndexState of FailedAndSuspended:

Get-MailboxDatabaseCopyStatus | sort Name | ft -AutoSize

image

To resolve this, we need to use the Update-MailboxDatabaseCopy cmdlet and specify the -CatalogOnly switch so that only the content index is re-seeded rather than the entire database:

Update-MailboxDatabaseCopy MDB03\LITEX01 -CatalogOnly

image

After this completes, check your content index state:

Get-MailboxDatabaseCopyStatus | sort Name | ft -AutoSize

image

As we can see, all content indexes are healthy and our issue is resolved.

Wednesday, 30 December 2015

Exchange 2016 Database Availability Group Troubleshooting (Part 1)

Introduction


I’ve recently posted instructions on how to set up an Exchange 2016 DAG and set up of the mailbox database copies so this post will be dedicated to testing our DAG in various failure scenarios while also demonstrating some of the DAG troubleshooting methods.

For more information on how to set up an Exchange 2016 DAG, see here. For more information on how to perform maintenance on your DAG, see here.

To go to other parts in this series:



Database Availability Group Testing


In order to make sure that our DAG is working correctly and to take a bit more of a look into what failures a DAG can tolerate without administrator input, we will test the following failures:

  • DAG member failure
  • REPLICATION network failure
  • MAPI network failure
  • Failure of single Exchange server storage
  • Failure of single Exchange server database storage
  • Information store crash on a single Exchange server
  • OWA Virtual Directory failure
  • DAG member and file share witness failure



Database Availability Group Troubleshooting Tools


The key tools we will look at are below:
  • System Event Log
  • Application Event Log
  • Crimson Channel Event Log
  • Cluster Logs
  • CollectOverMetrics.ps1 script

Lab Environment


The lab follows on from what we set up earlier in part 1 and 2. Anyways, here’s a reminder - we have two Exchange 2016 servers running on Server 2012 R2 configured in an IP-less DAG: LITEX01 and LITEX02:

image_thumb7

As for the mailbox databases, we have four mailbox databases and each has a copy on LITEX01 and LITEX02:

  • MDB01
  • MDB02
  • MDB03
  • MDB04

The preferred mailbox database copy for MDB01 and MDB02 is on LITEX01 and for MDB03 and MDB04, this is LITEX02. Here we can see the databases mounted on the preferred mailbox servers:

Get-MailboxDatabaseCopyStatus -Server LITEX01 | Sort Name

image

Get-MailboxDatabaseCopyStatus -Server LITEX02 | Sort Name

image

Our file share witness is on LITFS01, (file server) and our quorum model is node and file share majority, as is expected for a two node DAG:

Get-ClusterQuorum | fl *

image

In these tests, all client access services have been made highly available using DNS round robin. Instructions on how to set this up can be found here.

Ok, now we have confirmed that our DAG is in a healthy state before we go ahead and test it (break it?), we can begin with our first test.


DAG Member Failure


To test this, I’ll force power off server LITEX01 which will be the same effect as having a power cut or other sudden total failure.


Outlook 2016


Outlook in online mode may become unresponsive and you may see the warning below.

image

After a few seconds, Outlook recovers and you can open your emails and continue working without issue. Whether you see this warning or not depends on how long it takes for your DAG to fail over and for the databases to be mounted again.


Outlook Web Access


If the user was using OWA, they may be prompted with this error during failover but OWA then recovers without needing to log out and back in:

image


DAG and Cluster Status


Let’s take a look at what happened behind the scenes running these below commands from our surviving server, LITEX02. Our cluster failed over as we can see that LITEX01 is marked as down:

Get-ClusterNode

image

....and all databases failed over to LITEX02:

Get-MailboxDatabaseCopyStatus

image

Windows Failover Cluster Log


We can also check the cluster log which we can generate using the command to get the last 20mins of logs and output it to C:\temp\ClusterLog:

Get-ClusterLog -TimeSpan 20 -Destination C:\temp\ClusterLog

image

See below lines from the cluster log which show that node 1, 172.16.0.21, 10.2.0.22 has been marked as unreachable/dead.

00000060.00000854::2015/12/28-21:58:04.358 INFO  [CHM] Received notification for two consecutive missed HBs to the remote endpoint 10.2.0.21:~3343~ from 10.2.0.22:~3343~
00000060.00000a14::2015/12/28-21:58:07.483 DBG   [NETFTAPI] Signaled NetftRemoteUnreachable event, local address 172.16.0.22:3343 remote address 172.16.0.21:3343
00000060.00000854::2015/12/28-21:58:07.483 INFO  [IM] got event: Remote endpoint 172.16.0.21:~3343~ unreachable from 172.16.0.22:~3343~
00000060.00000854::2015/12/28-21:58:07.483 INFO  [IM] Marking Route from 172.16.0.22:~3343~ to 172.16.0.21:~3343~ as down
00000060.00000854::2015/12/28-21:58:07.483 INFO  [NDP] Checking to see if all routes for route (virtual) local fe80::f17a:75b4:6c65:f533:~0~ to remote fe80::783b:e33e:84d9:8878:~0~ are down
00000060.00000854::2015/12/28-21:58:07.483 INFO  [NDP] All routes for route (virtual) local fe80::f17a:75b4:6c65:f533:~0~ to remote fe80::783b:e33e:84d9:8878:~0~ are down
00000060.00003738::2015/12/28-21:58:07.952 ERR   [NODE] Node 2: Connection to Node 1 is broken. Reason GracefulClose(1226)' because of 'channel to remote endpoint fe80::783b:e33e:84d9:8878%17:~6057~ is closed'
00000060.000008d4::2015/12/28-21:58:07.952 INFO  [CORE] Node 2: executing node 1 failed handlers on a dedicated thread
00000060.000008d4::2015/12/28-21:58:07.952 INFO  [NODE] Node 2: Cleaning up connections for n1.


System Event Log


We can also take a look at the system event log on LITEX02 and we see event 1135 which states that LITEX01 was removed from the cluster:

“Cluster node 'litex01' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges”

image

Application Event Log


Let’s move on an check the application event logs and just a few seconds later, we see event 2091 logged for each database that was mounted on LITEX02 stating that the attempts to copy the last logs failed and that the database will be mounted on LITEX02:

Database: MDB02\LITEX02
Mailbox Server: litex02.litwareinc.com
Database MDB02\LITEX02 will be mounted with the following loss information:
* The last log generated (known to the server) before the switchover or failover was: 750
* The last log successfully replicated to the passive copy was: 750
* AutoDatabaseMountDial is set to: GoodAvailability

Attempts to copy the last logs from the active database copy weren't successful. Error: The log copier was unable to communicate with server 'litex01.litwareinc.com'. The copy of database 'MDB02\LITEX02' is in a disconnected state. The communication error was: A timeout occurred while communicating with server 'litex01.litwareinc.com'. Error: "A connection could not be completed within 5 seconds." The copier will automatically retry after a short delay.

image

We also see event 2113 for each database where Exhcange attempts re-delivery of the emails from the transport dumpster:

Re-delivery of messages from the transport dumpster will be attempted for database MDB02. Messages originally delivered between 28/12/2015 21:52:07 (UTC) and 28/12/2015 22:03:23 (UTC) will be re-delivered.

image

Eventually event 3169 is logged for each mailbox database to inform the administrator that the database has now been successfully moved to LITEX02:

(Active Manager) Database MDB01 was successfully moved from litex01.litwareinc.com to litex02.litwareinc.com. Move comment: None specified.

image

Event 40018 is also logged by the information store to let us know that each database was mounted successfully:

Active database MDB01 (fa0c03ad-3b51-45fe-b779-b689a3936aed) was successfully mounted.      
Process id: 9576      
Mount time: 20.3938645 seconds      
Log generation: 0x3      
Previous state: 3    


image

Crimson Channel Event Logs


The next log we should check is the Crimson channel event logs. This provides the same events as we see in the application log but also more information. You can access the Crimson Channel Event Logs by opening up Event Viewer, clicking on Application and Services Logs > Exchange > HighAvailability.

For example, here is one of the earlier events where attempts to LITEX01 time out after 5 seconds:

A client-side attempt to connect to litex01.litwareinc.com using '10.2.0.21:64327' from '<Null>' failed: Microsoft.Exchange.Cluster.Replay.NetworkTimeoutException: A timeout occurred while communicating with server 'litex01.litwareinc.com'. Error: "A connection could not be completed within 5 seconds."
   at Microsoft.Exchange.Cluster.Replay.TcpClientChannel.TryOpenChannel(NetworkPath netPath, Int32 timeoutInMs, TcpClientChannel& channel, NetworkTransportException& networkEx)


image

CollectOverMetrics.ps1


The CollectOverMetrics.ps1 script which you can find in the Exchange Scripts directory is also quite useful. It allows us to output an HTML file with the failovers during a specified time. See below for how to generate all events since 28th December 2015 21:55, just a few minutes before the failover:

.\CollectOverMetrics.ps1 -StartTime "12/28/2015 21:55" -GenerateHtmlReport -ShowHtmlReport -DatabaseAvailabilityGroup DAG01

image

The reported errors are because LITEX01 is still offline and the script is attempting to collect data from it. Part of the HTML report is below where we can see that there was a failover due to a node being down.

image

I’ll now power LITEX01 back on, check the services are running and then redistribute the databases using the .\RedistributeActiveDatabases.ps1 script. You can find out more information about this script here.


Conclusion


That concludes the DAG member failover test and hopefully you can see a number logs and scripts that you can use to find out what is happening in your IP-less DAG setup.
In the next part, we’ll continue our DAG testing with a simulated REPLICATION network failure. See here for part 2.

Tuesday, 29 December 2015

Exchange 2016 Database Availability Group Maintenance

Introduction


In this post, I’ll demonstrate how to do maintenance on a two node single site Exchange 2016 Database Availability Group.


For more information on Exchange 2016 Database Availability Groups, see here.




Lab Setup


In this lab, I have two Exchange 2016 servers in a DAG with mailbox databases replicated between them for high availability. The Exchange servers are:

  • LITEX01
  • LITEX02


We have four mailbox databases:
  • MDB01
  • MDB02
  • MDB03
  • MDB04

There’s a copy of each of these mailbox databases on LITEX01 and LITEX02.




Put a mailbox server into maintenance mode



We’ll start with putting LITEX01 into maintenance mode so we can install Exchange updates, Windows Updates, hardware maintenance etc.


In Exchange 2016, the mailbox servers include all the Exchange roles, CAS and MBX. If your Exchange 2016 server is providing CAS services for your clients, you should remove it from the load balanced array. How you do this will depend on how you have configured load balancing.


Also note that your incoming and outgoing external messages need to be routed through both servers so that when you put one into maintenance mode, this won’t stop external message delivery. This will depend on how you have your message routing configured.
Other than the CAS service, the server will be performing the below functions:


  • Message delivery
  • Unified Messaging (Call routing)
  • Cluster services (Primary Active Manager)
  • Mailbox service (either active or passive mailbox databases)

Message Delivery


The HubTransport component on LITEX01 needs to be drained. To do this, we put the HubTransport component into a draining state, restart the Transport Service then redirect messages that are pending delivery to LITEX02. Log into LITEX01 and run these commands from the Exchange Management Shell running as administrator:


Set-ServerComponentState LITEX01 -Component HubTransport -State Draining -Requester Maintenance

Restart-Service MSExchangeTransport
Redirect-Message -Server LITEX01 -Target LITEX02.litwareinc.com


Press y when prompted


image
The server should now not be involved in message transport. We can confirm this by checking that the HubTransport component on LITEX01 is draining:

Get-ServerComponentState LITEX01 -Component HubTransport


image


Unified Messaging


You may or may not be using the server for Unified Messaging but if you are, just run this command to prevent the server handling calls. Calls will be drained which means that ongoing calls will complete:


Set-ServerComponentState LITEX01 -Component UMCallRouter -State Draining -Requester Maintenance


image
Confirm that the UMCallRouter component is draining (maintenance mode):
Get-ServerComponentState LITEX01 -Component UMCallRouter
image


Cluster Services


If you’re wondering what the Primary Active Manager (PAM) is, well it’s the term given to the server that owns the quorum and reacts to server failures. Although a failure of the server that holds the PAM causes a failover to the Standby Active Manager, (SAM), it’s best to fail this over gracefully. To do this, we need to pause the cluster node, LITEX01. This not only moves the PAM from LITEX01 to LITEX02 but it prevents LITEX01 owning this role till the cluster node is resumed.


First, let’s confirm where our PAM is located:


Get-DatabaseAvailabilityGroup -Status | fl Name,PrimaryActiveManager


image
Here we see that it’s currently on LITEX01 which means we need to move it (yes, more work, excellent!).

Right, let’s move it to LITEX02 by running this command:


Move-ClusterGroup "Cluster Group" -Node LITEX02


image

We also need to prevent LITEX01 becoming the PAM by pausing the cluster node. You need to run this command from an elevated PowerShell window:

Suspend-ClusterNode LITEX01


image
We’ll just confirm this has in fact worked:
Get-ClusterNodeimage
Get-DatabaseAvailabilityGroup -Status | fl Name,PrimaryActiveManager

image
Ok, the PAM has been moved just fine and the cluster node LITEX01 is paused. We can move on to the next step.



Mailbox service


We need to move any active mailbox databases off LITEX01. They should fail over when we shut down the server or when the services stop but we’ll move them off manually which is the recommended approach.

Let’s just see what databases are mounted on LITEX01 before we start this step:


Get-MailboxDatabaseCopyStatus -Server LITEX01


image
Ok, we can see mailbox databases MDB01 and MDB02 are mounted on LITEX01. To move these to LITEX02, we use this command:

Get-MailboxDatabaseCopyStatus -Server LITEX01 | ? {$_.Status -eq "Mounted"} | % {Move-ActiveMailboxDatabase $_.DatabaseName -ActivateOnServer LITEX02 -Confirm:$false}
image
We can now confirm our databases have been moved to LITEX02:

Get-MailboxDatabaseCopyStatus -Server LITEX02


image

All our mailbox databases are mounted on LITEX02.

The next step is to prevent LITEX01 automatically mounting the databases in case of a problem with LITEX02. To do this, we set the DatabaseCopyAutoActivationPolicy property to blocked on LITEX01:


Set-MailboxServer LITEX01 -DatabaseCopyAutoActivationPolicy Blocked


image

We can confirm that this was done by running this command:

Get-MailboxServer LITEX01 | ft Name,DatabaseCopyAutoActivationPolicy


image

Our mailbox service on LITEX01 is now in maintenance mode.

We then put the server itself into maintenance mode:


Set-ServerComponentState LITEX01 -Component ServerWideOffline -State Inactive -Requester Maintenance


image


We can confirm that LITEX01 is now inactive by running the command below:

Get-ServerComponentState LITEX01 -Component ServerWideOffline


image

Congratulations! Your server is now in maintenance mode and we can now do the required work on it.



Take a mailbox server out of maintenance mode



When we’re done with our maintenance, we can take LITEX01 out of maintenance mode. We’ll reverse the changes we’ve made to put the server into maintenance mode.



Set the mailbox server as active


Set-ServerComponentState LITEX01 -Component ServerWideOffline -State Active -Requester Maintenance

image

Confirm this has worked:

Get-ServerComponentState LITEX01 -Component ServerWideOffline


image



Set the Unified Messaging component to active


Set-ServerComponentState LITEX01 -Component UMCallRouter -State Active -Requester Maintenance

image
Confirm this has worked:
Get-ServerComponentState LITEX01 -Component UMCallRouter

image


Resume the cluster node


Run this command from an PowerShell window with elevated permissions:
Resume-ClusterNode LITEX01

image

Confirm the node is now up in the cluster:
Get-ClusterNode

image


Set the mailbox server DatabaseCopyAutoActivationPolicy


Here we set the DatabaseCopyAutoActivationPolicy property to Unrestricted to allow LITEX01 to mount databases automatically if needed:

Set-MailboxServer LITEX01 -DatabaseCopyAutoActivationPolicy Unrestricted


image

We can confirm this has worked by running this command:
Get-MailboxServer LITEX01 | ft Name,DatabaseCopyAutoActivationPolicy

image


Set the HubTransport component to active


Set-ServerComponentState LITEX01 -Component HubTransport -State Active -Requester Maintenance
Restart-Service MSExchangeTransport

image


Confirm that the HubTransport component is active:

Get-ServerComponentState LITEX01 -Component HubTransport

image


Confirm that our server is not in maintenance mode


To confirm that our server is no longer in maintenance mode, we can run the command below to check that all required components are active:
Get-ServerComponentState LITEX02 | ft Component,State -AutoSize

image

Optional tasks


Optionally, you can re-balance your mailbox databases as after these steps, all mailbox databases are mounted on LITEX02. Instructions on how to do this are here.
You can now repeat the above tasks to do maintenance on LITEX02.


Conclusion


In this post, I’ve done a run-through of how you can perform maintenance on your DAG members without downtime. 

Monday, 28 December 2015

Exchange 2016 - Balance Active Mailbox Databases

Introduction


When you’ve had a failover and your mailbox databases are now not on your preferred servers, it can take a bit of time to work out which server each database should be mounted on and then move the active copy if you have a large number of mailbox servers and databases. This is where the RedistributeActiveDatabases.ps1 script comes in handy.
For more information on Exchange 2016 Database Availailability Group setup, see here.


Lab setup


In this lab, I have two Exchange 2016 mailbox servers configured in a Database Availabliity Group:

  • LITEX01
  • LITEX02

There are four mailbox databases:

  • MDB01
  • MDB02
  • MDB03
  • MDB04

Each mailbox database has a copy on LITEX01 and LITEX02.


Activation Preference


On each mailbox database, there is an ActivePreference property. This lists all the mailbox servers that host a copy of the mailbox database and for each mailbox server it includes a number. Mailbox servers with lower numbers are preferred over other servers when it comes to selecting a server to mount the database on in case of a failover.

This is handy when you have a multi-site DAG (with a production and disaster recovery site) and you’d prefer to have your mailbox databases fail over to a mailbox server in the production site in preference to the DR site where users would need to connect to it across the WAN.

To find out the ActivationPreference property for our mailbox databases, we run the command:

Get-MailboxDatabase | sort Name | fl Name,ActivationPreference

image

From the ActivationPreference property, we see that the preferred mailbox database copy for MDB01 and MDB02 is on LITEX01 and the preferred mailbox database copy for MDB03 and MDB04 is on LITEX02.


Mailbox database copy status


To find out where our maibox databases are mounted, we can use the Get-MailboxDatabaseCopyStatus cmdlet against each of our mailbox servers:

Get-MailboxDatabaseCopyStatus -Server litex01

image

Get-MailboxDatabaseCopyStatus -Server litex02

image

Here we can see that all the mailbox databases are mounted on LITEX02 which, in my case, was the result of a mailbox server failover.


Re-distribute mailbox databases


The RedistributeActiveDatabases.ps1 script allows us to re-balance the mailbox servers by distributing the mailbox databases. You can either balance databases in two different ways:

  • BalanceDbsByActivationPreference - this moves the databases to the server that is marked as the most preferred copy (based on ActivationPreference)
  • BalanceDbsBySiteAndActivationPreference - this balances databases to their most preferred copy but also tries to balance the databases between the AD sites (useful for multi-site DAGs where both sites have active users)

In our case, we only have two servers in one AD site so we’ll use BalanceDbsByActivationPreference. The script is located in “C:\Program Files\Microsoft\Exchange Server\V15\scripts” by default so we first need to change directory. We can use this neat shortcut to do this:

cd $exscripts

image

Now we’ll run our script to balance the databases according to ActivationPreference and then output the results using the -ShowFinalDatabaseDistribution switch:

.\RedistributeActiveDatabases.ps1 -BalanceDbsByActivationPreference -ShowFinalDatabaseDistribution

image

In the first part of the output, under the heading “Starting Server Distribution”, we can see that the script has worked out that we have 4 mailbox databases and that they’re all mounted on LITEX02.

Under the heading “Starting Database Moves”, the script then moves MDB01 and MDB02 to LITEX01 as it has worked out that LITEX02 has an ActivationPreference (AP) of 2 and LITEX01 has an AP of 1 for these mailbox databases. We’re prompted to confirm each move unless we use the -Confirm:$false parameter.

Once the databases are moved, we can see an output stating that there are two active and two passive databases on each mailbox server.

image

In the next part of the output, we can see a summary of the moves where it states each server and the AP of the copy that is active at the start and finish of the script. We can see that MDB01 and MDB02 were activated on LITEX01 and MDB03 and MDB04 were not moved. 


Conclusion


In this post, I've demonstrated how to balance your mailbox databases among your Exchange 2016 mailbox servers.