Exchange Log

Post navigation

Symptom:Customer reported Jetstress failures with the message, “Fail – The test has 1.05381856535713 Average Database Page Fault Stalls/sec. This should be not higher than 1.” The customer had recently purchased multiple servers to be used in an Exchange DAG and these Jetstress failures were halting the project. What was unique about this deployment is the customer was using all local SSD storage for the solution.

Analysis:I asked the customer to provide their Jetstress configuration XML file as well as their Jetstress Results HTML file. As soon as I saw the results file I knew what the issue was, but only because I’ve had discussions with fellow Exchange MCMs, MVPs, and Microsoft employees who had encountered this same odd behavior in the past. In short, Jetstress was generating TOO many IOPS, as seen in the below output:

To the untrained eye this may not be anything special, but I had to do a double take the first time I saw this. Jetstress was generating over 25,000 IOPS on this system. Also impressive was the performance of the hardware as there actually wasn’t any disk latency issues from an IO read/write perspective:

As you can see from the above screenshot, database and log read/write latency (msec) was still fairly low for the extremely high amount of IOPS being generated. Yet our issue was with Database Page Fault Stalls/Sec, which should always remain below 1 (shown below):

Background:Let’s spend some time covering how Jetstress is meant to behave, as well as the proper strategy for effectively utilizing Jetstress. Your first step in working with Jetstress should be to download the Jetstress Field Guide where most of this information is held.

The primary purpose of Jetstress is to ensure a storage solution can adequately deliver the amount of IOPS needed for a particular Exchange design BEFORE Exchange is installed on the hardware. You should use the Exchange Sizing Calculator to properly size the environment; using inputs such as # of mailboxes, avg. messages sent/received per day, and avg. message size. Once completed, on the “Role Requirements” tab you will find a value called “Total Database Required IOPS” (per Server) which tells you the amount of IOPS each server must be able to deliver for your solution. The value for “Total Log Required IOPS” can be ignored as Log IO represents sequential IO which is very easy on the disk subsystem.

With the per server Database Required IOPS value in hand, your job now is to get Jetstress to generate at least that amount of IOPS while delivering passing latency values. This is where some customers get confused due to improper expectations. They may think that Jetstress should always pass no matter what parameters they configure for it. I can tell you that I can make any storage solution fail Jetstress if I crank the thread count high enough, so it actually takes a bit of “under-the-hood” understanding to use Jetstress effectively.

Jetstress generates IO based on a global thread count, with each thread meant to generate 30-60 IOPS. Simply put, if I run a Jetstress test with 2 threads, I would expect it to generate ~120 IOPS on well performing hardware. Therefore, if my Exchange calculator stated I needed to achieve 1,200 IOPS on a server, I would begin by starting a quick 15min test with 20 threads (20 x 60=1,200). If that test generated at least 1,200 IOPS and passed, I would then run a 24hr test with 20 threads to ensure there’s no demons hiding in my hardware that only a long stress test can uncover. If that test passes then I’m technically in the clear and can proceed with the next phase of the Exchange project. Making sure to keep my calculator files, Jetstress configuration XML files, and Jetstress result HTML files in a safe location for potential future reference.

You could spend an hour reading the Jetstress Field Guide but what I’ve just covered is the short version. Get it to pass with the amount of IOPS you need and you’re in the clear. Of course it can often be much more complex than that. You may need to tweak the thread count or update hard drive firmware or correct a controller caching setting (see my post here for the correct settings) to achieve a pass. Auto-tuning actually makes this process a bit simpler, as it tries to determine the maximum amount of threads (which as stated is directly proportional to generated IOPS) the system can handle. However, it can lead to some confusion as people may focus too much on the maximum amount of IOPS a system can deliver instead of focusing on the amount of IOPS you actually need. While it’s certainly a valuable piece of information to know for future planning or even hardware repurposing, you shouldn’t stall your project trying to squeeze every last bit of IOPS out of the system if you’re already easily hitting your IOPS target and passing latency tests.

As someone who works for a hardware vendor, I’ll often get pulled into an escalation where a customer has opened Jetstress, cranked the thread count up to 50, it fails, and they’re pointing the finger at the hardware. However, based on what we already discussed, a thread count of 50 would be 3,000 IOPS. This is fine if the storage purchased with the system can support it. A single 7.2K NL SAS drive can achieve ~55 IOPS for Exchange database workloads, so if a customer has 20 single-disk RAID 0 drives in a system, the math tells us they can’t expect to achieve much more than 1,100 IOPS (20 x 55=1,100). It gets a bit more complicated when using RAID and having to consider which disks are actually in play (EX: In a 10-disk RAID 10, you only get to factor in 5 disks in terms of performance, due to the other 5 being used for mirroring) as well as write penalties when using RAID 5 or RAID 6. The bottom line is that you need a realistic understanding of the amount of IOPS the hardware you purchased can actually achieve, as we all must adhere to the performance laws of rotational media.

Resolution:Coming back to the issue at hand, now having an understanding of how Jetstress is supposed to work, we see why the results were troubling. Jetstress actually should NOT be generating that many IOPS when using 35 threads. 35 threads should generate ~2,100 IOPS not 25,000 IOPS (35 threads x 60 expected IOPS per thread=2,100). More IOPS is not a good thing because if nothing else, Jetstress is supposed to be predictable in terms of the amount of IO it generates. So why did this system generate so many IOPS and why did it fail? The short answer is that Jetstress doesn’t play well with SSD drives. It will always try to generate more IOPS per thread than expected on SSD storage. As I am not a Jetstress developer I can’t explain why this occurs but after several colleagues have also seen this issue I can at least confirm it happens and provide a workaround. In our customer’s case, they manually specified 1 thread which generated ~2,200 IOPS and passed without any Page Fault Stall errors. This was still way more IOPS than 1 thread should be generating but it achieved their calculator IOPS requirements and allowed them to continue with their project. As for why the test was failing with Page Fault Stalls, even though actual disk latency was fine, I can only speculate. As a Page Fault Stall is an Exchange-related operation (related to querying disk for a database page) and not a pure disk latency operation, I wonder if Jetstress is designed to even generate that many IOPS. I’ve never personally seen it run with more than a few thousand IOPS, so it’s possible the Jetstress application itself couldn’t handle it.

I hope this cleared up some confusion around how to effectively utilize Jetstress. Also, if you happen to come across this specific issue I’d be interested in hearing about it in the comments.

In a Lync and Exchange UM environment (version doesn’t particularly matter in this case), voicemail messages were not being delivered. The voicemail folder on Exchange (C:\Program Files\Microsoft\Exchange Server\V15\UnifiedMessaging\voicemail) was filling up with hundreds of .txt (header files) and .wav (voicemail audio files).

Resolution

This issue is not necessarily new (Reference1Reference2), but it didn’t immediately come up in search results. I also wanted to spend more time discussing why this issue happened and why it’s important to understand receive connector scoping.

This issue was caused by incorrectly modifying a receive connector on Exchange. Specifically, a custom connector used for application relay was modified so instead of only the individual IP addresses needed for relay (EX: Printers/Copiers/Scanners/3rd Party Applications requiring relay), the entire IP subnet was included in the Remote IP Ranges scoping. This ultimately meant that instead of Lync/ExchangeUM using the default receive connectors (which have the required “Exchange Server Authentication” enabled), they instead were using the custom application relay connector (which did not have Exchange Server Authentication enabled).

This resulted in the voicemail messages sitting in the voicemail folder and errors (Event ID 1423/1446/1335) being thrown in the Application log. The errors will state processing failed for the messages:

It’s also possible that the voicemail messages will eventually be deleted due to having failed processing too many times (EventID 1335):

The Microsoft Exchange Unified Messaging service on the Mailbox server encountered an error while trying to process the message with header file “C:\Program Files\Microsoft\Exchange Server\V15\UnifiedMessaging\voicemail\<string>.txt”. The message will be deleted and the “MSExchangeUMAvailability: % of Messages Successfully Processed Over the Last Hour” performance counter will be decreased. Error details: “Microsoft.Exchange.UM.UMCore.ReachMaxProcessedTimesException: This message has reached the maximum processed count, “6”.

Unfortunately, once you see this message above (EventID 1335) the message cannot be recovered. When UM states the message will be deleted, it will in fact be deleted with no chance of recovery. If the issue had been going on for several days and this folder were part of your daily backup sets then you could technically restore the files and paste them into the current directory; where they would be processed. However, if you did not have a backup then these voicemails would be permanently lost.

Note: Certain failed voicemail messages can be found in the “C:\Program Files\Microsoft\Exchange Server\V15\UnifiedMessaging\badvoicemail” directory. However, as our failure was a permanent failure related to Transport, they did not get moved to the badvoicemail directory and instead were permanently deleted.

Background

I wanted to further explain how this issue happened, and hopefully clear up confusion around receive connector scoping. In our scenario, someone left a voicemail for an Exchange UM-enabled mailbox which was received and processed by Exchange. The header and audio files for this voicemail message were temporarily stored in the “C:\Program Files\Microsoft\Exchange Server\V15\UnifiedMessaging\voicemail” directory on the Exchange UM server. Our scenario involved Exchange 2013, but the same general logic would apply to Exchange 2007/2010/2016. UM would normally submit these voicemail messages to transport using one of the default Receive Connectors which would have “Exchange Server Authentication” enabled. These messages would then be delivered to the destination mailbox.

Our failure was a result of the UM services being directed to a Receive Connector which did not have the necessary authentication enabled on it (the custom relay connector which only had Anonymous authentication enabled). Under normal circumstances, this issue would probably be detected within a few hours (as users began complaining of not receiving voicemails) but in our case the change was made before the holidays and was not detected until this week (another reason to avoid IT changes before a long holiday). This resulted in the permanent Event 1335 failure noted above and the loss of the voicemail. Since this failure occurs before reaching transport, Safety Net will not be any help.

So let’s turn our focus to Receive Connector scoping, and specifically, defining the RemoteIPRange parameter. Remote IP Ranges define for which incoming IP address/addresses that connector is responsible for handling. Depending on the local listening port, local listening IP address, & RemoteIPRange configuration of each Receive Connector, the Microsoft Exchange Frontend Transport Service and Microsoft Exchange Transport Service will route incoming connections to the correct Receive Connector. The chosen connector then handles the connection accordingly, based on the connector’s configured authentication methods, permission groups, etc. A Receive Connector must have a unique combination of local listening port, local listening IP address, and Remote IP Address (RemoteIPRange) configuration. This means you can have multiple Receive Connectors with the same listening IP address and port (25 for instance) as long as each of their RemoteIPRange configurations are unique. You could also have the same RemoteIPRange configuration on multiple Receive Connectors if your port or listening IP are different; and so on.

The default Receive Connectors all have a default RemoteIPRange of 0.0.0.0-255.255.255.255 (all IPv4 addresses) and ::-ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff (all IPv6 addresses). The rule for processing RemoteIPRange configurations is that the most accurate configuration is used. Say I have two Receive Connectors in the below configuration:

With this configuration, if an inbound connection on port 25 destined for 192.168.1.10 is created from 192.168.1.55, then ApplicationRelayConnector would be used and it’s settings would be applicable. If an inbound connection to 192.168.1.10:25 came from 192.168.1.200 then Default Receive Connector would instead be used.

The below image was taken from the “Troubleshooting Transport” chapter of the Exchange Server Troubleshooting Companion, an eBook co-authored by Paul Cunningham and myself. It’s a great visual aid for understanding which Receive Connector will accept which connection from a given remote IP address. The chapter also contains great tips for troubleshooting connectors, mail flow, and Exchange in general.

So in my customer’s specific scenario, instead of defining individual IP addresses on their custom application relay receive connector, they instead defined the entire internal IP subnet (192.168.1.0/24). This resulted in not only the internal devices needing to relay hitting the custom application relay connector, but also the Exchange Server itself and the Lync server also hitting the custom application relay connector; thus breaking Exchange Server Authentication. As a best practice, you should always use individual IP addresses when configuring custom application relay connectors, so that you do not inadvertently break other Exchange communications. If this customer had multiple Exchange Servers, this change would have also broken Exchange server-to-server port 25 communications.

OverviewOver the span of the last 3 weeks, I’ve encountered five different customers experiencing this issue. Through my own lab testing and working with Microsoft Premier Support, we were able to diagnose the issue as being related to a recent Windows Update that was installed on the customers’ Windows Server 2012 Domain Controllers that introduced authentication issues.

SymptomsOutlook users in Exchange environments experience repeated authentication prompts when attempting to access their mailbox. OWA and ActiveSync were not affected by this “login prompt loop.” The issue began presenting itself after Windows Updates were applied to the Domain Controllers in the environment. Another symptom experienced in some customer environments were authentication issues when accessing Terminal Services/Remote Desktop Services.

TroubleshootingIn testing, I actually found the authentication loop began when the Outlook client attempted to authenticate to the AutoDiscover service. From the client machine, I decided to test the authentication process by using Internet Explorer to browse to https://AutoDiscover.Contoso.com/AutoDiscover/AutoDiscover.xml. After initially authenticating, I was presented with repeated login prompts; the same symptoms seen in the Outlook clients. At one point I was curious to see if the issue was related to Windows/Kerberos authentication, so I decided to disable all authentication methods on the AutoDiscover virtual directory except for Basic. After doing this, I was able to successfully authenticate to AutoDiscover. Since this issue was affecting both the AutoDiscover and RPC virtual directories, it made me think it wasn’t necessarily Exchange that was broken, but AD authentication itself.

We first looked at recent Windows Updates on the Exchange servers themselves, but saw nothing unusual. Even as a precautionary measure we removed the Windows Updates that were recently installed on the Exchange Servers the previous day, but the issue remained. We then noticed some recently installed updates on the Domain Controllers which were related to authentication. After removing all updates installed the previous day, the issue was resolved and the authentication loops were gone. We then spent time (via process of elimination) trying to determine which update was the culprit.

SolutionSteps to reproduce:-Have a 2012 non-R2 DC-Have KB3175024 installed but DO NOT have KB3177108 (released back in August) installed.

Obviously one possible resolution in my case is to install KB3177108 (which immediately resolved my RDP and Exchange login issues). However, I was curious as to why KB3177108 was not installed on multiple customers’ environments. After working with Microsoft, it appears that it was either because there was an issue initially installing KB3177108, or some customers chose to not install it for possible incompatibility reasons.

Ultimately, the reason we encountered these issues was because the KB3175024 update builds on dependencies of KB3177108, but will install anyways (in error) if it is not present; resulting in the issue above. In short, KB3175024 makes changes that assumes KB3177108 is present. It installs even if KB3177108 is not present and causes authentication issues.

The error encountered above was a result of the IIS Metabase still holding remnants of a past instance of the OWA Virtual Directory, which was preventing the New-OwaVirtualDirectory Cmdlet from successfully completing. It’s important to understand that an Exchange Virtual Directory is really located in two places; Active Directory and IIS. When running the Get-OwaVirtualDirectory Cmdlet (or similar commands for other Virtual Directories), you’re really querying Active Directory. For example, the OWA Virtual Directories for both the Default Web Site and Exchange Back End website in my lab are located in the following location in AD (via ADSIEDIT):

So if a vDir is missing in IIS but present in AD, you’ll likely need to first remove it using the Remove-*VirtualDirectory Cmdlet otherwise it will generate an error stating it already exists. In my customer’s scenario, I had to do this beforehand as the OWA vDir was present in AD but missing in IIS.

This brought us to the state we were in at the beginning of this post; receiving the above error message. The OWA vDir was no longer present in AD nor in the Default Web Site, but when trying to recreate it using New-OwaVirtualDirectory we received the above error message.

Tip: Use Get-*VirtualDirectory with the –ShowMailboxVirtualDirectories parameter to view the Virtual Directories on both web sites. For example:

The solution was to install the IIS 6 Resource Kit and use Metabase Explorer to delete the ghosted vDir. When installing the Resource Kit, select Custom Install and then uncheck all features except for Metabase Explorer 1.6 and proceed with the installation. Once it finishes, it may require you add the .NET Framework 3.5 Feature.

When you open the tool on the Exchange Server in question, navigate to the below tree structure and delete the old OWA Virtual Directory by right-clicking it and selecting Delete. When completed, the OWA vDir should no longer be present (as seen below).

You should now be able to successfully execute the New-OwaVirtualDirectory Cmdlet. It’s always a bit nostalgic seeing a tool of days gone by still able to save the day. I’d like to thank my co-worker John Dixon for help with this post. When I can’t figure something out in Exchange/IIS (or anything really) he’s who I lean on for help.

I recently worked with a customer who had introduced an Exchange 2013 Server into an existing Exchange 2007 environment. The issue was the 2013 Server was unable to send email anywhere; neither externally or to other Exchange Servers. If you executed the below command to view the status of the transport queues you received the below output:

This was the resolution in my scenario as well. To resolve the issue, I simply had to re-check the “Register this connection’s addresses in DNS” option on the IPv4> Properties>Advanced>DNS tab on the primary NIC used for Active Directory communications. While you can uncheck this box on secondary NICs (such as for iSCSI, Replication, Backup, etc.), it should always remain checked on the MAPI/Primary NIC. I’ve also seen issues where having this unchecked on a 2013/2016 DAG node will result in Managed Availability-triggered database failovers.

Note: This article should also apply when Exchange 2016 CU1 releases and includes Mailbox Anchoring (unless Microsoft makes a change to behavior before it’s release). So the scenario of installing the first Exchange 2016 server using CU1 bits into an existing environment would also apply.

Summary

It was announced in Microsoft’s recent blog post about Exchange Management Shell and Mailbox Anchoring that the way Exchange is managed will change going forward. Starting with Exchange 2013 CU11 (released 12/10/2015) and Exchange 2016 CU1 (soon to be released), an Exchange Management Shell session will be directed to the Exchange Server where the user who is attempting the connection’s mailbox is located. If the connecting user does not have a mailbox, an arbitration mailbox (specifically SystemMailbox{bb558c35-97f1-4cb9-8ff7-d53741dc928c) will be used instead. In either case, if the mailbox is unavailable (because it’s on a database that’s dismounted or is on a legacy version of Exchange) then Exchange Management Shell will be inoperable.

Issue

While it has always been recommended to move system and Arbitration mailboxes to the newest version of Exchange as soon as possible, there is a scenario involving Exchange 2013 CU11 which have led to customer issues:

Existing Exchange 2010 Environment

The first version of Exchange 2013 installed into the environment is CU11

Upon installation, the Exchange Admin is unable to use Exchange Management Shell on Exchange 2013. Thus preventing the management of Exchange 2013 objects

The Exchange Admin may also be unable to access the Exchange Admin Center using traditional means

This is due to the new Mailbox Anchoring changes. If the Exchange Admin’s mailbox (or the Arbitration mailbox, if the Exchange Admin did not have a mailbox) was on Exchange 2013 then this issue would not exist. However, because this was the first Exchange 2013 server installed into the environment, and it was CU11, there was no way to prevent this behavior.

This issue was first reported by Exchange MVP Ed Crowley, and yesterday a customer of mine also encountered the issue. The symptoms were mostly the same but the ultimate resolution was fairly straightforward.

Possible Resolutions

Resolution#1:

Attempt to connect to Exchange Admin Center on 2013 using the “Ecp/?ExchClientVer=15” string at the end of the URL (Reference). For Example:

I’ve heard mixed results using this method. When Ed Crowley encountered this issue, this URL worked, yet when I worked with my customer I was still unable to access EAC by using this method. However, it is worth an attempt. Once you’re connected to EAC, you can use it to move your Exchange Admin mailbox to 2013. However, should you not have a mailbox for your Exchange Admin account, this method may fail because there’s currently no way to move Arbitration Mailboxes via the EAC. So it’s recommended to create a mailbox for your Exchange Admin account using the EAC and then you’ll be able to connect via EMS.

Resolution#2:

Note: Using this method has a low probability of success as Microsoft recommends using the newer version of Exchange to “pull” a mailbox from the older version. Based on feedback I’ve received from Microsoft Support, you may consider just skipping this step and going to Step 3.

Use Exchange 2010 to attempt to move the Exchange Admin mailbox to a database on Exchange 2013. Historically, it’s been recommended to always use the newest version of Exchange to perform a mailbox move. In my experience this is hit or miss depending on the version you’re moving from and the version you’re moving to. However, it’s worth attempting:

Issue the below command using Exchange 2010 Management Shell to move the Exchange Admin’s mailbox to the Exchange 2013 server:

New-MoveRequest <AdminMailbox> -TargetDatabase <2013Database>

If the Exchange Administrator does not have a mailbox, then move the Arbitration mailboxes to Exchange 2013:

In addition, there have been reported issues with 2013 EMS still having connectivity issues even after the relevant mailboxes have been moved. A different Windows user with appropriate Exchange permissions (using a different Windows profile) will work fine however. It seems there are PowerShell cookies for the initial profile used which could still be causing problems. In this scenario, you may have to remove all listed cookies in the following registry key (Warning, edit the registry at your own risk. A backup of the registry is recommended before making modifications):

So unless Microsoft changes the behavior of Mailbox Anchoring, this is a precaution that should be taken when installing the first Exchange 2013 CU11/2016 CU1 (when released) server into an existing environment.

Edit: This forum post also describes the issue. In it, the user experiences odd behavior with the 2013 servers not being displayed if you run a Get-ExchangeServer & other odd behavior. This is similar to what I experienced in some lab testing. Ultimately, the same resolution applies.

Tools

I’m happy to announce a significant update to our scalability guidance for Exchange 2016. Effective immediately, we are increasing our maximum recommended memory for deployments of Exchange 2016 from 96 GB to 192 GB. This change is now reflected within our Exchange 2016 Sizing Guidance, as well as the latest release of the Exchange Server...

Over the last few months, we ran a TAP Program where our customers tested the batch migration process to move their public folders (both online and on-premises) to Office 365 Groups. We want to thank all of the customers who helped us out with the testing by sharing their experiences with us. The TAP program...