Avamar Exchange 2013 DAG backup fails – Waiting for VSS

Today I have been troubleshooting Avamar backups of an Exchange 2013 DAG (Database Availability Groups) setup. The Avamar job would start but no data would be backed up and the job would fail within 15 minutes. Initially VSS was suspected to be the culprit but additional troubleshooting revealed some strange behavior of the Avamar client in which the individual plugins would keep waiting for a status message that would never arrive.

The setup is simple: three Windows 2008 R2 SP1 servers, Exchange 2013 Update 5 with Database Availability Groups and Avamar 7.0.2 as the backup software and target. The error that pops up looks like this:

During the backup the Avamar connects to the DAG, lists the databases, sends workorders to the individual clients (SENT_SUBWORKORDER). You can see that only EXCHANGE01 returns a SUBWORKORDER_STARTED; EXCHANGE03 doesn’t respond. After a while Avamar notices EXCHANGE03 doesn’t respond and sends a cancel request to Exchange03. Exchange01 also cancels a while later. At this point the VSSWriter on EXCHANGE01 shows a Stable state with a Retryable error. The EXCHANGE03 VSS writer doesn’t show an error.

You might be tempted to troubleshoot the server with the VSS error, except there isn’t really an error on that server! See the drilldown Avamar log of EXCHANGE01:

We were a bit puzzled by the IP addresses displayed in the initial Avamar log: IPv6 is disabled on the servers and the APIPA addresses didn’t make any sense either. So while I confirmed some settings my colleague Daniel Hass was googling these errors and ended up on this very helpful blogpost from Dan Anstis. We ran these suggestions by EMC Support whom confirmed we had to adjust the timeout values. We ended up with the following avexvss.cmd file on the “shared /var” location created during the initial DAG client configuration.

We also had to create identical avexvss.cmd files in the “local /var” folder (e.g. C:\Program Files\avs\var\) that only contained the –vss-snapshot-timeout and –subworkorder-timeout switches. Seems redundant to me but I didn’t have time to wait for the back-up to complete and then try it again. So give it a whirl without if you want to try the leanest possible solution…

The above config file fixed the issue: no longer did the Avamar client try to use non-existing or loopback IP addresses and the new configured timeouts gave the subworkorder enough time to start and report back. VSS got to work, wasn’t shot down by the parent Avamar process and soon after the consistency checks started running, followed by the actual backup.

Let me know if you run into the same issue and if this solves it for you!