Service Bus Relay based services slow to startup and slow runtime performance

While working on a project using the WCF webHttpRelayBinding binding with SAS based authentication over transport security, I found that my services were taking a very long time to spin up (30-60 seconds) and that my runtime performance was a bit less than optimal in terms of latency (and I had proven that the latency was not a result of my backend service). To give you an idea what I was working with, my web.config file had contents similar to the below in the system.serviceModel element.

I didn’t observe such problems on my development VM (which I was running with pretty much no firewalls behind it), but did observe this on my client’s UAT environment. This was in spite of following Microsoft’s guidelines that suggest that you should have outbound TCP ports 9350-9354 open on your firewall to enable Service Bus connectivity. I went through an exercise using the PortQuiz website to prove that these ports were indeed accessible from the UAT server so the performance issues were puzzling.

Next up, I spun up a fiddler capture. To start with I applied the below filter into Fiddler to get rid of the extra noise.

I then initialized some of my services (I shut them down forcibly and then spun them up again) to observe which ports were in use. I saw that the conversation with Service Bus was being initialized on port 9350 as expected, however that appeared to be the end of the story. I wasn’t seeing any comms on ports 9351-9354. I then right clicked one of the displayed records in WireShark and chose “Conversation Filter -> IP” which updates the filter such that it displays anything with a source or destination IP address matching those on the selected record.

This suddenly resulted in a whole lot more records being displayed and helped me get to the root of the issue. What I was observing was that after Service Bus made the initial connection on port 9350, it attempted to continue the conversation on port 5671 (AMQPS or AMQP over SSL) which hadn’t been opened on the firewall. This connection attempt was obviously failing, and the observed behavior was that some retries were attempted with fairly large gaps in between until Service Bus finally decided to fall back to port 443 (HTTPS) instead. Pay particular attention to the lines in the following screenshot with the numbers 1681, 2105, and 2905 in the first column.

This explained why my service was taking a long time to start up (because Service Bus was failing to connect via AMQPS and was going through retry cycles before falling back to HTTPS) and also explained why my runtime performance was lower than my expectation (because HTTPS is known to be slower than TCP). However this didn’t explain why Service Bus was attempting to use port 5671 rather than 9351-9354 as per documentation.

Repeating the same test on my own VM showed that Service Bus was continuing the connection on ports 9351-9354 as expected… So why the difference? On the suggestion of my colleague Mahindra, I compared the assembly versions of the Microsoft.ServiceBus assembly across the two machines. You can do this by running “gacutil -l Microsoft.ServiceBus” in a Visual Studio command prompt, or by manually checking the GAC directory which is typically “C:\Windows\Microsoft.NET\assembly\GAC_MSIL” for .NET 4+ assemblies.

Voila. I found that I was running version 2.1.0.0 on the machine that was behaving correctly, and version 2.6.0.0 on the machine that was misbehaving. It appears that the protocol choosing behavior for Service Bus relays has changed sometime in between these two assembly versions. I have not pinpointed exactly which version this change occurred in, and I don’t yet know whether this change was by design or accidental. Either way, Microsoft have not yet updated their documentation, which means that others will be as confused as I was.

So what are your choices?

You can downgrade to an older version of the assembly. 2.1.0.0 will definitely work for you, but you might be able to get away with a slightly higher version which is less than 2.6.0.0, but it will be up to you to establish which versions are acceptable since I haven’t managed to do this. You’ll need to update the webHttpRelayBinding binding registration in your machine.config files (or wherever you’ve chosen to register it if you’ve gone for a more granular approach) to point to the older assembly as well.

You can choose to stick with the latest version of the assembly and open up outbound TCP connections on port 5671 on your firewall.

I chose to stick with option #1 because I’m not sure at this stage whether this change in behavior is intentional or incidental, and also because my impression is that raw TCP over ports 9351-9354 would be faster than the AMQPS protocol. You will find that option #2 is also functional.

With the older version of the assembly in play I could not see traffic on ports 9351-9354 as expected, my services were spinning up in less than a second, and latency was much more in line with my expectations.