Network planning with ExpressRoute for Office 365

ExpressRoute for Office 365 provides layer 3 connectivity between the your network and Microsoft’s datacenters. The circuits use Border Gateway Protocol (BGP) route advertisements of Office 365’s front end servers. From the perspective of your on-premises devices, when they need to select the correct TCP/IP path to Office 365, Azure ExpressRoute is seen as an alternative to the Internet.

Azure ExpressRoute adds a direct path to a specific set of supported features and services that are offered by Office 365 servers within Microsoft’s datacenters. Azure ExpressRoute doesn’t replace Internet connectivity to Microsoft datacenters or basic Internet services such as domain name resolution. Azure ExpressRoute and your Internet circuits should be secured and redundant.

The following table highlights a few differences between the internet and Azure ExpressRoute connections in the context of Office 365.

Differences in network planning

Internet network connection

ExpressRoute network connection

Access to required internet services, including;

DNS name resolution

Certificate revocation verification

Content delivery networks

Yes

Requests to Microsoft owned DNS and/or CDN infrastructure may use the ExpressRoute network.

Existing Azure ExpressRoute customers

If you’re using an existing Azure ExpressRoute circuit and would like to add Office 365 connectivity over this circuit, you should look at the number of circuits, egress locations, and size of the circuits to ensure they'll meet the needs of your Office 365 usage. Most customers require additional bandwidth and many require additional circuits.

The Azure ExpressRoute subscription is customer centric, meaning subscriptions are tied to customers. As a customer, you can have multiple Azure ExpressRoute circuits and can access many Microsoft cloud resources over those circuits. For example, you can choose to access an Azure hosted virtual machine, an Office 365 test tenant, and an Office 365 production tenant over a pair of redundant Azure ExpressRoute circuits.

This table outlines the two types of peering relationships you can choose to implement over your circuits.

Peering relationship

Azure Private

Microsoft

Services

IaaS: Azure Virtual Machines

PaaS: Azure public services

SaaS: Office 365

SaaS: Dynamics 365

Connection initiation

Customer-to-Microsoft

Microsoft-to-Customer

Customer-to-Microsoft

Microsoft-to-Customer

QoS support

No QoS

QoS1

1QoS supports Skype for Business only at this time.

Bandwidth planning for Azure ExpressRoute

Every Office 365 customer has unique bandwidth needs depending on the number of people at each location, how active they are with each Office 365 application, and other factors such as the use of on-premises or hybrid equipment and network security configurations.

Having too little bandwidth will result in congestion, retransmissions of data, and unpredictable delays. Having too much bandwidth will result in unnecessary cost. On an existing network, bandwidth is often referred to in terms of the amount of available headroom on the circuit as a percentage. Having 10% headroom will likely result in congestion and having 80% headroom generally means unnecessary cost. Typical headroom target allocations are 20% to 50%.

To find the right level of bandwidth, the best mechanism is to test your existing network consumption. This is the only way to get a true measure of usage and need as every network configuration and applications are in some ways unique. When measuring you'll want to pay close attention to the total bandwidth consumption, latency, and TCP congestion to understand your network needs.

Once you have an estimated baseline that includes all network applications, pilot Office 365 with a small group that comprises the different profiles of people in your organization to determine actual usage, and use the two measurements to estimate the amount of bandwidth you’ll require for each office location. If there are any latency or TCP congestion issues found in your testing, you may need to move the egress closer to the people using Office 365 or remove intensive network scanning such as SSL decryption/inspection.

All of our recommendations on what type of network processing is recommended applies to both ExpressRoute and Internet circuits. The same is true for the rest of the guidance on our performance tuning site.

Securing Azure ExpressRoute connectivity starts with the same principles as securing Internet connectivity. Many customers choose to deploy network and perimeter controls along the ExpressRoute path connecting their on-premises network to Office 365 and other Microsoft clouds. These controls may include firewalls, application proxies, data leakage prevention, intrusion detection, intrusion prevention systems, and so on. In many cases customers apply different levels of controls to traffic initiated from on-premises going to Microsoft, versus traffic initiated from Microsoft going to customer on-premises network, versus traffic initiated from on-premises going to a general Internet destination.

Install new security/perimeter infrastructure specific to the ExpressRoute path and terminate the Point-to-Point connection there.

Any-to-Any IPVPN

Leverage an existing on-premises security/perimeter infrastructure at all locations that egress into the IPVPN used for ExpressRoute for Office 365 connectivity.

Hairpin the IPVPN used for ExpressRoute for Office 365 to specific on-premises locations designated to serve as the security/perimeter.

Some service providers also offer managed security/perimeter functionality as a part of their integration solutions with Azure ExpressRoute.

When considering the topology placement of the network/security perimeter options used for ExpressRoute for Office 365 connections, following are additional considerations

The depth and type network/security controls may have impact on the performance and scalability of the Office 365 user experience.

Outbound (on-premises->Microsoft) and inbound (Microsoft->on-premises) [if enabled] flows may have different requirements. These are likely different than Outbound to general Internet destinations.

Office 365 requirements for ports/protocols and necessary IP subnets are the same whether traffic is routed through ExpressRoute for Office 365 or through the Internet.

Topological placement of the customer network/security controls determines the ultimate end to end network between the user and Office 365 service and can have a substantial impact on network latency and congestion.

Customers are encouraged to design their security/perimeter topology for use with ExpressRoute for Office 365 in accordance with best practices for redundancy, high availability, and disaster recovery.

Here's an example of Woodgrove Bank that compares the different Azure ExpressRoute connectivity options with the perimeter security models discussed above.

Example 1: Securing Azure ExpressRoute

Woodgrove Bank is considering implementing Azure ExpressRoute and after planning the optimal architecture for Routing with ExpressRoute for Office 365 and after using the above guidance to understand bandwidth requirements, they’re determining the best method for securing their perimeter.

For Woodgrove, a multi-national organization with locations in multiple continents, security must span all perimeters. The optimal connectivity option for Woodgrove is a multi-point connection with multiple peering locations around the globe to service the needs of their employees in each continent. Each continent includes redundant Azure ExpressRoute circuits within the continent and security must span all of these.

Woodgrove's existing infrastructure is reliable and can handle the additional work, as a result, Woodgrove Bank is able to utilize the infrastructure for their Azure ExpressRoute and internet perimeter security. If this weren’t the case, Woodgrove could choose to purchase additional equipment to supplement their existing equipment or to handle a different type of connection.

High availability and failover with Azure ExpressRoute

We recommend provisioning at least two active circuits from each egress with ExpressRoute to your ExpressRoute provider. This is the most common place we see failures for customers and you can easily avoid it by provisioning a pair of active/active ExpressRoute circuits. We also recommend at least two active/active Internet circuits because many Office 365 services are only available over the Internet.

Inside the egress point of your network are many other devices and circuits that play a critical role in how people perceive availability. These portions of your connectivity scenarios are not covered by ExpressRoute or Office 365 SLAs, but they play a critical role in the end to end service availability as perceived by people in your organization.

Focus on the people using and operating Office 365, if a failure of any one component would affect peoples’ experience using the service, look for ways to limit the total percentage of people affected. If a failover mode is operationally complex, consider the peoples’ experience of a long time to recovery and look for operationally simple and automated failover modes.

Outside of your network, Office 365, ExpressRoute, and your ExpressRoute provider all have different levels of availability.

Service Availability

Office 365 services are covered by well-defined service level agreements, which include uptime and availability metrics for individual services. One reason Office 365 can maintain such high service availability levels is the ability for individual components to seamlessly failover between the many Microsoft datacenters, using the global Microsoft network. This failover extends from the datacenter and network to the multiple Internet egress points, and enables failover seamlessly from the perspective of the people using the service.

ExpressRoute provides a 99.9% availability SLA on individual dedicated circuits between the Microsoft Network Edge and the ExpressRoute provider or partner infrastructure. These service levels are applied at the ExpressRoute circuit level, which consists of two independent interconnects between the redundant Microsoft equipment and the network provider equipment in each peering location.

Provider Availability

Microsoft’s service level arrangements stop at your ExpressRoute provider or partner. This is also the first place you can make choices that will influence your availability level. You should closely evaluate the architecture, availability, and resiliency characteristics your ExpressRoute provider offers between your network perimeter and your providers connection at each Microsoft peering location. Pay close attention to both the logical and physical aspects of redundancy, peering equipment, carrier provided WAN circuits, and any additional value add services such as NAT services or managed firewalls.

Designing your availability plan

We strongly recommend that you plan and design high availability and resiliency into your end-to-end connectivity scenarios for Office 365. A design should include;

no single points of failure, including both Internet and ExpressRoute circuits.

minimizing the number of people affected and duration of that impact for most anticipated failure modes.

optimizing for simple, repeatable, and automatic recovery process from most anticipated failure modes.

supporting the full demands of your network traffic and functionality through redundant paths, without substantial degradation.

Your connectivity scenarios should include a network topology that is optimized for multiple independent and active network paths to Office 365. This will yield a better end-to-end availability than a topology that is optimized only for redundancy at the individual device or equipment level.

Tip: If your users are distributed across multiple continents or geographic regions and each of those locations connects over redundant WAN circuits to a single on-premises location where a single ExpressRoute circuit is located, your users will experience less end-to-end service availability than a network topology design that includes independent ExpressRoute circuits that connect the different regions to the nearest peering location.

We recommend provisioning at least two ExpressRoute circuits with each circuit connecting to with a different geographic peering location. You should provision this active-active pair of circuits for every region where people will use ExpressRoute connectivity for Office 365 services. This allows each region to remain connected during a disaster that affects a major location such as a datacenter or peering location. Configuring them in as active/active allows end user traffic to be distributed across multiple network paths. This reduces the scope of people affected during device or network equipment outages.

We don't recommend using a single ExpressRoute circuit with the Internet as a backup.

Example 2: Failover and High Availability

Woodgrove Bank’s multi-geographic design has undergone a review of routing, bandwidth, security, and now must go through a high availability review. Woodgrove thinks about high availability as covering three categories; resiliency, reliability, and redundancy.

Resiliency allows Woodgrove to recover from failures quickly. Reliability allows Woodgrove to offer a consistent outcome within the system. Redundancy allows Woodgrove to a move between one or more mirrored instances of infrastructure.

Within each edge configuration, Woodgrove has redundant Firewalls, Proxies, and IDS. For North America, Woodgrove has one edge configuration in their Dallas datacenter and another edge configuration in their Virginia datacenter. The redundant equipment at each location offers resiliency to that location.

The network configuration at Woodgrove Bank is built based on a few key principles:

Within each geographic region, there are multiple Azure ExpressRoute circuits.

Each circuit within a region can support all of the network traffic within that region.

Routing will clearly prefer one or the other path depending on availability, location, and so on.

In this configuration, with redundancy at the physical and virtual level, Woodgrove Bank is able to offer local resiliency, regional resiliency, and global resiliency in a reliable way. Woodgrove elected this configuration after evaluating a single Azure ExpressRoute circuit per region as well as the possibility of failing over to the internet.

If Woodgrove was unable to have multiple Azure ExpressRoute circuits per region, routing traffic originating in North America to the Azure ExpressRoute circuit in Asia Pacific would add an unacceptable level of latency and the required DNS forwarder configuration adds complexity.

Leveraging the internet as a backup configuration isn't recommended. This breaks Woodgrove’s reliability principle, resulting in an inconsistent experience using the connection. Additionally, manual configuration would be required to failover considering the BGP advertisements that have been configured, NAT configuration, DNS configuration, and the proxy configuration. This added failover complexity increases the time to recover and decreases their ability to diagnose and troubleshoot the steps involved.

Work with your provider or providers to select the best connectivity options, point-to-point, multi-point, or hosted. Remember, you can mix and match the connectivity options so long as the bandwidth and other redundant components support your routing and high availability design.