A site dedicated to my experiences with Microsoft's UC and Azure platforms.

Friday, November 23, 2012

Lync Server 2013 HA Design Changes and Considerations

Lync Server 2013 introduces new capabilities for recovering from a single server or pool failure and failing over between pools of servers; either Enterprise or Standard Edition.

This post discusses these capabilities, demonstrates their use, and offers suggestions for organizations wondering which path to choose.

Lync Server 2013...what's changed?

Enterprise Edition pools now are recommended to have a minimum of three, yes THREE front-end servers. This is due to the "Windows Fabric" replication architecture based on Azure. The back-end SQL database is no longer the store for real-time data.

(subject to change) Enterprise Edition pools use a quorum model similar to Exchange Server 2010/2013 in that a Majority Node Set (MNS) quorum leverages a tie-breaker for pools with even-numbered front-end servers. In the case of Lync Server 2013 this is the pool back-end SQL server.

Lync servers can be "paired" with like-infrastructure (Enterprise to Enterprise and Standard to Standard) to ensure resiliency in the event of a site outage (DR). This pairing activity ensures replication of critical pool/server data and must be invoked by an administrator via manual PowerShell commands.

Multiple Federation routes can be applied to the topology. For example, a Boston Standard Edition server can use a Boston Lync Edge server as its Federation route whereas a Seattle Enterprise pool can use a regional Seattle Lync Edge server/pool for Federation.

Now that Enterprise Edition pools can be paired with other EE pools, and Standard Edition servers can be paired with Standard Edition servers, this changes how we design Lync solutions in certain cases. I talk to customers who often suggest they need "High Availability" (HA) in their Lync infrastructure and this often comes from those who are implementing IM&P only. Instead of trying to meet some kind of unrealistic expectation or design to a requirement which centers around a term like HA, drive the conversation toward Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These two factors, along with an Service Level Agreement (SLA) percentage (i.e. 99.9%) should drive the outcome. Anyway, here is the rule of thumb I use personally today:

If the organization suggests they need HA are they willing to accept a Recovery Time Objective of >1 hour? If so, and the per-server user count does not exceed ~5000, use Lync Standard Edition. Two Lync Standard Edition servers could even be used to split the load of 5000 users in a location where 2500 are homed on each server and both servers are paired (backup for each other). The build list would look something like this:

If the organization insists they cannot incur downtime for Lync components contained within a single site, and they insist "high availability" is a requirement, the infrastructure looks something like this:

That's 24 servers to build a site redundant Lync Server 2013 Enterprise environment. This may seem a bit ridiculous however the point I'm illustrating is the value Standard Edition now brings in Lync Server 2013. Additionally, I haven't found an organization yet who would dedicate server hardware or VM's in this manner. You can collocate many of the roles and scale back on things like WAC and SQL mirroring. Lastly, organizations might suggest their DR infrastructure would accommodate lower user counts which may drive a design lacking redundancy at the "warm" site.

Hold on....what about Persistent Chat?
Okay, so you want a Persistent Chat pool as well....we need to add dual redundant servers at each site raising the total to 28 servers.

As you can see the case for paired Standard Edition servers quickly becomes favorable from a cost and complexity perspective albeit sacrificing availability in the event of a single server outage. The fact that hardware load balancers can be completely eliminated also tells a great story around simplicity. To date I have yet to see a successful implementation of OCS or Lync where hardware load balancers are in the mix at all. This is mostly due to lack of knowledge, lack of understanding on how the solution works, or in some cases simple reluctance to work together.

What if I have more than 5000 users at a single site and need DR?
Consider placing multiple Standard Edition servers paired with similar servers at your backup sites. You can split users homed between servers (i.e. 3000 on ServerA in SiteA and 3000 on ServerB in SiteA) to meet your capacity requirements.

What are the drawbacks to Lync Standard Edition anyway?
Well the first point people typically jump on is no "high availability". This is obviously due to the lack of a shared common data store whereby multiple front-ends connect and relate to. Here are some of the more important drawbacks when considering this approach:

Restoration of service is a manual effort resulting in users being left with "Resiliency Mode" until this action is taken.

Your Edge proxy to 'next hop' internal server can be only one SE server even if you have several of them. An outage to this next hop server results in an outage for all remote users' traffic. It is important to note as well that if Edge cannot contact the next hop, clients will not attempt to sign into another Edge proxy even if another exists (without manual intervention at each client system).

Response Groups and Call Park are a manual effort to switch over.

Assignment of users to a collection of SE servers takes thought and proper assignment so as to not overload a single server. In the case where you have two servers, decide if you're going to run them active/active or active/passive as this will change your user placement behavior. This can also be scripted for ease of user placement automatically.

You could argue this is more complex to manage however the same argument is made for the HLB/SQL infrastructure required.

Your PSTN conference DID is homed to a single server. If this server is down, the DID is as well. I have not yet tested the behavior of a pool failover whether this DID is restored on the backup registrar or not (TBD).

Exchange OWA/UCS integration has a single point of failure due to the lack of multiple server definitions in the Exchange 2010/2013 CAS setup.

Certainly you will have to weigh your own requirements against what is both supported and recommended. This article is intended to keep us thinking on our toes when designing Lync solutions for our customers. Enjoy!

That's an interesting write up Jason. We have been having numerous discussions on approach to Lync 2013 designs and have come to similar conclusions to yourself.

It's interesting customer's focus on HA with little really thought as to the cost and real business benefit to near line HA isn't it? Usually failover with nearline voice recovery suffices, and is a good position to design from commercially.

I agree completely. We often suffer from overly complex designs which are difficult to implement to begin with and nearly impossible for someone to support who doesn't have extensive multi-platform experience. I find it helps to keep my grey hair in check by keeping in simple.

Great Post Jason, I have a quick question regarding our implementation design, currently we have 1 main site, 2 branch offices and our Datacenter. all connected via MPLS cloud. we have 700 users and planning on moving all our voice calls to Lync. We want to deploy 2 lync 2013 standard at out datacenter and SBA at each site for voice resiliency. If is not too much asking, can you give me some feedback on this design. Thanks

You may consider placing a Standard Edition server at your main site and one at the datacenter. If you have redundancy in your connectivity back to either site you could rely on that as your sole method of providing voice to the branch. Also, something many people miss these days is the cost difference of a simple 1-port T1 gateway with an SBS vs. an SBA. I would try and steer clear (if it makes sense) by having all standard edition servers in your topology in one site.

Hi, I have a few questions about this infrastructure setup:1) How about DNS requirements, do I have to setup SRV records and all the other A records to the second std server? or does the failover meen re-pointing the DNS records?

2) The failover in this setup would be manual, triggered by an admin? And the downtime for the end users is up to 30 minutes I've read? why 30 mins and not instantly?

Jason - great write up. However, I think that you would serve your readers well by including some of the sacrifices associated with utilizing Standard over Enterprise. I would recommend to anyone considering a Lync 2013 deployment to evaluate what Enterprise offers over Standard and ensure that your organization doesn't require what might be lacking from Standard.

That doesn't change the central point that Jason made though. Very well written and received.

Microsoft Teams

About Me

The views expressed on this site are my own and do not represent those of Cisco, Microsoft, Apple, or any other company including the one I work for. They are my own personal thoughts so take it for what it's worth.