More Users, More Problems: How LinkedIn Manages their Cloud Infrastructure

One of the most basic truths of the IT industry is that as the number of users increases, so too must the number of infrastructure components within a company’s architecture, and therefore the number of things that can go wrong grows along with it.

This challenge is nothing new to the IT giants of the world. Companies like LinkedIn, which serves over 500 million users in more than 200 different countries and territories around the world, have deployed a wide range of infrastructure components, such as multiple DNS resolvers and content delivery networks (CDNs), to meet the performance needs of their users. These services are necessary for LinkedIn to reach their customers at the edge, but with only one CDN under their direct control, they must also have a reliable monitoring solution in place to ensure that their third-party vendors are performing up to expectations.

Putting this monitoring solution in place, however, is a challenge in and of itself. With so many third parties and so many end-user locations, the ability to have a point of presence as close to the end user as possible from which to monitor the digital experience becomes even more important. This is because many performance issues are isolated to specific regions due to issues with localized networks; for example, if you’re running your synthetic tests from Tokyo, you might be missing a micro-outage that is only being felt by users in Shanghai.

Another crucial aspect of this strategy is that it mitigates the risk of outsourcing much of your infrastructure to third parties. No company has the resources to put first-party infrastructure in place to reach 500 million global users; it would simply be cost prohibitive to do so, regardless of how large or successful they might be. It makes much more sense to rely on third parties dedicated to that specific purpose, but that also requires a certain level of trust that those vendors won’t hamper your customer experience and thus negatively impact your brand.

Service Level Agreements (SLAs) exist for this express purpose, ensuring that a vendor is tied to certain performance thresholds and must make financial restitution if they fall below them. However, even this can be a challenge, because there must be requisite monitoring capabilities in place to evaluate how the vendor’s service is truly performing.

Of course, an SLA payment is only made after the damage has been done; the first priority must be to prevent the end user experience from ever suffering in the first place. Here, too, is where LinkedIn’s SRE team and monitoring strategy play a key role.

The SREs are tasked with staying ahead of performance issues and minimizing the impact whenever one occurs, which requires them to be able to catch problems in real time and troubleshoot a solution as quickly as possible. Therefore, when something such as a spike in network latency occurs, the SRE team can be alerted immediately. If the issue can’t be solved right away, they need to start handing those users off to a different CDN until the vendor can correct the problem.

This means that in certain cases, LinkedIn could be aware of an issue even before the vendor(s) if their monitoring solution is faster and more accurate. In situations such as this, they can help the vendor identify the root cause of the issue thanks to in-depth reporting and analytical capabilities. By being able to do things like capture headers for every single object on the page, or collect and analyze the data in a short amount of time, they can then share the results with the vendor and thus generate a faster resolution to the problem.

When you’re trying to maintain digital performance for 500 million users, speed, accuracy, and reliability of the data makes all the difference in the world.