Rethinking Endorsements Infrastructure, Part 2: The New Endorsements Infrastructure

September 15, 2016

This is the second part in a series on how LinkedIn has evolved the Endorsements feature. Part 1 of the series, which discusses the basic existing Endorsements infrastructure and the new success metric developed to measure Endorsements, can be found here.

As explained in Part 1, the Endorsements feature has two main components: one for suggested endorsements, and another for serving endorsements. The suggested endorsement pipeline is a part of our backend infrastructure that produces a set of suggestions that are then presented to our members. These suggestions help guide their endorsements for people in their network.

While the suggested endorsements pipeline was not fundamentally changed during our overhaul, we did decide to making some changes to the serving endorsements infrastructure by investing in improving the relevance of the suggestions we give. To accomplish this, we integrated the endorser’s reputation score for skills to ensure that we suggest endorsements from members with good skill reputation. This post will explain in further detail how we achieved this integration.

Becoming an edge in the graph

For several years at LinkedIn, we have built and expanded our graph database (called GraphDB). It now powers many of the most important queries on the platform, such as the different degrees of each member’s connections.

Graph databases shine when you are trying to relate entities (nodes) to each other along relationships (edges). On top of new functionalities, the GraphDB at LinkedIn is heavily optimized, and is able to support millions of queries per second at very low latencies.

However, migrating the Endorsements feature to GraphDB also brought new challenges. As with every eventually-consistent system, the updates pushed to the GraphDB are not immediately visible to all of our members. This becomes an issue if you’ve just endorsed or unendorsed a member and the GraphDB nodes that the member is getting data from are still catching up with the stream of updates. In order to solve this issue, we chose to continue to query the SQL database as the source of truth for a member’s own endorsements. By mixing sensitive real-time data from the SQL database and near real-time data from the cloud, we can achieve both performance and flexibility, ensuring that our members see the most updated version of their endorsements at all times.

Also, to make sure that we did not overload the graph, we chose to not load the entire Endorsements dataset immediately. To control the progressive load of data and validate that all systems were still fully functional, we decided to add 200M nodes (a node being a tuple of member and skill) every two weeks until the entire Endorsements dataset was loaded.

Bootstrapping the Endorsements dataIn order to avoid falling too far behind if events were ever to be dropped from the Endorsements update and creation pipeline, GraphDB re-bootstraps itself (rebuilds from scratch) every week. In order to do that, Hadoop jobs run on offline data (all the Endorsements data is stored on HDFS, as well as online) and push the entire dataset to the graph.

As illustrated above, we wrote Hadoop jobs that fetch raw Endorsements data from HDFS, massage that data, and apply the proper filtering to generate Endorsements edge data. Another job in the graph stack will pick up that data and load it into GraphDB clusters. With billions of endorsements, despite the heavy filtering that we do, the final dataset that we have to add to graph is on the order of 400GB.

Handling online eventsThe number of writes to the Endorsements (new endorsements/endorsement status updates) database is pretty high, reaching more than a hundred updates per second. A bulk load approach that runs every week or even every day would mean falling behind over time. More importantly, when our members choose to edit their endorsements, we want our systems to very quickly reflect that change.

The legacy Endorsements update pipeline was pretty straightforward: when an update would reach our service (for example when a member updated, deleted, or accepted an endorsement), we would relay that query down our services stack, transform the query into a SQL command, and update the rows in the database. Because the database was the source of truth as well as the serving mechanism for the endorsements, the changes would be immediately reflected.

With the new infrastructure, where we have the SQL database as the source of truth and the GraphDB as our serving mechanism, we had to alter the way we process the updates. The first constraint is that it is not possible for our service to write directly to the GraphDB. In order to keep control over the rate of updates coming to the cloud and the complex nature of some updates (for example, a member connection triggers a lot of smaller updates to a lot of nodes and edges), we use a messaging queue to push our updates.

The producer (Endorsements service nodes) will emit an event and push it to the Kafka cluster on a specific topic. The consumers of this event are Samza instances that take this event and transform the initial event with additional data that would be too expensive for the producer to acquire. Samza will then call the graph through dedicated APIs to update the graph data.

Conclusion

With these changes in place, we are ready to change the Endorsements experience. The Endorsements dataset living in the graph enables us to provide more highly-rated endorsements and improved endorsements insights to our members. Without this new infrastructure, it would be impossible for us to serve these targeted, highly-rated endorsements to our members.

This technological advancement takes us one step closer to realizing our goal of being the largest and most trusted professional peer validation system.