Nuage: Making Data Systems Management Scalable

KafkaKafka is a distributed messaging system used for real-time data activities, like application communication, metrics collection, and log aggregation. It was developed at LinkedIn in 2010 and is the heart of LinkedIn’s data pipeline. The idea of Kafka is simple, which is critical for its scalability. It is a publish-subscribe system, with an arbitrary number of producers and consumers.

Producers generate messages to a Kafka topic, while consumers retrieve messages from this topic. The messages are partitioned and guaranteed to be delivered in order within one partition. Kafka does not throttle the producers’ writing speed so that scalability is not impaired. Instead, to solve the problem of the discrepancy between producer and consumers’ speeds, it provides a commit log and enables each consumer to read at its own pace.

Nuage manages almost all the Kafka topics at LinkedIn. Application developers can create or update their topics through Nuage with necessary information, like Kafka pipeline type, partition number, topic owner, and retention time. Nuage also provides them with the ability to view the metric dashboard, configuration information, topic schema, and cluster information.

VeniceVenice is a distributed key-value storage system. It handles derived data at LinkedIn and leverages Lambda architecture to reduce the problem of batch processing data. It is much more favorable to horizontal scalability compared to a traditional relational database because it uses partitioning and replication. Furthermore, Venice uses efficient caching so that customers do not experience latency in the distributed environment. Venice leverages Helix for its cluster management, which makes it very easy to operate and scale the cluster.

Nuage not only enables users to perform create, read, update, and delete (CRUD) operations of a Venice store but also allows them to evolve their schemas and promote their store. Nuage also provides the resource-monitoring and quota-management functions for the Venice administrators. Application developers can request quotas, including both traffic quotas and disk quotas, from Nuage, and Venice administrators may approve or reject their requests through Nuage. Nuage also continuously monitors resource usage for all the stores. If any overutilization or underutilization happens, Nuage will send alerts to the store owners asking them to make the corresponding adjustment.

EspressoEspresso is LinkedIn’s distributed document database. It currently hosts lots of important LinkedIn applications, such as member profile, messaging (LinkedIn’s member-to-member messaging system), and portions of the homepage and mobile applications, among others. The purpose of Espresso is to fill in the gap between traditionally-used relational database management systems (RDBMSs) and current key-value stores.

Espresso is much more flexible than a traditional relational database for many reasons. For example, it supports easy schema revolution and data center failover. It also incorporates richer features than simple key-value stores, has a change capture stream, and has cross-colo replication for maintaining data consistency.

Espresso provides a hierarchical data model, such as database to table to document. Nuage allows a user to manage all these data layers and their corresponding schemas for his or her own Espresso database. Nuage also provides health monitoring, job management, and quota management for Espresso service owners. Espresso also relies on Helix to provide partition management and fault tolerance. Nuage is working to incorporate the Helix information of a data platform, which will further facilitate the service owner’s monitoring and troubleshooting.

How Nuage handles scalability challenges

As the number of systems it manages grows, Nuage itself also faces a lot of scalability challenges in managing so many large-scale distributed systems with such large footprints of metadata. We seriously consider scalability and incorporate it into all our design decisions to make Nuage a more scalable service.

Caching, cachingAs the metadata from the data platforms that Nuage manages grows inconceivably large, it is unrealistic for Nuage to have to consult underlying platforms to serve every request. Instead, Nuage extensively uses caching to reduce latency and improve the user experience.

For example, if a user opens a Nuage page to view all the databases he or she owns, Nuage can render the page in almost no time thanks to caching, as compared to a round trip to the underlying platform, which could significantly hurt the user’s experience. Furthermore, Nuage persists some additional information about the data platforms in its own metadata database, including database owners and descriptions. These fields are for the convenience of application developers and do not need to be passed to the underlying data platform if they get updated.

Distributed for load balancingSimilar to the underlying data infrastructure, Nuage is a distributed system with its frontend and backend services running distributed across multiple data centers at LinkedIn. The customers’ requests are routed to the service machines to which they are closest. Requests are load balanced among a distributed set of Nuage instances across data centers. Distribution avoids single point failure and improves the system’s availability. It is a must-have to support future growth.

Asynchronous to avoid waiting timeNuage is enriched with an asynchronous approach to better support operations involving some of its data platforms, such as search service, and some operations that usually take a long time to complete, like the database promotion operation. An asynchronous approach dramatically reduces idle cycles and improves the throughput. We are currently trying to modify more operations to leverage an asynchronous framework, as a synchronous approach generally slows down scalability.

Stateless frontendNuage is equipped with a stateless frontend as a thin client. Its frontend is not aware of any state change, and for the same user input, it will always provide the same output. This stateless feature makes the frontend easy to test and also enables the system to scale, as users’ requests can be routed to any web server without feeling any difference.

Easy to monitor and diagnoseScalability not only means how much traffic the system can handle but also encompasses how easy it is to monitor behavior and make a diagnosis when something goes wrong as the load increases. Nuage automatically sends email alerts to Nuage administrators upon certain failures, like if a database provisioning operation fails. This way, we can detect urgent issues in time. Debugging a distributed system for a diagnosis is always difficult and time-consuming. In Nuage, we have set up ELK and do the request tagging so that we can easily find the log to investigate the root of an issue.

Capacity management and predictionNuage manages several clusters for every individual data platform. With continuous resource profiling, such as disk usage and resource (such as read quota, write quota) utilization, Nuage has a good understanding of the cluster resource usage history and trends. Therefore, Nuage can provide periodic capacity predictions to the platform teams to give them advanced warning that they need to increase their system’s capacity by either expanding the cluster or moving some databases to a different cluster.

We are also working on enabling Nuage to automatically expand system capacity through underlying resource management tooling, which will further improve the auto-management of our platforms. Capacity prediction helps form the growth plan for underlying database platforms and enables them to scale more smoothly.

Toward future growth at LinkedIn

We expect tremendous data growth in the next few years for LinkedIn. Nuage is playing an essential role in this trend, and what we should prioritize is making Nuage even more scalable and automated. A vital goal in our roadmap is the self-integration framework. Many teams at LinkedIn would like to integrate with Nuage. However, the integration work still requires many collaborations between Nuage and the platform team, with lots of time and effort from engineers.

To scale Nuage, we are making the integration steps more accessible and easier to plug in. Our team is actively working on providing an SDK for all platform developers so that they can plug their platform into Nuage by themselves. They will have the choice to onboard any frontend and backend features. Nuage, as a team, will shift the balance of its duty from onboarding new systems to developing new and added value features.

I would like to express my sincere appreciation to all of them. I would also like to extend a special thank to Lei Xia for his great feedback on this blog. Finally, a huge thank you to all data platform teams—Espresso, Kafka, Venice, Samza, Ambry, and others—for their continuous cooperation and support.