Outbrain Tech Blog Recent Posts

The Cloud is an Illusion

Cloud service providers have enabled innovations in many areas of our society from the way we watch movies to the way we share media with our family and friends. Companies can focus on amazing products without having to worry about managing data centers, servers, networking equipment, and all of the complexities therein.

But the cloud is an illusion, just an abstraction. Behind every cloud are data centers full of countless racks of servers, routers, firewalls, and engineers who design, deploy, and manage them. There is physical infrastructure that powers the technology we enjoy everyday and, at Outbrain, we run the majority of our workloads on bare metal infrastructure. We are, for that matter, our own cloud provider. This post is a peak behind the curtain of how we do this, at scale.

It was a sunny day in California…

In the summer of 2017, our new West Coast data center went live, along with our first fully automated 10G network deployment. It was a great success, which is why we wanted to also roll it out in our other data centers. The new mission was to migrate every server from our legacy 1G network to our shiny new 10G network. Essentially, this meant installing 10G network cards and running new twinax cabling to a good number of thousands of servers. No biggie… except:

The vast majority of those servers run production workloads

Many of them run stateful applications such as data stores (mysql, elasticsearch, etc.)

Many of these stateful applications can only tolerate a limited amount of downtime before they start shuffling (a lot of) data around

Servers need to be migrated around the clock, with no downtime to any cluster

The application and networking engineers are in Israel

The physical server engineers are in NYC

There is a 7 hour time difference between NYC and Israel (the workweek also only overlaps for 4 days of the week)

The on-site technicians who work on our servers cannot log into them for reboots, health checks, etc.

How many Engineers does it take to…?

Let’s take a look at everything that goes into a manual migration of one server, so we can better understand the task at hand. We will narrow the focus to one specific case: migrating a mysql node to the new network.

The people involved:

The process:

Yuval removes the mysql server from the cluster and makes sure that the cluster is still healthy.

Gerry properly powers down the server, and sends the location details to the remote hands team. Based on the location of the server within the rack, he also lets them know which switch ports to use.

Mike opens up the server, installs the 10G network card, and runs redundant twinax cabling to the 10G switches. He powers it back up, connected to both the legacy and the 10G networks.

Adi prepares the server to join the 10G network and reboots it for the final changes to take effect. After the reboot, he checks that the server is indeed part of the new network and that both twinax connections are up and stable.

Yuval adds the server back to the mysql cluster and makes sure everything looks healthy.

Mike removes the old RJ-45 cables and waits for Yuval and Gerry to prepare and shut down the next server (to avoid multi-node shutdown in the same cluster).

* This is a simplified explanation of the process, and assumes everything goes as smooth as possible. It involves 3 Outbrain engineers to be available across a major time zone difference + on-site remote hands. That’s 4 engineers to migrate a single node.

And now that we’re done, there’s only… a few thousand servers left to migrate…running on various types of hardware, with different versions of Linux, different applications and operating under different availability restrictions.

Which begs the question – Will it Scale?

Putting Remote Hands in the Driver Seat

After a few iterations, this is what the process looks like from Outbrain’s perspective:

Gerry executes a Rundeck job with one mouse click, and then emails Mike a list of server locations.

Gerry goes to sleep.

This is the process from the perspective of a remote hands technician at the data center:

Mike receives a list of server locations, and plugs an iPad into the first server on the list.

The iPad is running Slack and the chat room starts displaying new messages. It lets Mike know that the server is attempting to safely stop mysql and power itself down.

The server shuts itself down, but right before it goes down, it sends a message to the Slack channel explaining all the next steps, including which switch ports to connect to server to.

Mike installs the 10G network card, finishes up the cabling, and powers the server back up.

The server runs the network preparation scripts, reboots itself, checks the network status, starts up mysql, runs health checks, and lets Mike know that it’s time to remove the old cabling and move on to the next server.

Meanwhile, Gerry and Yuval’s teams are getting updates via email every time a server begins and succeeds the migration process. They can monitor the Slack channel during the process, or even look back at all of the migrations in a Kibana dashboard. If anything ever goes wrong, the iPad communicates that Mike should stop and contact Outbrain. The iPad won’t take any wrong actions if it is plugged into the wrong server or reseated at any point.

How We Built It

When the iPad is plugged into the server, it is recognized by a udev rule. This triggers a wrapper script that contains all of the migration logic. Here’s a deeper look at the individual steps and components:

Rundeck is used to put the desired servers into maintenance mode. It does this by touching empty files onto the servers that indicate that they are in maintenance mode, and they are in the first stage of the network migration process.

Chef contains all of the necessary scripts to run the migration. This includes the udev rules and the wrapper, networking, and application specific scripts. The Chef recipe chooses the proper pre and post migration scripts based on the role of the server.

UDEV Rules are very powerful, and we use them define what happens when the iPad is plugged into a server:

This rule roughly translates to: “When a usb device with this serial number in plugged into the server, run the following wrapper script”

A customstartup script is what triggers the wrapper script when the server boots back up after a shutdown or reboot to perform the next migration step.

The wrapper script holds all of the logic and functionality of the process.

It only runs if a maintenance file exists on this server.

It checks for a state file, and depending on what it finds, understands which part of the migration process to run next.

It sends the event logs (for the Kibana dashboard), emails (to the proper teams who manage this server), and Slack messages (via webhooks).

It handles the reboots and shutdowns.

When it runs an application specific pre/post migration script, it’s expecting to get an exit status of 0 (or else it will let everyone involved know that something went wrong).

When it runs the network preparation script (which is worthy of its own blog post), it can make a few decisions based on the exit status. It can rings the alarms, move on to the next step, or even let the remote hands technician know that one of the twinax cables seems loose and should be fully seated.

It cleans up by removing all migration related files, including itself.

More time to build more automation =]

Now we can set a bunch of servers to maintenance mode, hand off the list of server locations to remote hands, and continue our daily work while they are migrated.

And since this works so well using the current approach, work is underway to generalise the concept and make it available to many other types of physical maintenance.

This is part two of the Outbrain Haggling Game blog series. For the full description of the game we highly recommend you read the first part here.

When we heard about the game we were very excited. It’s fun to take a break from everyday work every once in a while to create something new and cool. Additionally, these games are a great opportunity to get to know coworkers we don’t get to interact with on a day to day basis. Moreover, the competitive nature of the haggling game was a huge draw to us – we love to win!

Before we describe our strategy and the implementation details, we want to reiterate the rules of the Haggling Game:

The Rules

The game is played by two players who negotiate the division of a group of items between them. The goal of each player is to maximise their score.

At the start of the game a set of ‘items’ is placed on the table (for example, a pair of sunglasses, two cups and a pencil). The players must agree on how to divide these items between them. The items are worth a different amount to each of the players, and the players know only their own preferences, and have no idea of the opponent’s. The total worth of all objects on the table is the same for both players.

The game consists of 9 rounds. In each round one of the players receives an offer of how to split the items and decides whether to accept it or return a counteroffer. If after 9 rounds neither player has accepted any offer the game ends with both players receiving 0. If at any point an offer is accepted each player receives the amount that their portion of the goods is worth to them.

Strategy – General Overview

Technical Strategy

We implemented a collaborative strategy, using two types of players:

The “Overlord”, our winning player

The “Minions”, helper bots whose job was to give points to the Overlord while blocking other players.

The Overlord and Minions recognize each other using a triple handshake protocol, based on mathematical calculations of the game parameters. Whenever a handshake completes successfully the Overlord requests the entire pot, while the Minion accepts the offer. In case the Overlord sees an unsuccessful handshake it moves to a competitive strategy and tries to claim the highest possible rewards using a player-vs-competitor bargaining strategy. Similarly, when a Minion realises that its opponent isn’t the Overlord it requests the entire pot in every round until the end of the game. This effectively blocks the opponent from receiving any points in this game.

Social Engineering

Beyond the technical strategy, we also employed a sort of “social engineering” : we hid the strength of the Overlord by ensuring that for the majority of the two-week development period the Overlord went no higher than third place. Given the amount of hacking going on during the game we wanted to make sure we didn’t draw undue attention, or become a major target to beat. We did this by populating the game with “sleeper cells” — players under our control running basic strategies ready to turn into minions at the right moment.

We waited until the last couple of hours of the game to start converting the minions. During the final hour, the Overlord shot to first place, outpacing the competitors by a huge margin. The strength of the technical strategy is evident by our spectacular win. However, the social engineering aspect is not to be diminished. In the last half hour of the game, we saw an attempt to hack our code in order to beat the Overlord. Our sneaky strategy did not leave enough time for the hacker to bring about our downfall.

Strategy – The Protocol

As we described, our protocol employs a triple handshake between the Overlord and a Minion. The roles of the Overlord and the Minion in the handshake are asymmetric.

The Overload:

Never initiates the handshake

Keeps the handshake state between rounds (Is the opponent possibly a Minion?)

If at any round the handshake fails, falls back to play the “competitive strategy”

If the opponent is believed to be a minion, demands all items

The Minions:

Always initiates a handshake in the first round

Keeps the handshake state between rounds (is the opponent possibly the Overlord?)

If at any round the handshake fails, blocks the opponent from getting any points by demanding all items.

If the opponent is believed to be an Overlord – the handshake is successful and the opponent requested the entire pot – accept the offer

Here we present the protocol flow and the code for the Overlord and Minion

For the remainder of this post “Counts” refers to the set of items that are on the table for this game.

Overlord Protocol:

We can see that the Overlord uses both the saved state and the received offer to determine whether this is a valid phase of the handshake..

This is a snippet of the Overlord’s code:

We can see that when the Overlord is the starting player it always plays the competitive strategy, and does not initiate a handshake. The Overlord checks the offers it receives to decide whether a handshake is ongoing. If at any point in the game the Overlord decides that the opponent is not a Minion it switches to the competitive strategy.

Minion Protocol:

The flow of a Minion playing against the Overlord

The flow of a Minion playing against a competitor:

This is a snippet of the Minion’s code:

We can see that the Minion always initiates a handshake on its first offer, regardless of what it received. At any point in the game, if the Minion decides that the opponent is not the Overlord, it switches to a blocking strategy.

Protocol Offer

The key to making the protocol work as we designed is to create a proper offer for the handshake. The offer must be deterministic and computable by both sides. Moreover, it should not be static: we want it to be changed from game to game, and also from round to round.

The main reason for that is to lower the probability that the opponent can learn our strategy, or that their strategy accidentally coincides with our protocol, giving them the win.

Additionally, the offer must be not to bad for us, so that we don’t give away free points. If the opponent accepts the offer while we are still trying to perform the handshake, we won’t lose too much.

Before we describe the algorithm for calculating the offer, let’s go back to the game rules:

Both players know which items are on the table for bargaining

Each player receives the round number and the opponents offer in each round

All communication, like http calls or some IO, is not allowed.

In every game there is a randomisation of the bargaining items

From the rules, we can see that the only common information between the two players are the group of bargaining items, the offer, and the round number. Moreover, the only information that passes between two players is the offer.

Thus, we built the protocol offer as follows:

For each round we need to create an offer that will fulfil the requirements described above. In order to do so, we decided to create an offer by picking a single item in each round. We offer a single instance of this item to the competitor, keeping the rest of the items to ourselves.

This way we try to minimise the worth of the offer to the competitor, while also minimising our loss if they accept it.

In order to choose the item to offer the opponent we hash the items and amounts in “Counts” along with current round number. We sort the items in “Counts” by their names and amounts, then use the hashed value we calculated modulo the size of “Counts” to choose the item to offer. “Counts” changes every game, and the round number every round, meaning that the hash value will be different in every round of every game.

This is a snippet of the offer code

Competitive Strategy

We started the game by developing a strategy for playing the game by the book – no Minions. We compared the results of several different kinds of strategies, from the simplest greedy algorithms to attempts to analyse the opponents’ preferences and build an offer accordingly. In the end, in order to emphasise the strength of the Overlord-Minion coalition we chose one of the simpler attempts. This was to show that even with a “bad” strategy the Minions could elevate the Overlord to the winning position.

We’ll describe the two basic directions we developed:

1. Greedy:

During the initialisation phase we built a static list of “best offers”. These were offers that ensured we got as many points as possible, sorted in descending order of value to us. We assumed that an offer which gave nothing to the opponent would not be accepted. Thus all the “best offers” included some items to be given to the opponent. The “best offer” options were hardcoded for each round. We tried several different configurations, sometimes allowing for more losses, sometimes for less. Finally, we compared the results of several such players and chose one for the basic master strategy.

2. Algorithmic:

The algorithmic strategy was to try to give the opponent the best deal we thought we could, while ensuring that our own gains would not fall below a certain threshold.

We kept a score for each item in counts, based on the number of times the opponent was willing to offer it to us. Our assumption was that the opponent was more likely to offer us items that were worth less to them. We sorted the items by their perceived importance to the opponent and built an offer containing items we thought were important to them, while not giving over items with the highest worth to us. This was an attempt to “sweeten the deal”. In this case we both win, but hopefully we win more. Here too we played with different configurations. This was done by changing amounts of attractive and unattractive items to the opponent, and playing with our threshold value. This strategy showed promising results, but was nowhere near those of the Overlord-Minion collaboration.

Results

The final result of the game was the crushing defeat of all those who dared to oppose us.

We can bask in our glory all day, but it is more interesting to describe the way we reached this victorious result. As we mentioned above, we felt that keeping a low profile was imperative to winning the game. The multiple hacking attempts and the general competitive atmosphere only served to strengthen this decision. Once we had finished working on the code we started to carefully test how much of a lift the Minions offered. The graph on the right is from a couple of days before the end of the competition. We enabled four minions, and saw that the Overlord immediately shot ahead of the competitors. We quickly disabled the minions, so as not to draw attention. In the final hour before the end of the game we enabled 20 minions, giving us the enormous lead seen in the graph on the left.

All hail the Overlord!

P.S.

Stay tuned to hear about how the game was hacked, and the different hacking strategies used by Roy Bass.

The past decade has produced substantial research verifying what may come as no surprise: developers want to have fun. While we also need our salaries, salaries alone will not incentivize us developers who, in most cases, entered a field to do what we love: engage in problem-solving. We like competition. We like winning. We like getting prizes for winning. To be productive, we need job satisfaction. And job satisfaction can be achieved only if we get to have fun using the skills we were hired to use.

We wanted to keep the backend developers challenged and entertained.
That’s why Guy Kobrinsky and I created our own version of Haggling, whose basic idea we adapted from Hola, a negotiation game.

The Negotiation Game:

Haggling consists of rounds of negotiations between pairs of players. Each pair’s goal is to maximize score in the following manner:

Let’s say there are a sunglasses, two tickets, and three cups on the table. Both players have to agree on how to split these objects between them. To one, the sunglasses may be worth $4, a ball $2, and the tickets are worthless. The opponent might value the same objects differently; while the total worth of all the objects is the same for both players, their valuation kept secret

Both players take turns making offers to each other about how to split the goods. A proposed split must distribute all objects between partners such that no items are left on the table. On each turn, one can either accept an offer or make a counter-offer. If after 9 offers an agreement is reached, every player receives the amount that its portion of the goods is worth, according to the assigned values. If there is still no agreement after the last turn, both players receive no points.

The Object of the Game:

Write code to obtain a collection of items with the highest value by negotiating items with an opponent player.

User Experience:

We wanted it to be as easy as possible for players to submit, play and test their code.
Therefore, we decided to keep player code simple – not relying on any third-party libraries.
To do this, we built a simple web application for testing and submitting code, supplying a placeholder with the method “accept” – the code that needs to be implemented by the different participants. The “accept” method describes a single iteration within the negotiation, in which each player must decide if they will accept the offer given to them (by returning null or the received offer) – or return a counter offer.

To assist in verifying the players’ strategy, we added a testing feature allowing players to run their code vs some random player. Developers were able to play around with it, re-implementing the code before actual submission.

Java Code Example:

Test Your Code and Submit Online:

Tournament And Scoreboard:

Practice tournaments ran continuously for two weeks, taking all submitted players into account and allowing developers to see their rank. During this time, competitors were able edit their code. So there was plenty of time to learn and improve.

We also provided analytics for every player. Developers were able to analyze and improve their strategy.

At the end of the two weeks, we declared a code freeze and the real tournament took place. Players’ final score was determined only from the results of the real tournament, not the practice tournaments.

Game Execution And Score:

We executed the game tournament using multiple agents – each of the agents was reported to Kibana:

The Back-Stage:

Where did we store players’ code?
We decided to store all players’ code in S3 of AWS to avoid revealing the code to other players.

What languages were supported?

We started with Java only, but players expressed interest in using Scala and Kotlin as well. So we gave these developers free rein to add support for those languages, which we then reviewed before integrating into the base code. Ultimately, developers were able to play in all three languages.

What was the scale of Haggling?

In the final tournament, 91 players competed in 164 million rounds in which 1.14 billion “accepts” were called. The tournament was executed on 45 servers, having 360 cores and using 225G of memory.

The greatest advantage of our approach was our decision to use Kubernetes, enabling us to add more nodes, as well as tune their cores and memory requirements. Needless to say, it was no problem to get rid of all these machines when the game period ended.

How did the tournament progress?

The tournament was tense, and we saw a lot of interaction with the game over the two weeks.
The player in the winning position changed every day, and the final winner was not apparent until very near the end (and even then we were surprised!).
We saw a variety of single-player strategies with sophisticated calculations and different approaches to gameplay.
Moreover, in contrast to the original game, we allowed gangs: groups of players belonging to a single team that can “help” each other to win.

So how do you win at haggling?

The winning strategy was collaborative – the winning team created two types of players: the “Overlord” which played to win, and several “Minions” whose job was to give points to the Overlord while blocking other players. The Overlord and Minions recognized each other using a triple handshake protocol, based on mathematical calculations of the game parameters. Beyond this, the team employed a human psychological strategy – hiding the strength of the Overlord by ensuring that for the majority of the development period the Overlord went no higher than third place. They populated the game with “sleeper cells” – players with basic strategies ready to turn into minions at the right moment. The upheaval occurred in the final hour of the game when all sleepers were converted to minions.

The graph shows the number of commits in the last hour before the code freeze:

Hats Off to the Hacker: who got the better of us?

During the two weeks, we noticed multiple hacking attempts. The hacker’s intent was not to crash the game, but rather to prove that it is possible and make a lesson of it.
Although it was not our initial intent, we decided to make hacking part of the challenge and to reward the hacker for demonstrated skills and creativity.

On the morning of November 7th, we arrived at the office and were faced with the following graph of the outcomes:

The game had been hacked! As can be seen in the graph, one player was achieving an impossible success rate. What we discovered was the following: the read-only hash map that we provided as method argument to players was written in Kotlin; but, when players converted the map to play in either Java or Scala, the resulting conversion rendered a mutable hash map, and this is how one of the players was able to modify the hash map. We had failed to validate the preferences, ensuring that the hashmap values that players turned in used the same values as the original.

In conclusion, This is exactly the sort of sandbox experience, however, that makes us better, safer, and smarter developers. We embraced the challenge.

Micro Front Ends — Doing it Angular Style Part 2

In the previous part, I talked about the motivations for moving towards an MFE solution and some of the criteria for a solution to be relevant. In this part, Ill get into how we implemented it at Outbrain.

As I mentioned in the previous part, one of the criteria was for a solution that can integrate with our current technological echo system and require little or no changes to the applications we currently maintain.

This is a big step towards our goal of separating our application into a mini application.

Moving from feature modules to mini apps

Angular feature modules along with Webpack bundling gives us the code separation we need, but this is not what we enough, since Webpack only allows us to create bundles as part of a single build process, what we want to be able to produce a separate JS bundle, that is built at a different time, from a separate code base in a separate build system that can be loaded into the application at runtime and share any common resources, such as Angular.

In order to resolve this, we had to create our own Webpack loader which is called share-loader.

Share-loader allows us to specify a list of modules that we would like to share between applications, it will bundle a given module into one of the applications js bundle, and provide a namespace in which other bundles access that modules.

In this example, we are telling Webpack to bundle angular and lodash into application A and expose it under the ‘container-app’ namespace.

In application B, we are defining that angular and lodash will not be bundled but rather be pointed to by the namespace ‘container-app’.

This way, we can share some modules across applications but maintain others that we wish not to share.

So far we have tackled several of the key’s we specified in the previous post, We now have two application that can be run independently or loaded remotely at runtime while wrapped in a js namespace and have CSS and HTML encapsulation, They can also share modules between then and encapsulate modules that shouldn’t be shared, now lets look into some of the other key’s we mentioned.

DOM encapsulation

In order to tackle CSS encapsulation we wrapped each mini-app with a generic angular component, this component uses angular CSS encapsulation feature, we have two options, we can use either emulated mode or native mode depending on the browser support we require, either way, we are sure that our CSS will not leak out.

This wrapper component also serves as a communication layer between each mini-app and the other apps. all communication is done via an event bus instance that is hosted by each wrapper instance, by using an event system we have a decoupled way to communicate data in and out, which we can easily clear when a mini application is cleared from the main application.

If we take a look at the situation we have so far, we can see that we have a solution that is very much inline with the web component concept, each mini application is wrapped by a standalone component, that encapsulates all js html and css, and all communication is done by an event system.

Testing

Since each application can also run independently we can run test suites on each one independently, this means each application owner knows when his changes have broken the application and each team is concerned mostly with their own application.

Deployment and serving

In order to provide each application with its own deployment, we created a node service for each application, each time a team created a new deployment of their application a js bundle is created that encapsulates the application, each service exposes an endpoint that returns the path to the bundle. At runtime, when a mini app is loaded into the container app, a call to the endpoint is made and the js file is loaded to the app and bootstrapped to the main application. This way each application can be built a deployed separately

Closing Notes:

Thanks for reading! I hope this article helps companies that are considering this move to realize that it is possible to do it without revolutionizing your code base.

Moving to a Micro Front End approach is a move in the right direction, as applications get bigger, velocity gets smaller.

This article shows a solution using Angular as a framework, similar solutions can be achieved using other frameworks.

At Outbrain we work at a fast pace trying to combine the challenges of developing new features fast, while also maintaining our systems so that they can cope with the constant growth of traffic. We deliver many changes on a daily basis to our production and testing environments so our velocity is much affected by our DevOps tools. One of the tools we use the most is the deployment tool since every new artifact must be deployed to simulation and staging environments and pass its test before it can be deployed to production. The simulation environment is used for running E2E integration tests. These tests simulate real use cases and they involve all relevant services. The staging environment is actually a single production machine (AKA a canary machine) which receives a small portion of the traffic in production. It allows us to make sure the new version is working properly in the production environment before we deploy it to the rest of the production servers. In this session, you’ll find out how we increased velocity with a safe automatic deployment of high scale services.

Our deployment flow

The illustration above depicts the flow each code change must pass until it arrives in production.

A developer commits code changes and triggers a “build & deploy” action that creates an artifact for the requested service and deploys it to our simulators servers. Once an hour, a build in TeamCity runs the simulation tests of our services.

If the developer doesn’t want to wait for the periodic run, they need to run the simulation tests manually. Once the build passes, the developer is allowed to deploy the artifact to the staging server. At this point, we verify that the staging server behaves properly by reviewing various metrics of the server, and by checking the logs of that server.

For instance, we verify that the response time hasn’t increased and that there are no errors in the log. Once all these steps are completed, the new version is deployed to all production servers. This whole process can take 30-45 minutes.

As one can see, this process has a lot of problems:

It requires many interventions of the developer.

The developer either spends time waiting for actions to complete in order to trigger the next ones or they suffer from context switches which slow them down.

The verification of the version in staging is done manually hence

It’s time-consuming.

There is no certainty that all the necessary tests are made.

It’s hard to share knowledge among team members of what the expected result of each test is.

The new automatic pipeline

Recently we have introduced a pipeline in Jenkins that automates this whole process. The pipeline allows a developer to send code changes to any environment (including production) simply by committing them into the source control while ensuring that these changes don’t break anything.

The illustration below shows all stages of our new pipeline

Aside from automating the whole process, which was relatively easy, we had to find a way to automate the manual tests of our staging environment. As mentioned, our staging servers serve real requests coming from our users.

Some of our services handle around 2M requests per minute so any bad version can affect our customers, our users, and us very quickly. Therefore we would like to be able to identify bad versions as soon as possible. To tackle this issue, our pipeline starts running health tests on our staging servers 5 minutes after the server goes up since sometimes it takes time for the servers to warm up.

The tests which are executed by TeamCity, pull a list of metrics of the staging server from our Prometheus server and verify that they meet the criteria we defined. For example, we check that the average response time is below a certain number of milliseconds. If one of these tests fail, the pipeline fails. At that point, the developer who triggered the pipeline receives a notification e-mail so that they can look into it and take the decision whether the new version is bad and revert it, or maybe the tests need some more fine-tuning and the version is okay to deploy to the rest of the servers.

The pipeline ends when the new version is deployed to production but this doesn’t necessarily mean that the version is 100% okay, although the chances that the version is not okay at this stage are low.

For the purpose of ensuring our production servers function properly, many periodic tests constantly monitor the servers and trigger alerts in case of a failure and allow us to react fast and keep our services available.

What we gained

The automated deployment process ensures the quality of our deliveries and that they don’t break our production servers.

Reduction of time developers spends on DevOps tasks.

The decision whether a version in staging is okay is more accurate as it is based on comparable metrics and not on a subjective decision of the developer.

The developer doesn’t need to remember which metrics to check for each service in order to tell whether a service functions properly.

Outbrain has quite some relations with Cassandra. We have around 30 production clusters of
different sizes, ranging from small ones to cluster with 100 nodes across 3 datacenters.

Background

Cassandra has proven to be a very reliable choice as a datastore which employs an eventual consistency model. We used the Datastax Enterprise 4.8.9 distribution (which is equivalent to Cassandra 2.1) and as Cassandra 3 evolved clearly at some point we had to plan out an upgrade. We ended up with two options, either upgrade the commercial distribution version provided by Datastax or migrate to the Apache Cassandra distribution. Since Outbrain uses Cassandra extensively it was an important decision to make.
At first, we identified that we do not use any of the Datastax enterprise features and we were quite happy with the basics that are still offered by Apache Cassandra. So, at this point, we decided to give Apache Cassandra 3.1.11 a try.
But when planning an upgrade aside from the necessity to stay on the supported version there are always some goals, like leveraging a new feature and seeking for performance improvement. Our main goal was to get more from the hardware that we have and potentially reduce the number of nodes per cluster.

In addition, we wanted to have in-place upgrade which means that instead of building new clusters we could perform the migration in the production clusters themselves while they are still running, it is like upgrading the railway tracks under a moving train and we are talking about the very big train moving in high speed.

The POC

But before defining the in-place upgrade we needed to validate that the Apache Cassandra distribution fit to our use cases, and for that, we performed a POC. After selecting one of our clusters that will be used as a POC we build a new Apache Cassandra 3.1.11 cluster, following the new storage format and hence read path optimisations in Cassandra 3 we started with a small cluster of 3 nodes in order to see if we can benefit from this feature for reducing the cluster size – that is if we could achieve acceptable read and write latencies with fewer nodes.
The plan was to copy data from the old cluster using sstableloader – while the application would employ dual writes approach before we start the copying. This way we would have two clusters in sync, and we could direct our application to do some/all reads from the new cluster.
And then the problems started… at first it became clear that with our multiple datacenter’s configurations we do not want sstableloader to stream data to foreign data centers, we wanted to be able to control the target datacenter for the stream processing, we reported a bug and instead of waiting for a new release of Apache Cassandra we build our own custom build.
After passing this barrier we reached another one that was related to LeveledCompactionStrategy (LCS) and how Cassandra performs compactions. With a cluster of 3 nodes and 600GB of data in a table with LCS it took few weeks to put data into the right levels. Even with 6 nodes it took days to complete compactions following sstablesloader data load. That was a real pity, because read and write latencies were quite good even with a fewer number of nodes. Luckily Cassandra has a feature called multiple data directories. At first it does look like something to do with allowing more data but this feature actually helps enormously with LCS. So once we tried it and played with an actual number of data directories we identified that we can compact 600GB of data on L0 in less than a day with three data directories.

Encouraged with these findings we conducted various tests on our Cassandra 3.1 cluster and realised that not only we can benefit from storage space savings and faster read path – but also potentially put more data per node for clusters where we use LCS.

Storage before and after the migration

The plan

The conclusions were very good and indicated that we are ok with Apache Cassandra distribution and we can start preparing our migration plan using in-place upgrades. At the high level the
migration steps should be:

Upgrade to the latest DSE 4.x version (4.8.14)

Upgrade to DSE 5.0.14 (Cassandra 3)

Upgrade sstables

Upgrade to DSE 5.1.4 (Cassandra 3.11)

Replace DSE with Apache Cassandra 3.11.1

After defining the plan, we were ready to start migrating our clusters. We mapped our clusters and categorised them according to if they are used for online serving activities or for offline activities.
Then we were good to go and started to perform the upgrade cluster by cluster.

Upgrade to the latest DSE 4.x version (4.8.14)

The first upgrade to DSE 4.8.14 (we had 4.8.9) is a minor upgrade. The reason this step is needed at all because DSE 4.8.14 includes all the important fixes to co-exist with Cassandra 3.0 during upgrade and our upgrade experience tells that indeed DSE 4.8.14 and DSE 5.0.10 (Cassandra 3.0.14) co-existed very well. The way it should be done is to upgrade nodes in a rolling fashion upgrading one node at a time. One node at a time applies to a datacenter only, datacenters could be processed in parallel (which is what we actually did).

Upgrade to DSE 5.0.14 (Cassandra 3)

Even if you aim to use DSE 5.1 you cannot really jump to it from 4.x, so 4.x – 5.0.x – 5.1.x chain is mandatory. The steps and pre-requisites on upgrade to DSE 5.0 are documented quite well by Datastax . What is probably not documented that well is which driver version to use and how to configure it so that application smoothly works both with DSE 4.x and DSE 5.x nodes during upgrade. What we did is that we instructed our developers to use cassandra-driver-core version 3.3.0 and enforce the protocol version 3 to be used so that the driver can successfully talk to both DSE 4.x and DSE 5.x nodes. Another thing to make sure is that before starting the upgrade you should ensure that all your nodes are healthy. Why? Because one of the most important limitations about clusters running both DSE 4.x and 5.x is the fact that you cannot stream data, which means that bootstrapping a node during node replacement is simply not possible. However, for clusters where we had SLAs on latencies we wanted more confidence and for those clusters what we did was that in one datacenter we upgraded a single node to DSE 5.0.x. This is 100% safe operation, since it’s the first node with DSE 5.0.x it can be replaced using the regular node replacement procedure with DSE 4.x. One very important point here is that DSE 5.0 does not really force any schema migration till all nodes in the cluster are running DSE 5.0.x. What this means is that system keyspaces remain untouched and this is what guarantees that DSE 5.0.x node can be replaced with DSE 4.x. Since we have 2 kinds of clusters (serving and offline) with different SLA we took 2 different approaches, for the serving clusters we upgraded the datacenters sequentially leaving some datacenter running DSE 4.x, we did it because if something goes really wrong, we still would have a healthy datacenter with DSE 4.x running. And for the offline clusters we upgraded all the datacenters in parallel, both approaches worked quite well.

Upgrade sstables

Whether clusters were upgraded in parallel or sequentially we always wanted to start sstablesupgrade as quick as possible after DSE 5.0 was installed. Sstables upgrade is the most important phase of upgrade because it’s most time consuming. There are not so many knobs to tune:

control number of compaction threads with -j option for the nodetool upgradesstables command.

control the compaction throughput with nodetool setcompactionthroguhput.

Whether -j would allow to speed up sstables upgrade depends a lot on how many column families you have, how big they are and what kind of compaction strategy is configured. As for the compaction throughput we determined that for our storage system it was ok to set it to 200 MB/sec to allow regular read/write operation to be carried out at reasonable higher percentile latencies, but for sure this has to be identified for each storage system individually.

Upgrade to DSE 5.1.4 (Cassandra 3.11)

In theory, upgrade from DSE 5.0.x to DSE 5.1.x is a minor one, just need to upgrade the nodes in a rolling fashion exactly the same way as minor upgrades were done within DSE 4.x, storage format stays the same and no sstablesupgrade is required. But when upgrading our first reasonably large cluster (in terms of number of nodes) we noticed something really bad, nodes running DSE 5.1.x suffered from never-ending compactions in the system keyspace which killed their performance while DSE 5.0.x were running well. For us, it was a tough moment because that was the serving cluster with SLAs and we did not notice anything like that on our previous clusters upgrades! We suspected that something is potentially wrong with schema migrations mechanism and the command nodetool describecluster confirmed that there are two different schema versions, one from 5.0.x nodes and one from 5.1.x nodes. So, we decided to speed up DSE 5.1.x upgrade under assumption that once all nodes are 5.1.x there would no issues with different schema versions and cluster would settle. This indeed happened.

Replace DSE with Apache Cassandra 3.11.1

So how was the final migration step of replacement of DSE 5.1.4 with Apache Cassandra? In the end it worked well across all our clusters. Basically, DSE packages have to be uninstalled, Apache Cassandra has to be installed and pointed to existing data and commit_log directories.

However, two extra steps were required:

DSE creates dse-specific keyspaces (e.g. dse_system). This keyspace is created with EverywhereStrategy which is not supported by the Apache Cassandra. So before replacing DSE with Apache Cassandra on a given node the keyspace replication strategy has to be changed.

There are few system column families that receive extra columns in DSE deployments (system_peers and system_local). So, if you attempt to start Apache Cassandra after DSE is removed – it will fail on schema check against these column families. There are two ways this could be worked around – the first option is to delete all system keyspaces, set auto_bootstrap to false and also set the allow_unsafe_join JVM option to true – assuming the node’s ip address stays the same Cassandra will re-create system keyspaces and jump into the ring. However we used alternative approach – we had a bunch of scripts which first exported these tables with extra columns to json – and then we imported them on a helper cluster which had the same schema as system tables do in Apache Cassandra. Then we imported data from json files and copied the resultant sstables to the node being upgraded

Minor version, big impact

So Cassandra 3.11 clearly has vast improvements in terms of storage format efficiency and better read latency. However even within a certain Cassandra major version a minor version can introduce a bug which would heavily affect performance. So the latency improvements were achieved on Cassandra 3.11.1. But this very specific version introduced an unpleasant bug (CASSANDRA-14010, fix sstable ordering by max timestamp in singlePartitionReadCommand)

This bug was introduced during refactoring – and led to that for tables with STCS almost all sstables were queried for every read. Typically Cassandra sorts sstables by a timestamp in a descending order – and it starts answering a query by looking at the most recent sstables until it’s able to fulfil a query (with some restrictions, like for counters or unfrozen collections). The bug however made sstables to be sorted in the ascending order. It was fixed in 3.11.2 – and here is the impact – number of sstables per read for 99th percentile has dropped after the upgrade. Reading less sstables for sure leads to lower local read latencies.

Drop in sstables per read

Local read latency drop on Cassandra 3.11.2

Always mind page cache

On one of our newly built Cassandra 3.11.2 clusters we noticed unexpectedly high local read latencies. This came as a surprise because this cluster was specifically dedicated to a table with LCS. We spent quite some time in blktrace/blkparse/btt to confirm that our storage by itself is performing well. However we identified later that high latency is only subject for 99th and 98th percentiles – while from btt output it became clear that there write IO bursts which result in a very large blocks thanks to a scheduler merge policy. Next step was to exclude compactions – so even with compactions disabled we still had these write bursts. Looking further it became evident that page cache defaults were different from what we typically had – and the culprit is vm.dirty_ratio which was set to 20%. With 256GB of RAM flushing 20% of it certainly would result in IO burst which would affect reads. So after tuning page cache (we opted for vm.dirty_bytes and vm.dirty_background_bytes instead of vm.dirty_ratio).

Page cache flush settings impact on local read latency

The results

So, in the end – what were our gains from Apache Cassandra 3.11?
Well aside from any new features we got lots of performance improvements, it’s a major storage savings which immediately transform into lower read latencies because more data fits into OS cache which improved a lot the clusters’ performance.

In addition, the newer storage format results in lower JVM pressure during reads (object creation ratio decreasing 1.5 – 2 times). The following image describe the decrease in the read latency in our datacenters (NY, CHI & SA) during load tests (LT) that we performed before and after the migration

Read latency before and after the migration

Another important aspect are new metrics exposed. Few interesting ones:

Ability to identify consistency level requested for both reads and writes. This can be useful to track if occasionally the consistency level requested is not what would be expected for a given column family. For example, if QUORUM is requested instead of LOCAL_QUORUM due to a bug in app configuration.

Metrics on the network messaging.

Read metrics before the migration

Read metrics after the migration

We did something that lots of people thought that it’s impossible, to perform such in-place migration in running production clusters. Although there were some bumps in the way, it’s totally possible to conduct an in-place upgrade of DSE 4.8.x to Apache Cassandra 3.11, while maintaining acceptable latencies and avoiding any cluster downtime.

I’m in the habit of leaving classes cleaner than they were before I got there. CMD+ALT+O (optimize imports) and CMD+ALT+L (reformat code) have become a muscle memory by now. I will spend my time removing unused properties, deleting dead code, removing deprecations or duplications. One of my all-time favorites is rewriting tests to make them clean and readable. I do this as part of any day to day tasks, no matter how small or insignificant. It is satisfying to know you leave your code in a better state than before (and my OCD is fed and happy).

One of the best tools I use to achieve this task is none other than the good old IntelliJ IDEA. The highlights to unused parameters, typos, accessibility level of a class or method, a parameter that is always called with the same value etc. You know what I’m talking about, that nagging little yellow square you never pay attention to or deliberately ignore because you’re always working on some important, time-sensitive feature. You “never” have time to quickly sweep and ‘clean-up’ the class you’re working on.

“That nagging little square”

My team at work is divided into three groups:

The people who use a several-major-versions-old IntelliJ (reciting the age-old mantra — “if it ain’t broke, don’t fix it”)

A person who uses the latest stable version of IntelliJ

And me, always making sure to use the latest EAP.

The first group of people doesn’t see some of the suggestions IntelliJ gives me. The first example which pops to mind is the support of lambda expressions we got with Java 8. IntelliJ is very helpful in highlighting the code that can be converted to a shorter lambda which makes the code shorter, cleaner and quite frankly more readable (which is a matter of personal taste and habit). Today I got to a class which I have never seen before to make a tiny change. It’s a class in a service I’m not usually working on, so I hadn’t had the chance to make my usual OCD powered sweep and clean up. I proceeded to change one global parameter to private accessibility from the public and added a ‘final’ keyword to another parameter. Happy and satisfied, my OCD let me commit this minor change.

Imagine how surprised I was to have my OCD rub my nose in a piece of code I entirely missed:

IntelliJ just told me that it found a bug:

After changing the code and calling the ‘putAll’ method with ‘innerMap’:

IntelliJ even added a bonus of highlighting another optimization I could do — “Lambda can be replaced with method reference.” and now the code looks like so:

Now, I know what you’re thinking. It’s something of the sort of: “where were the tests?” or “tests would have protected him from making this stupid mistake” (I can hear your scolding tone in my head right now thank you very much), it’s a valid point, but it’s not the point I’m trying to make and it’s out of the scope of this post. What I’m trying to tell you is — trust your IntelliJ!

Working in an infrastructure team is different from working in a development team. The team develops tools that are used by developers’ teams inside Outbrain. It is very important to provide a convenient way for working with new technologies. Having a Spark cluster enables running distributed data processing in scale. Until now developers worked with Spark writing ad-hoc crontab jobs in production. It created a messy and unstable environment and required manual maintenance in case of failure. For that reason, our team wanted to give an infrastructural way for working with Spark.

In this blog post, I explain how we built the solution implementing Spark to be a native tool in our WorkflowEngine.

Meet WorkflowEngine

The Outbrain Data Infrastructure team enables running hundreds of ETL jobs every hour. These jobs are orchestrated by WorkflowEngine – an internal tool, which is developed by the team.

The WorkflowEngine (WFE) enables running various types of transformations from many kinds of data sources: MySQL, Cassandra, Hive, Vertica, Hadoop etc., located in two data centers and the cloud. Every transformation, or flow, implements some business logic. A flow may contain several tasks, which are executed one after another.

Flows are developed and owned by researchers, data scientists and developers in other groups at Outbrain, such as Algorithm research groups, BI and Data Science.

Defining flows in WFE gives users a lot of benefits. It enables users to autonomously specify dependencies between flows, so when one flow ends, another is triggered. In addition, users can define a number of retries for their flows, so if a flow fails for some reason, WFE reruns it. If all retries fail, WFE sends an email or alert about the flow failures. Logs for each running flow are shown in a convenient form in WFE UI. Furthermore, WFE collects metrics for all flows, which enables us to track the health of the system.

From the operational perspective, several instances of WFE are deployed on machines which are owned by the infrastructure team, users just define ETL flows in a separate repository.

Diving Into the Solution

In order to respond to requirements for running data analytics on Spark, our team had to allow an infrastructural way for running Spark jobs orchestrated by WFE. To achieve this, we’ve developed a new kind of WorkflowEngine task called SparkStep.

When approaching the design for SparkStep, we had several principals in mind:

Avoid producing additional load on WFE machines

Create an isolated environment for every Spark job

Allow users to specify and at the same time allow the system to limit the number of YARN resources allocated for every SPARK job

Enable using multiple Spark clusters with different versions

Taking all this into consideration, we came up with the following architecture:

Keeping the control while having fun

The Spark job definition specifies how a job will run on the Spark cluster (e.g. jarURL, className, configuration parameters). In addition, there are properties that are set internally by WFE. When WFE starts executing a SPARK job, it builds a spark-submit command using all these properties.

The spark-submit command submits the Spark job to YARN, which allocates needed resources and launches the application. After submitting the job, WFE constantly checks its status using YARN REST API. When the job is completed in YARN, WFE reports a COMPLETED status for the WFE Spark task and continues running the next task. However, if the Spark job fails in YARN, WFE reports a FAILED status for the WFE job. In this case, WFE can rerun the job, according to the defined number of retries. If the job runs in YARN for too long, and its running time exceeds the defined maximum, WFE stops the job using the YARN REST API. In this case, the job is marked as FAILED in WFE. This functionality ensures both that we have a cap on the amount of resources each flow can use and that resources are saved when a job is canceled for any reason.

The spark-submit command runs in a Docker container. The Docker exits once the job is submitted to the cluster. It enables us to create an isolated environment for every Spark job executed by WFE. In addition, using Docker container, we can easily use different Spark versions simultaneously, by creating different Docker images.

By using this architecture we achieve all our goals. Spark jobs are submitted to YARN in a “cluster” mode, so they do not create a load on WFE machines. YARN handles all resources for each job. Every Spark job runs in an isolated environment owing to the Docker container.

What if?

What if someone told you it is forbidden to use logs anymore?

In my case, it began in a meeting with our ops team that claimed our services are writing too many logs and we should write less. Therefor logs are being “throttled” ie: some log messages are just discarded.

Too many logs were 100 lines per minute, which in my opinion was a ridiculously low number.

Maybe I am doing something wrong?

It might be that logs are not a good pattern in a micro-services-highly-distributed environment. I decided to rethink my assumptions.

Why do I need logs anyway?

The obvious reason to write logs is debugging purposes. When something goes wrong in production, logs are a way I can understand the flow that leads to that erroneous state.

The main alternative that I am aware of is connecting a debugger and stepping thru the code in the flow. There are a couple of disadvantages for using the debugger: It is time-consuming, It might slow down the production process itself and you have to be connected when it happens — so If this bug happens only at night — bummer. In addition debugging is a one time process, the learning from it is just in your head so it is hard to improve that way.

Another alternative is adding metrics. Metrics are pretty similar to logs but they have the extra feature of having nice dashboards and alerting systems (we use metrics , prometheus and grafana). On the other hand, metrics have bigger overhead in the setup process which is the main disadvantage in my opinion. Metrics are also more rigid and do not allow usually to log all state, context, and parameters. The fact that writing logs is easy makes it a no-brainer to use everywhere in the code while applying the “think later” paradigm.

The third alternative is auditing via systems like ELK etc’. Similar to metrics it has higher overhead and it also hard to follow sequential operations with those discrete events.

There are even more reasons for logging. It is an additional documentation in the code that can help understand what is going on. Logging can also be used as metrics and alerting systems and even replace those systems. Many insights can be gained from logs, sometimes even user passwords.

Specification

If I go back to that meeting with the ops guys, one questions I was asked was:

‘What are your requirements from your logging system’?

Here it is:

Order matters — messages should be in ‘written before’ order so it will be possible to understand the flow of the code.

Zero throttling — I expect that there will be no rate limit on writing, only on the volume of saved messages, so every time a message is thrown away, it is the oldest one.

X days history — Log files should keep log messages for at least couple of days — if that is not the case it might be you are writing too many log messages, so move some to debug level.

Logs should be greppable — text files has this big advantage of flexibility and multiple tools that can be used with them. So text files are a big advantage.

Metadata — logging system should provide some metadata like calling thread, timestamp, calling method or class etc’ to remove that burden from the developer (logging should be easy).

Distributed and centralized — it should be convenient to look at all logs in a central location, but also be able to split it and see logs of a specific process.

Easy to use, easy to install, easy to consume — in general logging should be fun.

Tips and Tricks

I am in. Is there anything else to know?

Use logging framework

Don’t log to standard output with print lines in a long-running service. Logging framework allows various levels, various appenders like logging to HTML via HTTP, log rotation and all sort of features and tricks.

Developer tip — Log levels

It’s important to be consistent when using different levels, otherwise, you will lose semantics and meaning of the message severity. The way I use them is:

Debug — something happened, I wish to see only under special circumstances otherwise it will clutter normal logs.

There are some special reasons to override that rule: Error and Warn messages are monitored in our services, so sometimes I might move something to info level to snooze alerting of it. I move messages from Info to Debug if there is too much clutter and the other direction if I want to focus on something.

Developer tip — Meaningful messages

I often see messages like “start processing” or “end method”. It is always a good idea to try and imagine what else you would like to see when reading log messages s and add as much info as possible, specific parameters, fields and contextual data which might have different value at different scenarios.

Developer tip — Lazy evaluated strings

Especially for Debug level, it is a good practice to use frameworks that prevent the overhead of string concatenation when the level is turned off without the need to explicitly check if the level isn’t suppressed. In kotlin-logging, for example, it will look like this:

logger.debug { “Some $expensive message!” }

Developer tip — Don’t log errors more than once per error

It is a common anti-pattern to log an exception and then re-throw it just to be logged later in another place in the code again. It makes it harder to understand the number of errors and their origin. Log exceptions only if you are not re-throwing them.

Ad-Hoc enablement of log level

Some frameworks and services allow on-the-fly change of the active log level. It means you can print debug messages of a specific class in a specific instance for a couple of minutes, for example. It allows debugging while not trashing the log file in the rest of the time.

Ad-Hoc addition of logging messages

When I worked at Intel, one of my peers developed a JVM tool that allowed bytecode manipulation of methods and adding log messages at the beginning of methods and at the end of methods.

It means you didn’t have to think in advance on all those messages, but just to inject them with log messages when needed while the process is running.

In-Memory logs

Another useful technique is keeping last messages in memory. It allows developing an easy way to access them from remote via REST call, for example. It is also possible to have that dumped into a file in case the process has a crash.

Logging as a poor-man profiler

It is possible to analyze logs to gain also insight on the performance of the application. The simple technique I saw is using the timestamp in the logs. A more advanced technique is using the context to calculate and show the time from the beginning of the sequence (ie: when the HTTP call started), by using MDC.

Log formatting

The content of the message is also important. Various logging framework allows embedding predefined template parameters such as:

Location info — Class and file name, method name and line number of where the log message was issued.

Date and time.

Log level as discussed above.

Thread info — relevant to multi-threads environments to be able to separate different flows.

Context info — similar to thread info but more specific to a use case, add context information like user id, request id etc’. Framework features like MDC make it easier to implement

I highly recommend using those, but bare in mind that some features pose performance overhead when they are evaluated.

Logging — the essentials

Logging is a big world, I couldn’t cover all of it here, but I hope I convinced you to use it.

In this episode we will focus on the migration itself, building a POC environment is all nice and easy, however migrating 2 PB (the raw part out of 6 PB which include the replication) of data turned to be a new challenge. But before we jump into technical issues, lets start with the methodology.

The big migration

We learned from our past experience that in order for such a project to be successful, like in many other cases, it is all about the people – you need to be minded to the users and make sure you have their buy-in.

On top of that, we wanted to complete the migration within 4 months, as we had a renewal of our datacenter space coming up, and we wanted to gain from the space reduction as result of the migration.

Taking those two considerations in mind, we decided that we will have the same technologies which are Hadoop and Hive on the cloud environment, and only after the migration is done we would look into leveraging new technologies available on GCP.

Now after the decision was made we started to plan the migration of the research cluster to GCP, looking into different aspects as:

Build the network topology (VPN, VPC etc.)

Copy the historical data

Create the data schema (Hive)

Enable the runtime data delivery

Integrate our internal systems (monitoring, alerts, provision etc.)

Migrate the workflows

Reap the bare metal cluster (with all its supporting systems)

All in the purpose of productizing the solution and making it production grade, based on our standards. We made a special effort to leverage the same management and configuration control tools we use in our internal datacenters (such as Chef, Prometheus etc.) – so we would treat this environment as yet just another datacenter.

Copying the data

Sound like a straightforward activity – you need to copy your data from location A to location B.

Well, turns out that when you need to copy 2 PB of data, while the system is still active in production, there are some challenges involved.

The first restriction we had, was that the copy of data will not impact the usage of the cluster – as the research work still need to be performed.

Second, once data is copied, we also need to have data validation.

Starting with data copy

Option 1 – Copy the data using Google Transfer Appliance

Google can ship their transfer appliance (based on the location of your datacenter), that you would attach to the Hadoop Cluster and be used to copy the data. Ship it back to Google and download the data from the appliance to GCS.

Unfortunately, from the capacity perspective we would need to have several iterations of this process in order to copy all the data, and on top of that the Cloudera community version we were using was so old – it was not supported.

Option 2 – Copy the data over the network

When taking that path, the main restriction is that the network is used for both the production environment (serving) and for the copy, and we could not allow the copy to create network congestion on the lines.

However, if we restrict the copy process, the time it would take to copy all the data will be too long and we will not be able to meet our timelines.

Setup the network

As part of our network infrastructure, per datacenter we have 2 ISPs, each with 2 x 10G lines for backup and redundancy.

We decided to leverage those backup lines and build a tunnel on those lines, to be dedicated only to the Hadoop data copy. This enabled us to copy the data in relatively short time on one hand, and assure that it will not impact our production traffic as it was contained to specific lines.

Once the network was ready we started to copy the data to the GCS.

As you may remember from previous episodes, our cluster was set up over 6 years ago, and as such acquired a lot of tech debt around it, also in the data kept in it. We decided to take advantage of the situation and leverage the migration also to do some data and workload cleanup.

We invested time in mapping what data we need and what data can be cleared, although it didn’t significantly reduce the data size we managed to delete 80% of the tables, we also managed to delete 80% of the workload.

Data validation

As we migrated the data, we had to have data validation, making sure there are no corruptions / missing data.

More challenges on the data validation aspects to take into consideration –

The migrated cluster is a live cluster – so new data keeps been added to it and old data deleted

With our internal Hadoop cluster, all tables are stored as files while on GCS they are stored as objects.

It was clear that we need to automate the process of data validation and build dashboards to help us monitor our progress.

We ended up implementing a process that creates two catalogs, one for the bare metal internal Hadoop cluster and one for the GCP environment, comparing those catalogs and alerting us to any differences.

This dashboard shows per table the files difference between the bare metal cluster and the cloud

In parallel to the data migration, we worked on building the Hadoop ecosystem on GCP, including the tables schemas with their partitions in Hive, our runtime data delivery systems adding new data to the GCP environment in parallel to the internal bare metal Hadoop cluster, our monitoring systems, data retention systems etc.

The new environment on GCP was finally ready and we were ready to migrate the workloads. Initially, we duplicated jobs to run in parallel on both clusters, making sure we complete validation and will not impact production work.

After a month of validation, parallel work and required adjustments we were able to decommission the in-house Research Cluster.

What we achieved in this journey

Upgraded the technology

Improve the utilization and gain the required elasticity we wanted

Reduced the total cost

Introduced new GCP tools and technologies

Epilogue

This amazing journey lasted for almost 6 months of focused work. As planned the first step was to use the same technologies that we had in the bare metal cluster but once we finished the migration to GCP we can now start planning how to further take advantage of the new opportunities that arise from leveraging GCP technologies and tools.