Ohad Shai

What if?

What if someone told you it is forbidden to use logs anymore?

In my case, it began in a meeting with our ops team that claimed our services are writing too many logs and we should write less. Therefor logs are being “throttled” ie: some log messages are just discarded.

Too many logs were 100 lines per minute, which in my opinion was a ridiculously low number.

Maybe I am doing something wrong?

It might be that logs are not a good pattern in a micro-services-highly-distributed environment. I decided to rethink my assumptions.

Why do I need logs anyway?

The obvious reason to write logs is debugging purposes. When something goes wrong in production, logs are a way I can understand the flow that leads to that erroneous state.

The main alternative that I am aware of is connecting a debugger and stepping thru the code in the flow. There are a couple of disadvantages for using the debugger: It is time-consuming, It might slow down the production process itself and you have to be connected when it happens — so If this bug happens only at night — bummer. In addition debugging is a one time process, the learning from it is just in your head so it is hard to improve that way.

Another alternative is adding metrics. Metrics are pretty similar to logs but they have the extra feature of having nice dashboards and alerting systems (we use metrics , prometheus and grafana). On the other hand, metrics have bigger overhead in the setup process which is the main disadvantage in my opinion. Metrics are also more rigid and do not allow usually to log all state, context, and parameters. The fact that writing logs is easy makes it a no-brainer to use everywhere in the code while applying the “think later” paradigm.

The third alternative is auditing via systems like ELK etc’. Similar to metrics it has higher overhead and it also hard to follow sequential operations with those discrete events.

There are even more reasons for logging. It is an additional documentation in the code that can help understand what is going on. Logging can also be used as metrics and alerting systems and even replace those systems. Many insights can be gained from logs, sometimes even user passwords.

Specification

If I go back to that meeting with the ops guys, one questions I was asked was:

‘What are your requirements from your logging system’?

Here it is:

Order matters — messages should be in ‘written before’ order so it will be possible to understand the flow of the code.

Zero throttling — I expect that there will be no rate limit on writing, only on the volume of saved messages, so every time a message is thrown away, it is the oldest one.

X days history — Log files should keep log messages for at least couple of days — if that is not the case it might be you are writing too many log messages, so move some to debug level.

Logs should be greppable — text files has this big advantage of flexibility and multiple tools that can be used with them. So text files are a big advantage.

Metadata — logging system should provide some metadata like calling thread, timestamp, calling method or class etc’ to remove that burden from the developer (logging should be easy).

Distributed and centralized — it should be convenient to look at all logs in a central location, but also be able to split it and see logs of a specific process.

Easy to use, easy to install, easy to consume — in general logging should be fun.

Tips and Tricks

I am in. Is there anything else to know?

Use logging framework

Don’t log to standard output with print lines in a long-running service. Logging framework allows various levels, various appenders like logging to HTML via HTTP, log rotation and all sort of features and tricks.

Developer tip — Log levels

It’s important to be consistent when using different levels, otherwise, you will lose semantics and meaning of the message severity. The way I use them is:

Debug — something happened, I wish to see only under special circumstances otherwise it will clutter normal logs.

There are some special reasons to override that rule: Error and Warn messages are monitored in our services, so sometimes I might move something to info level to snooze alerting of it. I move messages from Info to Debug if there is too much clutter and the other direction if I want to focus on something.

Developer tip — Meaningful messages

I often see messages like “start processing” or “end method”. It is always a good idea to try and imagine what else you would like to see when reading log messages s and add as much info as possible, specific parameters, fields and contextual data which might have different value at different scenarios.

Developer tip — Lazy evaluated strings

Especially for Debug level, it is a good practice to use frameworks that prevent the overhead of string concatenation when the level is turned off without the need to explicitly check if the level isn’t suppressed. In kotlin-logging, for example, it will look like this:

logger.debug { “Some $expensive message!” }

Developer tip — Don’t log errors more than once per error

It is a common anti-pattern to log an exception and then re-throw it just to be logged later in another place in the code again. It makes it harder to understand the number of errors and their origin. Log exceptions only if you are not re-throwing them.

Ad-Hoc enablement of log level

Some frameworks and services allow on-the-fly change of the active log level. It means you can print debug messages of a specific class in a specific instance for a couple of minutes, for example. It allows debugging while not trashing the log file in the rest of the time.

Ad-Hoc addition of logging messages

When I worked at Intel, one of my peers developed a JVM tool that allowed bytecode manipulation of methods and adding log messages at the beginning of methods and at the end of methods.

It means you didn’t have to think in advance on all those messages, but just to inject them with log messages when needed while the process is running.

In-Memory logs

Another useful technique is keeping last messages in memory. It allows developing an easy way to access them from remote via REST call, for example. It is also possible to have that dumped into a file in case the process has a crash.

Logging as a poor-man profiler

It is possible to analyze logs to gain also insight on the performance of the application. The simple technique I saw is using the timestamp in the logs. A more advanced technique is using the context to calculate and show the time from the beginning of the sequence (ie: when the HTTP call started), by using MDC.

Log formatting

The content of the message is also important. Various logging framework allows embedding predefined template parameters such as:

Location info — Class and file name, method name and line number of where the log message was issued.

Date and time.

Log level as discussed above.

Thread info — relevant to multi-threads environments to be able to separate different flows.

Context info — similar to thread info but more specific to a use case, add context information like user id, request id etc’. Framework features like MDC make it easier to implement

I highly recommend using those, but bare in mind that some features pose performance overhead when they are evaluated.

Logging — the essentials

Logging is a big world, I couldn’t cover all of it here, but I hope I convinced you to use it.

Prelude

With SVN life was easy. We used Intellij IDEA with Teamcity for CI so it supports pre-tested commits.

It works as follows: instead of directly commit to trunk, there is a button in intellij IDEA “Run remote…” that essentially sends a patch to Teamcity to run all integration tests and when all tests pass it reports back to IDEA, that in turn commits the tested files directly to trunk (==master).

What it gives you? a higher certainty that your change hasn’t broken anything. As the team gets bigger (and in outbrain we had more than 100 developers committing to trunk) it becomes essential to have such a gatekeeper that will prevent commits that break the trunk. Otherwise, it becomes tedious to keep the trunk green all the time.

But…

When hitting the same button with GIT there is a small asterisk says: “Pre-tested commits are not supported in distributed VCS”. It still puzzles me why it can’t just send a patch with the exact same flow?! Anyway, we wanted this when moving to GIT so we had to find a solution.

Firstly, we could work with pull requests / dev branches / feature branches. That is a viable solution but it makes development speed slower and requires more clicks and time to get to production (ie: more bureaucracy).

The obvious solution was to commit directly to master without testing first in CI server. Since we were in a small team at the time that was actually the first approach we have taken. It works quite well but does not scale well to big teams.

Another viable alternative is to send a patch to Teamcity for testing, and after the tests pass manually approve the files commit and push. Usually, when I commit/push code I love to take a coffee break or something, while this approach forces me to wait until all tests pass.

Auto-merge

It took a while to configure but eventually, that worked for us (see how to configure it below).

One notable advantage over the SVN approach is that the workstation doesn’t have to be connected to the network to make the actual approval of the merge while in SVN the commit happens from the IDEA itself.

However, there are some caveats. Most notably, this process has more complicated merges that result in more merge commits. It makes the commit tree a bit more obscure and trashed with merges.

How to configure?

We created a development branch on the server for each developer, myn is dev-oshai. On local repo developers works on master as usual and pull from master. Push is made always to the developer branch. Teamcity, in turn, run all integration tests and auto-merge to master when all tests pass.

Configure Teamcity

In Project configuration settings -> Build features add “Automatic merge”. For some obscure reason, the branch to watch should be defined as “+:oshai”. Not sure why…

Note those branches must be configured also as build triggers, otherwise build will not run at all.

Eventually, Teamcity will show all those branches in the branches view:

Configure Intellij IDEA

Actually here there is not too much to configure. When pushing the code, specify the dev-<you-name> branch itself:

Push from the command line

There are 2 alternatives here:

The simple one is to push like that:

git push origin dev-oshai

In case you would like to use git push(without the branch), it is also possible. See here how to configure it:

Enjoy!

Continuous delivery is a methodology where each commit can potentially get into production in a timely manner.

Jenkins Pipeline is one of the tools out there that automates the delivery process to make it short, robust, and without human intervention as much as possible.

We have recently done such an integration on our team at Outbrain, so here are some tips and advises from our humble experience.

X. You should have done this ages ago (so do it today)

Don’t wait till you have all the building blocks in place. Start with a partial pipeline and add all the automated steps you already have in place. It will give you the motivation to add more automation and improve the visibility of the process.

The pipeline set of plugins in Jenkins are ~1 year old in its current form, So it is mature and well documented. Definitely ready to use.

X. Validate artifacts and source code consistency across pipeline

That tip I read in the Teamcity pipeline post but is relevant for Jenkins as well. Make sure that the same version of sources and artifacts is used across all stages. Otherwise, commit might be pushed while the pipeline is executing, and you might end up deploying untested version.

X. Use commit hook with message regexp

Well, if I will try to generalise this tip I would say: try to ask for as little human intervention as possible (when it is not required). A good place to start is a commit hook. It works in a way that when a developer push code with a specific commit message — in our case #d2p (deploy to production), the pipeline is automatically triggered.

Here is a code sample from Jenkinsfile (the pipeline configuration file):

X. Try the Blue Ocean view

Blue Ocean set of plugins are in release-candidate stage as of the time of writing (now it is GA). Stable enough and a very good UI— especially for parallel stages. So I would recommend using. In addition, it is working side-by-side with the old UI.

All is green

When something goes wrong

X. Ask for user authorization on sensitive operations

If you still not sure that your monitoring system is robust enough, start by automating the pipeline, and ask for developer authorization before the actual deploy to production.

X. Integrate slack or other notifications

Slack is awesome and has a very documented API as well. Sending notifications on pipeline triggering and progress helps to communicate the work between team members. We use it to send start, completed, and failure notifications right now. We plan to integrate the above approval input with a slack bot so we can approve it directly from slack.

A slack notification

X. Make the pipeline fast (parallelize it)

Making pipeline turnaround time short helps to keep work efficient and fun. Set a target for the total turnaround time. Our target is less than 10 minutes. One of the easiest ways to keep it fast is by running stages that are independent in parallel. For example, we run a deployment to a test machine in parallel to the integration tests and deployment to a canary machine in parallel to our black-box tests.

Some people say writing code is kind of an art.
I don’t think so.
Well, maybe it is true if you are writing an ASCII-Art script or you are a Brainfuck programmer. I think that in most cases writing code is an engineering. Writing a program that will do something is like a car taking you from A to B that someone engineered. When a programmer writes code it should do something: data crunching, tasks automation or driving a car. Something. For me, an art is a non-productive effort of some manner. Writing a program is not such case. Or maybe it is more like a martial-art (or marshaling-art :-)) where you fight your code to do something.

So — what’s the problem?

Most of the programs I know that evolve over time, needs to have a quality which is not an art, but an ability to be maintainable. That’s usually reflect in a high level of readability.

Readability is difficult.

First, when you write some piece code, there is a mental mode you are in, and a mental model you have about the code and how it should be. When you or someone else read the code, they don’t have that model in mind, and usually, they read only a fragment of the entire code base.
In addition, there are various languages and styles of coding. When you read something that was written by someone else, with a different style or in a different language it is like reading a novel someone wrote in a different dialect or language.
That is why, when you write code you should be thoughtful to the “future you” reading the code, by making the code more readable. I am not going to talk here about how to do it, design patterns or best practices to writing your code, but I would just say that practicing and experience are important aspects in relate to code readability and maintainability.
As a programmer, when I retrospect what I did last week at work I can estimate about 50% of the time or less was coding. Among other things were writing this blog post, meetings (which I try to eliminate as much as possible) and all sort of work and personal stuff that happens during the day. When writing code, among the main goals are the quality, answering the requirements and do both in a timely manner.
From a personal perspective, one of my main goals is improving my skill-set as a programmer. Many times, I find that goal in a conflict with the goals above that were dictated by business needs.

Practice and more practice

There are few techniques that come to solve that by practicing on classroom tasks. Among them are TDD Kata’s and code-retreat days. Mainly their agenda says: “let’s take a ‘classroom’ problem, and try to solve it over and over again, in various techniques, constraints, languages and methodologies in order to improve our skill-set and increase our set of tools, rather than answering business needs”.

Code Retreat @ Outbrain — What do we do there?

So, in Outbrain we are doing code-retreat sessions. Well, we call it code-retreat because we write code and it is a classroom tasks (and a buzzy name), but it is not exactly the religious Corey-Haines-full-Saturday Code-Retreat. It’s an hour and a half sessions, every two weeks, that we practice writing code. Anyone who wants to code is invited — not only the experts — and the goals are: improve your skills, have fun, meet developers from other teams in Outbrain that usually you don’t work with (mixing with others) and learn new stuff.
We are doing it for a couple of months now. Up until now, all sessions were about fifteen minutes of presentation/introduction to the topic, and the rest was coding.
In all sessions, the task was Conway’s game of life. The topics that we covered were:

Cowboy programming — this was the first session we did. The game of life was presented and each coder could choose how to implement it upon her own wish. The main goal was an introduction to the game of life so in the next sessions we can concentrate on the style itself. I believe an essential part of improving the skills is the fact that we solve the same problem repeatedly.

The next session was about Test-Driven-Development. We watched uncle-bob short example of TDD, and had a fertile discussion while coding about some of the principles, such as: don’t write code if you don’t have a failing test.

After that, we did a couple of pair programming sessions. In those sessions, one of the challenges was matching pairs. Sometimes we do it in a lottery and sometimes people could group together by their own selection, but it was dictated by the popular programming languages that the developers choose to use: Java, Kotlin, Python or JavaScript. We plan to do language introduction sessions in the future.
In addition, we talked about other styles we might practice in the future: mob-programming and pairs switches.
These days we are having functional programming sessions, the first one was with the constraint of “all-immutable” (no loops, no mutable variables, etc’) and it will be followed by more advanced constructs of functional programming.
All-in-all I think we are having a lot of fun, a lot of retreat and little coding. Among the main challenges are keeping the people on board as they have a lot of other tasks to do, keeping it interesting and setting the right format (pairs/single/language). The sessions are internal for Outbrain employees right now, but tweet me (@ohadshai) in case you are around and would like to join.
And we also have cool stickers:
P.S. — all the materials and examples are here.

In this post, I will describe a technical methodology we used to remove a piece of functionality from a Monolith/Service and turn it into a Micro-Service. I will try to reason about some of the decisions we have made and the path we took, as well as a more detailed description of internal tools, libraries and frameworks we use at Outbrain and in our team, to shed some light on the way we work in the team. And as a bonus you might learn from our mistakes!
Let’s start with the description of the original service and what it does.
Outbrain runs the largest content discovery platform. From the web surfer’s perspective it means serving a recommended content list that might interest her, in the form of ‘You might also like’ links. Some of those impression links are sponsored. ie: when she clicks on a link, someone is paying for that click, and the revenue is shared between Outbrain and the owner of the page with the link on it. That is how Outbrain makes its revenue.

My team, among other things, is responsible for the routing of the user to the requested page after pressing the link, and for the bookkeeping and accounting that is required in order to calculate the cost of the click, who should be charged, etc.
In our case the service we are trying to split is the ‘Bookkeeper’. Its primary role is to manage the paid impression links budget. After a budget is spent, The ‘Bookkeeper’ should notify Outbrain’s backend servers to refrain from showing the impression link again. And this has to be done as fast as possible. If not, people will click on links we cannot charge because the budget was already spent. Technically, this is done by an update to a database record. However, there are other cases we might want to stop the exposure of impression links. One such an example is a request from the customer paying for the future click to disable the impression link exposure. So for such cases, we have an API endpoint that does exactly the same with the same code. That endpoint is actually part of the ‘Bookkeeper’ that is enabled by a feature toggle on specific machines. This ‘Activate-Impressionable’ endpoint as we call it, is what was decided to split out of the ‘Bookkeeper’ into a designated Micro-Service.
In order to execute the split, we have chosen a step-by-step strategy that will allow us to reduce the risk during execution and keep it as controlled and reversible as possible. From a bird’s eye view I will describe it as a three steps process: Plan, Up and Running as fast as possible and Refactor. The rest of the post describes these steps.

Plan (The who’s and why’s)

In my opinion, this is the most important step. You don’t want to split a service just in order to split. Each Micro Service introduces maintenance and management overhead, with its own set of challenges[1]. On the other hand, Microservices architecture is known for its benefits such as code maintainability (for each Micro Service), the ability to scale out and improved resilience[2].
Luckily for me, someone already did that part for me and took the decision that ‘Activate-Impressionable’ should split from the ‘Bookkeeper’. But still, Let’s name some of the key factor of our planning step.
Basically, I would say that a good separation is a logical separation with its own non-overlap RESTful endpoints and isolated code base. The logical separation should be clear. You should think what is the functionality of the new service, and how isolated it is. It is possible to analyze the code for inter-dependencies among classes and packages using tools such as lattix. At the bottom line, it is important to have a clear definition of the responsibility of the new Micro Service.
In our case, the ‘Bookkeeper’ was eventually split so that it remain the bigger component, ‘Activate-Impressionable’ was smaller and the common library was smaller than both. The exact lines of code can be seen in the table below.

Unfortunately, I assessed it only after the split and not in the plan step. We might say that there is too much code in common when looking at those numbers. It is something worth considering when deciding what to split. A lot of common code implies low isolation level.

Of course part of the planning is time estimation. Although I am not a big fan of “guestimates” I can tell that the task was planned for couple of weeks and took about that long.
Now that we have a plan, let’s get to work.

Up and Running – A Step by Step guide

As in every good refactor, we want to do it in small baby steps, and remain ‘green’ all the time[3]. In continuous deployment that means we can and do deploy to production as often as possible to make sure it is ‘business as usual’. In this step we want to get to a point the new service is working in parallel to the original service. At the end of this step we will point our load-balancers to the new service endpoints. In addition, the code remains mutual in this step, means we can always deploy the original fully functioning ‘Bookkeeper’. We actually do that if we feel the latest changes had any risk.
So let’s break it down into the actual phases:

Overview

Step Details

Starting phase

Create the new empty Micro-Service ‘Activate-Impressionable’. In outbrain we do it using scaffolding of ob1k framework. Ob1k is an open source Micro Services Framework that was developed in-house.

Create a new empty Library dependent both by the new ‘Activate-Impressionable’ service and the ‘Bookkeeper’. Ideally, if there is a full logic separation with no mutual code between the services that library will be deleted in the cleanup phase.

Move the relevant source code to the library. Luckily in our case, there was one directory that was clearly what we have to split out. Unluckily, that code also pulled up some more code it was dependent on and this had to be done carefully not to pull too much nor too little. The good news are that this phase is pretty safe for static typing languages such as Java, in which our service is written in. The compiler protects us here with compilation errors so the feedback loop is very short. Tip: don’t forget to move unit tests as well.

Move common resources to the library, such as spring beans defined in xml files and our feature flags files that defined in yaml files. This is the dangerous part. We don’t have the compiler here to help, so we actually test it in production. And when I say production I mean using staging/canary/any environment with production configuration but without real impact. Luckily again, both yaml and spring beans are configured to fail fast, so if we did something wrong it will just blow out in our face and the service will refuse to go up. For this step I even ended up developing a one-liner bash script to assist with those wicked yaml files.

Copy and edit web resources (web.xml) to define the service endpoints. In our case web.xml can’t reside in a library so it has to be copied. Remember we still want the endpoints active in the ‘Bookkeeper’ at that phase. Lesson learned: inspect all files closely. In our case log4j.xml which seems like an innocent file by its name contains designated appenders that are consumed by other production services. I didn’t notice that and didn’t move the required appender, and it was found only later in production.

Deploy

Deploy the new service to production. What we did is deploy the ‘Activate-Impressionable’ side-by-side on the same machines as the ‘Bookkeeper’, just with a different ports and context path. Definitely makes you sleep better at night.

Up-And-Running

Now is a good time to test once again if both ‘Bookkeeper’ and ‘Activate-Impressionable’ are working as expected. Generally now we are up and running with only few more things to do here.

Clients Redirect

Point clients of the service to the new endpoints (port + context path). A step that might take some time depends on the number of clients and the ability to redeploy them. In outbrain we use HA-Proxy, so reconfiguring it did most of the work, but some clients did require code modifications.

(More) Validation

Move/copy simulator tests and monitors. In our team, we heavily rely on tests we call simulator tests. These are actually black-box tests written in JUnit that runs against the service installed on a designated machine. These tests see the service as a black-box and calls its endpoints while mock/emulate other services and data in the database for the test run. So usually a test run can look like: put something in the database, trigger the endpoint, and see the result in the database or in the http response. There is also a question here whether to test ‘Activate-Impressionable’ or the ‘Bookkeeper’. Ideally you will test them both (tests are duplicated for that phase), and that is what we did.

Refactor, Disconnect & Cleanup

When we got here the new service is working and we should expect no more behavior changes from the endpoints point of view. But we still want the code to be fully split and the services to be independent from each other. In the previous step we performed the phases in a way that everything remains reversible with a simple feature toggle & deploy.

In this step, we move to a state where the ‘Bookkeeper’ will no longer host the ‘Activate-Impressionable’ functionality. Sometimes it is a good idea to have a gap from the previous step to make sure that there are no rejections and backfires that we didn’t trace in our tests and monitoring.
First thing, If was not done up until now, is deploying the ‘Bookkeeper’ without the service functionality and make sure everything is still working. And wait a little bit more…
And now we just have to push the sources and the resources from the library to the ‘Activate-Impressionable’ service. In the ideal case there is no common code, we can also delete the library. This was not how it was in our case. We still have a lot of common code we can’t separate for the time being.
Now is also the time to do resources cleanup, web.xml edit etc’.
And for the bold and OCD among us – packages rename and refactor of code with the new service naming conventions.

Conclusion

The entire process in our case took a couple of weeks. Part of the fun and advantage in such process, is the opportunity to know better an old code and its structure and functionality without the need to modify something for a new feature with its constraints. Especially when someone else wrote it originally.
In order to perform well such a process it is important to plan and remain organized and on track. In case of a context switch it is very important to keep a bookmark of where you need to return to in order to continue. In our team we even did that with a handoff of the task between developers. Extreme Programming, it is.
It is interesting to see the surprising results in terms of lines of code. Originally we thought of it as splitting a micro-service from a monolith. In retrospective, it looks to me more like splitting a service into two services. ‘Micro’ in this case is in the eye of the beholder.