Featured in DevOps

Adin Scannell talks about gVisor - a container runtime that implements the Linux kernel API in userspace using Go. He talks about the architectural challenges associated with userspace kernels, the positive and negative experiences with Go as an implementation language, and finally, how to ensure API coverage and compatibility.

There was an API that ingested events and forwarded them to a distributed message queue. An event, in this case, is a JSON object generated by a web or mobile app containing information about users and their actions. As events were consumed from the queue, customer-managed settings were checked to decide which destinations should receive the event. [...] The event was then sent to each destination’s API, one after another, which was useful because developers only need to send their event to a single endpoint, Segment’s API, instead of building potentially dozens of integrations.

Failure to deliver events caused the event to be re-queued in the system, meaning that in some cases workers were responsible for delivering new events as well as attempting re-delivery of previously failed events which could result in delays across all destinations. As Noonan explains:

To solve the head-of-line blocking problem, the team created a separate service and queue for each destination. This new architecture consisted of an additional router process that receives the inbound events and distributes a copy of the event to each selected destination. Now if one destination experienced problems, only its queue would back up and no other destinations would be impacted. This microservice-style architecture isolated the destinations from one another, which was crucial when one destination experienced issues as they often do.

The article then discusses how the Segment development team originally had all of the code in a single repository but that lead to problems:

A huge point of frustration was that a single broken test caused tests to fail across all destinations. When we wanted to deploy a change, we had to spend time fixing the broken test even if the changes had nothing to do with the initial change. In response to this problem, it was decided to break out the code for each destination into their own repos.

This did lead to improvements in flexibility for the development teams. However, as the number of destinations grew, so too did the number of repositories. To ease the burden on developers having to maintain these codebases, the Segment team created a number of shared libraries for common transformations and functionality across all of the destinations. This set of shared libraries brought with them an obvious benefit to maintenance. However, there was a less obvious downside: updating and testing changes to the shared libraries began to take up a lot of time and introduced an element of risk for fear of breaking unrelated destinations. Eventually different versions of these libraries began to arise and they diverged from each other, leading to an unforeseen problem where each destination codebase relied on different versions of the shared libraries. As Noonan admits, they could have built tools to help automate the rollout of changes to these libraries. However, at about this time they were encountering other issues with their microservices architecture.

The additional problem is that each service had a distinct load pattern. Some services would handle a handful of events per day while others handled thousands of events per second. For destinations that handled a small number of events, an operator would have to manually scale the service up to meet demand whenever there was an unexpected spike in load.

Auto-scaling was an implemented capability of their system but since each service typically required specific CPU and memory resources, tuning the auto-scaling configuration was "more art than science". As mentioned earlier, the number of repositories increased each time they added destinations and at one time they had the team adding three destinations per month on average, also requiring more queues and yet more services.

In early 2017 we reached a tipping point with a core piece of Segment’s product. It seemed as if we were falling from the microservices tree, hitting every branch on the way down. Instead of enabling us to move faster, the small team found themselves mired in exploding complexity. Essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded. [...] Therefore, we decided to take a step back and rethink the entire pipeline.

In the rest of her article Noonan describes how they moved away from their microservices architecture, which included the development of Centrifuge responsible for replacing all of their individual queues and instead sending events to a single monolithic service. They also moved all of their destination code into a single repository but this time imposing some specific rules for managing the code: there would be one version for all desinations and all destinations would be updated accordingly. They no longer had to worry about differences between dependency versions as all destinations were using the same version and would continue to do so. For their developers, it became much less time consuming and less risky to maintain a growing number of desinations.

There is much more in Noonan's article about their journey back to a monolithic service, and the interested reader should check it out as it includes details on the architecture, thoughts on respository structure and approach to building a resilient test suite. However, the summary of the benefits the team saw in the end includes the following:

In 2016, when our microservice architecture was still in place, we made 32 improvements to our shared libraries. Just this year we’ve made 46 improvements. We’ve made more improvements to our libraries in the past six months than in all of 2016. The change also benefited our operational story. With every destination living in one service, we had a good mix of CPU and memory-intense destinations, which made scaling the service to meet demand significantly easier. The large worker pool can absorb spikes in load, so we no longer get paged for destinations that process small amounts of load.

However, there are some downsides/trade-offs to this re-architecture which include the fact that isolation of faults is difficult (if a bug in one destination causes the service to crash then it fails for all other destinations) and updating the version of a dependency may break some other destinations which then need to be updated too. Noonan ends the article on a pragmatic note:

When deciding between microservices or a monolith, there are different factors to consider with each. In some parts of our infrastructure, microservices work well but our server-side destinations were a perfect example of how this popular trend can actually hurt productivity and performance. It turns out, the solution for us was a monolith.

In fact some of these concerns with microservices may sound familiar. Earlier this year we reported that ThoughWorks suggested microservices would not reach the Adopt Ring in their Technology Radar. As reported then, "one of the main reasons for this is that many organisations are simply not microservices ready, lacking in some foundational practices around operations and automation". Furthermore, as Jan reported in another article around failures with microservices from a few years ago, Richard Clayton, chief software engineer at Berico Technologies suggested one problem they had at the time:

Balancing the desire to share common utility code between services against independent services with replicated functionality became a huge tradeoff finally leading to a major refactoring.

Back to the original article and there has been a lot of discussion elsewhere on the topic including Hacker News and Reddit; with several of them suggesting concerns around other areas than microservices may have been the cause. For example, other comments point out that there is no reference to CI in Noonan's original article only CD which is an odd combination at least. One other commentator suggested that perhaps the problems were not specific to microservices but distributed systems in general, which we've touched on before too, referring to a similar experience with SOA:

I worked on a code base like that back when it was called SOA and before the cloud. Every call to a service would launch a full instance of the service, call a method and then shutdown the instance. I think we need to make network latency mandatory elements of architecture diagrams.

Interestingly, a lot of that comment thread discusses problems with data in the context of microservices, something which we have covered several times elsewhere and it is a common source of problems as well as disagreements. As one comment on Hacker News illustrates:

It's worse than that; it's my observation that most microservice architectures just ignore consistency altogether ("we don't need no stinking transactions!") and blindly follow the happy path. I've never quite understood why people think that taking software modules and separating them by a slow, unreliable network connection with tedious hand-wired REST processing should somehow make an architecture better. I think it's one of those things that gives the illusion of productivity - "I did all this work, and now I have left-pad-as-a-service running! Look at the little green status light on the cool dashboard we spent the last couple months building!"

Furthermore, defining domains for microservices is something we have raised over the years as being important for successful deployments of microservices. In fact, there was a presentation on using DDD to deconstruct monoliths and this may be relevant to something else discussed on the Reddit thread:

Building a good microservice architecture is hard - and I tend to think that it's all about properly segregating your domains successfully and reevaluating this aspect consistently when the system evolves. Despite the name, microservices don't have to be small, but rather fulfil certain charactestics of the architecture - that's the biggest pitfall most seem to fall for.

What do others think? For example, did the Segment microservice architecture have problems which could have been solved in other ways without having to go return to a monolith-based approach? Or could their original monolith-based architecture have been evolved to better accommodate their growing needs without introducing microservices in the first place?

This news item was updated on the 16th of July 2018 with details from Hacker News and Reddit.

The silver bullet thinking

Your message is awaiting moderation. Thank you for participating in the discussion.

I think microservices as a way of decoupling code in a architecture level, just that is much harder to do, debug and the consequences of ignoring something (like the latency mentioned) could hurt much more, refactoring is also trick.

One way to navigate this, is to take a more slow but steady approach of decompose part's of a previous monolithic. I assume the monolithic has good internal design, and you can change pieces for external ones, or ease refactor to do so.

This approach assume you are evolving the software and creating "miniservices" first. One could realize that micro is not the way to go far before it became a problem, or you could visualize that this is the way to go. In any case, I think, a better understanding of the problem and ways to solve will emerge.

That has the advantage of the possibility to apply different technics and patters. For instance, I had the impression by reading the article that the team tried to solve everything using queue. I am not familiar with the problem, but if this impression is confirmed, is that a good approach?

Microservices doesn’t need to be all or nothing. It’s not a silver bullet.

To visualize image a puzzle of the Manhattan city that you are building, you could try to cut that on one million pieces on first round, or you could cut first on 10, stop to evaluate, and then take this and cut each again on 10. Now you have one hundred smaller pieces. You need to stop and evaluate: Do I keep cutting? How these pieces talk each other? What are the tools that will help me now and on growth? How is performing? What are the best patterns? How the team is dealing with it? Can we improve your workflow? And many other questions. You have just 100 pieces, that is far from 1 million, but I think you are leaning much more and deciding how to better scale on the way. It can turn out, parts will became products on it on, the company can switch priorities, workloads and performance will be different, and one million pieces can be just not what you want on the first place.

It's not about microservices at all

Your message is awaiting moderation. Thank you for participating in the discussion.

Time and time again I come across exactly the same situation: some team starts with microservices-first approach, then ends up with a distributed monolith and plunges into a nightmare that goes with it, like chatty services, huge latency, fragile deployments, etc.I've always been trying to realize at what time things went wrong. Now I'm pretty sure that the root of the evil hides a bit deeper -- in a procedural mindset. Instead of treating a request as data flow, like 1) validate data, 2) check business rules, 3) transform data, 4) send data, etc, which often leads to a distributed monolith, I think concentrating on DDD aggregates is the way to go. I write about it in more detail here.