Give Codeship a try

Want to learn more?

“This article was originally published on Medium by Alexandra Grant, and with their permission, we are sharing it here for Codeship readers.“

I’ve dabbled in JavaScript since college, made a few web pages here and there and while JS was always an enjoyable break from C or Java, I regarded it as a fairly limited language, imbued with the special purpose of serving up animations and pretty little things to make users go “ooh” and “aah”. It was the first language I taught anyone who wanted to learn how to code because it was simple enough to pick up and would quickly deliver tangible results to the developer. Smash it together with some HTML and CSS and you have a web page. Beginner programmers love that stuff.

Then something happened two years ago. At that time, I was in a researchy position working mostly on server-side code and app prototypes for Android. It wasn’t long before Node.js popped up on my radar. Backend JavaScript? Who would take that seriously? At best, it seemed like a new attempt to make server-side development easier at the cost of performance, scalability, etc. Maybe it’s just my ingrained developer skepticism, but there’s always been that alarm that goes off in my brain when I read about something being fast and easy and production-level.

Then came the research, the testimonials, the tutorials, the side-projects and 6 months later I realized I had been doing nothing but Node since I first read about it. It was just too easy, especially since I was in the business of prototyping new ideas every couple months. But Node wasn’t just for prototypes and pet projects. Even big boy companies like Netflix had parts of their stack running Node. Suddenly, the world was full of nails and I had found my hammer.

Fast forward another couple months and I’m at my current job as a backend developer for Digg. When I joined, back in April of 2015, the stack at Digg was primarily Python with the exception of two services written in, wait for it, Node. I was even more thrilled to be assigned the task of reworking one of the services which had been causing issues in our pipeline.

Our troublesome Node service had a fairly straightforward purpose. Digg uses Amazon S3 for storage which is peachy, except S3 has no support for batch GET operations. Rather than putting all the onus on our Python web server to request up to 100+ keys at a time from S3, the decision was made to take advantage of Node’s easy async code patterns and great concurrency handling. And so Octo, the S3 content fetching service, was born.

Node Octo performed well except for when it didn’t. Once a day it needed to handle a traffic spike where the requests per minute jump from 50 to 200+. Also keep in mind that for each request, Octo typically fetches somewhere between 10–100 keys from S3. That’s potentially 20,000 S3 GETs a minute. The logs showed that our service slowed down substantially during these spikes, but the trouble was it didn’t always recover. As such, we were stuck bouncing our EC2 instances every couple weeks after Octo would seize up and fall flat on its face.

The requests to the service also pass along a strict timeout value. After the clock hits X number of milliseconds since receiving the request, Octo is supposed to return to the client whatever it has successfully fetched from S3 and move on. However, even with a max timeout of 1200ms, in Octo’s worst moments we had request handling times spiking up to 10 seconds.

The code was heavily asynchronous and we were caching S3 key values aggressively. Octo was also running across 2 medium EC2 instances which we bumped up to 4.

I reworked the code three times, digging deeper than ever into Node optimizations, gotchas, and tricks for squeezing every last bit of performance out of it. I reviewed benchmarks for popular Node webserver frameworks, like Express or Hapi, vs. Node’s built-in HTTP module. I removed any third party modules that, while nice to have, slowed down code execution. The result was three, one-off iterations all suffering from the same issue. No matter how hard I tried, I couldn’t get Octo to timeout properly and I couldn’t reduce the slow down during request spikes.

A theory eventually emerged and it had to do with the way Node’s event loop works. If you don’t know about the event loop, here’s some insight from Node Source:

Node’s “event loop” is central to being able to handle high throughput scenarios. It is a magical place filled with unicorns and rainbows, and is the reason Node can essentially be “single threaded” while still allowing an arbitrary number of operations to be handled in the background.

Not-So Magic Event Loop Blocking (X-Axis: Time in milliseconds)

You can see when all the unicorns and rainbows went to hell and back again as we bounced the service.

With event loop blocking as the biggest culprit on my list, it was just a matter of figuring why it was getting so backed up in the first place.

Most developers have heard about Node’s non-blocking I/O model; it’s great because it means all requests are handled asynchronously without blocking execution, or incurring any overhead (like with threads and processes) and as the developer you can be blissfully unaware what’s happening in the backend. However, it’s always important to keep in mind that Node is single-threaded which means none of your code runs in parallel. I/O may not block the server but your code certainly does. If I call sleep for 5 seconds, my server will be unresponsive during that time.

And the non-blocking code? As requests are processed and events are triggered, messages are queued along with their respective callback functions. To explain further, here’s an excerpt from a particularly insightful blog post from Carbon Five:

In a loop, the queue is polled for the next message (each poll referred to as a “tick”) and when a message is encountered, the callback for that message is executed. The calling of this callback function serves as the initial frame in the call stack, and due to JavaScript being single-threaded, further message polling and processing is halted pending the return of all calls on the stack. Subsequent (synchronous) function calls add new call frames to the stack…

Our Node service may have handled incoming requests like champ if all it needed to do was return immediately available data. But instead it was waiting on a ton of nested callbacks all dependent on responses from S3 (which can be god awful slow at times). Consequently, when any request timeouts happened, the event and its associated callback was put on an already overloaded message queue. While the timeout event might occur at 1 second, the callback wasn’t getting processed until all other messages currently on the queue, and their corresponding callback code, were finished executing (potentially seconds later). I can only imagine the state of our stack during the request spikes. In fact, I didn’t need to imagine it. A little bit of CPU profiling gave us a pretty vivid picture. Sorry for all the scrolling.

The flames of failure

As a quick intro to flame graphs, the y axis represents the number of frames on the stack, where each function is the parent of the function above it. The x axis has to do with the sample population more so than the passage of time. It’s the width of the boxes which show the total time on-CPU; greater width may indicate slower functions or it may simply mean that the function is called more often. You can see in Octo’s flame graph the huge spikes in our stack depth. More detailed info on profiling and flame graphs can be found here.

In light of these realizations, it was time to entertain the idea that maybe Node.js wasn’t the perfect candidate for the job. My CTO and I sat down and had a chat about our options. We certainly didn’t want to continue bouncing Octo every other week and we were both very interested in a promising case study that had cropped up on the internet: Handling 1 Million Requests per Minute with Go

If the title wasn’t tantalizing enough, the topic was on creating a service for making PUT requests to S3 (wow, other people have these problems too?). It wasn’t the first time we had talked about using Golang somewhere in our stack and now we had a perfect test subject.

Two weeks later, after my initial crash course introduction to Golang, we had a brand new Octo service up and running. I modeled it closely after the inspiring solution outlined in Malwarebyte’s Golang article; the service has a worker pool and a delegator which passes off incoming jobs to idle workers. Each worker runs on its own goroutine, and returns to the pool once the job is done. Simple and effective. The immediate results were pretty spectacular.

A nice simmer

Our average response time from the service was almost cut in half, our timeouts (in the scenario that S3 was slow to respond) were happening on time, and our traffic spikes had minimal effects on the service.

Blue = Node.js Octo | Green = Golang Octo

With our Golang upgrade, we are easily able to handle 200 requests per minute and 1.5 million S3 item fetches per day. And those 4 load-balanced instances we were running Octo on initially? We’re now doing it with 2.

Since our transition to Golang we haven’t looked back. While the majority of our stack is (and probably will always be) in Python, we’ve begun the process of modularizing our code base and spinning up microservices to handle specific roles in our system. Alongside Octo, we now have 3 other Golang services in production which power our realtime message system and serve up important metadata for our content. We’re also very proud of the newest addition to our Golang codebase, DiggBot.

This is not to say that Golang is a silver bullet for all our problems. We’re careful to consider the needs of each of our services. As a company, we make the effort to stay on top of new and emerging technologies and to always ask ourselves, can we be doing this better? It’s a constantly evolving process and one that takes careful research and planning.

I’m proud to say that this story has a happy ending as our Octo service has been up and running for a couple months with great success (a few bug fixes aside). For now, Digg is going the way of the Gopher.

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles. Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.

We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

I am so thrilled I found your weblog, I really found you by mistake, while I was researching on Google for something else, regardless. I am here now and would just like to say thank you for a tremendous blog work.

ha ha, I totally thought it was spam until I saw she replied to you. too funny…

Christiaan

Are these issues with node.js single-threading and performance eliminated by ‘serverless’ implementations of node like AWS Lambda/Google CloudFunctions/IBM OpenWhisk etc? Are your problems mainly around running it all on two node servers, instead of using a serverless service?

Jay Looney

Can you describe this in more detail, it’s new to me. How do you run “serverless” server code?

CB

@zenware:disqus ‘serverless’ is a bit of a misnomer. It’s ‘less’ in that the developer is not managing the server. Instead, a single process or thread equivalent is invoked on a server managed by AWS/Google to run your particular function just that one time. It’s interesting in that it’s yet another layer of abstraction, easy to get started and touts similar scaling benefits to other managed services like Heroku etc. The pricing model is compelling as well, in that you only pay for compute time used instead of the VM instance itself.

Jay Looney

Wow, that’s really interesting. I think as another layer of abstraction it sounds like it could potentially get in the way of integration, maybe I’m wrong. I’ll definitely investigate it and see if I can’t deploy it somewhere just to get a feel for myself.

That does sound pretty interesting, but it also sounds like a very oddly isolated way to develop. At some point, a developer becomes accustomed to a file having several functions and if they start using OOP (which seems out of the question in this environment) then they have many methods on a class. And with serverless only being able to spin up a function at a time, that leaves a developer only able to interact with the code powering a single function at a given time. I feel like there might be many applications where this constraint is too restrictive.

Sebastian Acuña

In this specific example, it sounds more like the benefit was switching the process model from an overeager async I/O slurp to a more controlled distributed worker model. In other words, wouldn’t you expect similar results if you implemented the same distributed worker model in Node? Why did the language matter? Is the point that such a model is a more comfortable/natural fit in Go?

notbrain

This was exactly what I was thinking the whole time reading, would love to see if a different optimization of node with multiple workers (each with its own processor core) could approach the same performance.

In Go, this distributed worker model is all executed inside the single app using Goroutines. In node, even if you tried a distributed worker model inside the same app, it would still be executing on the single thread. Now if you wanted to have separate worker applications and use a message queue between them (or something similar), I suspect you would achieve the same results. But like you said, Go is a more natural fit with built in queues (channels) and goroutines that will execute more threads when necessary.

Sebastian Acuña

Thanks, that makes sense

bressonnemesis

Goroutines are multi-threaded as opposed to Node’s always single threaded?

rideon88

Goroutines are created/queued for execution correctly based on the number of processes on the machine right? You could have just built a more robust FIFO queue and Node could have handled the requests better.

I have to agree. While I see the use case for Go here, this sounds exactly like what node-worker-farm was meant to be used for, which was built by a core node contributor. Essentially it’s a messaging process that essentially spawns a separate process for each amount of work that needs to be done, which gives it the durability and flexibility that is being looked for.

Not to say that Go isn’t great at this sort of thing – It is. Just that there are ways to get a similar level of concurrency from Node. I have doubts that it’d reach the same heights of speed that Go does, even with the worker farm, still. :)

Great article either way. Thanks for the write up.

Alex Mills

The Libuv threadpool in Node.js represents a distributed worker model?

Willem Goudsbloem

Very good article, I already suspected the outcome and it is always great to see when someone validates this in the real world. Most people in our industry don’t look at all the components involved and the real computer science. They seem to believe high level languages like node can solve problems somehow. Go is becoming a very viable low(er) level programming language in the likes of C/C++ and gives power back to the programmer. Again, great article Alexandra!

Tony Garcia

“They seem to believe high level languages like node can solve problems somehow”. Actually, high level languages like JavaScript (node) (or Python, or Ruby) can solve LOTS of problems. You just have to know which problems to apply them to.

Willem Goudsbloem

There is nothing a high level language can solve, that a lower level can’t. And the article clearly proves that using node is maybe easy but at the cost of performance, and if the language execution engine doesn’t get that fixed for you there is nothing as a developer you can do.

Tony Garcia

The article clearly proves that _for_this_particular_problem_ Go may have been a better solution than node. Not all problems or applications require super-high performance or massive concurrency and sometimes the developer experience of using a high level language over something like C++ is a worthwhile tradeoff if those other factors aren’t critical.

Luke Autry

There is plenty to be said about Node’s shortcomings, especially in the performance department. I’ve used Node and Go extensively.

From a development perspective though, Go leaves a lot to be desired – especially if you’ve been using TypeScript on top of Node as I have. There’s simply no way to write succinct, functional style code in Go. For instance, something like users.map(u => doThingsToUser(user)) would probably require 10 lines of code in Go. And, of course, there are no generics, so when you want to reuse code, you end up doing some really counter-intuitive things to make it all work.

I think there are strong use cases for Node and Go. My gripe with Go is how clunky it is when writing the real business logic in your app;

Victor Ivri

Should I say… Scala? ;)

Alex Mills

The only valid opinion – someone who has actually used both!

Jeremy Wight

Great article. This is really helpful in evaluating the use case. Thank you!

Vishal Chougule

Did you consider option of Vert.x which is becoming competitor of node but can spawn multiple event loops (generally equal to cores ) ?

m8

“If I call sleep for 5 seconds, my server will be unresponsive during that time” – What do you mean by sleep? In javascript it is setTimeout, right? It does not block the thread. That is the sole design goal of Node.js. Would you be willing to try for an alternative solution under Node.js, just for fun? Can you reach out to me – murukesh @ gmail?

justin

Right, you are good with the native setTimeout, however do not use something like this…

This blocks the entire event loop! Unless of course, that is what you intend.

Zach

If I read your article correctly, you ran all of this on a single instance of Node? As some others have pointed out, Node is single-threaded, so it’s not going to use more than one CPU core by default. If this single core maxes out and you’ve still got requests coming in, the stack is probably going to grow really quickly and you’ve got a mess on your hands. The same applies to any scenario when all of your cores are maxing out, but this is less likely to happen when you’re using all of your cores.

As of Node V.6.X there is an out-of-box solution for scaling Node to leverage multiple cores for thruput scenarios like this one, the Clustering module. If you use a production script manager like PM2, just run this command on a multi-core machine to leverage all cores:

pm2 start app.js -i max

You may need to tweak your application design to allow all cores to be used in parallel, I think most TCP connections will load-balance but I can’t say for sure. It has worked well for me when using NSQ.io.

What I like about this approach is that it puts a lot of power in the hands of someone who needs to do both front and back-end type work; it’s tough to juggle too many active languages. Surely a lower-level language is going to be better for squeezing every last drop out of performance though.

GorillaFart

Was this issue not fixed with promises?

Graham Perich

No, promises are just a layer of abstraction on top of traditional callback chaining. Think of it as just a cleaner syntax. It doesn’t add new functionality or remedy old problems with JS. It just makes asynchronous code easier to read and write. JavaScript is fundamentally limited by the Event Loop, and Promises don’t change that.

NickG

I “like” how everyone assumes she missed some OBVIOUS feature of Node and somehow failed to implement or try it (like clustering). It’s clear that she was pushing Node to the limit, coming up with fixes beyond the obvious, trying to exploit any possible performance boost. The simple truth is that Golang was built by google to handle insane amounts of traffic. The built-in concurrency (not to be confused w/ parallelism) utilizes resources in a fundamentally more efficient way. I think the V8 engine itself is the bottleneck to Node. I’m sure someone out there will create a new engine (maybe in golang) that runs something like Node just so JS devs can avoid learning a new language. I’m serious about that last bit, it’s not an attack on the JS community.

Alex Mills

The V8 engine is in C++ homie.

NickG

guess it was just node itself that was her problem. Regardless, if someone found a solution that fixes their problems elegantly I don’t see the use in so many trolls doubting their ability. Things get ad hominem real quick.

Alex Mills

I rarely go full ad-hominem! :)

DR01D

Brilliant article! However as a node.js Noob my first thought was, “I wonder if someone will produce a module to fix this.”