Sunday, October 2, 2011

Your Favorite Language is Probably Terrible at Concurrency too

The internet has been ablaze with posts on NodeJS, to some people's joy and to others chagrin. Some have claimed that Node solves a long standing problem in concurrency, saying:

People are starting to build more on Node.js because it’s superior and it solves these problems that have always existed. I/O has been done wrong for the last 30 years

In my opinion, Node is bad at concurrency, and guess what? Your language probably isn't any better. But let's make sure we're on the same page first.

Language/Framework - Most languages do not have concurrency as a first class citizen. So when I say "your language is bad at concurrency", what I really mean is "the options available for doing concurrent things in your language are bad". The former just rolls off your tongue better.

Concurrency - What do I mean by concurrency? I mean a model by which you can define actions that can happen at the same time. That could mean running multiple pieces of code in parallel or interleaving them. Specifically in this post I am concerned with solving problems where the number of things you want to do concurrently is significantly larger than the number of cores you have.

There are a lot of options for concurrency out there. You may have heard of things like Pi calculus, Join calculus, Communicating Sequential Processes, Event-loops and Coroutines. Your language probably has an implementation of one of these, or a conceptual subset. NodeJS and Twisted implement an event-loop. Coroutines is the path Python's Gevent has taken, as well as libraries for Ruby, C, and C++. Go has chosen Communicating Sequential Processes. But all these distinctions aren't important unless I can say what I consider a good solution to concurrency.

Ideally, a good solution should have the following properties:

Scaling - If you are writing concurrent software you've already decided handling one thing at a time is not a scalable solution, so now you want to handle multiple things at a time. An ideal solution should scale to the limits of the machine. That means making use of multiple cores, if available.

Reasoning - It should be easy for a reader of your code to reason about what it does. Edge cases and gotcha's should be limited. Preferably one shouldn't even be aware of the concurrent aspects of the code unless they need to be.

Debugging - Debugging should not be painful. Standard tools like stacktraces should be meaningful. Tracing the path a piece of code takes shouldn't be harder than launching the space shuttle.

My claim is that very few concurrent solutions meet these criteria. But let me be clear, I'm not saying this is the only way you should judge selecting a solution. There is a Python library that does basically everything you think you need and it will be really hard to re-implement that functionality in another language? Well, maybe dealing with Python's concurrency shortcomings is less work than rewriting the library.

Scaling

Most languages were built for writing serial code. Memory is accessible by any piece of code in the process and it is assumed that nothing interesting happens between two function calls. But modern computers are not fast enough to do all the work programmers want them to do in serial and these languages have a lot of momentum behind them. For valid reasons, it is challenging to just move to another solution. Instead, we duct tape concurrency on top of these serial languages. One problem is that some of these languages can't even run code in parallel (that is, have two functions running at the same exact time) even if they wanted to. Python and Ocaml have a global lock that restricts this. In other languages it's just too much coordination to do safely. In C and C++ it can be too hard and time consuming to coordinate distributing concurrent work over multiple threads. For this reason, many mainstream solutions to concurrency are limited to running on a single core. It's insane, right? I can buy a laptop with, ostensibly, 8 cores now, yet a program written in most mainstream languages cannot make use of more than one.

For this reason, most solutions fail to be scalable. For example, NodeJS, Twisted, Ocaml/Lwt, and Gevent: from the point-of-view of a user of these frameworks, their code not only cannot run on multiple cores, but it depends on it. Consider some Twisted code that downloads N web pages and appends the result to a list:

Ignoring my failure to handle failures, this code is acceptable Twisted, and it could not work if Python suddenly got the ability to run code on multiple cores and Twisted used it. The reason being, there is no coordination around the ret.append(s) line. What if two threads were to try to append to ret at the same time? NodeJS and Gevent have the same idea in mind. Almost no data access is surrounded by a mechanism to coordinate multiple pieces of code accessing it at the same time. The result is, none of the code using these frameworks can be run on multiple cores. If CPython or V8 got multicore support it would take a rewrite of all of the code to make use of it.

But, you say, who cares? "I can just spin up N instances of my program, where N is the number of cores on my machine. I can easily scale that way". You can't even get concurrency right and now you want to move into distributed programming? Who are you fooling? But seriously, the problem is your code now needs to be "location aware". If you want to do something with object X, you have to be aware of where object X lives. This adds another layer of complexity to your system. Without a good way of communicating between instances you are limited to solving embarrassingly parallel problems or pushing the concurrency to another software layer. Either way, you aren't actually solving the problem with your framework. Luckily, a lot of what people want concurrency for is serving webpages, which requires almost no interprocess communication right now.

Reasoning

No matter how you slice it, writing concurrent code is hard. When it comes to serial code, looking at it and knowing what it does is as simple as understanding how each function operates given the current state of the program. But with concurrent code, the state of the program is changing while a function runs. Understanding a concurrent program involves understanding how the concurrent components are interacting with each other. Some solutions make this easier than others.

If you want to add proper error handling, the situation gets worse with callback code. Twisted has attempted to solve this by encapsulating code flow in an object called a Deferred, but the problem remains: a unit of work in callback-based code is not a function, like one is used to in serial code, it is work to do between events. Like the above example code showed, there isn't a function that connects to a db, does a query, and returns the result. There is a function to open a db connection, another function for when that is done and to do the db query, and another function to handle the result. You have defined three functions where you previously needed one. More importantly, you have to define functions not because it makes your code clearer but because the framework requires it.

Given how negatively this affects code, there are a lot of attempted solutions. Twisted, for example, allows one to use the defer.inlineCallbacks decorator so a function can use generators to express asynchronous code. Our previous NodeJS code might look like this:

The NodeJS community has been at work solving this problem for themselves too. One person added coroutines to V8, and gave it a C#-like syntax. OKCupid gave us TameJS. Both of these solutions have their problems which are deal breakers for many.

There are also, less complete, solutions like Step. But library solutions, like Step, only give you access to a subset of functionality you would get from the sequential code you really want to write. To do that you need a full CPS transformation (which is what TameJS gives you, at a cost of debugging). This is actually how the syntax extensions for Ocaml/Lwt work. The previous NodeJS code might look like this in Ocaml/Lwt (the relevant part is that lwt causes a CPS transformation to turn the code into the appropriate callback-based code):

This is one reason for Gevent/Eventlet's popularity in Python. Gevent uses coroutines to give you asynchronous code that looks sequential. The trick is, underneath the hood, some function calls actually result in all of the state for your current function call being saved, another one switched to, executed, rinse, repeat. Gevent has a cooperative scheduler that tries to intelligently decide which function to switch to.

Say you want to write the earlier NodeJS code in sequential Python, you might get:

How would this look in Gevent? Exactly the same. The openConnection and query functions have an I/O call which actually jumps back to the Gevent scheduler so it can do something else while the I/O happens.

But Gevent is not without its cost when it comes to reasoning about code. Consider this:

deffoo(data):print data.bar
do_something()print data.bar

Looking at this code, will the same value be printed twice? The answer is: no idea. Even though do_something does not take data as input, it could do something that causes Gevent to context switch to another function, another function which also has access to data and modifies it. There is no way to tell, simply by looking at the code, if it will context switch or not.

Debugging

The previous Gevent code is printing out two different values for data.bar and you don't want this, how do you fix it? The first thing you might try, from your serial programming days, is a debugger. But that might not work very well. Why? You're in concurrent-land now, multiple things are happening at once! That means timing is important. If you set a break point somewhere, you've disrupted the time things happen and your program could take a completely different path, not the one you want to debug.

If you're smart and you control access to data.bar through function calls, you can do some printf debugging. Perhaps print out a stacktrace when one modifies it. But let's say, even those prints are causing the timing of your program to change, so now data.bar is coming out as the same value at each print. What do you do?!

The point is, debugging concurrent code can be very hard. Event-loop code adds another problem to debugging: your code doesn't have a linear path. If you could visualize sequential code, it would be a line. You start at point A, you do the things in order to get to point B, at any point if you have an error your callstack represents the path you took to get there. Event-loop code always needs to hit the event-loop for a blocking call though. The callstack you see is always limited to the path from the last event you got. A callstack in the code handling a database query may not contain the how you got there. If that query is part of a piece of fairly generic code you don't have many leads to go on to track it down.

Who got it right then?

Three languages come to mind: Erlang, Oz, Haskell. There are more out there but I'm not omnipotent. In my opinion, these languages are capable of the three properties I previously mentioned. Right now you are probably rolling your eyes and saying "I should have known, one of THOSE guys". But my argument is conservative: based on the properties that I believe are important for concurrent solution to be good, these languages excel (or are capable of it) at them. Real world problems contain more than just concurrency issues though, so this does not mean you're wrong to use a language that doesn't meet my criteria, but it does mean you are sacrificing something. Perhaps that sacrifice is acceptable. But don't fool yourself into thinking your language is not terrible at concurrency, because it probably is.

21 comments:

"this code is acceptable Twisted, and it could not work if Python suddenly got the ability to run code on multiple cores and Twisted used it. The reason being, there is no coordination around the ret.append(s) line. What if two threads were to try to append to ret at the same time?"

Why on earth couldn't list.append be made threadsafe? That would seem a basic requirement of free threading in (C)Python. list.append from multiple threads is safe in IronPython and Jython.

@Michael - It very well might be, but it's irrelevant. The list.append call is there just to show the basic problem, replace list.append with any multistep data modification and you're back to the same problem.

@Nick - I don't know Clojure well enough to say. I tried to limit myself to things I felt sure of.

What I would really like to see is the Haskell model (green threads + epoll + real threads) but with a reasonable language. There's work going on to bring something like this to LuaJIT, but then Lua as a language kinda sucks too.

Nice post. This is why I have moved on from non-blocking i/o and green threads to Actors as my preferred fundamental concurrency primitive. Actors don't use shared memory, can be scheduled with epoll and on multiple os threads, and have the additional advantage of decoupling message-send and message-receive with the mailbox, a property which I think is currently underappreciated.

"there is no coordination around the ret.append(s)" .. But there is. The Python run-time model requires that append() be thread-safe.

"Almost no data access is surrounded by a mechanism to coordinate multiple pieces of code accessing it at the same time" .. excepting run-time guarantees made by the language, and the parts of Python which use locks or other thread-safe protections. Python has been around for 20 years, and multi-threaded support isn't new.

"If CPython or V8 got multicore support it would take a rewrite of all of the code to make use of it." Really? How come most of the standard library works without modification on Jython, which is a Python implementation which does support multicore? How can PyPy talk about using Software Transactional Memory to get multicore support on existing Python code, without a rewrite?

When evaluating languages or platforms, I am NOT looking at what is convenient, instead I'm looking at what is possible.

The truth of the matter is that virtual machines like the JVM or .NET do not have global-interpreter locks, while also having really efficient garbage collectors and also providing the right primitives for dealing with non-blocking algorithms and data-structures.

You want user-level light cooperative threading on the JVM? Then you can do something that scales on multiple cores. You want transactional memory? You can also do that. You want actors and message-passing? No problem!

Of course, the language does stay in your way and most solutions involve byte-code manipulations and other workarounds. And it's pretty easy to fuck up your implementation too, but you know what SUCKS the most?

What sucks the most for me is when using a language powered by a shitty VM / runtime. And the problem with this is that people do get blinded by their usecases -- just because your web app is I/O bound that doesn't mean other apps don't have different requirements.

@Andrew - You are correct about list.append, I should have chosen a better example, but you are missing the larger point: code written for Twisted is not written to be thread safe. Replace list.append with anything you would normally place in a Lock and the issue should be clearer. The point isn not that if Python got multicore support you couldn't write multicore code in it, it is that the model Twisted provides depends on running on a single core. I'm not saying you have to rewrite the entire standard library, I'm saying you Twisted code would need to be rewritten.

@Alex - I'm not sure you have disagreed with anything I have said. You have mentioned two language platforms, JVM and .NET, which are below the level of language that I am talking about. I am talking about languages, not platforms. I know Erjang is doing some impressive things on the JVM. I didn't explicitly mention anything on the JVM or .Net because I simply don't know those technologies well enough to say. But if you agree with my criteria for what makes a good concurrency solution then you should be able to apply it to whatever you come across and decide if it is a good concurrency solution. My list of languages, as I explicitly state, is not the complete list. It is just the list of ones I am aware of.

do_something could do things to data today, there is no way to know that in python, maybe do_something also has a reference to the data object, maybe it introspect it from the stack frames. Gevent do makes things a bit harder, but it is not something that could not happen before.

"you are missing the larger point: code written for Twisted is not written to be thread safe."

On the contrary, I'm pointing out that code written for Twisted can be thread-safe, and in fact this example is thread-safe. Twisted code can also be non-thread-safe, but there's no surprise there since it can't make Python itself be thread-safe.

I agree with your points about the difficulty of debugging and of working with the callback conceptual model, but the idea that Twisted's defers "depend on" single core is just wrong.

Deferreds are conceptually identical to C++ promises/Java futures, which are absolutely designed for multicore systems.

But that isn't what I said, I said that the way people write Twisted, data accesses do not have any coordination mechanism around them. People write Twisted with the intent of avoiding threading so, almost invariable, the code in a callback is presumed to not be preempted. Perhaps my word choice was poor but my point is valid: almost no Twisted code written right now could validly run over multiple cores.

Your example shows it, but I was distracted by the append(). The problem is actually with the "len(ret) == len(urls)". If it's running in two separate threads (and multicore is not important) then it's possible that two callbacks will both see that test to be true.

It can be solved with something like: def _return(_, x=range(len(urls)): if x.pop() == 0: d.callback(ret)

That returns to thread guaranties made by Python.

Since I'm certain that your view of how people write Twisted code is correct, and since all Twisted code is unproven in multithreaded environments (I don't see viable Twisted support for Jython or Iron Python), then your conclusion is inescapable.