state of ActiveRecord and concurrency, another update

Over the years I’ve blogged my experience with ActiveRecord concurrency, and these continue to get a lot of hits, and occasionally comments and emails to me, so I’ll give it another installment with things I’ve learned or things that have changed since last time.

Concurrent support back in master/rails4 ConnectionPool, great

Since my last update, @tenderlove changed his mind and restored blocking into ConnectionPool in rails4, so you can again create a pool of N connections, which serves M threads where M > N, and have it work out.

Great. Thanks tenderlove.

Fair Queueing greatly improves M-thread-to-N-connection behavior.

However, my rosy outlook in my last update about the state of M-thread-to-N-connection concurrency in rails 3-2 was premature.

For some context — lets say you have a bunch of threads that want to share some connections through AR. Potentially more threads than there are connections. The ConnectionPool contract theoretically supports this. However, the only sane way to do it (IMO) within the AR contract is to make each thread (or at least each thread that’s NOT a standard rails request handling thread, but one you created yourself) explicitly check out a connection with `with_connection`, holding it for as short a time as it can.

I was doing this. The problem is that some threads were being starved for connections and timing out even though a connection reasonably should have been available to them. It’s complicated to describe the scenario, but it all ended up coming back to the fact that ruby’s mutex primitives are not “first in first out”. Let’s say a thread is waiting on a mutex. Over the next second, 5 more threads start waiting on the same mutex. Half a second after that, the resource becomes available and the mutex ‘signals’ — that first thread that was waiting doens’t necessarily get access to the mutex, it can get ‘jumped’ in line. And this can happen multiple times, until eventually the first thread waiting times out and gives up — even though if things had been ‘first in, first chance at mutex’, all the contenders could have happily completed their mutex work within their timeouts.

I think java thread primitives are actually the same way, so it’s not unique to ruby.

But in this particular case, it causes problems. I submitted a patch, accepted, to Rails, which provides a bit of workaround to this issue to improve things at least somewhat. But it wasn’t good enough — my production app was still requiring a lot more connections than it should have, to avoid timing out, in cases where if contention was first-in-first-access, fewer connections would have been fine.

Fortunately, just as I was tearing my hair out wondering what to do, @pmahoney showed up with a patch to Rails master to fix this issue correctly with a clever fair thread-safe queue implementation. Which involved concurrency programming of a trickier sort then I would have been able to come up with.

I’m not running rails4. But I took pmahoney’s patch, and hackily monkey patch backported it into AR in my rails 3.2 app. And I put it in production. And it worked marvelously. For about two months now, it’s been running, and finally my connection use under concurrency was reasonable, and threads weren’t timing out waiting on connections.

I tried to submit a patch to backport pmanoney’s fair queue in ConnectionPool to rails 3-2. But we ran into weird problems, where it made the AR test suite fail in ways that shouldn’t have been effected, or we couldn’t figure out why. @rafaelfranca helped out with a bunch of them, but we were left with one failing test (related to pessimistic locking with postgres) that was failing and we couldn’t figure out why.

Rails 4.0.0 will be out at some point in the not too distant future. But while you’re stuck with Rails 3.2, if you are trying to use concurrency with Rails 3.2, I strongly recommend monkey patching a version of @pmahoney’s fix into your app’s ConnectionPool. You can use the diff from the attempted backport as a guide. It didn’t (yet?) get backported into rails 3.2, but the only failing test in the end was around pessimistic locking , and only with postgres. If you don’t use both postgres and pessimistic locking, i wouldn’t worry about it. Really, even if you do (although in general pessimistic locking is not a great solution, it’s a very niche feature).

(and if anyone with some time and some postgress-fu and concurrency-fu wants to help get the backport passing all tests in 3-2-stable, please!)

Of timeouts

Another issue concerns timeout values with the ConnectionPool. The default timeout value there is 5 seconds. Meaning a thread will wait up to 5 seconds to get a checkout from the ConnectionPool. This is actually usually too far long for a properly performing app under most sorts of designs — if your thread has to wait more than even a second — it probably means you need either more connections in the pool, or fewer threads, or something else is wrong. The whole point of the many-threads-with-connection design is to keep checkouts short — if a thread has to wait longer than however long the longest checkout should take, your code is not functioning well, and you need more connections or fewer threads (or rewriting your code to make with_connections more granular). So you might want to set this to a lower value.

The problem is that in Rails 3.2, you set this value with the key `wait_timeout` on your database connection spec (what you usually have in config/database.yml), defaulting to 5 seconds. But, uh oh, that very same `wait_timeout` key is also used by the mysql2 adapter for an entirely differnet value — setting the MySQL server’s own `wait_timeout` value, how long the server will allow an idle connection before closing it. Which defaults to like 9 hours. A clue that you usually won’t want your ConnectionPool checkout timeout to be the same as your MySQL wait_timeout, but rails 3-2 gives you no way to change either one away from it’s default without making them the same. if you change MySQL’s `wait_timeout` to like 1 second, you probably will have problems (with MySQL dropped connection exceptions all over the place).

We fixed this in master (to be rails4) by changing the ConnectionPool’s key to be `checkout_timeout` instead of `wait_timeout`. I just submitted a backport patch to rails 3-2, maybe it’ll make it into a future rails 3-2 release. In the meantime, be aware of the issue (instead of spending hours/days debugging and figuring it out when you run into it like I had to :) ).

Evaluating AR’s concurrency contract, comparing to alternatives

The slightly depressing thing, is that even once we’ve got AR ConnectionPool working ‘correctly’ as advertised — it’s basic design, it’s API/contract for concurrency, is still kind of a pain.

If you are making multi-threaded use of AR, you want each thread to reserve a connection briefly for only as long as it needs it and return it to the pool for other threads to use. That’s just standard default good design for multi-threaded use of an exclusive resource, like a db connection, right? If you want to do that, you’ve got to wrap every single area of your code that will end up making AR use a database connection in a `with_connection` block.

This is a pain to begin with, all those `MyModel.with_connection do`s to write. Making things more complicated/confusing, is that it can be difficult to predict what AR calls might require a trip to the database. You don’t normally have to think about that, and in fact not having to think about that is part of AR’s design. So you basically have to wrap any code that touches any AR model in a `with_connection` block, as even just accessing an association (or attribute?) might result in a trip to the db.

Worse is what happens if you forget or miss a `with_connection`. If you invoke some AR code that requires a trip to the db, and forgot to wrap it in a `with_connection`, AR will ‘helpfully’ automatically check out a connection to your thread for you. But then it’ll never get checked back in, you’ve ‘leaked’ a connection. Rails 3.2 tries to look for leaked connections and reclaim them on nearly every `checkout`, but this is an expensive thing to do, and master/rails4 doesn’t do it anymore — and in fact takes away any easy way of even identifying leaked connections at all to reclaim them ever. Could be a really big problem.

So this is not a great design in the first place. What are the other options? Well, as far as I can tell, both DataMapper and Sequel do this different — you aren’t responsible for checking out connections yourself at all, the actual library takes care of it for you, doing the equivalent of a `with_connection` itself under the hood in any code that requires a db connection (including any `with_transaction` type code, that needs to keep the connection checked out for the whole block).

This makes things a lot more reasonable. It also makes things somewhat more expensive, more mutex action going on under the hood on a very fine-grained (and thus potentially high-volume) level. But it’s really the only reasonable way to do things, it’s a neccesary cost for multi-threaded use. At least one of those frameworks (I forget, DataMapper or Sequel, maybe Sequel?) I recall seeing it lets you tell it you won’t be using more than thread, and then it doens’t bother with all it’s mutexes to give you better performance.

So if you’re just starting a project that’s going to need multi-threading and databases, I’d recommend investigating DataMapper or Sequel as an alternative to ActiveRecord. No doubt they’ll each have their own bugs and things you wish had been designed differently too though, it’s always a trade off. But if I were starting a new project I knew had need of serious multi-threading, I’d definitely be considering it. (My existing complex project is probably stuck with AR for a while).

Could AR itself be modifed to use that approach, doing `with_connection` under the hood itself every time it uses a database connection, instead of making the client code do it? Certainly, in theory, without even making any significant architectural changes to AR. You’ve just got to find every place in AR that uses a db connection and wrap it in a `with_connection`. (including `transaction`). The code the end-developer writes would only need to do `with_connection` itself manually if you actually wanted to deal with a raw Connection object yourself.

In theory. In actuality, I’m sure there would be catches. And it would take someone more familiar with AR than me to pull it off. And I think it may conflict with different ideas @tenderlove might have for the evolution of AR, I’m not really sure. At any rate, so far as I know nobody with both the time and AR expertise to pull this off seems to be pursuing this path at present.

Oh, and fibers

Another topic worth briefly mentioning — although I don’t understand all the details myself, is that as ‘fibers’ become more popular in various ways in ruby, it complicates things for AR too. Depending on exactly what you’re doing and how you’re using fibers — it might be just fine with AR, or it might cause serious problems with connections not being properly allocated to threads/fibers by AR ConnectionPool. For instance, the awesome Celluloid uses fibers in such a way that it can mess things up with AR ConnectionPool. (Unless you put all your AR use in an an Actor in a Celluloid ‘exclusive’ block, which will cause fibers not to be used in the way that confuses AR).

Sorry I’m being vague and hand-wavy here becuase I haven’t completely wrapped my head around what’s going on — concurrency issues are confusing! I’m not sure what the solution is, but something to keep your eye on. I’m also very curious as to how DataMapper or Sequel are effected (or not) by these issues — for instance if things will work just fine with Celluloid’s Actor’s use of fibers with DataMapper and/or Sequel. It’s a difficult question to answer simply by automated testing, it’s hard to tell if there’s still potential race conditions not caught by your test. You really need someone who understands both how Sequel (or DataMapper) do concurrency and connection pooling and how Celluloid uses fibers (for instance) (or a room with both people in it talking to each other) and so far as I know that hasn’t happened yet, but I’m very curious.

Mixing the API for fiber local and thread local really causes so much troubles. That said, under a context of non-root fiber, Thread#[] is actually fiber local, not thread local. Also, Fiber.current would change under a context of non-main thread, as if there were an invisible fiber created. There’s also no Fiber.root as Thread.main to give you the top-level fiber.

I used to embrace fibers, but given all those caveats, I started to think threads might be a better solution. At least people know threads much more than fibers in history, I guess. Still it would be good if all caveats in fibers could be ironed out though. I would be more than happy to see that happens, too.

Thank you for your works on ActiveRecord for threading. Much appreciated.