https://cjf.io/Ghost 0.11Thu, 03 Jan 2019 15:05:57 GMT60At Khan Academy, we run most of our software on Google's appengine and store our data in Google's cloud datastore (a non-relational key/value-like store). If we need to go back and add a new field to one of our datastore models, we usually run a map[reduce] job (no]]>https://cjf.io/2015/06/12/improving-performance-of-appengine-mapreduce-using-batching/ef22c59a-a3c9-439e-9f63-b297e7ebd072Fri, 12 Jun 2015 13:30:18 GMTAt Khan Academy, we run most of our software on Google's appengine and store our data in Google's cloud datastore (a non-relational key/value-like store). If we need to go back and add a new field to one of our datastore models, we usually run a map[reduce] job (no reduce step required) over the objects in the datastore using the appengine-mapreduce library, adding the new field to each object in the process.

I recently found myself needing to add a new field to ProblemLog, our datastore model recording every problem ever done on Khan Academy. We have well over 3 billion of them in the datastore. I started to run the mapreduce job and estimated that in the best case, it would take a little over 6 weeks. In practice, because of uneven splitting of the objects among different processes, and transient datastore errors that cause partial restarts of the job, it might take much longer (or never finish at all).

One potential source of slowness is that each object is processed one at a time. A single process will make an RPC to fetch the object from the datastore, apply the change that adds the field, and then make another RPC to write the updated version. If the RPCs are rate-limiting, then we could potentially speed things up a lot by batching: fetch N objects in each process in parallel and then write them back in parallel too. Best case, this speeds things up by a factor of N.

I went ahead and implemented this via a new mapreduce InputReader. (In appengine's mapreduce, there's an InputReader hierarchy of classes controlling fetching the data and passing it to the processes doing the mapping.) The standard InputReader I was previously using would pass datastore keys to the mapper processes one at a time. I modified it to pass a lazy python generator that would yield up to N keys per process. (Doing this lazily ensured that if the mapper was interrupted partway through, we didn't miss any keys. I'm not actually sure if this was necessary, but better safe than sorry.) Using this, I could then modify the mapper process to perform the RPCs for fetching and storing the N objects in parallel.

After deploying the new input reader, I did a bit of casual performance testing. I started a mapreduce over our ProblemLogs with the old input reader as well as the new one with the batch size set to various values. I then let the job run for at least 10 minutes and read out the number of objects processed per second (this rate stabilized by about 5 minutes, so the timing didn't need to be exact; just longer than 5 minutes).

This is a plot of objects processed per second vs batch size. The new batching input reader is shown in the red solid line; the baseline from the old non-batching reader is shown as a blue dashed line.

The processing rate plateaued in this test around 100 datastore keys per batch. Unfortunately, this only represents a relatively modest ~3.75x speedup-- far less than the maximum possible N time speedup for a batch size of N, but still considerably more convenient! Of course, the position of the plateau and the relative speedup will vary considerably depending on what the mapper is actually doing, so this may look different for other mapreduce jobs.

In the end, I didn't actually run the complete mapreduce over ProblemLog (it would have been extremely costly, and we're pursuing alternate approaches). The new input reader has stuck around, though, and it's a nice little speedup for RPC-heavy mapreduces of any size.

]]>I upgraded to Ubuntu 14.10 today. It was mostly quick and easy, except my Dell XPS13's touchpad stopped working after the upgrade -- the hardware wasn't even being recognized, and it didn't appear in the output of xinput. It turns out this was a result of the kernel upgrade]]>https://cjf.io/2014/10/24/fixing-my-touchpad-in-ubuntu-14-10/9131c4dc-cf02-4184-9cd4-397754a046cdFri, 24 Oct 2014 00:26:48 GMTI upgraded to Ubuntu 14.10 today. It was mostly quick and easy, except my Dell XPS13's touchpad stopped working after the upgrade -- the hardware wasn't even being recognized, and it didn't appear in the output of xinput. It turns out this was a result of the kernel upgrade from linux 3.13 to 3.16.

After some playing around, I found that I needed to unblacklist the kernel module i2c_hid (by deleting the file /etc/modprobe.d/blacklist-i2c_hid.conf). Then, after either restarting or modprobe i2c_hid, the trackpad was recognized again. I suspect that had previously blacklisted this module as a fix for some touchpad issues in a previous kernel; a quick search indicates it's a fairly common suggestion for touchpad problems.

]]>This week I was introduced to the Elixir programming language. I'm super excited to try it out on a project. It's a scripting language built on erlang. The syntax looks a lot like ruby (yay!) but it's a functional language, feels a bit lispy, and it's got the concurrency benefits]]>https://cjf.io/2014/03/10/exceptional-exceptions/751b618f-e094-405e-91e3-b492d104f300Mon, 10 Mar 2014 06:00:37 GMTThis week I was introduced to the Elixir programming language. I'm super excited to try it out on a project. It's a scripting language built on erlang. The syntax looks a lot like ruby (yay!) but it's a functional language, feels a bit lispy, and it's got the concurrency benefits of erlang. Awesome.

I was reading the getting started documentation and came across the following in the section on exceptions:

Developers should not use exception values to drive their software. In fact, exceptions in Elixir should only be used under exceptional circumstances.

Notice that File.read does not raise an exception in case something goes wrong; it returns a tuple containing { :ok, contents } in case of success and { :error, reason } in case of failure.

This makes sense for a semi-pure functional language: if you're trying to minimize side-effects of functions, the possibility of a nonlocal jump with an exception is probably a bad thing. I felt like this was a bit extreme, though, in most cases. I remember when I was first learning about java exceptions long ago, they felt like a breath of fresh air. No longer was it necessary to set some sort of integer return code or something -- instead you could raise a named exception. Exceptions were readable and useful, and in java they were part of the interface (public void myFunction() throws MyAwesomeException), so you could do error handling well.

This weekend, I encountered a lot of problems with exceptions in production propagating way outside of the scope where they were raised. The issue was my code's fault, not the fault of how exceptions are used. Nonetheless, this got me thinking about whether exceptions are really not such a great thing. They have the property that they can cause nonlocal jumps in code, which is generally a bad thing because it can be hard to figure out how you got to where you are in the code. (This is one of the principal reasons that goto statements are a bad thing. Unlike exceptions, though, goto statements also make code exceptionally unreadable.) The fact that a function that I wrote can cause problems in code outside of that function that does not use the results of that function is not good. My overpropagating exception problems essentially boiled down to a "file not found". Why did my it have to be an exception? It's not unreasonable that a file might be missing. This shouldn't crash the whole program.

In all the programming languages I'm familiar with, there are generally two groups of exceptions: those that are expected to be caught by programs, and those that aren't. (e.g. in java RuntimeException and its subclasses vs. other exceptions). But if we're expecting some of the exceptions, why are we throwing them? Are they really exceptional? Is it worth the risk of nonlocal jumps to use them? Perhaps it works ok in java, where there are compile-time checks for non-runtime exceptions.

My issue was in python, however, which doesn't have such checks. (If you're willing to say that any language overuses exceptions, it's got to be python. Heck, iterators use exceptions to signal that they're done iterating! This is something normal that is expected from any finite iterator, not something exceptional!) While my code was at fault, it wouldn't have been a problem if a regular error that can be reliably dealt with didn't raise exceptions that can cause non-local program flow.

Maybe it's worth taking elixir's advice to heart and considering whether any time you're tempted to raise an exception, a different approach might work just as well, but not be as dangerous. Something to ponder at least!

]]>NMatrix is a high-performance linear algebra library for ruby, currently entering its first beta version. Think NumPy, but for, you know, a sensible programming language. A few months ago, I worked on resolving some insidious segfaults resulting from the way we interacted with the ruby garbage collector in C code.]]>https://cjf.io/2014/03/05/adventures-with-the-ruby-garbage-collector-in-nmatrix/8a76d42e-0387-42fa-af5f-b64703e3dd50Wed, 05 Mar 2014 05:48:30 GMTNMatrix is a high-performance linear algebra library for ruby, currently entering its first beta version. Think NumPy, but for, you know, a sensible programming language. A few months ago, I worked on resolving some insidious segfaults resulting from the way we interacted with the ruby garbage collector in C code. I'm still no expert in ruby garbage collection in C modules, but we learned a lot of interesting things from the experience.

I'd meant to blog about this a long time ago and never got around to it, but I was revisiting some of this in preparation for GSoC 2014 (we've been selected as a mentoring organization!). Here's a perhaps meandering description of some of what we learned.

Garbage collection in ruby C modules

Initially I was a bit surprised that I might have to interact with the garbage collector at all from C code. Most of the time, though, it's a fairly pleasant experience, especially if you're passing ruby objects around between a lot of different functions. Interacting with the garbage collector means that you don't need to worry about keeping track of ruby objects and free()ing them when you're done with them.

Normally, if you're creating a ruby object from a custom class that you've defined in C code, you use the macro Data_Wrap_Struct(cls, mark_fct, free_fct, mystruct). In this function's signature,

cls is a VALUE (a typedef that is kind of like a pointer to a ruby object most of the time) that is the class object for the resulting object

mystruct is a struct that is the C representation of the object.

free_fct is a function that the garbage collector calls to free the structure's storage. This is fairly straightforward and just needs to free up any storage in the struct that you previously malloced (or if you're being a good rubyist, that you previously allocated using ruby's macros ALLOC or ALLOC_N).

mark_fct is a function that take one of these wrapped up struct and tells the garbage collector how to keep track of (mark) its components.

mark_fct warrants a bit more discussion:

mark_fct is easy if your struct doesn't store any ruby objects; it can just be NULL. It gets slightly more complicated if your struct does store ruby objects internally. Ruby has a mark and sweep garbage collector. mark_fct is called during the garbage collector's mark phase, and its job is to in turn mark all internally stored objects. Ruby provides a number of convenient functions to do this. The simplest is rb_gc_mark(VALUE), which just takes a VALUE and marks it.

We need to make sure that the marking function is associated with a struct via Data_Wrap_Struct right away because ruby garbage collection might potentially happen during any call to the ruby C API. (Actually, I think it may only happen when using the ruby memory allocation macros (or calling functions/using macros that use them), but there might be other cases I'm not aware of.)

This is all fairly straightforward so far and has the awesome (in my opinion) benefit of making some objects in C sort of garbage collected!

This is just a simple iterator (perhaps from an implementation of a map method) that yields an index to the calling code and stores the result back in the array n. At any point during the rb_yield, garbage collection might run and collect all our stored results in n because they're not in scope in the ruby code any more, and we haven't wrapped them in a struct with a marking function.

Not a problem! Ruby also does something seriously cool and looks on the stack for things that look like VALUEs or VALUE*s and marks them automatically. So we're still ok.

If this is some other data structure and not a VALUE or VALUE* on the stack, though, the garbage collector may run and collect any VALUEs in that struct. For example:

Then at some later point when these are used, boom: segfault. These are particularly nice because they're often nondeterministic and depend on when exactly the garbage collector runs, which in turn depends on how much free memory there is on the system.

One solution to this problem might be to wrap whatever data structure has the VALUEs using Data_Wrap_Struct before starting the loop, but if the struct you're constructing isn't finished, you can run into trouble with segfaulting during the marking loop as well.

A more commonly seen solution is just to put a pointer or a value on the stack. So in the last example, this would look like:

This works great... until you turn on compiler optimizations. NMatrix is intended to be a high-performance scientific computing library, so we compile with pretty aggressive optimizations. This unfortunately means that it's hard to guarantee that anything is on the stack when you think it is, and in simple cases like this last on, the variable hn will almost certainly be completely removed since it's never used.

Volatile variables

C and C++ have a rather cryptic keyword volatile (note: very different from java's volatile) intended for declaring variables that might be changed by hardware. Essentially, volatile tells the compiler that the value of this variable might be changed by some code or hardware about which it has no knowledge, so it shouldn't do any optimizations that assume when it will and won't be changed. This happens to mean that volatile is also useful for this unintended case:

Now this works fine: the garbage collector will not collect the array of values being returned from the rb_yield, and thus no segfault!

However, this is not ideal for several reasons:

First, I'm not a huge fan of using volatile for something other than its intended purpose. While it's probably fine, it may mean that it's not doing precisely what we think it is in all cases, and it's also possible that the behavior may change in the future.

Second, we have to add a lot of useless volatile variables everywhere. On the one hand, this isn't really adding a lot of extra code, since we're going to have to add something to fix the problem. On the other, this particular solution massively detracts from the readability of the code. While I was researching this problem in other projects, almost every time I saw someone introduce volatile variables for this reason, I saw someone else comment something to the effect of "wtf is this doing?!? what does this mean?". Moreover, when you have complicated data structures with lots of variables, this ends up being a lot of volatile variables. Both the readability and lots of variables problems could potentially be fixed with liberal use of preprocessor macros.

Third, we're preventing compiler optimizations! Not all of them, but some. This was something we'd hoped to avoid. It's not clear to me how much of a performance hit this actually is. I didn't profile because at the time the code was segfaulting in the relevant places, and I was pretty confused about what was going on, so I didn't have a good baseline. It'd be interesting to go back and see whether this is actually any slower not that I've got a better handle on what's going on.

Registering ruby objects

The solution we arrived at in the end is actually pretty simple: create a static structure wrapped with Data_Wrap_Struct before the code uses any VALUEs. Then any time there's a VALUE being used in a place where it could potentially cause problems, add it to the static structure so the garbage collector can find it, and then remove it when the danger has passed. We implemented this static structure with a linked list acting as a stack (since most of the time VALUEs could be removed in the reverse order that they were added). We then added functions nm_gc_register(VALUE) and nm_gc_unregister(VALUE) that add a VALUE to the stack and remove it, respectively. We also added helper functions for registering more complicated data structures.

This does cause some slowdown in the code because of the extra function calls. At this point, I had figured out enough to be able to profile it. In the places using nm_gc_(un)register the most, computations took a little under 5% longer. In other places, the effect was smaller. Not bad, but something to keep an eye on in case we end up making performance improvements elsewhere that make the registration a more significant time sink.

There's a lot of cases where we just didn't know enough about compiler optimizations to tell whether it was going to be necessary to register a VALUE or not. For instance, is there ever a case where an optimization could occur that would cause one of the parameters of a function not to be on the stack? I have no idea. For these cases, we made a preprocessor macro called NM_CONSERVATIVE() that just deletes whatever is in the parentheses and put the registration/unregistration calls inside the macro. That way, they're not sitting around causing slowdown, but if we start seeing segfaults again, we can quickly redefine the macro and see if any of these cases are to blame. We can even write a script to remove them one at a time and see what particular registration (or lack thereof) is to blame.

If you want to have a look at all of this in action, check out the NMatrix source. The static data structure and helper functions are found in ext/nmatrix/ruby_nmatrix.c. The registration functions and NM_CONSERVATIVE are used all over the place, but that same file has several nice examples of usage too.

I particularly enjoyed the part on effect sizes: "Critics also bemoan the way that P values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect." I've been ranting at my colleagues about this for years: you may have measured so many cells that you can detect a 1% change in some signal with p < 0.01, but does that mean anything for the actual biological system?

Instead of demanding a particular p-value, do experiments to determine what effect size might actually matter for your system, and then use (more appropriate) statistics to figure out whether your measured effect is at least this large.