Deep Dive Failure: Learning Through Investigation

November 15th, 2016

You may have noticed us talking about serverless architecture a lot recently. It’s no secret that running Node code in this manner is a highly efficient way to deploy very robust systems that are made up of many microservices that work together. I had a problem with the way Lambda handles Node dependencies, though, and I wanted to investigate a solution for it.

Beginning the Investigation

You, as the application maintainer, are required to build and package your dependencies before deploying to Lambda. That’s really not a big deal for Node modules that are written entirely in JavaScript but immediately falls apart when native modules are in play (modules that are built from source code), because the modules must be built on the same kind of system that they’re being used in. If you don’t have a Linux system similar to the Lambda stack, the modules at best may not behave correctly and at worst won’t even work.

Amazon’s suggested solution is to spin up an EC2 box, build your native modules, then deploy that code. Given that the whole point of using Lambda is to avoid having to run and maintain servers, I wasn’t thrilled with this answer. I wanted to see if there was a way to build native modules that didn’t require me to manage more infrastructure.

tl;dr — there is a practical way to build native modules without spinning up a dedicated server: use Travis or another CI system that runs in an AWS like environment. That’s not the solution I explored though. The solution I wanted to implement either is not possible or would require more initial investment than is really worthwhile.

That said, I’d like to describe what I investigated and talk about how failures in research are sometimes just as valuable as successes. I may not have been able to build a clever and successful solution to the problem I was faced with, but I did learn a lot about Lambda, npm, and node-gyp along the way.

The Local Prototype

The design was simple in my mind: create a lambda function that (1) could accept a list of dependencies in the same format that’s used by npm, (2) would install those dependencies, and (3) zip up the result and send it to S3 so it could be downloaded later.

I was able to build scaffolding in a matter of minutes and was confident I would have a fully working and deployed solution in a matter of hours. But then reality struck in the form of npm’s lightly documented API. The API for using npm from the command line is very well documented and I could have rather easily forked child processes to run npm commands, but that’s not what I wanted to do—I wanted to require(‘npm’) in my script and use it directly from within JavaScript.

The top level commands that I was interested in were easy enough to locate in npm’s source but how to actually use them wasn’t quite as clear. Before I go any further, I must point out: this is a warning sign. If a piece of code isn’t well documented it’s highly likely it’s not meant to be directly used by an external application. You should consider this kind of code volatile and be prepared for it to break without warning.

Since this was just for a toy project, I decided to press on and see if I could get things to work anyway. After a few rounds of “ok I’ll change this and try again” I decided to slow down and accept that the problem was going to take a little longer to solve. I spent some time looking for examples of npm usage but either my search skills were off that day or there just isn’t much out there, but then it dawned on me: a lot of Node.js code is well tested. Tests are great examples of how to use your code and can sometimes be even more useful than a simply documented method.

It turns out, npm is no different. There is an extensive test suite and examples of how to call npm.install directly. Now armed with some examples, I was able to construct the correct chain of commands I needed to successfully install an arbitrary set of node modules. Hooray! Time to package it up and send the code to Lambda!

Works On My Machine…

Packing up my code and deploying it out to Lambda was easy peasy thanks to the Serverless framework. As soon as my code was in place I ran a test command with a small module that I knew had no dependencies. I wanted something that would fail fast and hopefully in an obvious way if something was wrong. It turns out it did fail fast, but not so obviously…

From looking at the Lambda output and logs I could tell that npm was being initialized correctly but as soon as the install began the process was crashing, because Lambda has a read-only file system. This didn’t seem like too hard of a problem to solve, I could just install the modules in a temporary directory which Lambda does let you write to. I made the code changes and redeployed but the read only errors persisted.

I spent a lot of time using my usual debugging tricks locally — a step through debugger and monitoring the directory that was being touched — but for the life of me I couldn’t understand why the command was still failing. I decided to look at npm’s test suite again to see if anything stood out: one of the keys to a good test suite is creating an environment that is not polluted between test runs so that you can ensure your test behaves predictably and without outside influence. So I was hopeful that there would be some clues in the code as to what else might be changing on disk.

Once again, the answer was right in front of me. In order to speed up installs, npm relies on a cache. Normally this cache directory is located somewhere accessible to your user so that no matter where you run npm install, cache can be used. That obviously wouldn’t work on lambda since the only place on disk you can touch is the temporary directory.

Another minor code change and a deploy out to Lambda and success! I was able to build a node_modules directory on a remote machine by providing only a list of dependencies. I immediately cleaned up my code and added a test with my simple case, certain that scaling up would be no problem. And then I tried it again with a native module — the whole reason I started on this path — and it failed, now with a message about node-gyp.

So I began to tear through node-gyp code, trying to understand what it might be doing that was causing a failure. It turns out that node-gyp saves common header files in a central place that can be used to compile any native node modules. Unlike npm’s cache directory though node-gyp accounts for access problems and falls back to a temporary directory so the reason behind the failure wasn’t as clean cut as the npm problem was.

I determined that at least part of the problem was that node-gyp was not in the PATH variable for Lambda so the build commands in the native module were failing. With some clever path manipulation I was able to work around this but new problems emerged. After looking through the Lambda logs it became inescapably clear that Lambda did not have a compiler installed. Game over, man.

Enough Already!

In theory, this is a problem I could still solve by building a compiler and any dependencies I needed for that compiler on an AWS machine, then package that with my code. That still wouldn’t handle cases where some native module needs a specific library that’s not available and really leads to an unmaintainable situation. Regardless, code was written and lessons learned. If you’re interested in the code that led to this piece you can find it here.

At the end of it all I didn’t feel defeated. Through my exploration I learned a lot about systems that I take for granted every day and not only feel like I have a better understanding of them but have gained useful knowledge about the way those systems are architected. My high level takeaways from this adventure can be summed up in a few bullet points:

Test your code! It’s not only useful to ensure that your code behaves as expected but is an invaluable source for developers that might use your code.

Don’t be afraid to read source code. You’ll learn a lot about how other systems work and can take those lessons and apply them to your own code. When you’re reading through well used and well built systems you will learn valuable lessons on code design.

Don’t be afraid to fail. As developers we don’t always have the luxury of playing with something that might fail, but if you do have that luxury, exercise it. A best effort failure can yield just as much valuable experience as a success, even if it’s frustrating.

Know when you need to stop. At some point, even for a toy project, the effort required to get your solution working might not be worth it. Knowing when it’s time to investigate other options is extremely important.