Wednesday, 23 March 2016

Yesterday someone broke the internet, and this time it had nothing to with Kim Kardashian. According to The Register and various other sources, developer Azer Koçuluremoved over 250 of his own modules from NPM, the package management platform used by almost all open-source JavaScript projects. Among the projects he removed was left-pad, a tiny library that performs string padding. Turns out that literally thousand of open-source projects depend on left-pad – over 2 million downloads last month – and with left-pad removed from NPM, suddenly none of these projects would build any more. Oops.

It's worth remembering this is JavaScript, and one of the many things that makes JavaScript so entertaining is that it doesn't have a standard runtime library. In every other language I have ever used, string padding is either built-in, or it's part of a standard runtime that's available locally on every workstation and build server. But not in JavaScript – if you want to pad a string in JS, you either write your own function, you copy & paste one from StackOverflow, or you import a package to do it for you, and based on the fallout from yesterday it looks like a lot of people went for the package option.

Now, package management and dependency management is hard. Projects like NPM and RubyGems and NuGet appear to have made it a lot easier, but in most cases what they're actually doing is asking us to relinquish control of elements of our own projects in exchange for convenience. And they are convenient - as long as you're online, and everything's working, and everybody's being friendly, it's great – but I don't think we, as an industry, have done nearly enough to understand what happens if those assumptions turn out to be false.

Say you're about to get on a long-haul flight; just before they call your gate, you git clone a couple of projects onto your laptop and then board the plane. Would you be able to build the project once in flight? I suspect in many cases the answer is "no", because standard practice these days is to download all the required libraries as the first stage of the build process – and that means you can't build your project without being online.

Ok, that one's easy. This time round you shrug and watch the inflight movies instead, and next time you remember to do a full build before you go offline. Not a big deal. But what would happen if nuget.org was offline? Online services go dark for all sorts of reasons - infrastructure problems, legal action, DoS attacks. Most of the time, they come back, but not always – in 2014, Code Spaces was completely wiped out by an attacker who gained access to their AWS account. Does your build pipeline quietly rely on the fact that nuget.org isn't going to go away? And if it did, how long would it take to get things back up and running?

But NuGet or NPM going down isn't even the worst case scenario. According to NuGet.org, the most download packages over the last six weeks are:

NewtonSoft.Json (2.4M downloads)

jQuery (1.2M downloads)

Microsoft.AspNet.Mvc (993K downloads)

Microsoft.AspNet.Razor (987K downloads)

EntityFramework (974K downloads)

Now, let's imagine a Nefarious Super Villain breaks into James Newton-King's house one night and forces him at gunpoint to deploy a new point release of NewtonSoft.Json. A release that works perfectly, only every time you call JsonConvert.Serialize(), it sneakily posts a copy of the JSON output to an IP address in Russia. How long before you ended up running this malicious code on your own production systems? How long before you noticed something was amiss – assuming you ever noticed at all? OK, somebody would notice eventually – that's the beauty of open-source, after all – but what about closed-source libraries? If you're using EntityFramework, I'd wager good money that you're using a single set of database credentials that has read/write/delete access to all your data – and trusting that the code isn't going to do anything unpleasant.

EDIT: Demis Bellot pointed out on Twitter after I first posted this article that source code is just one of many attack vectors that exploit our faith in package repositories. Unless you're comparing checksums or building your own reference binaries, you're blindly trusting that the source you're reading on Github is the same source that was used to built the binaries that are now running on your production servers.

There's ways around some of this. At the very least, cache the packages used by your recent builds in case you need to run them again. Most package repositories run over HTTP and use stable URLs, so all you really need is a caching proxy between your build pipeline and your package repo servers. Here at Spotlight, we run a NuGet server on our LAN that hosts our own packages, and we've also set up TeamCity so that following a successful build, all the .nupkg files (including the dependencies used by the build) are published to our local NuGet server. We actually started doing this by mistake – we set up a build to publish our own .nupkg files to our server and then realised afterwards it had also picked up all the package dependencies – but it actually works pretty well, and it means that if nuget.org was to go dark for a while, we could still build & deploy software as long as we didn't update any package dependencies.

As for package dependencies as an attack vector for malicious code? That one's a lot harder. We often do some ad-hoc traffic inspection as part of our review process – fire up the application with SQL Monitor and Fiddler running, take a look at the network traffic, database activity, debug logs and so on just to make sure we haven't done something stupid – but it's interesting to think how this could be turned into something more rigorous.

But as with so much else, it boils down to a choice – trust people, or do everything yourself. One is a calculated risk, one is a guaranteed expense, and it's managing that balance between risk and cost that makes IT – and business – so endlessly fascinating.