Monday, November 27, 2017

I'm announcing a small nix-diff utility I wrote for comparing Nix derivations. This post will walk through two use cases for how you might use this utility.

Background

This section provides some required background for understanding this post if you're new to Nix.

There are three stages to a Nix build:

Nix source code (i.e. *.nix files)

This corresponds to a source distribution in a typical package manager

Nix derivations (i.e. /nix/store/*.drv files)

This is the stage that caching works at

Nix build products (i.e. /nix/store/* files that are not derivations)

This corresponds to a binary distribution in a typical package manager

You can convert between these stages using the following command-line tools:

nix-instantiate converts Nix source code to Nix derivations

i.e. *.nix → /nix/store/*.drv

nix-store --realise converts Nix derivations to Nix build products

i.e. /nix/store/*.drv → /nix/store/*

nix-build is a convenience utility which combines the two preceding steps to go straight from source code to build products

i.e. *.nix → /nix/store/*

Nix supports caching binary build products so if you try to build the same derivation twice then the second build will reuse the result of the first build (i.e. a "cache hit"). If the derivation changes in any way, you get a "cache miss" and you need to build the derivation.

Carefully note that caching works at the level of Nix derivations and not at the level of Nix source code. For example, the following two Nix files differ at the source code level:

These *.drv files use the ATerm file format and are Nix-independent. Conceptually, Nix is just a domain-specific language for generating these ATerm files. That means, for example, that you could replace Nix with any front-end language or tool that can generate these ATerm files. In fact, this is how Guix works, by replacing Nix with Guile Scheme as the front-end language.

Understanding how Nix derivations work is fundamental to understanding the Nix ecosystem. nix-diff is one tool that aids this learning process as the following sections will illustrate.

Cache misses

nix-diff is a tool that I wish I had back when Awake Security first adopted Nix. We frequently ran into cache misses when using Nix because of subtle differences in Nix derivations in different development environments.

We can understand why we got cache misses by referring back to the three stages of a Nix build:

Nix source code (i.e. *.nix files)

Nix derivations (i.e. /nix/store/*.drv files)

Nix build products (i.e. /nix/store/* files that are not derivations)

For production we prefer to distribute Nix build products (i.e. binary distributions), but internally for development we distribute Nix source code. We prefer Nix code internally because this gives developers complete control over all of their transitive dependencies. For example, a developer can easily patch the systemd executable used on the virtual machine that runs their integration tests.

However, this flexibility comes at a price: if you don't know what you are doing you can easily accidentally change the derivation. This is because Nix and Nixpkgs are customizable to a fault and they have all sorts of "impure" defaults that change depending on the development environment. If you trip over one of these pitfalls you end up with a cache miss, which is a poor user experience.

The most common pitfalls we ran into early on in our Nix adoption were:

Let's motivate this with a real example. Suppose that I have the following derivation to build the Glasgow Haskell compiler (ghc):

$ cat example0.nix
let
pkgs = import <nixpkgs> { };
in
pkgs.ghc

This Nix expression is "impure" because the expression depends on the ambient nixpkgs channel that the user has installed. Compare this to the following expression which pins nixpkgs to a specific revision protected by a hash:

Now we can see at a glance that the versions of several dependencies changed and GHC has split out its man pages into a new man output for better granularity of the build graph.

Note that these are not the only differences between the two derivations. However, all of the other differences are downstream of the above differences. For example, the two derivations have different out paths, but we expect them to differ for any two derivations that are not identical so there's no point including that in the diff. nix-diff makes an effort to highlight the root cause of the difference.

Understanding differences

Nix is more than just a package manager. You can use Nix to build and deploy an entire machine, which is how NixOS (the Nix operating system) works. The machine configuration is a Nix expression that you can instantiate and build like any other Nix expression.

This means that we can also use nix-diff to compare two machine configurations and understand how they differ. For example, when we change our production systems at Awake Security we sometimes run the change through nix-diff during code review to ensure that reviewers understand every change being made to the system.

We can illustrate this with a small example comparing two NixOS system specifications. The first system specification is a mostly blank system:

$ nix-diff$(nix-instantiate example0.nix)$(nix-instantiate example1.nix)- /nix/store/6z9nr5pzs4j1v9mld517dmlcz61zy78z-nixos-system-nixos-18.03pre119245.
5cfd049a03.drv:{out}+ /nix/store/k05ibijg0kknvwrgfyb7dxwjrs8qrlbj-nixos-system-nixos-18.03pre119245.
5cfd049a03.drv:{out}
• The input named `etc` differs
- /nix/store/05c0v10pla0v8rfl44rs744m6wr729jy-etc.drv:{out}+ /nix/store/8waqvzjg7bazzfzr49m89q299kz972wv-etc.drv:{out}
• The input named `dbus-1` differs
- /nix/store/a16j2snzz25dhh96jriv3p6cgkc0vhxr-dbus-1.drv:{out}+ /nix/store/mliabzdkqaayya67xiwfhwkg4gs9k0cg-dbus-1.drv:{out}
• The input named `system-path` differs
- /nix/store/jcf6q7na01j8k9xcmqxykl62k4x6zwiv-system-path.drv:{out}+ /nix/store/kh4kgsms24d02bxlrxb062pgsbs3riws-system-path.drv:{out}
• The set of input names do not match:
+ apache-kafka-2.12-0.10.2.0
• The input named `system-path` differs
• These two derivations have already been compared
• The input named `system-units` differs
- /nix/store/yqnqdajd4664rvycrnwxwaj0mxp7602c-system-units.drv:{out}+ /nix/store/2p5c4arwqphdz5wsvz6dbrgv0vhgf5qh-system-units.drv:{out}
• The set of input names do not match:
+ unit-apache-kafka.service
• The input named `user-units` differs
- /nix/store/x34dqw5y34dq6fj5brj2b5qf0nvglql9-user-units.drv:{out}+ /nix/store/4iplnk260q2dpr8b8ajrjkrn44yk06aq-user-units.drv:{out}
• The input named `unit-dbus.service` differs
- /nix/store/fd6j972zn1hfvqslxc8c64xxaf1wg475-unit-dbus.service.drv:{out}+ /nix/store/s7rpgwbald9qx8rwlw4v276wj2x3ld8r-unit-dbus.service.drv:{out}
• The input named `dbus-1` differs
• These two derivations have already been compared
• The input named `system-path` differs
• These two derivations have already been compared
• The input named `users-groups.json` differs
- /nix/store/x6c7pqx40wfdzwf96jfi1l0hzxjgypri-users-groups.json.drv:{out}+ /nix/store/gk5yyjw579hgyxgwbrh1kzb3hbdbzgbq-users-groups.json.drv:{out}
• The environments do not match:
text=''{"groups":[{"gid":55,"members":[],"name":"adm"},{"gid":17,"members":[]
,"name":"audio"},{"gid":24,"members":[],"name":"cdrom"},{"gid":27,"members":[],"name":"dialout"},{"gid":6,"members":[],"name":"disk"},{"gid":18,"members":[],"name":"floppy"},{"gid":174,"members":[],"name":"input"},{"gid":96,"members":[],"name":"keys"},{"gid":2,"members":[],"name":"kmem"},{"gid":20,"members":[],"name":"lp"},{"gid":4,"members":[],"name":"messagebus"},{"gid":30000,"members":["nixbld1","nixbld10","nixbld11","nixbld12","nixbld13","nixbld14","nixbld15","nixbld16","nixbld17","nixbld18","nixbld19","nixbld2","nixbld20","nixbld21","nixbld22","nixbld23","nixbld24","nixbld25","nixbld26","nixbld27","nixbld28","nixbld29","nixbld3","nixbld30","nixbld31","nixbld32","nixbld4","nixbld5","nixbld6","nixbld7","nixbld8","nixbld9"],"name":"nixbld"},{"gid":65534,"members":[],"name":"nogroup"},{"gid":0,"members":[],"name":"root"},{"gid":62,"members":[],"name":"systemd-journal"},{"gid":110,"members":[],"name":"systemd-journal-gateway"},{"gid":152,"members":[],"name":"systemd-network"},{"gid":153,"members":[],"name":"systemd-resolve"}
,{"gid":154,"members":[],"name":"systemd-timesync"},{"gid":25,"members":[],"name":"tape"},{"gid":3,"members":[],"name":"tty"},{"gid":100,"members":[],"name":"users"},{"gid":29,"members":[],"name":"utmp"},{"gid":19,"members":[],"name":"uucp"
},{"gid":26,"members":[],"name":"video"},{"gid":1,"members":[],"name":"wheel"}],
"mutableUsers":true,"users":[{"createHome":false,"description":"→Apache Kafka daemon user","group":"nogroup","hashedPassword":null,"home":"/tmp/kafka-logs","initialHashedPassword":null,"initialPassword":null,"isSystemUser":false,"name":"apache-kafka","password":null,"passwordFile":null,"shell":"/run/current-system/sw/bin/nologin","uid":169},{"createHome":false,"description":"→D-Bus system mess...

However, this doesn't do the diff justice because the output is actually colorized, like this:

From the diff we can see that:

This change adds Kafka executables to the system PATH

This change adds a new apache-kafkasystemd service

This change adds a new apache-kafka user to the system

Note how nix-diff does more than diffing the two root derivations. If the two derivations differ on a shared input then nix-diff will descend into that input and diff that and repeat the process until the root cause of the change is found. This works because Nix's dependency graph is complete and reachable from the root derivation.

Conclusion

You can find the nix-diff utility on Hackage or GitHub if you would like to use this in your own development workflow. Hopefully nix-diff will help you better understand how Nix works under the hood and also help you pin Nix derivations more robustly.

Friday, November 3, 2017

The Dhall configuration language just added support for "semantic integrity checks". This post explains what "semantic integrity check" means, motivates the new feature, and compares to semantic versioning.

The problem

I added this feature in response to user concerns about code injection in Dhall configuration files.

We'll illustrate the problem using the following example.dhall configuration file which derives a summary of student information from a list of students:

Values, functions, and types are all Dhall expressions, so we can inject all of them in our code via URLs or paths. When we interpret a Dhall configuration file these imports get substituted with their contents and then we evaluate the fully resolved configuration file as an expression in a functional language:

Users were concerned that these imports could be compromised, resulting in malicious code injection

The solution

The latest release of Dhall added support for import integrity checks to address user concerns about malicious tampering. We can use these integrity checks to "freeze" our imports by adding a SHA-256 hash after each import.

First, we ask the dhall-hash utility to compute the current hash for our imports:

Once you add these integrity checks the Dhall interpreter will enforce them when resolving imports. In this case, the example configuration still successfully evaluates to the same result after adding the integrity checks:

Dhall recognizes that this is no longer the same expression and rejects the import. Only an import that represents the same value can pass the check.

This means, for example, that malicious users cannot tamper with our imports, even if we were to distribute the imported code over an insecure channel. The worst that an attacker can do is cause our configuration to reject the import, but they cannot trick the configuration into silently accepting the wrong expression.

Refactoring

We can use these integrity checks to do more than just secure code. We can also repurpose these checks to assert that our code refactors are safe and behavior-preserving.

I originally introduced semantic integrity checks to protect against malicious code modification then later realized that they can also be used to protect against non-malicious modifications (such as a refactor gone wrong).

Textual hashes

The semantic hash provides a more information than a textual hash of the import. For example, suppose we changed our ./double.dhall function to triple the argument:

λ(x :Natural) → x *+3

A textual hash of the ./students.dhall import would not detect this change because the real change took place in the text of another file that ./students.dhall imported. However, A semantic hash can follow these imports to detect transitive changes to dependencies.

The semantic hash is also more flexible than a textual hash because the semantic hash does not change when we make cosmetic changes like refactoring, reformatting, or commenting code.

Caveats

Dhall's semantic versioning can reject some behavior-preserving changes to functions. Dhall only attempts to detect if two functions are β-equivalent (i.e. the same if fully β-reduced).

For example, the following two functions are equivalent, but will not produce the same hash:

λ(x :Bool) → x

λ(x :Bool) →if x thenTrueelseFalse

Similarly, Dhall's semantic hash cannot detect that these two functions are the same:

λ(x :Natural) → x *+2

λ(x :Natural) → x + x

On the other hand, Dhall will (almost) never give two semantically distinct expressions the same hash. Only an astronomically improbable hash collision can cause this and at the time of this writing there is no known vulnerability in the SHA-256 hash algorithm.

Dhall will support other hash algorithms should SHA-256 ever be broken. This is why Dhall prefixes the hash with the algorithm to leave the door open for new hash algorithms.

Semantic versioning

You might wonder how semantic integrity checks compare to semantic versioning. I like to think of semantic integrity checks and semantic versions as two special cases of the following abstract interface:

a package publishes a version string for each official release

you can compare two version strings to detect a breaking change to the package

Semantic versioning is one special case of that abstract interface where:

the version string has a major number and minor number

a difference in major version numbers signals a breaking change

Some variations on semantic versioning propose independently versioning each exported function/value/type instead of versioning the package as a whole. Also, some languages (like Elm) mechanically enforce semantic versioning by detecting API changes programmatically and forcing a major version bump if there is a breaking change.

A semantic integrity check is another special case of that abstract interface where:

the version string is a SHA-256 hash

if two hashes are different then that signals a breaking change

The key difference between semantic versioning and semantic integrity checks is how we define "a breaking change". Semantic version numbers (usually) treat changes to types as breaking changes whereas semantic integrity checks treat changes to values as breaking changes. (To be totally pedantic: semantic integrity checks treat changes to expressions as breaking changes, and in a language like Dhall everything is an expression, including types).

This does not imply that semantic integrity checks are better than semantic version numbers. Sometimes you want to automatically pick up small changes or improvements from your dependencies without adjusting a hash. In cases like those you want the expected type to be the contract with your dependency and you don't want to pin the exact value.

For example, we could "simulate" semantic versioning in Dhall by attaching a type annotation to our ./students.dhall import like this:

... and now we can add or remove students from our imported list without breaking anything. We've used the type system as a coarser integrity check to state that certain changes to our configuration file's meaning are okay.

Conclusion

You can think of a semantic integrity check as a "value annotation" (i.e. the term-level equivalent of a type annotation). Instead of declaring an expected type we declare an expected value summarized as a hash.

This is why the title of this post declares that "semantic integrity checks are the next generation of semantic versioning". If you think of a semantic version as a concise summary of an imported package's type, then a semantic integrity check is a concise summary of an imported package's value.

Monday, October 16, 2017

This post summarizes advice that I frequently give to Haskell beginners asking how to start out learning the language

First, in general I recommend reading the Haskell Programming from first principles book, mainly because the book teaches Haskell without leaving out details and also provides plenty of exercises to test your understanding. This is usually good enough if you are learning Haskell as your first language.

However, I would like to give a few additional tips for programmers who are approaching Haskell from other programming languages.

Learn Haskell for the right reasons

Some people learn Haskell with the expectation that they will achieve some sort of programming enlightenment or nirvana. You will be disappointed if you bring these unrealistic expectations to the language. Haskell is not an achievement to unlock or a trophy to be won because learning is a never-ending process and not a finish line.

I think a realistic expectation is to treat Haskell as a pleasant language to use that lets you focus on solving real problems (as opposed to wasting your time fixing silly self-induced problems like null pointers and "undefined is not a function").

Avoid big-design-up-front

Haskell beginners commonly make the mistake of trying to learn as much of the language as possible before writing their first program and overengineering the first draft. This will quickly burn you out.

You might come to Haskell from a dynamically typed background like JavaScript, Python, or Ruby where you learned to avoid refactoring large code bases due to the lack of static types. This aversion to refactoring promotes a culture of "big-design-up-front" where you try to get your project as close to correct on the first try so that you don't have to refactor your code later.

This is a terrible way to learn Haskell, for two reasons. First, Haskell has a much higher ceiling than most other programming languages, so if you wait until you hit that ceiling before building something you will wait a looooooong time. Second, refactoring is cheap in Haskell so you don't need to get things right the first time.

You will accelerate your learning process if you get dirty and make mistakes. Write really ugly and embarrassing code and then iteratively refine your implementation. There is no royal road to learning Haskell.

Avoid typeclass abuse

Specifically, avoid creating new typeclasses until you are more comfortable with the language.

Functional programming languages excel because many language features are "first class". For example, functions and effects are first class in Haskell, meaning that you can stick them in a list, add them, nest them, or pass them as arguments, which you can't (easily) do in imperative languages.

However, typeclasses are not first-class, which means that if you use them excessively you will quickly depend on advanced language features to do even simple things. Programming functionally at the term-level is much simpler and more enjoyable than the type-level Prolog that type-classes encourage.

Begin by learning how to solve problems with ordinary functions and ordinary data structures. Once you feel like you understand how to solve most useful problems with these simple tools then you can graduate to more powerful tools like typeclasses. Typeclasses can reduce a lot of boilerplate in proficient hands, but I like to think of them as more of a convenience than a necessity.

You can also take this approach with you to other functional languages (like Elm, Clojure, Elixir, or Nix). You can think of "functions + data structures" as a simple and portable programming style that will improve all the code that you write, Haskell or not.

Build something useful

Necessity is the mother of invention, and you will learn more quickly if you try to build something that you actually need. You will quickly convince yourself that Haskell is useless if you only use the language to solve Project Euler exercises or Sudoku puzzles.

You are also much more likely to get a Haskell job if you have a portfolio of one or two useful projects to show for your time. These sorts of projects demonstrate that you learned Haskell in order to build something instead of learning Haskell for its own sake.

Conclusion

Hopefully these tips will help provide some guard rails for learning the language for the first time. That's not to say that Haskell is perfect, but I think you will enjoy the language if you avoid these common beginner pitfalls.

Saturday, October 7, 2017

I wrote this post to challenge basic assumptions that people make about software architecture, which is why I chose a deliberately provocative title. You might not agree with all the points that I am about to make, but I do hope that this post changes the way that you think about programming

This post is an attempt to restate in my own words what Conal Elliot (and others before him) have been saying for a while: modern programming is a Rube-Goldberg machine that could be much simpler if we change the way we compose code.

Most programmers already intuitively understand this at some level. They will tell you that the programming ecosystem is deeply flawed, fragmented, and overly complex. I know that I felt that way myself for a long time, but in retrospect I believe that my objections at the time were superficial. There were deeper issues with programming that I was blind to because they are so ingrained in our collective psyche.

Disclaimer: This post uses my pet configuration language Dhall to illustrate several points, mainly because Dhall is a constructive proof-of-concept of these ideas. The purpose of this post is not so much to convince you to use Dhall but rather to convince you to think about software in a new way

Input and output

Consider the title of this post for example:

"Why do our programs need to read input and write output?"

Most people will answer the title with something like:

"Our programs need a way to communicate with the outside world"

"Programs need to do something other than heat up CPUs"

Now suppose I phrased the question in a different way:

"What if only the compiler could read input and write output?"

"What's the difference?", you might ask. "Surely you mean that the language provides some library function that I can call to read or write input to some handles, whether they be file handles or standard input/output."

No, I mean that only the compiler implementation is allowed to read input or write output, but programs written within the compiled language cannot read input or write output. You can only compute pure functions within the language proper.

Again, this probably seems ridiculous. How could you communicate at all with the program?

Imports

Most languages have some way to import code, typically bundled in the form of packages. An imported function or value is something that our compiler reads as input statically at compile time as opposed to a value read by our program dynamically at run time.

Suppose I told you that our hypothetical programming language could only read input values by importing them

"Ridiculous!" you exclaim as you spit out your coffee. "Nobody would ever use such a language." You probably wouldn't even know where to begin since so many things seem wrong with that proposal

Perhaps you would object to the heavyweight process for publishing and subscribing to new values. You would recite the package management process for your favorite programming language:

Create a source module containing your value

Create a standard project layout

Create a package description file

Check your project into version control

Publish your package to a package repository

Perhaps you would object to the heavyweight process for configuring programs via imports? Your favorite programming language would typically require you to:

Retrieve the relevant program from version control

Modify the project description file to reference the newly published dependency

Modify project code to import your newly published value

Compile the program

Run the program

Why would a non-technical end user do any of that just to read and write values?

This is exactly the Rube-Goldberg machine I'm referring to. We have come to expect a heavyweight process for source code to depend on other source code

Importing paths

Distributing code doesn't have to be heavyweight, though. Consider Dhall's import system which lets you reference expressions directly by their paths. For example, suppose we saved the value True to a file named example.dhall:

$ echo 'True' > example.dhall

Another Dhall program can reference the above file anywhere the program expects a boolean value:

$ dhall <<< './example.dhall || False'
Bool
True

This is the exact same as if we had just replaced the path with the file's contents:

$ dhall <<< 'True || False'
Bool
True

Dhall doesn't need to support an explicit operation to read input because Dhall can read values by just importing them

Similarly, Dhall doesn't need to support an explicit write operation either. Just save a Dhall expression to a file using your favorite text editor.

"What if I need a way to automate the generation of files?"

You don't need to automate the process of saving a file because one file is always sufficiently rich to store as much information as you need. Dhall is a programmable configuration language which supports lists and records so any one file can store or programmatically generate any number of values. Files are human-generated artifacts which exist purely for our convenience but Dhall code does not behave any differently whether or not the program spans multiple files or a single file.

Most of the time people need to automate reads and writes because they are using non-programmable configuration file formats or data storage formats

Programmable configuration

You might object: "Configuration files shouldn't be Turing-complete!"

However, Dhall is not Turing-complete. Dhall is a total programming language, meaning that evaluation eventually halts. In practice, we don't actually care that much if Dhall halts, since we cannot guarantee that the program halts on reasonable human timescales. However, we can statically analyze Dhall and most of the Halting Problem objections to static analysis don't apply.

For example, Dhall can statically guarantee that programs never fail or throw exceptions (because obviously a configuration file should never crash). Dhall also lets you simplify confusing files by eliminating all indirection because Dhall can reduce every program to a canonical normal form.

In fact, most objections to programmable configuration files are actually objections to Turing-completeness

Importing URLs

Dhall also lets you import code by URL instead of path. Dhall hosts the Prelude of standard utilities online using IPFS (a distributed hashtable for the web),and you can browse the Prelude using this link, which redirects to the latest version of the Prelude:

... and this post is not about Dhall so much as Conal's vision of an effect-free purely functional future for programming. I believe explicitly reading input and writing output will eventually become low-level implementation details of higher-level languages, analogous to allocating stack registers or managing memory.

A smarter approach would be to keep the accumulator strict, which means that we evaluate as we go instead of deferring all evaluation to the end. For example, the accumulator starts off as just the empty string:

""

... then after one iteration of the loop we get the following accumulator:

(λ(x :Text) → x ++"!") ""

... and if we evaluate that accumulator immediately we get:

"!"

Then the next iteration of the loop produces the following accumulator:

(λ(x :Text) → x ++"!") "!"

... which we can again immediately evaluate to get:

"!!"

This is significantly more efficient than leaving the expression unevaluated.

We can easily implement such a strict loop by making the following change to the interpreter:

... or in other words about 30 microseconds per element. We could still do more to optimize this but at least we're now in the right ballpark for an interpreter. For reference, Python is 4x faster on my machine for the following equivalent program:

In this case the accumulator of the fold is a list that grows by one element after each step of the fold. We don't want to normalize the list on each iteration because that would lead to quadratic time complexity. Instead we prefer to defer normalization to the end of the loop so that we get linear time complexity.

We can measure the difference pretty easily. A strict loop takes over 6 seconds to complete:

Why not both?

This poses a conundrum because we'd like to efficiently support both of these use cases. How can we know when to be lazy or strict?

We can use Dhall's type system to guide whether or not we keep the accumulator strict. We already have access to the type of the accumulator for our loop, so we can define a function that tells us if our accumulator type is compact or not:

Conclusion

Many people associate dynamic languages with interpreters, but Dhall is an example of a statically typed interpreter. Dhall's evaluator is not sophisticated at all but can still take advantage of static type information to achieve comparable performance with Python (which is a significantly more mature interpreter). This makes me wonder if the next generation of interpreters will be statically typed in order to enable better optimizations.