Saturday, August 18, 2012

Functional programming is all the rage these days, but in this post I want to emphasize that functional programming is a subset of a more important overarching programming paradigm: compositional programming.

If you've ever used Unix pipes, you'll understand the importance and flexibility
of composing small reusable programs to get powerful and emergent behaviors.
Similarly, if you program functionally, you'll know how cool it is to compose a bunch of small reusable functions into a fully featured program.

Category theory codifies this compositional style into a design pattern, the category. Moreover, category theory gives us a precise prescription for how to create our own abstractions that follow this design pattern: the category laws. These laws differentiate category theory from other design patterns by providing rigorous criteria for what does and does not qualify as compositional.

One could easily dismiss this compositional ideal as just that: an ideal, something unsuitable for "real-world" scenarios. However, the theory behind category theory provides the meat that shows that this compositional ideal appears everywhere and can rise to the challenge of messy problems and complex business logic.

This post is just one of many posts that I will write over time where I will demonstrate how to practically use this compositional style in your programs, even for things that may seem like they couldn't possibly lend themselves to compositional programming. This first post starts off by introducing the category as a compositional design pattern.

Categories

I'm going to give a slightly different introduction to category theory than most
people give. I'm going to gloss over the definition of what a morphism or an object is and skip over domains and codomains and instead just go straight to composition, because from a programmer's point of view a category is just a compositional design pattern.

Category theory says that for any given category there must be some sort of composition operator, which I will denote (.). The first rule is that this composition operator is associative:

(f . g) . h = f . (g . h) -- Associativity law

This is useful because it means we can completely ignore the order of grouping and write it without any parentheses at all:

f . g . h

Category theory also says that this composition operator must have a left and right identity, which I will denote id. Being an identity means that:

id . f = f -- Left identity law
f . id = f -- Right identity law

The associativity law and the two identity laws are known as the category laws.

Notice that the definition of a category does not define:

what (.) is,

what id is, or

what f, g, and h might be.

Instead, category theory leaves it up to us to discover what they might be.

The brilliance behind the category design pattern is that any composition operator that observes these laws will be:

easy to use,

intuitive,

and free from edge cases.

This is why we try to formulate abstractions in terms of the category design pattern whenever possible.

Not a coincidence! Monadic functions just generalize ordinary functions and
the Kleisli category demonstrates that monadic functions are composable, too.
They just use a different composition operator: (<=, and a different
identity: return.

Well, let's assume that category theorists aren't bullshitting us and that (<= really is some sort of composition and return really is its identity. If that were true, we'd expect the following laws to hold:

Look familiar? Those are just the
monad laws, which
all Monad instances are required to satisfy. If you have ever wondered
where those monad laws came from, now you know! They are just the category laws
in disguise.

Consequently, every new Monad we define gives us a category for free! Let's try out some of these brave new categories:

Those look an awful lot like an identity and composition. I leave it as an
exercise for the reader to prove that they actually do form a category:

cat
Pipes show how more complicated things that don't fit neatly into the functional
programming paradigm can still be achieved with a compositional programming style. I won't belabor the compositionality of pipes, though, since my tutorial already does that.

So if you find something that doesn't seem like it could be compositional,
don't give up! Chances are that a compositional solution exists just beneath the surface!

Conclusions

All category theory says is that composition is the best design pattern, but
then leaves it up to you to define what precisely composition is. It's up to
you to discover new and interesting ways to compose things besides just
composing functions. As long as the composition operator you define obeys the
category laws, you're golden.

Also, I'm really glad to see a resurgence in functional programming (since functions form a category), but in the long run we really need to think about
more interesting composition operators than just function composition if we are serious about tackling more complicated problem domains.

Hopefully this post gets you a little bit excited about category theory. In
future posts, I will expand upon this post with the following topics:

Why the category laws ensure that code is easy, intuitive, and free of edge cases

Friday, August 10, 2012

I've seen beginners on /r/haskell ask for practical code examples so I thought I
would share some code from my own work. Today's example will focus on how you
can use Haskell to write clear and self-documenting code.

Today my PI gave me the following task:

Parse the alpha carbons from a PDB file

Scan the sequence using a window of a given size

For each window, collect all residues within a certain distance (called the "context")

Split the context into contiguous chains

Don't worry if you don't understand the terminology, because I will explain the
terms as I go along.

If you want to follow along, use the full code sample in the appendix and download and extract the sample PDB file: 1YU0.pdb.

Parsing

The first thing I had to do was to take a protein structure and parse it into
its alpha carbons. The input format is a Protein Data Bank (i.e. PDB) file,
which has a well documented file format found
here. I'm interested in the
ATOM record, which specifies the coordinates of a single atom in the
PDB file:

Here I use a type synonym to document what I want to use coord for.
This has the added advantage that I can use a different Point type
later on (such as a length-indexed type), and I need only update the type
synonym to update my type signatures.

This function takes a list of atoms, and associates each atom with a list of its
neighbors within some distance cutoff. I will be joining these lists of
neighbors later on, so really a more efficient data structure would be a
Set, but this is good enough for now, and I can optimize it later if
necessary.

Also, note the use of a type synonyms as a quick and dirty documentation. A
well-chosen type synonym can go a long way towards making your code easy to
understand. Also, if I go back and decide to use Set, I only need to
change the type synonym definition to:

This is definitely not an efficient algorithm (a HashMap-based binning
algorithm would work faster), but I will let it slide for now since this is just a simple script. Also, notice that it does not eliminate an atom from its own neighbor list. This is because the atom will be reintroduced later when joining contexts, so I postpone doing this.

Windows

My PI wants a sliding window of 7 residues and a context generated for each
window. Rather than manually slide the window in an imperative style, I instead
generate the set of all windows:

I want to draw attention to the sortBy call in the chains
function. This takes advantage of two very useful tricks. The first is the
comparing function from Data.Ord. The second is the
Monoid trick to compose two orderings, giving priority to the first
one. Combining the two tricks gives code that reads like English: "Sort by
comparing chainID first, then by comparing resSeq".

The last step is that I have to take each window and the chains in its context and
pair the window with each chain. If that didn't make sense, I'm pretty sure the type signature would make it clear:

The above function pipeline clearly illustrates the flow of information and
makes it easy to reason about what my code actually does. I could even use it
as a high-level specification of my program. Reading from bottom to top it says
that you:

Attach contexts

Partition the sequence into windows

For each window:

Join the contexts

Group the joined context into chains

Distribute the window over each chain

Concatenate all the results

If you prefer to order things from left to write (or top to bottom), then you
can use the (>>>) operation from Control.Category to reverse
the direction in which you compose functions:

I then load up the protein in PyMOL to inspect the matched ranges and, sure enough, it works correctly:

In the image I highlighted the matched window in yellow and the matched context in purple.

Conclusion

So I hope this shows some of the tricks I use to improve code clarity. I left out a lot of type signatures for brevity because this was a one-off script and I only wanted to draw attention to certain functions, but if I were really serious about fully documenting the code I would include all type signatures.