Saturday, October 12, 2013

An all-atom protein search engine powered by Haskell

This post discusses a Haskell-based project that is the central component of my thesis: a fast, atomic-level structural search engine named "Suns". I will discuss what problem this search engine solves, why I chose Haskell for this project, and what were the pros and cons of my language choice.

The Challenge

Proteins are amazing molecules that control biology on the molecular level. If you are new to cellular biology, you can get an immediate sense for how diverse and sophisticated they are by watching The Inner Life of the Cell, an artistic rendition of several diverse processes controlled by proteins, including:

cellular adhesion (how cells recognize and stick to each other)

forming a skeleton for the cell

assembling and disassembling cellular highways

trafficking cargo along those highways

In addition, proteins also:

sense the molecular environment

perform chemistry

are cellular weapons

and much, much more!

I am a graduate student in a research lab that designs new proteins for both medical treatments and scientific study. Unfortunately, proteins are significantly harder to design than software. A typical protein can be very chemically complex and tightly integrated:

A protein structure in atomic detail

Proteins are tangled ropes of amino acids, so they are the epitome of spaghetti code. However, it's not as bad as it sounds. Within these tangled balls we see several patterns emerge, both on small, medium, and large scales.

On a larger scale you have "protein domains". I liken these to high-level scripts: self-contained blobs of functionality with little coupling or state. They are often easy to connect together to generate new functionality:

On the medium scale you have "secondary structure", consisting primarily of alpha helices and beta strands, which form locally repetitive structures. I liken these to C code: low-level, but still reasonably portable. Our lab has historically had success transplanting patterns in secondary structure to new contexts, generating new functionality.

On the small scale you have "motifs", small configurations of a few amino acids that frequently occur in natural proteins. I liken these to assembly code: they are very tightly integrated and pack a lot of utility into a small space, but they are not very portable because they have very precise chemical and geometric requirements and little room for customization:

An example motif, with a geometrically well-defined hydrogen-bonding pattern

Our lab needs to design proteins on this small and compact scale, so they use my search engine to discover and incorporate these small motifs into their designed proteins. The search engine overcomes the "portability issues" of these low-level motifs by finding existing motifs that are exactly geometrically and chemically compatible with partial blueprints.

Overview

The most popular interface to the search engine is through PyMOL, a molecular graphics system that lets you visualize protein structures three-dimensionally. You can find a PyMOL search plugin on the official Suns web site, which also includes an extended tutorial for how to use the search plugin.

The search plugin lets you point and click on multiple protein fragments and then click "Search". Typically within less than a second it will stream up to 100 matches to the fragment of interest, superimposed on your original search query. You can then expand on your original query by incorporating new fragments from the search results:

An exmple iterative design process that begins from a guanidinium group and which grows to an N-terminal helix-capping motif. Black dashes highlight fragments incorporated from search results.

The PyMOL plugin is a thin Python-based client that connects to a remote search backend written in Haskell.

Why Haskell?

I initially did not begin this project in Haskell. I actually began it in C. The reason why was that at the time I did not have a solid background in data structures and algorithms, so I would reflexively turn to C for high-performance projects to try to make up for it by improving the constant factors of my code. However, all-atom searching is the kind of problem where the algorithm matters and the bare-metal performance does not. My initial implentation in C attempted to brute force the solution to no avail, taking hours to perform a single search.

The second problem that I encountered was that when I got the first slow draft working I immediately got additional feature requests and I was not prepared to handle them. C is a very brittle language to program in because of manual memory management and no standard library of data structures. However, like many inexperienced programmers I had trouble saying "no" and bit off more than I could chew trying (and failing) to satisfy these feature requests.

At that time I had been dabbling in Haskell in my free time and took interest in the language. Things that intrigued me about the language included:

expressive and strong types

an elegant mix of imperative and functional style (using monads)

high-level idioms and design patterns

Plus, Haskell is garbage collected and has a rich library of data structures, which solved my two main qualms with C, so I decided to rewrite my code base in Haskell.

This rewrite was not a simple translation of the original C code. One of the main reasons that I transitioned was that I wanted to familiarize myself with improved data structures and algorithms. Haskell made it very easy to prototype new code for several reasons:

You have a rich set of high-performance purely functional data structures in the containers, unordered-containers, and vector libraries

Strong static types make it a breeze to refactor code without having to write tests

Consequently, Haskell gave me much greater confidence to rewrite large swaths of code because the compiler caught more of my mistakes and automated more details. This encouraged me to try out new ideas and rapidly gravitate towards a good solution instead of getting complacent with my first working solution. These changes improved performance and now queries took milliseconds instead of hours. At this point, PyMOL became the bottleneck and could not load search results fast enough, so there was no point improving performance further.

Another major advantage of Haskell for my project is parsing. Haskell libraries excel at parsing by necessity: there is a huge pressure to convert untyped textual data to more strongly typed data structures in order to take advantage of the type system. I want to give a special shoutout to attoparsec and aeson, which are fantastic for this purpose.

Another big benefit of programming in Haskell was that my code was remarkably stable and devoid of bugs. This meant that I spent less time maintaining the code and more time adding features.

Problems with Haskell

The first issue I encountered is that if you program in Haskell you need to be prepared to fill in the missing gaps in the library ecosystem. I began at a time when the Haskell ecosystem was beginning to mature, but there were still several important gaps. The foremost of these was a stream programming library, which I needed to write myself because I was unsatisfied with existing iteratee-based solutions, and this eventually evolved into my pipes stream programming library. If you are considering starting a large project in Haskell you need to be prepared to do some library work of your own.

Another issue was numerical code. This is limited almost exclusively to hmatrix (Haskell bindings to the GNU scientific library) and repa (a fast and native Haskell library, but without a standard library of numerical algorithms). hmatrix was problematic because it is GPLv3 licensed, and my university, UCSF, forbids releasing code with a GPLv3 license because of the patent clause. This meant that I was forced to write my own LAPACK bindings for performance-critical code. Haskell's foreign-function interface (FFI) is nice, but not having to use it at all is even nicer.

Next, space leaks cause problems. Space leaks are like the Haskell analog to C's segmentation faults: they require programmer discipline to prevent and they are not easy to debug. With experience I learned how to avoid them, mainly by:

Defining data types with strict fields

Ensuring all folds are strict in the accumulator

... but this is still a problem because it bites beginners to the language really early. Also, space leaks are the biggest source of program instability, which is a shame because it compromises what would be an otherwise rock-solid stability story for the language.

The next problem is the lack of standardized error-handling idioms. I ended up writing my own library for this, too, called the errors library, but this did not completely solve the problem either. I'm currently collaborating with John Wiegley and Michael Snoyman (of Yesod fame) to try to solve this problem.

Another problem is the pre-existing ecosystem of functionality dependent on lazy IO. I'm not going to name any names, but one library that I depend on uses lazy IO internally and it is still a source of bugs. Lazy IO causes these sort of problems because it does not guarantee ordering of effects and also results in synchronous exceptions being thrown asynchronously in pure code (which is a big no-no in the Haskell world).

Conclusions

The code for this project is open source and hosted on Github. It consists of three separate projects:

People interested in learning Haskell may find the code for the search engine useful to see what a medium-sized Haskell application looks like. Haskell gets a bad reputation for being overly academic and I hope people can see past that to discover an excellent programming language for rapid development with high reliability.

I started the Haskell version in January 2012. I did it off and on until about Fall 2012 when I felt that I finally understood how to write a complete Haskell application. What helped a lot was that I also do a lot of scripting for my lab (mainly a bunch of parsing and formatting of instrument data and also processing order requisitions for the lab) and I would make a point of writing all these scripts exclusively in Haskell.

The original C algorithm for search was just a recursive backtracking search over entire protein structures. A variation on it still lives on within the Haskell code as the very last search stage (after all other optimizations have been applied). You can find it here:

The equivalent C code was much larger than that (maybe 30x larger?, I don't have the code with me at the moment). They're not exactly the same, but still basically the same idea: a recursive depth-first search.

The real optimizations occur before even getting to that point. I set up a forward index that narrows down the search results a lot and I also organize the data along motifs instead of arbitrary atom configurations, which means I can cache a lot of useful data and just look it up in memory. These are what ended up giving the biggest performance gains. Then when it gets to the actual recursive depth-first search it's only searching a very small subset of the data that has a very high chance of matching.

Great Job!! I'm working in a package for doing ab-initio simulations and molecular dynamics in Haskell, but I work in no man's land, physicists and chemists barely known what functional programming is, and I have found myself in great trouble to explain quantum chemistry to the Haskell community. You have managed to integrate both worlds, could you please share your experience about trying to join both fields?

The main reason why it is hard to sell functional programming to researchers is that functional programming shines most on large projects with lots of moving parts, but researchers rarely build large projects. It's sort of a chicken-and-egg problem: in my the languages they prefer to use (i.e. Matlab, Python and Perl) limit them to simple programming projects, so they only ask questions that can be solved by simple programming projects. This gives them the false impression that these languages answer all the questions they are interested in asking.

For example, before I wrote this search engine the kind of question my professor would think to ask the programmers in our lab was "how often do you see an aspartate binding to two backbone amides in a particular configuration". This would usually take the programmers in the lab days to write up the relevant script and compute the result, so he never dreamed of asking anything harder than that. After writing scripts like these for a while I just stopped and said "Why don't I just write a search engine so he can interactively do these searches himself?" Once I wrote the search engine this immediately raised the bar and he started asking more powerful questions like "How can I build a binding site using this tool?"

So if you really want to sell researchers on functional programming, you don't want to waste your time showing them how to keep asking their old and simple questions using functional languages. Instead, you want to show them how functional languages allow you to ask more powerful and challenging research questions.

Now, regarding the other way around (explaining your research to Haskell programmers), that just takes practice explaining your research in a simplified way. Just pretend that you are explaining your research to somebody in your family who is not a scientists. I also find that blogging on a regular basis greatly improved my writing skills. Regularly writing and getting feedback from your readers gives you a better sense of what things they might find interesting about your own research.

I can approve that your writing skills really make the difference for me to understand all the things you do in haskell. I enjoy your writing style even when i did not find the topic itself interesting at first. My expectations change when i see how haskell with it pro's and con's help to improve algorithmic understanding of complex problems and transport them into to a working codebase in really fast turnaround times. This feels amazing and i did not encounter this in any other language and writings in other blogs. So this is probably owed to your writing style and haskell together.

For me this shows that thinking differently is the best way to evolve decisively. To overcome the problems that you have found and others that exist will be solved faster when more people find their way into haskell.

But still you don't treat haskell as the holy grail of programming. And yet i have the feeling that functional languages will be a big part of the future that is ahead of the it industry. So its also fun to get into it and trying to be a part of it.

Actually, I do think Haskell is the holy grail of programming! I have to be the biggest Haskell fanboy ever. :)

However, I think the best way to promote a language is to be honest about its shortcomings as much as its advantages so that people evaluating the language feel like they are making an informed decision.

It's definitely a completely different way of thinking about programs, both in terms of their data structures, algorithms, and control flow and I think it's inevitable that people will think in the more mathematical style of Haskell as programming evolves. It feels like a really exciting time to be a programmer in general, seeing theory and practice intersect in such a big way.