Monday, October 15, 2012

Parsing chemical substructures

This is the second in a series of coding examples from my own work. /r/programming asked for non-trivial, yet digestible, Haskell examples, so I hope this fits the bill.

In this post I will show how learning Haskell changes the way you think. I will take a common Haskell idiom (monadic parsing) and apply it in a new way to solve a structural bioinformatics problem. To make this post more accessible to non-Haskell programmers, I'm going to gloss over implementation details and instead try to discuss at a high-level how I apply Haskell idioms to solve my problem.

The Parser type

When we approach parsing problems we normally focus on the text as the central element of our algorithm and design our program around combining and manipulating text. Monadic parsing turns this approach on its head and places the parser front and center, where we instead combine and manipulate parsers.

This means we need to define a Parser type:

type Parser a = String -> [(a, String)]

A Parser a takes a starting String as its input and parses it to return a single value of type a and the remaining unconsumed String. There might be multiple valid parses, so we actually return a list of possible parsings instead of a single one. If the parse fails, we return an empty list signifying no valid parsings.

We can also use Haskell's newtype feature to encapsulate the implementation and hide it from the user:

newtype Parser a
= Parser { runParser :: String -> [(a, String)] }

This defines the Parser constructor which we use to wrap our parsing functions:

Parser :: (String -> [(a, String)]) -> Parser a

.. and the runParser function which unwraps the Parser to retrieve the underlying function:

Monads

Now we want to combine our Parsers so we can parse both a True and a False with optional spaces in between. This means we need some elegant way to take the unconsumed input from each Parser and feed it directly into the next Parser in the chain.

Fortunately, Haskell solves this problem cleanly using monads. A Monad defines an interface to two functions:

Like interfaces in any other language, we can program generically to that interface. Haskell's do notation works with this generic Monad interface, so we can use the imperative do syntax to manipulate anything that implements Monad.

Our Parser implements the Monad interface quite nicely:

instance Monad Parser where
return a = Parser (\str -> [(a, str)])
m >>= f = Parser (\str1 ->
-- This is a list comprehension, basically
[(b, str3) | (a, str2)
This is not a monad tutorial, so I'm glossing over why that is the correct definition or what it even means, but if you want to learn more about monads, I highly recommend: You could have invented monads.

When we make Parser a Monad, we gain the ability to assemble Parsers using do notation, so let's use do notation to combine multiple parsers in an imperative style:

trueThenFalse :: Parser (Bool, Bool)
trueThenFalse = do
t
That reads just like imperative code: parse a True, skip some spaces, then parse a False. Finally, return the two values you parsed. This seems straightforward enough until you realize we haven't actually parsed any text, yet! All we've done is combine our smaller parsers into a larger parser poised to be run on as many inputs as we please.

Alternatives

Sometimes we want to try multiple parsing alternatives. For example, what if I want to parse a True or a False? I can define a () operator that tries both parsers and then returns the union of their results:

Since s is "polymorphic", we can set it to any conceivable type and the above code still works. The only String-specific behavior lies within the specific parser definitions, and their new compiler-inferred types reflect that:

But there's no reason we can't define entirely different Parsers that accept completely different non-textual input, such as chemical structures. So I'll switch gears and define parsers for chemical Structures, where a
Structure is some sort of a labeled graph:

Now, I can define new Parsers that operate on Structures instead of Strings.

Parsing bonds

The most primitive parser I'm interested in parses a single bond. It requires two AtomNames which specify what kind of bond to look for (i.e. a carbon-carbon bond, except it can be even more specific). Then, it outputs which two indices in the graph matched that bond-specification:

parseBond :: AtomName -> AtomName -> Parser Structure (Int, Int)

I can use the list monad (i.e. a list comprehension) to define this primitive parser (and don't worry if you can't precisely follow this code):

Haskell strongly encourages a pure functional style, which keeps me from "cheating" and using side effects or mutation to do the parsing. By sticking to a pure implementation, I gain several bonus features for free:

If our bond occurs more than once, this correctly matches each occurrence, even if some matches share an atom

If both AtomNames are identical, this correctly returns both orientations for each matched bond

This handles backtracking with () correctly

I can parallelize the search easily since every search branch is pure

I got all of that for just 6 lines of code!

Parsing substructures

Now I can build more sophisticated parsers on top of this simple bond parsers. For example, I can build a generic substructure parser which takes a sub-Structure to match and returns a list of matched indices:

Reusable abstractions

I find it pretty amazing that you can build a substructure parser in just 18 lines of Haskell code. You might say I'm cheating because I'm not counting the amount of lines of code I took to define the Parser type, the Monad implementation, and the () type. However, the truth is I can actually get all of those features using 1 line of Haskell:

I don't expect the reader to know what StateT or [] are, but what you should take away from this is that both of them are part of every Haskell programmer's standard repertoire of abstractions.

Moreover, when I combine them I automatically get a correct Monad implementation (i.e. do notation) and a correct Alternative implementation (which provides the () function), both for free!

Conclusions

This is just one of many abstractions I used to complete a structural search engine for proteins. Now that it's done, I'll be blogging more frequently about various aspects of the engine's design to give people ideas for how they could use Haskell in their own projects. I hope these kinds of code examples pique people's interest in learning Haskell.

Appendix

I've included the full code for the String-based Parsers. The Structure-based Parsers depend on several project-specific data types, so I will just release them later as part of my protein search engine and perhaps factor them out into their own library.

Also, as a stylistic note, I prefer to use ($) to remove dangling final parentheses like so:

Parse (\str -> => Parse $ \str ->
someCode => someCode
) =>

... but I didn't want to digress from the post's topic by explaining how the ($) operator behaves.

15 comments:

Do you have any performance statistics for your structural search engine? For example, how big is a typical Structure graph and how long does your implementation take to search it? Have you compared it to an optimized C/C++ implementation?

I've tinkered with Haskell before but never managed to get anything close to C/C++ performance out of it. However, I've been working on numerical PDE solvers, which is quite a different sort of problem, so maybe Haskell just isn't so suitable for that sort of thing.

Yes, I do. First off, the index building is not really the rate-limiting step and the search algorithm is what must be fast (and it is VERY fast). However, the indexing algorithm still has good enough performance for our purposes.

I've only benchmarked the indexing on a subset of the protein data bank since my computer only has 1GB of memory free for testing and we're still waiting on a part for the final server that has enough memory to hold the index for the entire data set, so until then I can't benchmark the final data set. However, I have extrapolated based on what my computer can hold, and it indexes 4 structures a second on a single 1GHz core. The graphs of these structures have 2000 nodes on average and they are very sparse with a maximum degree of 4 for any node (since protein atoms form 4 bonds maximum), and 3 is typical. Even when I eventually branch out to all molecules, the maximum outdegree is 6 and that's rare, so I store these graphs as adjacency lists to take advantage of their sparseness.

Extrapolating to the entire protein data bank (i.e. 85,000 structures), the indexing stage would take on the order of 6 hours and 40 GB of memory. However, I'm only indexing the entire data set just to prove that we can (and that's what the server is for). In practice, the search service will index a high-resolution non-redundant data set of 2000 protein structures, which should take about 8 minutes and 1.1 GB of memory and runs rapidly on any ordinary desktop computer.

Once the index is built the search runs very quickly on just one 1 GHz core and responds in milliseconds to queries. It turns out that the front-end (PyMOL, a molecular visualization software that uses a C backend for loading the results) is the rate limiting step, so short of modifying PyMOL I can't really improve performance any more, and it's the fastest molecular visualization software available since it is commercial.

I haven't written the equivalent algorithm in C (the data structures and tests alone would take forever to implement), but for other projects I have written the same code in both Haskell and C. I can usually get within a factor of 3 of C (both for performance and memory usage) just by using high-performance Haskell libraries that are either heavily optimized to generate efficient core or just wrappers around C APIs.

For example, just by defining Storable instances for my data types and using Data.Vector.Storable, I got an in-memory representation that was byte-for-byte identical to what I calculated the equivalent C representation would take. Similarly, the Vector library gives almost C-like array performance and generates really good assembly even for its high-level API.

So if you want an economic assessment of the performance vs. language tradeoff, I cost my professor about $200 a work day ($100 in salary and $100 in tuition+benefits) and this project took exactly four weeks (20 work days), so the price of labor is $8000. I expect that had I written it in C it would have taken at least 3 times as long, so the cost savings is on the order of at least $10,000 and more like $20,000. The cost of our server that can fit the entire PDB data set in memory is $2500. So in this case, the hardware is much cheaper than I am and it's a better use of my professor's money to program more quickly in Haskell and just buy a cheap server.

I've compared substructure matchers in cheminformatics and I can say that two implementations, both in C++, can vary by a factor of 8. The difference is in data structure layout, attention to algorithm implementation, etc. I don't have the PDB data set hanging around any more, so I can't do extensive testing. I loaded 2PLV, which is a virus capsid structure containing 7000+ atoms. I can find all 76 phenol-like structure (6 carbons in a ring, with a bond to an oxygen; SMARTS "[#6]~1~[#6]~[#6]~[#6]~[#6]~[#6]~1O") in about 0.56ms. The search for the peptide backbone (SMARTS "CCNC(=O)") finds 1709 hits in 6.6 ms.

This is with OEChem, a high-performance, commercial toolkit for cheminformatics. There's been a lot of time spent optimizing this code base.

Hence it's *extremely* unfair to ask about a comparison to an "optimized C/C++ implementation" because that's like asking how my wife's speed on her recumbent compares to an Tour de France racer - they are for entirely different goals.

BTW, your salary estimate is off by about a factor of two. Your professor pays your salary out of grant funding, and about 1/2 of that funding goes into overhead for the building, library, network, etc. But a purely economic argument isn't appropriate because if it were you would get the academic version of the OEChem toolkit. It is free to academics. It also means that you wouldn't be learning how to implement fundamental algorithms, which is a part of the training you likely need for your future research.

For the last few years, I have been working in an industry with a different balance of programmer cost versus runtime cost. Our jobs typically take weeks (sometimes months) to complete on clusters of thousands of cores, so a couple of months of programmer time to make something run twice as fast is often worth it.

I had a go at implementing a finite difference application in Haskell. I enjoyed working with Haskell but my initial implementation was painfully slow. That's fair enough: it was my first serious attempt at using the language. So I learned about Data.Vector and some other libraries. However, after a while I found my beautiful, clean Haskell was becoming harder to read and understand than my original C so I gave up.

I came to two conclusions. First, Haskell has a steep learning curve. Simple algorithms are beautiful and easy to understand but realistic programs are hard for a novice to both read and write. Second, high-level functional languages are not applicable to performance-critical applications. Would you agree?

I generally agree with you. Usually, the Haskell approach to get C-like performance is to try to write batch operations in a C library and then just put a high-level API (i.e. `hmatrix`, which is just a wrapper around `libgsl`, which in turn just wraps `LAPACK`). Usually I switch to Haskell when the cost overhead of calling out to the C FFI is much smaller than the operation it calls, but even then I imagine you can still get cache issues or garbage collection issues that you would not get in a tuned pure C/C++ implementation.

(Reddit is down so I moved the conversation here.) The conversation there is:

Me:

I've done molecular structure parsing for a long time, and I don't understand what you explained. First, what is your graph data structure? I get that atoms only have a name, and that a bond either exists or doesn't (that is, there's no bond type information; that's reasonable for structural bioinformatics though not for cheminformatics.) Does that mean that each bond link is stored twice? That is, if the atom at index 5 has 8 as an adjacent atom then the atom at index 8 has 5 as an adjacent? That appears to be the case.

(A cheminformatics package would have the atoms pointing to a bond, with bond type and aromaticity information, and the bond type would point back to each of the two atoms it's connected to.)

You do a 'deleteBond', and I couldn't tell if deletion goes as O(n) or O(log n) time. For your code to make sense, it must be the latter. Your molecules are protein sized, but how big are your substructures?

I can't figure out what your substructure patterns look like. You're matching on atom names, and atom names in proteins are unique in a residue, yes? This means there's very little backtracking. Even if you match on element name, there's relatively little call for extensive backtracking. If that's the case then that would explain why parseSubstructure is so short. I usually expect something like Ullmann or VF2 for the matching, and from what I can see, your algorithm for the general case is much less efficient. That is, it looks like you do successive linear scans of the entire bond list, which implies a n*m time, where n is the number of atoms in the molecule and m the number of atoms in the subgraph.

I've a question of the <|> operator. What would happen if you wanted to find if a single substructure containing a linear chain of 12 carbon atoms exists in a buckyball? Would the (<|>) operator end up generating all 118440 matches? Or would you just get the first that it finds? How long does it take?

You replied:

The graph is undirected but since the Graph type is directed I store bonds as two directed edges and similarly delete them in pairs.

The graph is stored as an adjacency array, so the time complexity of deletion is only proportional to the worst case valency, which is 4 for proteins so it is constant (one array lookup, delete an element from a linked list of maximum size 4, repeat for the the other direction).

Edit: Or maybe the correct term is adjacency list (I'm still new to graphs)? It's a node-indexed array where you store linked lists of neighbors, if that clears it up.

Everything you said about substructure matching is correct except that I don't scan the entire bond or atom list. I just scan the atom's neighbors. Again, because valency is capped proteins are the epitome of sparse graphs so this is efficient.

Regarding your last question, Haskell is lazy, so it computes only as many results as I demand. I always take the first result so this prevents the combinatorial blowup. But, it does not guarantee that the first match won't require a lot of backtracking (especially for parsing something like a buckyball). I specialized this algorithm to protein structures and their chemistry. I should have made that clear in the post, however the topic was on the novelty of parsing non-textual things and not about the algorithm's efficiency.

Also, since you are interested in backtracking, Haskell has a very sophisticated backtracking library named LogicT that also handles pruning and fair branching. I didn't use it because it was unnecessary for the small motifs that I index.

Also, I use several other tricks to keep the search performant. The most significant one is that I partition the structure into pages of fixed volume since I have a specified upper bound on the size of each query (In this case, a cube of 15 angstroms). This is probably the optimization you were looking for me to talk about.

Yes, bounded valences help a lot. In general there may be higher-order valences, but if you're just working with proteins, and skipping, say, organometallic systems, then there shouldn't be any problem. Then again, even with 10-valent systems, I can't see there being much of an effect. In fact, you can likely store them as a vector, and not have the linked-list overhead.

I mostly work in graph space, so spatial cutoffs aren't so relevant. The Ullmann and VF2 algorithms refine the algorithm you describe, in order to improve its efficiency. Ullmann is described at http://www.engr.uconn.edu/~vkk06001/GraphIsomorphism/Papers/Ullman_Algorithm.pdf . You may be interested in some of the work described in http://www.jcheminf.com/content/4/1/13/abstract , although I have disagreements about some of the historical statements they asserted, and I found their work someone shoddy.

BTW, I think there's a problem about using the Parser for this task. If you want to match a single bond then there's no reason to return the mutated graph from parseBond. Instead, it should return the original graph, and parseSubstructure, if it needs to continue the search, should do the mutation itself. As it stands, you're always doing an extra graph mutation. I suspect that the performance difference is measurable, and guess it's about 5%. (Figuring motifs of length 10 atoms, and 1/2 the time spent in graph search and 1/2 in graph edits.)

You are right. I could also inline the definition of parseMotif into parseSubstructure and handwrite the state-passing and list operations manually to maximize performance and combine or trim graph updates. In practice, though, this is not the rate-limiting step of the service, since it only indexes once and then just runs as a search. It goes through 4 PDB structures a second, so even for large data sets it is just a few hours to index at most and then I just serialize the results to an SSD for safe keeping. The search service is the only part that has to be performant and it just loads the results into memory once and returns results on the millisecond time scale from that point onward.

Also, keep in mind I'm only using the full protein data bank just to prove in my paper that it scales to the worst case scenario. The actual search service I will provide will only index a couple thousand high-quality non-redundant structures and that only takes minutes to index.

Also, keep in mind that I'm not even mentioning the parsing algorithm in the paper I'm publishing. The main novelty of my paper is the blazing fast all-atom structural search that is provided as a network service that seamlessly interfaces with molecular visualization software. That's why this is a blog post and not a paper! I mainly use my blog to write about non-publishable lessons I learn while programming in Haskell to get people excited about learning the language.

While I do lots of substructure matches, so when I saw the link in reddit I nearly jumped. "Lots" to me means something like 20 billion matches, of 650 substructures against 30 million small molecule compounds, or more recently some work I've been doing to find maximum common substructure profiles for a set of compounds. See http://dalke.developer.se:8888/ for my work-in-progress.

Since I'm a consultant, and don't get paid to write papers, and have to pay either for access to pay-walled-literature or pay for some US$12000 to publish in the open access literature, I tend to write my blog posts as publications. ;)

Well, it's not a generalized substructure search. The only reason it performs well at all is that it only keeps track of a limited set of substructures (i.e. imidazole, carboxyl, indole ring, peptide bond, carboxamide) that are relevant to the target scientific audience (Structural biologists and protein designers). There is no way I could reasonably use my algorithm to index every possible chemical substructure and if I tried searching the database online without any sort of indexing my substructure search would fall down and cry! Like I said, my algorithm is specialized for protein-centric searches where it focuses on protein-centric applications with very low latency and integration with visualization software. It definitely is not a general purpose drug-design tool.

Oh, also, since we seem to have some overlap in interests, I'm curious if you live near the San Francisco bay area or in Texas. I currently work at UCSF, but I will eventually move back to Texas after a couple of years. Also, since you are a consultant that does a lot of substructure stuff, I'm wondering if you know Barry Bunin or his CDD company. That seems like something that would be right up your alley.

Oh, don't get me wrong - I wasn't saying that it should be a generalized search and neither was I complaining about anything. I was describing how my history got me to follow the link to your essay. Your "non-publishable lessons I learn while programming in Haskell to get people excited about learning the language" just also happens to pick up "people interested in molecular substructure search" ;)

I live in Sweden, which is rather far from both San Francisco and Texas. I don't know who Barry Bunin or CDD is. I looked at the web site now; more specifically, the video. It's going to be hard to convince pharmas to trust their data to an external cloud service. I can't judge the usefulness of the tool because I don't know the target audience and their goals. My guess is small pharma/biotech who don't have internal IT support. But they need to be big enough that using Excel/SD files isn't enough. Again, this isn't a market that I know much about.

It is better to use Haskell for all tasks. C++ can't efficiently use multiple cores and has data race issues. Haskell is slightly slower than C++ but an optimized Haskell can be slower than C++ just by a factor of 2.

Dear Mr.Dalke, you have pointed that u can employ programmer for a couple of extra months to do the task in C++ than use Haskell. I feel that Haskellers can do another new project/task in that period. C++, Java, scripting languages are a failure now. There is no way to handle concurrency and parallelism in these languages efficiently. The author of the article rightly pointed that we can throw more hardware than use programmer time. =>Haskell is the future whether u like it or not. That's it.