Saturday, March 21, 2009

Concise Generic Queries

A few weeks ago David Miani asked how to write concise queries over a data type. The answer is certainly generic programming, a technique that I feel is underused in the Haskell community. I suggested David look at Uniplate, but he found greater success with SYB. Sean Leather gave a solution using EMGM. One of the advantages of Uniplate is conciseness, so I decided to tackle the same problem and compare.

A full description of the task, including data type definitions, is at Sean's blog. From a data type representing structured files (tables, headings, paragraphs) find a heading with a particular name then within that heading find a paragraph starting with "Description". The rest of this post contains solutions using Uniplate, EMGM (taken from Sean) and SYB (from David). The SYB solution is slightly different from the EMGM or Uniplate solutions, but they all do roughly the same generic operations. It is entirely possible that the EMGM/SYB solutions could be improved, but that is a job for other people.

Line 1: Type signature, given a name and a file, return the paragraph if you find one

Line 3: Find a heading with the right name

Line 4: Find a paragraph below that heading, whose name starts with "Description"

Line 2: Pick the paragraph

I find this code to be a clear, concise and simple description of the problem. The thought process to come up with the solution was as follows: You want to search, or perform a query. The first question is whether this is a deep (all nodes) or shallow (just the children) query - David doesn't say but the example seems to imply deep. If it's deep use universeBi. Operations are combined with a list comprehension that finds an element, check it has the necessary properties (the name), then performs more operations. The result is the code you see above.

This solution isn't bad, but is more verbose than the Uniplate solution. Perhaps it could be rewritten with list comprehensions? It seems that G.collect is similar to universeBi - although I am not sure.

SYB Solution

David's SYB solution can be found here along with another solution using different combinators.

The relative merits of each solution are highly subjective, but I believe the Uniplate solution is concise. The Uniplate solution is a simple translation of the problem, without any clever steps, so hopefully other users (who didn't write the library!) will be able to achieve similar results. The Uniplate solution required only one function from the Uniplate library, so has a small learning curve. Even if you don't choose Uniplate, generic programming techniques are very useful, and can make your code concise and robust.

The SYB solution is working with Either to provide feedback when the element is not found. The feedback is better than that provided by the Uniplate or EMGM versions, at the cost of some extra lines. Using Maybe we can get a shorter version:

You are right in that Uniplate's "universeBi" and EMGM's "collect" are the same. EMGM's solution used Maybe directly, instead of keeping things in [], so list comprehensions couldn't be used. However, since the EMGM solution uses "firstr" only to convert lists to Maybe, you could use listToMaybe as well. We can keep using [] and rely on lazyness such that the outermost listToMaybe will only actually force calculation of the first element in the comprehension, instead of cutting off at the first element with firstr inside "firstPara", we get:

José: Thanks for the Maybe version of SYB, that does follow the same pattern now. I realised there were differences in the operations with the Either vs Maybe, but couldn't be bothered to change/test the SYB again.

Joeri: Is there any reason for EMGM to use Maybe over list comprehensions? You revised EMGM seems much clearer.

There's actually a couple of differences in that they both search for something different, due to the backtracking nature of [].

The Maybe version will only consider the first OrgHeaderP with the right name, and only return Just when the first ParagraphP inside that OrgHeaderP matches "Description".

The [] version will consider all OrgHeaderPs with the right name, and return Just if any ParagraphP inside matches "Description". It'll return Just the text of the first matching ParagraphP inside the first matching OrgHeaderP.

If we somehow have the guarantees (due to file structure or other) that if we have a "Description", it's in the first ParagraphP, and the names of OrgHeaderP are unique, it's the same thing. If we cannot guarantee such a thing, it depends on what you're looking for.

Of course, the same holds for the Uniplate version, which can, in a similar way, be rewritten using Maybe. Just throw in some extra listToMaybes.

I worded that third paragraph poorly. The [] version will search search for other matching OrgHeaderPs when the first one with the right name doesn't have a matching ParagraphP. The conversion to Maybe cuts short this further searching.

If we have the guarantees mentioned in my previous comment, the shortcutting done by conversion to Maybe will be more efficient, because it'll save us from searching through the whole tree.

But as far as I know, it's not something library-specific. If needed, insert listToMaybe in the Uniplate version if you want the notationally more cumbersome, but non-backtracking version.

So I think in this case, I think it's just a different interpretation of the problem.