Getting into blogging before it's too late

Category Archives: Haskell

I’m now in my fourth week of a PhD in Computer Science at the Australian National University under the supervision of Professor Brendan McKay (of nauty fame). My topic is as yet still not completely defined, but at the moment it is to do with comparison of graph generation algorithms.

PEPM’10

In January this year, I presented a paper on my SourceGraph program at the 2010 ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation (PEPM), a reprint of which is available here. For those of you who are interested, the slides are used (which were heavily edited the night before I gave my talk!) are also available. I would have made these available before now but as soon as I arrived back in Australia I was busy getting ready to move down to Canberra, and I only obtained internet access last week.

The experience was interesting: not only was this the first time I have presented at a professional conference, it was also the first time I had even attended one. What made it even more interesting was the large proportion of people who presented there who (like myself) were not from the actual targeted field (e.g. Florian Haftmann who presented a paper on Isabelle and Haskabelle is from a theorem-proving background).

I’m still working on SourceGraph (if you compare the paper to my Honour’s thesis you should be able to spot the improvements it has already); however, the fact that I actually have university work I’m meant to be doing is proving a hindrance.

Other Projects

Generic graph class

Some people have expressed interest in whether or not I’ve gotten anywhere on my generic graph class proposal. The answer is that I have a rough-and-ready form of one based upon initial discussions I had with Cale Gibbard that uses associated types to define various types such as the vertex type, vertex label type and arc label type. However, the main stumbling block at this stage is that I wanted to add a sub-class dealing with “mappable graphs”, i.e. those graph types that support applying a mapping function over the vertex and arc labels. The problem is, superclass equalities have not yet been implemented within GHC as of 6.12 (but maybe 6.14 will have them). In any case, if anyone is interested I can send them the current messy version, but even without the mappable graphs class it still requires at least GHC 6.10 for associated types support.

Noam Lewis (aka sinelaw) had asked me to add functionality to augment a list of DotGraphs using a single dot (or related command) process; however it turns out that for more than five DotGraphs this takes longer than doing each one individually. As far as I can tell, the problem relates to the rendering of the pretty-printed list of DotGraphs. As of now I’ve rolled back support for this, but I may try rendering to lazy text values, which will also let me avoid encoding problems by forcing usage of UTF-8.

I’ve also been asked by several people to add support for record labels; I’m doing so, but I got side-tracked into implementing proper support for HTML-like labels, which involves a lot of manual experimentation to determine what is valid syntax, etc.

Another area which involves manual experimentation: proper parsing of nodes and edges. Up until now, the library has assumed that there is at least a semi-colon between each individual statement in Dot code. This is not actually the case (which I found out when trying to parse the output of ghc-pkg dot), but when I tried to change this to assuming that each line ended in either a semicolon or a newline (I’m a mathematician, so or implies both as well!) then the edge a -> b was instead parsed as just the node a (with a parsing failure for the rest of the input). What makes it even better is that this is a valid Dot graph:

digraph { a b -> d c -> b -> a [color="red"] }

Now, how the hell am I meant to sensibly parse that?!?!? (Note: it isn’t the multi-edge c -> b -> a that’s the problem; the library can already cope with that; it’s just being able to tell when a particular statement ends and another begins). Bonus points if you can tell just by looking at it which edges are coloured red.

For version 2999.7.0.0 of my graphviz library, one of the improvements I’ve made was to re-write how to call the actual Graphviz command so that error messages could actually be reported to the user and to make it more robust (compare the old version to the new one). Duncan Coutts helped me out with this, by giving me a starting point from Cabals’ rawSystemStdout' function.

So I wrote the code, recorded a Darcs patch, then a few days later decided to actually test it (for some reason I didn’t seem to have actually tried using the function after writing it…). But whenever I tried to use it, I kept finding that the call to Graphviz would block: the input would seem to get consumed, but no output was generated. Duncan and I spent a few hours putting trace statements, etc. throughout before we eventually worked out what the problem was.

My challenge: without looking at the released version of the code that I’ve linked to above, try to find the problem part of the code below. There’s a crucial thing to remember here whenever making a system call to another executable. I’ve cleaned up and simplified the actual code I’ve put here, but otherwise its exactly what I had when I tried to work out why it wasn’t working. Note that hGetContents' is a strict version of hGetContents

However, it isn’t clear what should happen if a number is used as a string value: does it need quotes or not? Furthermore, that page doesn’t specifically mention that keywords (graph, node, edge, etc.) need to be quoted when used as string values (it just says that compass points don’t have to be quoted).

What is a cluster?

The language specification page mentions that it is possible to have sub-graphs inside an overall graph, and that these sub-graphs can have optional identifiers. The Attributes page has mention of cluster attributes. But the only way to tell how define a cluster is to look at the examples page and notice that a sub-graph is a cluster if it has an ID that begins with cluster_ (with the underscore also appearing to be optional when playing with the Dot code manually). Furthermore, it isn’t specified that if you have more than one cluster, then they must have unique identifiers; it doesn’t even suffice to have two “main” clusters with identifiers of Foo and Bar, each with a sub-cluster with an identifier of Baz: the sub-clusters have to have unique identifiers as well; it took me a few hours to work this out.

If that isn’t bad enough, the fact that cluster_ is at the beginning of every cluster identifier means that the normal quotation, etc. rules for values doesn’t seem to work: a HTML identifier for a cluster now has the form of "cluster_<http://www.haskell.org&gt;"; that’s right, it’s a URL prepended with a string and then wrapped in quotes! This plays merry hell with any attempts at properly generating and parsing identifiers for sub-graphs, especially when considering what happens to escaped quotes inside that string (my approach has been to do a two-level printing/parsing).

Poor/inconsitent documentation

In several cases, the documentation for Graphviz contradicts itself. Take output values for example: the official list of output types can be found here. Yet, if we look at the documentation for how to define a color value, we find it mentions a non-existent “mif” output type. Not only that, but there are apparently various renderers and formatters available for each output type; not only are these renderers and formatters not listed anywhere, it isn’t even explained what these renderers and formatters do (let alone what’s the difference between them). Furthermore, to make matters even interesting on my system I have at least one more output type (x11) than what is listed there.

Custom standards

Another annoying factor is how Graphviz treats named colors. The default colorscheme is to use X11 colors. However, if you compare Graphviz’s X11 colors to the “official” list (such as it is; there’s no real official standard, but most X11 implementations seem to use the same one) you’ll notice that they’re different: some colors have been added and others removed. I admit that it could arise from an older X11 implementation’s definition of X11 colors, but it prevented me from making a common library to use for X11 colors.

Assertion Madness

Every now and again, Graphviz fails to visualise a graph because an internal assertion failed; for example: dot: rank.c:237: cluster_leader: Assertion `((n)->u.UF_size <= 1) || (n == leader)' failed. This is extremely annoying, not least because even looking through the relevant source code doesn’t reveal what the problem is. If these assertions are really needed for some reason, please say why and what the actual problem is.

Getting help

I’m spoiled: #haskell is one of the largest IRC channels on Freenode, and the various Gentoo ones are usually rather large and helpful as well. Usually whenever I try to get help from #graphviz, I get no help; partially because there’s sometimes only two other people there, neither of whom respond (probably due to time zones).

There’s more

There are other niggles I’ve had with Graphviz, but these are the main big problems I’ve had that I can recall.

Overall, however, Graphviz is a great set of applications; unfortunately, they seem to be feeling their age (along with keeping a large number of deprecated items floating around for compatibility purposes).

As I’ve mentioned previously, I’m currently writing a QuickCheck-based test suite for graphviz. Overall, I’m quite pleased with QuickCheck, especially since the amount of moaning from people that QuickCheck-2 (which I’m using) is too different from the version 1.x series. The monadic usage of Gen for arbitrary means in most cases instances are just a matter of picking the right liftM function with multiple calls to arbitrary. However, in the course of using it, I’ve come across some observations/problems with QuickCheck.

Default Instances

Usually, it’s great to see a library define instances of its own classes for the common data types (that is, those available from the Prelude, etc.). However, I’m finding the the default instances of Arbitrary for lists (and to a lesser but related extent Chars) a pain. Specifically, how the shrink method is defined: it not only tries to shrink the size of the list (which is great) but also individually shrink each element in the list. My preferred behaviour (which I’ve defined a custom function that my code explicitly calls) is to just shrink the size of the list, unless it’s a singleton, in which case try to shrink that value.

The reason the default behaviour is so bad in my case is that I quite often have lists of custom data types, which can individually have lots of other sub-types in them, possibly with lists of their own. As such, if I used the default shrinking behaviour on lists, this could result in a lot of attempts at shrinking.

Note that this isn’t really a problem of QuickCheck per-se: it’s great that it defines an Arbitrary instance for lists; it would be great (but probably not type-safe, etc.) if it was possible to override class instances in Haskell.

Why lists anyway?

One of the problems with the shrinking behaviour of lists is due to the number of appends that occur; whilst lists are nicer/easier to deal with, using something like Seq from Data.Sequence might improve performance.

Getting big for its boots

In most of the tests that I’ve done, the problems that occur are usually in printing and parsing Attributes. As such, at the start of my test suite I run a test on lists of Attributes; to try and ensure that they’re valid I run 10000 tests rather than the default 100. These extra tests, however, come at a price: QuickCheck keeps testing longer and longer lists, which means that each individual test takes longer and longer to run. I’d prefer to run even more tests which are individually smaller (around the mid-point of what gets generated with 10000 tests); as it is, 10000 tests take over half an hour here.

Documentation

Whilst the high-level details are explained rather well, there’s parts of the QuickCheck documentation that is rather lacking. First of all, how to use QuickCheck; I wasn’t aware that there was a community standard of starting the names of all properties with prop_ (though Real World Haskell deals relatively well with how to use QuickCheck). Also, it took me a while to dig through the (relatively un-documented) source to work out that a Result of “GaveUp{...}” is returned when too many values were discarded.

Keep going

You’ve found a value that breaks the property? Excellent (well, not in that it’s great to have a bug, but it’s great that it was picked up)! But can’t you please keep going and trying to find more?

Edit: one of the reasons I would like this behaviour is for when the test isn’t actually a failure per-se, it’s just a matter of my Arbitrary instances not being strict enough. For example, if it generates a String value that is actually a number (e.g. "1.2") and my data type can be either a Double or a String, then obviously this value should actually be parsed back as a Double; this though breaks the print . parse = id property in its most strictest sense. As such, if quickCheck kept going, then I could manually verify whether it is a bug or not and fix it (so that an arbitrary String isn’t actually a number) whilst it kept doing the rest of the tests.

Getting results

Related to the previous one: once quickCheck has found a data value that breaks a property, the only way of getting that value to manually determine why the property is breaking is to copy/paste it: whilst the output can be redirected, it’s an all or nothing affair of the entire output rather than just the data value itself. Even better if the Result data type was parametrised so that it could return the value in its Failure constructor, so that in my code I can manually write it to file using my wrapper script around the QuickCheck tests.

Recursive values

In graphviz, I have a DotGraph data type which contains a DotStatement value; this contains a list of DotSubGraph values, each of which contains a DotStatement value. As such, my initial implementation of Arbitrary for these data types resulted in large, deeply recursive structures even for “small” sample values; this resulted in making it almost impossible to track down the source of the problem that resulted in an error. As such, to solve this I’ve done the following:

Define an arbDotStatement :: Bool -> Gen DotStatement function which will only have a non-empty list of sub graphs if the boolean is True.

The Arbitrary instance for DotStatement has arbitrary = arbDotStatement True; that is, an arbitrary DotStatement value can contain DotSubGraphs.

The Arbitrary instance for DotSubGraphs uses arbDotStatement False to generate its DotStatement; that is, a DotSubGraph cannot have any DotSubGraphs of its own.

This results in an Arbitrary instance of any of these data types that won’t endlessly recurse and is thus easier to debug.

Brent Yorgey is doing work on testing functions that use recursive data types; that should also help in the future.

Shrinking

I think the inclusion of shrinking into QuickCheck is great, in how it helps find a minimal common case for a bug. I’ve found, however, that for large data types you need to be very careful how you implement the shrink method: I’ve found it useful to only shrink the sub-values that are most likely to have errors (that is, the Attributes) rather than checking every possible shrink of the integral node ID, etc.

How do you have 0.11043 of a shrink?

With shrinking, however, what does QuickCheck mean when it says something like 0.11043 shrinks? Is it trying to say how deeply its shrinking? Note that this doesn’t seem to be a real floating point number; it seems to be treated as Int . Int.

In my previous post (what? I’m doing another post just three days after my previous one? 😮 ), I mentioned that I was planning on adding QuickCheck support to graphviz. Last night, I finished implementing the Arbitrary instances for the various Attribute sub-types and did a brief test to see if it worked… and came across three bugs :s

Parsing my own dog-food

The property that I was testing was that parse . print == id; that is, graphviz should be able to parse back in its own generated code output and get the same result back. I decided to do a quick test on the Pos value type, as I figured this would be reasonably complex due to the usage of either points or splines. And yes, I was right that it was complex, as this revealed the following three bugs:

When printing the optional start and end points in a spline, they should be separated from each other and from the other points with spaces; I had used the pretty-printer <> combinator rather than <+> .

Lists of splines should have only semi-colons in between each spline and not a semicolon and a space: using hcat rather than hsep fixed this.

The parsing behaviour was initially to try parsing the Pos value as a point first and then a spline. However, if the spline didn’t contain an optional start or end point, then the parser would successfully parse the first point in the spline as a stand-alone point, and then choke on the space following it (or indeed, a spline consisting of a single point followed by another spline would also confuse the parser). Thus, testing for a spline-based position first fixed this.

Note that this is two printing-based problems and one parsing-based problem. The initial fix for the last bug, however, created another problem: as I alluded, a spline consisting of a single point is equivalent to a point, so all point-based positions would be parsed as a single spline consisting of a single point. The current parsing behaviour is now to only parse as a spline-based position, then convert it to a point-based position if necessary.

Rant about tests in packages

Duncan Coutts recently mentioned that QuickCheck is one of the packages that split HackageDB, due to the newer version 2 branch being incompatible with the (more popular) version 1 branch. My opinion is that this is a problem with how Haskell developers treat testing in their packages both from a user and from a distribution-packager point of view.

Let’s take hmatrix as an example. It uses both QuickCheck and HUnit for testing purposes. However, why does an end-user care about the tests, as long as the developer has run them? This introduces two compulsory dependencies to the package (which I have no problem with overall) that most people don’t need or care about. Some library developers include their tests (and the dependencies for those tests) in a separate executable that is disabled by default; however, due to how Cabal deals with this (Duncan has partially fixed the problem for Cabal-1.8), these “optional” dependencies are still required. I can think of several reasons why developers include the tests inside the main package in this way (listed in what I think is decreasing order of validity):

The tests use internal data structure implementations or functions that should not be publicly accessible.

The tests are also located inside the library as extra documentation about the properties of the library.

Convenience; everything is all bundled together, and if end users want to test the validity of the code it’s there for them.

Laziness: why should they bother separating it out when it makes it easier for them to do “cabal install” and run the test binary?

I myself have never run a test suite for a package that is not my own, and I wonder how many people actually do. I just find that this makes packaging libraries for Gentoo more difficult, and leads to the problems Duncan has been having on Hackage.

My approach

I have a darcs repository for graphviz, the location of which is listed in the Cabal file (and is displayed by Hackage). This darcs repository is publically available for anyone to get a copy of, so if people want to send me a patch with extra functionality they are able to get the latest stuff that I’m working on.

My testing files are located within this repository; the actual tests are defined an run in an external module from where the data structures and functions are defined. I can use it to test my code; if anyone else wants to test it they are able to grab it and do so. However, I am not going to include the testing module[s] within any releases of the library.

I believe that for many cases, this would be a much better approach on how to develop and distribute test suites. At the very least, if you are unable to extricate the tests from within your projects source files, using a pre-processor to remove them from the distributed tarballs might be a valid approach (I have no idea how hard/easy this would do though). This way, people who want to run the test suite can, and other people who trust the developers (like me for the most part) can do so without having to install dependencies that are for the most part useless.

I was planning on posting semi-regularly here, but I’ve been procrastinating way too much. It’s not that I don’t have anything to write about, it’s just that by the time I get on the computer I can usually find better things to do 😉

Anyway, this is kind of a summary update about what I’ve been working on and what I’m going to be doing.

The Graph Trifecta

I’m responsible for three graph-related Haskell packages on HackageDB: graphviz, Graphalyze and SourceGraph. These three packages are interrelated: graphviz is used by Graphalyze which is in turn used by SourceGraph (which also uses graphviz).

graphviz

The graphviz provides bindings to the Graphviz (note the differences in capitalisation) suite of tools for visualising graphs. Well, OK, I’m not actually binding the actual C library, just generating the appropriate representation into Graphviz’s Dot code and calling the appropriate tool, but close enough. I’m hoping that it will eventually become the definitive way of using Graphviz in Haskell (as opposed to all the other packages that either provide “bindings” to Graphviz or do so internally). It does have some limitations, however:

Only supports conversion to/from FGL graphs for now, as there’s (as-yet) no standard way of considering an arbitrary graph data structure.

Does not support arbitrary placement of Dot statements but follows a specific order; I did ask on the Haskell mailing lists if anyone wanted this feature but was told resoundly “no”; I might add it as an option in a future release however.

It seems every release of graphviz introduces an API change. Well, OK, not all; there are bug-fix releases, etc., and I use the Haskell Package Versioning Policy (PVP) so releases that do break the API are obvious. Also, after the 2999.5.0.0 release, I’m trying to keep the API changes to a minimum whilst still improving the library. After all, it’s all part of the process of making A Better Library (TM) for everyone 😉

What I’m working on at the moment for graphviz is improved support for interaction with the actual Graphviz tools in how you call them and how you choose which format they’re created in. Rather than just indicating if the operation was successful or not and printing any error messages to stderror, it’s now using the Either datatype to return the actual error if there is one. The code isn’t perfect (a full disk exception will currently freeze it, etc.) but it’s getting there. I’ve also improved the image output support by allowing the programmer to use any (non-deprecated) image type that Graphviz officially supports and not just those that I happened to have had available on my system when the first version was written (note that there might be other types as well that aren’t listed in the official documentation; my install has support for a gv output type and I have no idea what it is); there is also the ability for the library to automatically add the correct file extension to the output filename. Canvas (i.e. make a window with the graph visualisation rather than create a file) output types are now also treated separately.

I’m also working on adding QuickCheck support to ensure that parse . print == id for all supported types. Note that it isn’t valid to consider this the other way round: graphviz doesn’t support arbitrary placement of statements in the Dot code, and it parses more liberally than it prints. This should hopefully find any lingering bugs I’ve got lying around in the parsing and printing side of things. The next step would then be to ensure that any generated Dot code is then able to be processed by Graphviz using the Dot (why does Graphviz have to call so many things “Dot”?) output type to add position information, etc. and have that fully parseable.

Also, documentation is gradually being improved; I’m never intending that users will be able to completely ignore the official Graphviz documentation; however, I’m trying to make it as obvious as possible the differences of opinions/terminology/etc. between Graphviz and what graphviz supports, as well as providing broad overviews of the different options and giving links to the appropriate upstream documentation.

Graphalyze

Graphalyze was the main focus of my mathematics honours thesis last year. It provides a library to use graph-theory to analyse the relationships in discrete data sets. Nowadays, it mainly serves to do the heavy lifting for SourceGraph.

Work on Graphalyze is mainly to add extra graph algorithms that I want for SourceGraph. I’m also wanting to eventually replace the current reporting framework with a pretty-printing based one; possibly removing the pseudo-option of using something apart from Pandoc and just making it the default; this will definitely necessitate a licensing change for most of the library however (which I have no problems with, and it doesn’t appear to be used by anyone else…).

SourceGraph

First of all, I was urged (by people who shall remain nameless unless they say its OK) to submit a paper based upon SourceGraph to Partial Evaluation and Program Manipulation 2010 (PEPM) which is going to be held in Madrid next January. I didn’t expect to get it expected (to be honest, there’s nothing that I think is that interesting/new in SourceGraph for the moment except for possibly the overall concept), but it gave me a figurative kick up the proverbial to actually update SourceGraph (which was my first ever real released Haskell program… 😮 ) and I thought that the actual process of writing the paper would be good practice. It turns out however that I waschosen, which resulted in a mad rush to update the paper (many thanks to quicksilver on #haskell for fixing up my Haskell terminology). So I now find myself preparing for my first ever real academic conference, let alone the first one that I’ll be presenting at! What makes the situation even more interesting is that I’m currently not a student (more on that later) and so the financial aspect will be interesting… I’ll be putting a version of my paper up here as soon as I fully understand the hoops I have to jump through to do so.

Whilst SourceGraph was just a sample application of Graphalyze in my honours thesis, it was why I’d originally thought up the topic; my supervisor convinced me to make the application aspects more general by splitting out the graph-theory side of things into a library and having several sample usages, of which I only ever ended up doing the one (mainly due to time constraints). It is now the driving force behind graphviz and Graphalyze (well, OK, users sending me bug reports/feature requests and my own sense of pride do help with graphviz; the latter was what prompted the re-writing of the printing/parsing layout whilst I was meant to be on holidays overseas visiting family…) and the focus (transitively or otherwise) of most of my Haskell hacking at the moment.

Anyway, I’m currently trying to improve the quality of the output produced by SourceGraph in terms of usage (by adding command line parameters) and in what kind of analyses it produces. I have kind of shot myself in the foot, however, as I’ve already updated it to use the as-yet-unreleased 1.8 series of Cabal, so I can’t release it (not that it’s anywhere close to being in a releasable state) until Cabal-1.8 is available…

Haskell in Gentoo

I’m going to do a proper post over on the official Gentoo-Haskell blog later on about most of this, but here’s a brief rundown on what I’ve been doing (along with kolmodin, trofi, etc.):

I’m still not an official dev, as I don’t have time/can’t be bothered (pick whichever you think is more accurate :p ) to finish off the dev quiz; I still help out on the IRC channel, answer (but not close) bugs on bugzilla, etc.

I’ve been doing some “spring-cleaning” in the overlay to try and remove old, unused packages (and versions of packages). I’m defining “unused” as “the version in the overlay is out of date, no-one has complained about it being out of date and I can’t be bothered checking through all the dependencies, etc.” 😉

Trying to get GHC 6.12 RC1 to work; whilst I’ve managed to build an x86 binary bootstrap package, I’m not able to actually use it to bootstrap, and if I try and install that with USE="binary", then it appears our hack for not installing haddock with GHC doesn’t seem to work anymore :s

haskell-updater has been updated to work with GHC 6.12/Cabal 1.8 \o/ . Of course, I can’t actually test this out fully as I’m not able to install 6.12 here… The underlying code has also been rewritten to make it easier to add extra options down the track.

University

After I finished my maths honours last year, I had been hoping to study at the University of Edinburgh this year (starting in September, which would have been convenient for going to ICFP), but whilst I was accepted I was unable to obtain any funding 😦 As such, I’ve been either working at the University of Queensland (doing some casual sysadmin-ing as well as developing material for and helping to run the new SCIE1000 subject) or working on a few odds and ends at home (that is, the projects mentioned above) all this year.

Because my dreams of studying overseas have been dashed, I’ve now applied to study at Australian National University in Canberra under the supervision of Professor Brendan McKay. Whilst I haven’t been officially accepted, I’ve been told that I’m guaranteed a spot (and at least some funding) due to my academic results for honours last year. I’ll be working on combinatorial algorithms, so nothing officially Haskelly in nature (but I’m intending on doing most – if not all – of my coding in my favourite programming language).

PEPM

I mentioned previously that I’d be going to PEPM next year… will anyone else from the Haskell community be there (or at any of the other conferences associated with POPL)? I’ve as yet only had the pleasure of meeting one person (dibblego), and it’d be great if I could put some faces to some of the other names I see in #haskell or on the mailing lists…

That’s all folks

In case anyone cares, this is what I’ve been up to and am planning on doing for the most part. This blog post on the main was to try and get me writing here more often, as well as an attempt to stifle Joe Fredette’s continual complaints of lack of Haskell Weekly News material because I haven’t released anything recently 😉 So Joe, go ahead and write about this! :p

As far as I know, no-one has as yet reported that Real World Haskell has hit the shores of Australia. Well, I’d like to rectify that situation now that I just received my copy (which whilst later than people in the Northern Hemisphere and the Americas have reported, is a day earlier than the earliest I was told to expect it by Fishpond)!