30 December 2010

Graphs and Labels

As some of you may be aware, I’ve been working on and off on a new library to define what graphs are in Haskell. This is the first part of a series on some of the thought processes involved in trying to define classes that fit the vast majority of graphs.

One of the first things I’ve been considering how to deal with in the new graph classes that I’m working on is how to deal with node and edge labels in graphs. My point of view is that graphs contain two separate but related types of information:

The structure of the graph.

The information explaining what the structure means.

As an example, consider graph colouring: we have the actual structure of the graph and then the colours attached to individual vertices (or edges, depending how you’re doing the colouring). Another example is a flow graph, where the distances/weights are not an actual “physical” part of the graph structure yet nevertheless form an important part of the overall graph.

Yet there are times when the extra labelling/information is an inherent part of the structure: either we are concerning ourselves solely with some graph structural problem (e.g. connected components) or – more commonly when programming – the information about the structure is embedded within the structure (for example, Cabal’s PackageIndex type: this is simplistically equivalent to an unlabelled graph with PackageIndexID as the node type).

As such, I’ve come up with at least three different ways of dealing with graph labels:

A graph can choose whether or not it has node or edge labels (if I understand correctly, this is the approach taken by the Boost Graph Library for C++).

A graph either has no labels or it has both node and edge labels.

All graphs must have both node and edge labels (even if they’re just implicit labels of type ()).

Something along the lines of the first two options is very tempting: there is no requirement to force graphs that don’t have or need labels to pretend to have them just to fit the constraints of some class. Furthermore, different graph types can thus be more specific in terms of which graph classes they are instances of.

However, there is a problem here: duplication. Let us consider a simplified set of graph classes that fit the second criteria:

So if some graph type wants to be an instance of LabelledGraph, it must specify two ways of getting all of the nodes available (admittedly, it will probably have something along the lines of nodes = map fst labNodes, but wouldn’t it be nice if this could be done automatically?).

But OK, writing a set of classes and then instances for those classes is a one-off cost. Let’s say we accept that cost: the problems don’t stop there. First of all, consider something as simple as adding a node to the graph. There is no way (in general) that the two classifications (labelled and unlabelled) can share in the slightest a method to add a node, etc. Furthermore, this segregation would spread to other aspects of using a graph: almost all algorithms/functions on graphs would thus need to be duplicated (if possible). Since one of the main criteria I have for designing this library is that it should be possible to use graphviz to visualise the PackageIndex type, this kind of split is not something I think would be beneficial.

As such, the only real viable choice is to enforce usage of labels for all graphs. This might be to the detriment of graphs without labels, but I’m planning on adding various functions that let you ignore labels (e.g. a variant of addNode that uses mempty for the graph label, which means it’s usable by graphs that have () as the label type). The distinction between nodes and labNodes above could also be made automatic, with only the latter being a class method and the former being a top-level function.

This solution isn’t perfect: to ensure it works for all suitable graph types, it has to be kind *. But this means that no Functor-like mapping ability will be available, at least without really ugly type signatures (which the current experimental definition uses) at least until superclass constraints become available (or possibly some kind of kind polymorphism, no pun intended). However, this is still the best available solution that I can come up with at this stage.

There is a major problem with this kind of system: each edge stores the labels of the nodes, leading to redundancy, etc. problems. For example, if you want to make an edge f -> t, do you really want to have to bother looking up the labels for those nodes just so you can create the edge, when ideally you just want to use the node IDs?

Note also that technically, both my sample cut-down class and yours won’t work due to how edges are handled (see the graph example in Fun with type functions for why). For yours to work, you would have to use an actual data family, whereas I’ve settled on using data Edge g = Edge { fromNode :: Node g; toNode :: Node g } which doesn’t lend itself to possibly allowing labels.

The node IDs represent the structure. We have to use a distinct type because in the general case the labels will have repeats.

However, if we make the overall node type part of the label, then we get duplication: say we have two nodes, (1,'a') and (2,'b'), where the first part of the tuple is the node ID and the second is the label. Using your proposed scheme, if I want to make an edge between them then I’d have to say something like: addEdge g (1,'a') (2,'b') "a to b".

So what happens here if I somehow accidentally instead did addEdge g (1,'c') (2,'b') "a to b" (i.e. the label for the first node is wrong)? Should the method accept it or fail? And whilst it might be more of an implementation rather than an interface/representation detail, but does this mean that the node label gets stored for every edge? Either way, this will result in more bookkeeping and more possible ways for things to go wrong.

Even if we can live with these limitations, I’m not a big fan of this style because rather than just using a tuple or some such, there is no standard way of differentiating between the ID part and the label part of the node and edge types. Instead, any time that you do want to deal with that overall node type, you have to customise how you do so depending on the instance involved.

Instead, I’m wanting to push towards making the ID type completely abstract. To see what I mean, see my current draft of planar-graph: users of the library cannot construct node IDs directly, and instead must “request” new such IDs by specifying what labels they want. If they are using a label-less graph, then defaults exist that use mempty for the label value.

What happens when you insert a node with the same ID but a different label depends on the Eq instance of your Node type. The Node type might not even include an ID and only have a label. Having ID and label separately instead of together in one Node type is what I see as more bookkeeping and more ways for things to go wrong.

You don’t even need a way to differentiate between the ID part and the label part. Node identity is per Eq (or Ord) instance. And the label is extracted with the nodeLabel function in the LabelledNodesGraph class.

As there doesn’t have to be a specific ID part, making it completely abstract is then no problem either, that happens almost automatically.

In just about all cases, you do need an ID part, unless you can guarantee that your node labels are all unique (for edge labels this doesn’t really matter, unless you want a simple graph or some such). In general, this isn’t true.

Don’t get me wrong, this type of definition is very tempting. However, I think it involves too much work/compromise to be able to “scale”. Using graphviz for another example, specifically the GraphvizParams type. With the “labels are compulsory” specification, using it remains almost the same as it currently is. What’s more, you can then also change the underlying graph type and – as long as the label types remain the same – you can use the pre-existing GraphvizParams, etc. to visualise the graph.

However, if we go with your suggested layout, then every time you change your graph type, the transformation functions in the GraphvizParams change since there is no known structural layout that you can use: you can’t even tell if there _is_ a node or edge label! As such, functions such as these will become less generic and require more per-instance customisation because you can’t tell whether or not there is a label.

Another advantage of making usage of labels explicit and compulsory: sometimes you may be wishing to compact/collapse/etc. several nodes into one. If the presence of labels (and their types) are explicit, you can specify the requirement of a function that tells you how to combine them to form the new label. If you don’t know if there are labels or not, then creating a new node is much more difficult (another advantage of making the IDs abstract and distinct from the labels).

When you need labels, your type just gets the LabelledNodesGraph class constraint. The class can always be implemented with const (). The class also makes sure you can change the node and graph representation without having to reimplement anything else.

If you combine nodes into one node, there could be more attributes other than the label that need combining, so it would make much more sense to require a function for combining complete Nodes.

I have to agree with vivian, a graph library should not know anything about labels. (Your arguments all similarly apply to node color for example.)

I’m not sure how your Label class is meant to work, but when I did try implementing such a set of classes I very quickly came up to the problems I described earlier: any time I want to consider adding a node, I now have to determine if the graph I’m dealing with requires labels or not. If I want to visualise the graph or do something with all the nodes, etc. I have to determine if it requires labels or not. Very quickly there is a huge amount of duplication present.

If an algorithm has no need for labels, then it doesn’t have to use them. I’m concerned more with trying to find the best compromise definition of what a graph-like data structure is that will work for the vast majority of cases (e.g. one of the projects I’m currently working on is a planar graph type that stores the embeddings on the plane/surface; it will be possible to make it an instance of the various graph classes, but in some cases – notably adding edges – it will not be recommended).

By forcing the requirement of labels, all it will mean in practice is that if the graph type does not in fact understand the notion of labels then the person writing the instance will have to bother explicitly adding (and ignoring) a lot of () values everywhere.

If you can provide an outline at least of classes like you recommend such that – for example – I can quite easily adapt graphviz from using FGL-only graphs to allowing any instance of these new-style classes without requiring duplication (for labelled and unlabelled graphs), I would be most interested. But I’ve been playing with how to do this with Thomas Bereknyei for a while now, and neither of us have been able to get anything like this working.

From a frequent user of FGL: Ivan, thanks for putting the effort and thought into this, I can’t wait to see the first versions of the new graph libraries/classes.

From my experience with FGL I’ve always needed labels on my graphs, and in my opinion, even if you were only interested in the properties of the graph itself, the internal representation of the graph nodes (Ints with FGL) still essentially represent labels.

Additionally, my main frustration with FGL is the need to constantly handle the mapping between the Node (Int) and the labels and the associated reverse mapping. Having the label BE the node representation would simplify things greatly to my inexpert eye.

Well, in general you can’t make the nodes be the labels, since in a lot of cases the labels will not be unique. However, with this new graph classes we’re writing you will be able to define a graph type with any type you want for the node type.

I just free hand wrote that, so i’m sure there are issues with it, but it seems that this labeled/unlabeled split causes so many problems. How about we try to separate that concept from the concept of a graph? A graph being just the structure. Then we can add labels to it.

Welcome!

Greetings and salutations be upon thee to my little corner of what we like to charmingly call "the internet". This is where I put forth cunning arguments and repartee on such topics as Haskell, Gentoo, university... just don't ask me if I have a social life :s