As you can see, there's a lot of information here. The bit I'm interested in is the 'Depends' line. This lists all the packages that are required in order for this package to work correctly. When you install autopilot, your package manager will install all it's dependencies, and the dependencies of all those packages etc. This is (in my opinion), the best feature of a modern Linux distribution, compared with Windows.

Packages and their dependant packages form a directed graph. My goal is to make pretty pictures of this graph to see if I can learn anything useful.

Process: The Python

First, I wanted to extract the data from the apt package manager and create a graph data structure I could fiddle with. Using the excellent graph-tool library, I came up with this horrible horrible piece of python code:

Yes, I realise this is terrible code. However, I also wrote it in 10 minutes time, and I'm not planning on using it for anything serious - this is an experiment!

Running this script gives me a 2.6MB .gml file (it also takes about half an hour - did I mention that the code is terrible?). I can then import this file into gephi, run a layout algorithm over it for the best part of an hour (during which time my laptop starts sounding a lot like a vacuum cleaner), and start making pretty pictures!

The Pretties:

Without further ado - here's the first rendering. This is the entire graph. The node colouring indicates the node degree (the number of edges connected to the node) - blue is low, red is high. Edges are coloured according to their target node.

These images are all rendered small enough to fit on the web page. Click on them to get the full image.

A few things are fairly interesting about this graph. First, there's a definite central node, surrounded by a community of other packages. This isn't that surprising - most things (everything?) relies on the standard C library eventually.

The graph has several other distinct communities as well. I've produced a number of images below that show the various communities, along with a short comment.

C++

These two large nodes are libgcc1 (top), and libstdc++ (bottom). As we'll see soon, the bottom-right corder of the graph is dominated by C++ projects.

Qt and KDE

This entire island of nodes is made of up the Qt and KDE libraries. The higher nodes are the Qt libraries (QtCore, QtGui, QtXml etc), and the nodes lower down are KDE libraries (kcalcore4, akonadi, kmime4 etc).

Python

The two large nodes here are 'python' and 'python2.7'. Interestingly, 'python3' is a much smaller community, just above the main python group.

System

Just below the python community there's a large, loosely-connected network of system tools. Notable members of this community include the Linux kernel packages, upstart, netbase, adduser, and many others.

Gnome

This is GNOME. At it's core is 'libglib', and it expands out to libgtk, libgdk-pixbuf (along with many other libraries), and from there to various applications that use these libraries (gnome-settings-daemon for example).

Mono

At the very top of the graph, off on an island by themselves are the mono packages.

Others

The wonderful thing about this graph is that the neighbourhoods are fractal. I've outlined several of the large ones, but looking closer reveals small clusters of related packages. For example: multimedia packages:

This is Just the Beginning...

This is an amazing dataset, and really this is just the beginning. There's a number of things I want to look into, including:

Adding 'Recommends' and 'Suggests' links between packages - with a lower edge weight than Depends.

Colour coding nodes according to which repository section the package can be found in.

Try to categorise libraries vs applications - do applications end up clustered like libraries do?

I'm open to suggestions however - what do you think I should into next?

I recall getting very similar graphs from apt-rdepends for debugging purposes. The scrips was as simple as "apt-rdepends --dotty | springgraph" with some parameters. But your solution is more elaborate and admittedly fancier!

First, thanks for the feedback and words of encouragement. This is the first time I've done any sort of visualisation, so I'm still learning how to do this stuff.

I promise I'll do a follow-up post some time soon (hopefully this week), and release both the .gml and .gephi files, so you can play around with the data as well.

Finally, I'd love to produce Andrew's suggestion - an interactive visualisation of the graph. I can produce a 12MB SVG image, but most browsers choke when I try and load it. So I'm looking for a JS library to selectively load & display parts of a large SVG file. Any pointers most welcome :)