July 16, 2009

I recently downloaded the package manifests from Haskell's hackage (a global repository for random people to post useful Haskell code) database and started performing some interesting analysis on them. My goal was to check out various easy statistics. All packages in this database are nicely indexed with a file describing all the dependencies, version information, etc. That manifest file (for package foo, it's foo.cabal) enables cabal (the package manager system Haskell / hackage uses) to interact with the main hackage database and ghc. However, there is all sorts of interesting information to visualize and toy with. For a sample of what one looks like check the manifest for the tar package tar.cabal. (This package allow

The whole blob of these database headers is stored in a .tar.gz (a tar gzip'd). So first, I wget'd the archive so it was stored locally.

% wget http://hackage.haskell.org/packages/archive/00-index.tar.gz

I could have manually decompressed (with tar) the manifests first, but they are just fat text and I had a more interesting idea: hackage has nice packages for handling tar and gzip'd data. I decided to use those instead.

% cabal install tar...% cabal install zlib...

Now at the header we use import all these libraries (to illustrate some).

The entries value holds a Tar.Entries, a linked list of tar entries. I define the (>|=) operator as just a ghetto use of fmap(or liftM) to make the transformations look like more of a pipeline. Now tar entries are basically a ByteString of the file and path info. Those byte buffers are the raw text of the manifest files (the .cabal files).

Now all we need is is a way to turn those byte strings into abstract syntax. At any rate, we left that worry to my ill-named mapEntries function above and we fill in the details here.

For a bit I considered actually doing some raw work, by parsing and representing the syntax of these cabal files (actually quite a bit of work). But I recalled seeing some packages when perusing the hierarchical libraries. Turns out the packages the cabal tool uses are nice and modular and are even part of ghc's default hierarchical libraries. So it was as easy as:

Then all we need is to call parsePackageDescription passing a String of the file as an argument. There are better ways to convert a ByteString to a String, I just assumed it was ASCII and used the unpack function mapping the Word8's to Char's via(Data.Char.chr . fromIntegeral)(I think one is supposed to import a different ByteString package, but this works fine for me.) Now, parsePackageDescription gives us its super complicated syntax representation as GenericPackageDescription data values. After scanning all the options and depth of this syntax tree, I am really glad I didn't mess with it directly. Next, I group packages in a tree, removing older versions (each hackage component might have several versions, I only want the newest).

The type of scanArchive is:

scanArchive :: FilePath -> IO [GenericPackageDescription]

Lazy Evaluation RantParsing the 12m archive with 14oo+ packages takes ghci around 350m in memory. I have no idea where the bloat is, I am using Data.Map.adjust to add things to the update packages.Even after the parse an culling of duplicate packages, for some reason ghci holds onto all 350m. (Even after I force evaluation of the stupid thing via printing.)

Forget Control.Strategies.Parallel.rnftoo, the big honking GenericPackageDescription syntax tree of data types is not an instance, and I am not going to derive all of them myself. If I force the map holding the modules as I add them to partially reevaluate each iteration, I can get it down to about 190m with (Control.Exception.evaluate (m /= m')) where m and m' are the old and new map. Using (length . show) on the map takes much longer, but prunes the memory use to 150m. Even better, if I use the in-equality approach every 10 packages or so, it takes 200m and completes in less than 20 seconds. I'll take it.

What makes me furious about this is that is not that I don't have 350m lying around, it is that I cannot easily reason about my ridiculously simple program. That is scary to me and so galling that it quickly drives me int a bloodlust.

Now I have a list of all the packages. Since I don't want to wait 20 seconds per query of our database, we use the ``global'' value trick.

db = unsafePerformIO $ scanArchive "00-index.tar.gz"

Now db only gets evaluated once, but holds the entire list of packages.

The Fruits Of Our LaborA simple function to query our ``database'' creates histograms of information. It takes the database, a display function and a function to project with.

The first arguments are self as described above. The projection function ((.>) is flip (.)) selects all the category attributes of the package descriptions. Since this is already a String, we use id, show would add quotes. The listing shows that 128 packages are categorized as being related to data. The second most common category is system related packages.

StabilityAs a second example, we project on package stability. The lowercase allows us to be case insensitive (e.g. some people list Alpha and others alpha).

(Note: that the 703 at the top is not a total, 703 entries list the empty string as their stability level.)

Now, I love hackage. In a couple of hours I cobbled together a really cool script mainly using hackage-based code. However, about, ``all the free code that is already out there for us to use,'' what do we make of its stability? I would say provisional and stable are the good ones that is 140 of like 1400 packages. So caveat emptor and choose carefully.

That said, if 2/3 of software cost is maintenance; nobody wants to service a bunch of whiny users if they don't have to. It's the first 1/3 that is fun I suppose. I know if I uploaded something to hackage, I wouldn't want to babysit it once I got bored with it. But this is kind of a nice feature of hackage: post the code, if someone wants to use it, let them fix it.

How many packages have a different author than maintainer? Does this imply a such a change of hands as above?