Hackage GSoC status for the 3/7ths mark

A week and a half ago I talked about how the new Hackage server is internally structured. What I’d like to communicate now, to commemorate the Google Summer of Code coding period being exactly 3/7ths of the way through (36 days down, 48 to go), is more of a status report than anything else. (Granted, 3/7ths marks would be more significant if we used base 7 numbers.) I spent last week on a feature-implementing spree. With an emphasis on “spree”, and none on doing-anything-else-or-even-leaving-the-house. So I spent the weekend regenerating, and I think it’s time I reflected on how I’ve been doing.

The original schedule

When I applied to do the Hackage for Summer of Code, I included a tentative schedule. I have not strictly followed it so far, though I didn’t quite expect to. Here’s why.

1. 2 weeks. Become familiar with codebase and add documentation to declarations as I understand them. Find functionality not in the old server and not covered by the coming weeks and fully port it. Do the same for items in the hackage-server TODO list.

I didn’t anticipate all of the restructuring that needed to be done, thinking I could mostly append rather than modify. Well, I have substantially altered the already-great codebase into a modular form I’m pretty happy with, but it nonetheless takes a long time to do so when you’re starting with a 10,000-line codebase developed over 2 years (now it’s around 12,000 lines). The old server is mostly fully ported, although it wasn’t done within the space of these two weeks.

2. 1.5 weeks. Get build reports to display and gather useful information: already partially implemented. Use this feature as an opportunity to become even more comfortable refactoring and enhancing the hackage-server source.

Non-anonymous build reports are essentially complete. Anonymous ones are a bundle of privacy pitfalls, so we’ll have them as separate feature, using a variant on the data structure currently used to house per-package reports. The idea is to publish them to everyone but do so in a way that mostly eliminates identification or cross-referencing. More on this below.

3. 1.5 weeks. Get user accounts and settings working, writing a system for web forms, both the dynamic JavaScript kind and static kind. Use this system to get package configurating [sic] settings editable by both package maintainers and Hackage administrators.

I’ve written precious little HTML and no JavaScript, instead using curl to prod the server and setting up an Arch VM to ensure compliance with the current (soon to be old?) cabal-install. User accounts, digest authentication, and user groups — essentially access control lists — are all here. Most of this information is served in text/plain at the moment. Given that the new server will probably require a redesign by more design-minded Haskellers, I’d rather keep everything minimalistic for the time being. As I mentioned last post, I think the server architecture has a good separation of model and view.

4. 1 week. If a viable solutions for changelogs comes up by this point, I’ll implement it here. This might be as simple as a ./changelog file with a simple prescribed format.

That’s this week! At least 100 packages on Hackage already have changelogs. Of those, about two dozen are named changelog.md (they use markdown fix/feature structure, which git uses). The rest have whichever format the author chose, and these formats are all over the place. Some use darcs changes, which is too fine-grained for Hackage. All this is a too non-uniform for an automatic uniform interface. One approach that I can probably code up in a day or two is to have a changelog editable on Hackage. It could be inputted on upload and possibly edited afterwards by maintainers. Otherwise, I’ll leave this one until “a viable solution for changelogs comes up”.

What’s been done

All of the features I listed in the last blog post have been implemented, although not all of it’s exposed through HTML. Brief descriptions of them are there. The most interesting one, also proving to be the most challenging, is the candidate packages feature, which an enhanced version of the check-pkg CGI script. Here’s what you can do with it.

/packages/candidates/: see the list of candidate packages. POST here to upload a candidate package; candidates for existing packages can only be added by maintainers.

/package/{package}/candidate: see a preview package page for a candidate tarball, with any warnings or errors that would prevent putting it in the main index

/package/{package}/candidate/publish: POST here to put the package in the main index. It has to be a later version than the latest current existing under the name, and only maintainers can do this. If no package exists under the name, these restrictions don’t apply.

/package/{package}/candidate/{cabal}: get the cabal file for this package

/package/{package}/candidate/{tarball}: get the tarball

In the immediate future

I’d like to get the newer server ready for running on sparky by the end of the week. It doesn’t yet look very different from the current Hackage in terms of what web browsers can access.

Currently there are four ways to start up the server. The first is to initialize it on a blank slate and go from there with hackage-server --initialise. Second, you can start it normally with an existing dataset stored by happstack-state, just hackage-server. Otherwise, you can import from an existing source. You can import mostly everything from the old Hackage server, as I described in my first post. Alternatively, you can initialize it from a single backup tarball produced by the server.

I’d like to revamp the interface to make it easier to deploy. Instead of importing directly from old sources, there’s going to be an auxiliary mode to convert legacy data into a newer backup tarball. Then, the new tarball can be imported directly. I haven’t had any backup tarballs on hand to test the newer import/export system, though it compiles. This is next on the todo list.

Some features that I’d like to get done soon are uploading documentation and implementing deprecation. Deprecated packages might still be needed as dependencies, so they’re kept around and will probably go in the index tarball, but they won’t be highly visible on any of the HTML pages. Currently documentation will be implemented by uploading tarballs. This is compatible with the current solution, which is to have a dedicated build client. It would be easier to have users upload their own docs, and not have to deal with the build client not being able to do so. This would be simple if .haddock files provided everything neessary for generating HTML docs and linking them with hscolour pages. I’m not sure if this is the case. Holding onto .haddock files also makes documentation statistics a lot easier. For now, documentation tarball upload is the route I’m taking.

Another nice feature would be serving directly from package tarballs, preferrably without having to store them in memory or unpack them on the server filesystem. Like the documentation feature, it would use a data structure defined in the hackage-server source: a TarIndexMap. Given a file path, it can efficiently give you the byte offset of the tar entry where that file is stored, and from that retrieve the file directly. There are some downsides here. First, package tarballs are not .tar but .tar.gz, so this might more-than-double the amount of storage required, which unpacking would do anyway. Second, the TarIndexMap of every single package tarball would be kept in memory, although this uses an efficient trie structure, so it’s not so bad.

There are also some internal server design challenges, which I’ll describe in the next two paragraphs; skip them if you like. One of them is making URI generation less clunky. Every resource provides enough information to generate its canonical URI given an association list of string pairs. However, this requires passing around the resource itself, which also contains the server function and other things. I’m considering making a global map that, given the string name of a resource, gives a URI-generating function, which means either passing this mapping to every single server function or setting up a ReaderT monad around Happstack’s ServerPartT. The other issue is that a URI is not guaranteed; it’s wrapped in a Maybe, since this system doesn’t provide the type safety guarantees of libraries like web-routes: it’s ‘stringly typed’.

In addition, user groups are currently totally decentralized, but perhaps they could use some more coordination. The MediaWiki system of having a global mapping for which groups are allowed to execute which permissions is pretty good, though in a typical PHP manner, it uses strings to do this. It might be better for each type of group to list what permissions it can do, rather than having this check in code itself, but again this might require passing this mapping to every single server function.

Memory and performance

I’ve done some rudimentary statistics-gathering, but much more will need to be done soon.

For instance, importing from the old Hackage server causes the memory used by the server to reach around 700MB and stay there (any memory allocated by GHC always stays there), and this is only for the current tarball versions. However, this is only needed for initialization, as I mentioned I plan on making a separate mode for legacy import.

By contrast, starting up the server with the current set of package versions occupies 390 MB of memory, although only 148 MB is used by the RTS at any given time. When initializing the server in this mode, 40% of the CPU time is used on garbage collection, but things seem reasonably stable afterwards. The directory storage with the current tarball versions occupies 130 MB disk space, and the happstack-state database is just 17 MB. This database is pretty small comparatively, likely because it doesn’t include the parsed PackageDescription data structure, which contains lots of fields and lots of strings.

In general, I think I need some modifications to ensure that GHC isn’t too heap-hungry, I suspect. Heap profiling has proven suspect thus far, since apparently the sever has a special affinitify for ghc-prim:GHC.Types.:, and if I’m reading it right I find it somewhat hard to believe that over 90% of the sever’s memory is used on cons cells. On the other hand, maybe there are that many Strings and [Dependency]s. I think later on I’ll be asking the advice of some more senior Haskell hackers to keep memory usage down, even if one of the selling points of Happstack is that all data’s in memory. (Not entirely true here: the blob storage is used for package tarballs.)

In the eventual future

Build reports are a must-do, and at present authenticated clients can submit build reports and build logs. Anonymous reports are tricky though (but still immensely useful), and I know many of you guys wouldn’t submit reports without them. Statistics need to be done as well; how to take a large amount of these:

and tell you something useful about them. Perhaps it could tell you that the above report is not recent.

Also, a solution for systematic client-side and server-side caching of HTML hasn’t come up yet, if this is in the cards at all. Making an ETag-generating function is not a simple matter, particularly when multiple representations of the same resource are served in multiple formats at multiple URIs (sadly, I can’t rely solely on the Accept header, because browser implementers seemingly read RFCs highlighted with black markers).

Finally, there’s no clear procedure for migrating data, and I’m still not fully familiar with Happstack state’s data versioning system. Apparently both data types need to exist at the same time, and then the old one can be discarded. I could probably write a startup mode for this.

The most eventual of future elements is more shiny features. This future will extend beyond this summer, so while some individual features might deserve Summer of Code projects in their own right, I’ll try to knock out as many of the others as possible. Let the other 4/7ths begin!