The Debian repository format was designed a long time ago. The oldest
versions of it were produced with the help of tools such as
dpkg-scanpackages and consumed by dselect access methods such as
dpkg-ftp. The access methods just fetched a Packages file (perhaps
compressed) and used it as an index of which packages were available; each
package had an MD5 checksum to defend against transport errors, but being
from a more innocent age there was no repository signing or other protection
against man-in-the-middle attacks.

An important and intentional feature of the early format was that, apart
from the top-level Packages file, all other files were static in the
sense that, once published, their content would never change without also
changing the file name. This means that repositories can be efficiently
copied around using rsync without having to tell it to re-checksum all
files, and it avoids network races when fetching updates: the repository
you’re updating from might change in the middle of your update, but as long
as the repository maintenance software keeps superseded packages around for
a suitable grace period, you’ll still be able to fetch them.

The repository format evolved rather organically over time as different
needs arose, by what one might call distributed consensus among the
maintainers of the various client tools that consumed it. Of course all
sorts of fields were added to the index files themselves, which have an
extensible format so that this kind of thing is usually easy to do. At some
point a Sources index for source packages was added, which worked pretty
much the same way as Packages except for having a different set of fields.
But by far the most significant change to the repository structure was the
“package pools” project.

The original repository layout put the packages themselves under the
dists/ tree along with the index files. The dists/ tree is organised by
“suite” (modern examples of which would be “stable”, “stable-updates”,
“testing”, “unstable”, “xenial”, “xenial-updates”, and so on). This meant
that making a release of Debian tended to involve copying lots of data
around, and implementing the “testing” suite would have been very costly.
Package pools solved this problem by moving individual package files out of
dists/ and into a new pool/ tree, allowing those files to be shared
between multiple suites with only a negligible cost in disk space and mirror
bandwidth. From a database design perspective this is obviously much more
sensible. As part of this project, the original Debian “dinstall”
repository maintenance scripts were
replaced
by “da-katie” or “dak”, which among other things used a new apt-ftparchive
program to build the index files; this replaced dpkg-scanpackages and
dpkg-scansources, and included its own database cache which made a big
difference to performance at the scale of a distribution.

A few months after the initial implementation of package pools, Release
files were added. These formed a sort of meta-index for each suite, telling
APT which index files were available (main/binary-i386/Packages,
non-free/source/Sources, and so on) and what their checksums were.
Detached signatures were added alongside that (Release.gpg) so that it was
now possible to fetch packages securely given a public key for the
repository, and client-side verification
support for
this eventually made its way into Debian and Ubuntu. The repository
structure stayed more or less like this for several years.

At some point along the way, those of us by now involved in repository
maintenance realised that an important property had been lost. I mentioned
earlier that the original format allowed race-free updates, but this was no
longer true with the introduction of the Release file. A client now had
to fetch Release and then fetch whichever other index files such as
Packages they wanted, typically in separate HTTP transactions. If a
client was unlucky, these transactions would fall on either side of a mirror
update and they’d get a “Hash Sum Mismatch” error from APT. Worse, if a
mirror was unlucky and also didn’t go to special lengths to verify index
integrity (most don’t), its own updates could span an update of its upstream
mirror and then all its clients would see mismatches until the next mirror
update. This was compounded by using detached signatures, so Release and
Release.gpg were fetched separately and could be out of sync.

Fixing this has been a long road (the first time I remember talking about
this was in late 2007!), and we’ve had to take care to maintain
client/server compatibility along the way. The first step was to add
inline-signed versions of the Release file, called InRelease, so that
there would no longer be a race between fetching Release and fetching its
signature. APT has had this for a while, Debian’s repository supports it as
of stretch, and we finally implemented it for
Ubuntu six months ago.
Dealing with the other index files is more complicated, though; it isn’t
sensible to inline them, as clients usually only need to fetch a small
fraction of all the indexes available for a given suite.

The solution we’ve ended up with, thanks to Michael Vogt’s work implementing
it in APT, is called
by-hash
and should be familiar in concept to people who’ve used git: with the
exception of the top-level InRelease file, index files for suites that
support the by-hash mechanism may now be fetched using a URL based on one of
their hashes listed in InRelease. This means that clients can now operate
like this:

This is now enabled by default in
Ubuntu. It’s only there
as of xenial (16.04), since earlier versions of Ubuntu don’t have the
necessary support in APT. With this, hash mismatches on updates should be a
thing of the past.

There will still be some people who won’t yet benefit from this.
debmirror doesn’t support by-hash yet; apt-cacher-ng only supports it as
of xenial, although there’s an easy configuration
workaround. Full archive mirrors must make
sure that they put new by-hash files in place before new InRelease files
(I just fixed our recommended two-stage sync
script to do this;
ubumirror still needs some work; Debian’s
ftpsync is almost correct but
needs a tweak for its handling of translation files, which I’ve sent to its
maintainers). Other mirrors and proxies that have specific handling of the
repository format may need similar changes.

Please let me know if you see strange things happening as a result of this
change. It’s useful to check the output of apt -o
Debug::Acquire::http=true update to see exactly what requests are being issued.

Julian has
written
about their efforts to strengthen security in APT, and shortly before that
notified us that Launchpad’s
signatures on PPAs use
weak SHA-1 digests. Unfortunately we hadn’t noticed that before; GnuPG’s
defaults tend to result in weak digests unless carefully tweaked, which is a shame.

I started on the necessary fixes for this immediately we heard of the
problem, but it’s taken a little while to get everything in place, and I
thought I’d explain why since some of the problems uncovered are interesting
in their own right.

Firstly, there was the relatively trivial matter of using SHA-512 digests
on new
signatures.
This was mostly a matter of adjusting our configuration, although writing
the test was a bit tricky since
PyGPGME isn’t as helpful as it could
be. (Simpler repository implementations that call gpg from the command
line should probably just add the --digest-algo SHA512 option instead of
imitating this.)

After getting that in place, any change to a suite in a PPA will result in
it being re-signed with SHA-512, which is good as far as it goes, but we
also want to re-sign PPAs that haven’t been modified. Launchpad hosts more
than 50000 active PPAs, though, a significant percentage of which include
packages for sufficiently recent Ubuntu releases that we’d want to re-sign
them for this. We can’t expect everyone to push new uploads, and we need to
run this through at least some part of our usual publication machinery
rather than just writing a hacky shell script to do the job (which would
have no idea which keys to sign with, to start with); but forcing full
reprocessing of all those PPAs would take a prohibitively long time, and at
the moment we need to interrupt normal PPA publication to do this kind of
work. I therefore had to spend some quality time working out how to make
things go fast enough.

The first couple of changes
(1,
2)
were to add options to our publisher script to let us run just the one step
we need in “careful” mode: that is, forcibly re-run the Release file
processing step even if it thinks nothing has changed, and entirely disable
the other steps such as generating Packages and Sources files. Then
last week I finally got around to timing things on one of our staging
systems so that we could estimate how long a full run would take. It was
taking a little over two seconds per archive, which meant that if we were to
re-sign all published PPAs then that would take more than 33 hours!
Obviously this wasn’t viable; even just re-signing xenial would be
prohibitively slow.

The next question was where all that time was going. I thought perhaps that
the actual signing might be slow for some reason, but it was taking about
half a second per archive: not great, but not enough to account for most of
the slowness. The main part of the delay was in fact when we committed the
database transaction after processing each archive, but not in the actual
PostgreSQL commit, rather in the ORMinvalidate method called to prepare for a commit.

Launchpad uses the excellent Storm for all
of its database interactions. One property of this ORM (and possibly of
others; I’ll cheerfully admit to not having spent much time with other ORMs)
is that it uses a
WeakValueDictionary
to keep track of the objects it’s populated with database results. Before
it commits a transaction, it iterates over all those “alive” objects to note
that if they’re used in future then information needs to be reloaded from
the database first. Usually this is a very good thing: it saves us from
having to think too hard about data consistency at the application layer.
But in this case, one of the things we did at the start of the publisher
script was:

defgetPPAs(self,distribution):"""Find private package archives for the selected distribution."""if(self.isCareful(self.options.careful_publishing)orself.options.include_non_pending):returndistribution.getAllPPAs()else:returndistribution.getPendingPublicationPPAs()defgetTargetArchives(self,distribution):"""Find the archive(s) selected by the script's options."""ifself.options.partner:return[distribution.getArchiveByComponent('partner')]elifself.options.ppa:returnfilter(is_ppa_public,self.getPPAs(distribution))elifself.options.private_ppa:returnfilter(is_ppa_private,self.getPPAs(distribution))elifself.options.copy_archive:returnself.getCopyArchives(distribution)else:return[distribution.main_archive]

That innocuous-looking filter means that we do all the public/private
filtering of PPAs up-front and return a list of all the PPAs we intend to
operate on. This means that all those objects are alive as far as Storm is
concerned and need to be considered for invalidation on every commit, and
the time required for that stacks up when many thousands of objects are
involved: this is essentially accidentally
quadratic behaviour, because all
archives are considered when committing changes to each archive in turn.
Normally this isn’t too bad because only a few hundred PPAs need to be
processed in any given run; but if we’re running in a mode where we’re
processing all PPAs rather than just ones that are pending publication, then
suddenly this balloons to the point where it takes a couple of seconds. The
fix
is very simple, using an
iterator instead
so that we don’t need to keep all the objects alive:

fromitertoolsimportifilterdefgetTargetArchives(self,distribution):"""Find the archive(s) selected by the script's options."""ifself.options.partner:return[distribution.getArchiveByComponent('partner')]elifself.options.ppa:returnifilter(is_ppa_public,self.getPPAs(distribution))elifself.options.private_ppa:returnifilter(is_ppa_private,self.getPPAs(distribution))elifself.options.copy_archive:returnself.getCopyArchives(distribution)else:return[distribution.main_archive]

After that, I turned to that half a second for signing. A good chunk of
that was accounted for by the signContent method taking a fingerprint
rather than a key, despite the fact that we normally already had the key in
hand; this caused us to have to ask GPGME to reload the key, which requires
two subprocess calls. Converting this to take a key rather than a
fingerprint
gets the per-archive time down to about a quarter of a second on our staging
system, about eight times faster than where we started.

Using this, we’ve now re-signed all xenial Release files in PPAs using
SHA-512 digests. On production, this took about 80 minutes to iterate over
around 70000 archives, of which 1761 were modified. Most of the time
appears to have been spent skipping over unmodified archives; even a few
hundredths of a second per archive adds up quickly there. The remaining
time comes out to around 0.4 seconds per modified archive. There’s
certainly still room for speeding this up a bit.

We wouldn’t want to do this procedure every day, but it’s acceptable for
occasional tasks like this. I expect that we’ll similarly re-sign wily,
vivid, and trusty Release files soon in the same way.

Launchpad operates a few SSH endpoints: bazaar.launchpad.net and
git.launchpad.net for code hosting, and upload.ubuntu.com and
ppa.launchpad.net for uploading packages. None of these are
straightforward OpenSSH servers, because they don’t give ordinary shell
access and they authenticate against users’ SSH keys recorded in Launchpad;
both of these are much easier to do with SSH server code that we can use in
library form as part of another service. We use
Twisted for several other tasks
where we need event-based networking code, and its
conch package is a good
fit for this.

Of course, this means that it’s important that conch keeps up to date with
the cryptographic state of the art in other SSH implementations, and this
hasn’t always been the case. OpenSSH 7.0 dropped support for some old
algorithms, including disabling the
1024-bit diffie-hellman-group1-sha1 key exchange method at run-time.
Unfortunately, this also happened to be the only key exchange method that
Launchpad’s SSH endpoints supported (conch supported the slightly better
diffie-hellman-group-exchange-sha1 method as well, but that was disabled
in Launchpad due to a missing piece of configuration). SHA-2
support was clearly called for,
and the fact that we had to get this sorted out in conch first meant that
everything took a bit longer than we’d hoped.

In Twisted
15.5,
we contributed support for several conch improvements:

Between them and with some adjustments to the
lazr.sshserver package we use
to glue all this together to add support for DH group exchange, these are
enough to allow us not to rely on SHA-1 at all, and these improvements have
now been rolled out to all four endpoints listed above. I’ve thus also
uploaded OpenSSH 7.1 packages to Debian unstable.

If you also run a Twisted-based SSH server, upgrade it now! Otherwise it
will be harder for users of recent
OpenSSH client versions to use your server, and for good reason.

Step down considerately: When somebody leaves or disengages from the
project, we ask that they do so in a way that minimises disruption to the
project. They should tell people they are leaving and take the proper
steps to ensure that others can pick up where they left off.

I’ve been working on Ubuntu for over ten years now, almost right from the
very start; I’m Canonical’s employee #17 due to working out a notice period
in my previous job, but I was one of the founding group of developers. I
occasionally tell the story that Mark originally hired me mainly to work on
what later became Launchpad Bugs due to my experience maintaining the Debian
bug tracking system, but then not long afterwards Jeff Waugh got in touch
and said “hey Colin, would you mind just sorting out some installable CD
images for us?”. This is where you imagine one of those movie time-lapse
clocks … At some point it became fairly clear that I was working on
Ubuntu, and the bug system work fell to other people. Then, when Matt
Zimmerman could no longer manage the entire Ubuntu team in Canonical by
himself, Scott James Remnant and I stepped up to help him out. I did that
for a couple of years, starting the Foundations team in the process. As the
team grew I found that my interests really lay in hands-on development
rather than in management, so I switched over to being the technical lead
for Foundations, and have made my home there ever since. Over the years
this has given me the opportunity to do all sorts of things, particularly
working on our installers and on the GRUB boot loader, leading the
development work on many of our archive maintenance tools, instituting the
+1 maintenance effort and proposed-migration, and developing the Click
package manager, and I’ve had the great pleasure of working with many
exceptionally talented people.

However. In recent months I’ve been feeling a general sense of malaise and
what I’ve come to recognise with hindsight as the symptoms of approaching
burnout. I’ve been working long hours for a long time, and while I can draw
on a lot of experience by now, it’s been getting harder to summon the
enthusiasm and creativity to go with that. I have a wonderful wife, amazing
children, and lovely friends, and I want to be able to spend a bit more time
with them. After ten years doing the same kinds of things, I’ve accreted
history with and responsibility for a lot of projects. One of the things I
always loved about Foundations was that it’s a broad church, covering a wide
range of software and with a correspondingly wide range of opportunities;
but, over time, this has made it difficult for me to focus on things that
are important because there are so many areas where I might be called upon
to help. I thought about simply stepping down from the technical lead
position and remaining in the same team, but I decided that that wouldn’t
make enough of a difference to what matters to me. I need a clean break and
an opportunity to reset my habits before I burn out for real.

One of the things that has consistently held my interest through all of this
has been making sure that the infrastructure for Ubuntu keeps running
reliably and that other developers can work efficiently. As part of this,
I’ve been able to do a lot of
work over the years
on Launchpad where it was a good fit with my
remit: this has included significant performance improvements to archive
publishing, moving most archive administration operations from
excessively-privileged command-line operations to the webservice, making
build cancellation reliable across the board, and moving live filesystem
building from an unscalable ad-hoc collection of machines into the Launchpad
build farm. The Launchpad development team has generally welcomed help with
open arms, and in fact I joined the ~launchpad
team last year.

So, the logical next step for me is to make this informal involvement
permanent. As such, at the end of this year I will be moving from Ubuntu
Foundations to the Launchpad engineering team.

This doesn’t mean me leaving Ubuntu. Within Canonical, Launchpad
development is currently organised under the Continuous Integration team,
which is part of Ubuntu Engineering. I’ll still be around in more or less
the usual places and available for people to ask me questions. But I will
in general be trying to reduce my involvement in Ubuntu proper to things
that are closely related to the operation of Launchpad, and a small number
of low-effort things that I’m interested enough in to find free time for
them. I still need to sort out a lot of details, but it’ll very likely
involve me handing over project leadership of Click, drastically reducing my
involvement in the installer, and looking for at least some help with boot
loader work, among others. I don’t expect my Debian involvement to change,
and I may well find myself more motivated there now that it won’t be so
closely linked with my day job, although it’s possible that I will pare some
things back that I was mostly doing on Ubuntu’s behalf. If you ask me for
help with something over the next few months, expect me to be more likely to
direct you to other people or suggest ways you can help yourself out, so
that I can start disentangling myself from my current web of projects.

Please contact me sooner or later if you’re interested in helping out with
any of the things I’m visible in right now, and we can see what makes sense.
I’m looking forward to this!

We had
somerequests
to get GHC (the Glasgow Haskell Compiler) up
and running on two new Ubuntu architectures:
arm64, added in 13.10,
and ppc64el, added
in 14.04. This has been something of a saga, and has involved rather more
late-night hacking than is probably good for me.

Book the First: Recalled to a life of strange build systems

You might not know it from the sheer bulk of uploads I do sometimes, but I
actually don’t speak a word of Haskell and it’s not very high up my list of
things to learn. But I am a pretty experienced build engineer, and I enjoy
porting things to new architectures: I’m firmly of the belief that breadth
of architecture support is a good way to shake out certain categories of
issues in code, that it’s worth doing aggressively across an entire
distribution, and that, even if you don’t think you need something now, new
requirements have a habit of coming along when you least expect them and you
might as well be prepared in advance. Furthermore, it annoys me when we
have excessive noise in our build failure
and proposed-migration output
and I often put bits and pieces of spare time into gardening miscellaneous
problems there, and at one point there was a lot of Haskell stuff on the
list and it got a bit annoying to have to keep sending patches rather than
just fixing things myself, and … well, I ended up as probably the only
non-Haskell-programmer on the Debian Haskell team and found myself fixing
problems there in my free time. Life is a bit weird sometimes.

Bootstrapping packages on a new architecture is a bit of a black art that
only a fairly small number of relatively bitter and twisted people know very
much about. Doing it in Ubuntu is specifically painful because we’ve always
forbidden direct binary uploads: all binaries have to come from a build
daemon. Compilers in particular often tend to be written in the language
they compile, and it’s not uncommon for them to build-depend on themselves:
that is, you need a previous version of the compiler to build the compiler,
stretching back to the dawn of time where somebody put things together with
a big magnet or something. So how do you get started on a new architecture?
Well, what we do in this case is we construct a binary somehow (usually
involving cross-compilation) and insert it as a build-dependency for a
proper build in Launchpad. The ability to do this is restricted to a small
group of Canonical employees, partly because it’s very easy to make mistakes
and partly because things like the classic “Reflections on Trusting
Trust” are in the backs of our
minds somewhere. We have an iron rule for our own sanity that the injected
build-dependencies must themselves have been built from the unmodified
source package in Ubuntu, although there can be source modifications further
back in the chain. Fortunately, we don’t need to do this very often, but it
does mean that as somebody who can do it I feel an obligation to try and
unblock other people where I can.

As far as constructing those build-dependencies goes, sometimes we look for
binaries built by other distributions (particularly Debian), and that’s
pretty straightforward. In this case, though, these two architectures are
pretty new and the Debian ports are only just getting going, and as far as I
can tell none of the other distributions with active arm64 or ppc64el ports
(or trivial name variants) has got as far as porting GHC yet. Well, OK.
This was somewhere around the Christmas holidays and I had some time.
Muggins here cracks his knuckles and decides to have a go at bootstrapping
it from scratch. It can’t be that hard, right? Not to mention that it was
a blocker for over 600 entries on that build failure list I mentioned, which
is definitely enough to make me sit up and take notice; we’d even had the
odd customer request for it.

Several attempts later and I was starting to doubt my sanity, not least for
trying in the first place. We ship GHC 7.6, and upgrading to 7.8 is not a
project I’d like to tackle until the much more experienced Haskell folks in
Debian have switched to it in unstable. The porting documentation for
7.6 has bitrotted
more or less beyond usability, and the corresponding documentation for
7.8 really isn’t
backportable to 7.6. I tried building 7.8 for ppc64el anyway, picking that
on the basis that we had quicker hardware for it and didn’t seem likely to
be particularly more arduous than arm64 (ho ho), and I even got to the point
of having a cross-built stage2 compiler (stage1, in the cross-building case,
is a GHC binary that runs on your starting architecture and generates code
for your target architecture) that I could copy over to a ppc64el box and
try to use as the base for a fully-native build, but it segfaulted
incomprehensibly just after spawning any child process. Compilers tend to
do rather a lot, especially when they’re built to use GCC to generate object
code, so this was a pretty serious problem, and it resisted analysis. I
poked at it for a while but didn’t get anywhere, and I had other things to
do so declared it a write-off and gave up.

Book the Second: The golden thread of progress

In March, another mailing list conversation prodded me into finding a blog
entry by Karel
Gardas
on building GHC for arm64. This was enough to be worth another look, and
indeed it turned out that (with some help from Karel in private mail) I was
able to cross-build a compiler that actually worked and could be used to run
a fully-native build that also worked. Of course this was 7.8, since as I
mentioned cross-building 7.6 is unrealistically difficult unless you’re
considerably more of an expert on GHC’s labyrinthine build system than I am.
OK, no problem, right? Getting a GHC at all is the hard bit, and 7.8 must
be at least as capable as 7.6, so it should be able to build 7.6 easily
enough …

Not so much. What I’d missed here was that compiler engineers generally
only care very much about building the compiler with older versions of
itself, and if the language in question has any kind of deprecation cycle
then the compiler itself is likely to be behind on various things compared
to more typical code since it has to be buildable with older versions. This
means that the removal of some deprecated interfaces from 7.8 posed a
problem, as did some changes in certain primops that had gained an associated compatibility
layer in 7.8 but nobody had gone back to put the corresponding compatibility
layer into 7.6. GHC supports running Haskell code through the C
preprocessor, and there’s a __GLASGOW_HASKELL__ definition with the
compiler’s version number, so this was just a slog tracking down changes in
git and adding #ifdef-guarded code that coped with the newer compiler
(remembering that stage1 will be built with 7.8 and stage2 with stage1, i.e.
7.6, from the same source tree). More inscrutably, GHC has its own
packaging system called Cabal which is also used by the compiler build
process to determine which subpackages to build and how to link them against
each other, and some crucial subpackages weren’t being built: it looked like
it was stuck on picking versions from “stage0” (i.e. the initial compiler
used as an input to the whole process) when it should have been building its
own. Eventually I figured out that this was because GHC’s use of its
packaging system hadn’t anticipated this case, and was selecting the higher
version of the ghc package itself from stage0 rather than the version it
was about to build for itself, and thus never actually tried to build most
of the compiler. Editing ghc_stage1_DEPS in ghc/stage1/package-data.mk
after its initial generation sorted this out. One late night building round
and round in circles for a while until I had something stable, and a Debian
source upload to add basic support for the architecture name (and other
changes which were a bit over the top in retrospect: I didn’t need to touch
the embedded copy of libffi, as we build with the system one), and I was
able to feed this all into Launchpad and watch the builders munch away very
satisfyingly at the Haskell library stack for a while.

This was all interesting, and finally all that work was actually paying off
in terms of getting to watch a slew of several hundred build failures vanish
from arm64 (the final count was something like 640, I think). The fly in
the ointment was that ppc64el was still blocked, as the problem there wasn’t
building 7.6, it was getting a working 7.8. But now I really did have other
much more urgent things to do, so I figured I just wouldn’t get to this by
release time and stuck it on the figurative shelf.

Book the Third: The track of a bug

Then, last Friday, I cleared out my urgent pile and thought I’d have another
quick look. (I get a bit obsessive about things like this that smell of
“interesting intellectual puzzle”.) slyfox on the #ghc IRC channel gave me
some general debugging advice and, particularly usefully, a reduced example
program that I could use to debug just the process-spawning problem without
having to wade through noise from running the rest of the compiler. I
reproduced the same problem there, and then found that the program crashed
earlier (in stg_ap_0_fast, part of the run-time system) if I compiled it
with +RTS -Da -RTS. I nailed it down to a small enough region of assembly
that I could see all of the assembly, the source code, and an intermediate
representation or two from the compiler, and then started meditating on what
makes ppc64el special.

You see, the vast majority of porting bugs come down to what I might call
gross properties of the architecture. You have things like whether it’s
32-bit or 64-bit, big-endian or little-endian, whether char is signed or
unsigned, that sort of thing. There’s a big
table on the Debian wiki
that handily summarises most of the important ones. Sometimes you have to
deal with distribution-specific things like whether GL or GLES is used;
often, especially for new variants of existing architectures, you have to
cope with foolish configure scripts that think they can guess certain things
from the architecture name and get it wrong (assuming that powerpc* means
big-endian, for instance). We often have to update config.guess and
config.sub, and on ppc64el we have the additional hassle of updating
libtool macros too. But I’ve done a lot of this stuff and I’d accounted for
everything I could think of. ppc64el is actually a lot like amd64 in terms
of many of these porting-relevant properties, and not even that far off
arm64 which I’d just successfully ported GHC to, so I couldn’t be dealing
with anything particularly obvious. There was some hand-written assembly
which certainly could have been problematic, but I’d carefully checked that
this wasn’t being used by the “unregisterised” (no specialised machine
dependencies, so relatively easy to port but not well-optimised) build I was
using. A problem around spawning processes suggested a problem with
SIGCHLD handling, but I ruled that out by slowing down the first child
process that it spawned and using strace to confirm that SIGSEGV was the
first signal received. What on earth was the problem?

From some painstaking gdb work, one thing I eventually noticed was that
stg_ap_0_fast‘s local stack appeared to be being corrupted by a function
call, specifically a call to the colourfully-named debugBelch. Now, when
IBM’s toolchain engineers were putting together ppc64el based on ppc64, they
took the opportunity to fix a number of problems with their ABI: there’s an
OpenJDK bug with a handy
list of references. One of the things I noticed there was that there were
some stack allocation
optimisations in
the new ABI, which affected functions that don’t call any vararg functions
and don’t call any functions that take enough parameters that some of them
have to be passed on the stack rather than in registers. debugBelch takes
varargs: hmm. Now, the calling code isn’t quite in C as such, but in a
related dialect called “Cmm”, a variant of C— (yes, minus), that GHC uses
to help bridge the gap between the functional world and its code generation,
and which is compiled down to C by GHC. When importing C functions into
Cmm, GHC generates prototypes for them, but it doesn’t do enough parsing to
work out the true prototype; instead, they all just get something like
extern StgFunPtr f(void);. In most architectures you can get away with
this, because the arguments get passed in the usual calling convention
anyway and it all works out, but on ppc64el this means that the caller
doesn’t generate enough stack space and then the callee tries to save its
varargs onto the stack in an area that in fact belongs to the caller, and
suddenly everything goes south. Things were starting to make sense.

Now, debugBelch is only used in optional debugging code; but
runInteractiveProcess (the function associated with the initial round of
failures) takes no fewer than twelve arguments, plenty to force some of them
onto the stack. I poked around the GCC patch for this ABI change a bit and
determined that it only optimised away the stack allocation if it had a full
prototype for all the callees, so I guessed that changing those prototypes
to extern StgFunPtr f(); might work: it’s still technically wrong, not
least because omitting the parameter list is an obsolescent feature in C11,
but it’s at least just omitting information about the parameter list rather
than actively lying about it. I tweaked that and ran the cross-build from
scratch again. Lo and behold, suddenly I had a working compiler, and I
could go through the same build-7.6-using-7.8 procedure as with arm64, much
more quickly this time now that I knew what I was doing. One upstream
bug, one Debian upload, and
several bootstrapping builds later, and GHC was up and running on another
architecture in Launchpad. Success!

Epilogue

There’s still more to do. I gather there may be a Google Summer of Code
project in Linaro to write proper native code generation for GHC on arm64:
this would make things a good deal faster, but also enable GHCi (the
interpreter) and Template Haskell, and thus clear quite a few more build
failures. Since there’s already native code generation for ppc64 in GHC,
getting it going for ppc64el would probably only be a couple of days’ work
at this point. But these are niceties by comparison, and I’m more than
happy with what I got working for 14.04.

The upshot of all of this is that I may be the first non-Haskell-programmer
to ever port GHC to two entirely new architectures. I’m not sure if I gain
much from that personally aside from a lot of lost sleep and being
considered extremely strange. It has, however, been by far the most
challenging set of packages I’ve ported, and a fascinating trip through some
odd corners of build systems and undefined behaviour that I don’t normally
need to touch.

This is mostly a repost of my ubuntu-devel
mail
for a wider audience, but see below for some additions.

I’d like to upgrade to GRUB 2.02 for Ubuntu 14.04; it’s currently in beta.
This represents a year and a half of upstream development, and contains many
new features, which you can see in the
NEWS file.

Obviously I want to be very careful with substantial upgrades to the default
boot loader. So, I’ve put this in trusty-proposed, and filed a blocking
bug to ensure
that it doesn’t reach trusty proper until it’s had a reasonable amount of
manual testing. If you are already using trusty and have some time to try
this out, it would be very helpful to me. I suggest that you only attempt
this if you’re comfortable driving apt-get directly and recovering from
errors at that level, and if you’re willing to spend time working with me on
narrowing down any problems that arise.

Please ensure that you have rescue media to hand before starting testing.
The simplest way to upgrade is to enable trusty-proposed, upgrade ONLY
packages whose names start with “grub” (e.g. use apt-get dist-upgrade to
show the full list, say no to the upgrade, and then pass all the relevant
package names to apt-get install), and then (very important!) disable
trusty-proposed again. Provided that there were no errors in this process,
you should be safe to reboot. If there were errors, you should be able to
downgrade back to 2.00-22 (or 1.27+2.00-22 in the case of grub-efi-amd64-signed).

Please report your experiences (positive and negative) with this upgrade in
the tracking
bug. I’m
particularly interested in systems that are complex in any way: UEFI Secure
Boot, non-trivial disk setups, manual configuration, that kind of thing. If
any of the problems you see are also ones you saw with earlier versions of
GRUB, please identify those clearly, as I want to prioritise handling
regressions over anything else. I’ve assigned myself to that bug to ensure
that messages to it are filtered directly into my inbox.

I’ll add a couple of things that weren’t in my ubuntu-devel mail. Firstly,
this is all in Debian experimental as well (I do all the work in Debian and
sync it across, so the grub2 source package in Ubuntu is a verbatim copy of
the one in Debian these days). There are some configuration differences
applied at build time, but a large fraction of test cases will apply equally
well to both. I don’t have a definite schedule for pushing this into jessie
yet - I only just finished getting 2.00 in place there, and the release
schedule gives me a bit more time - but I certainly want to ship jessie with
2.02 or newer, and any test feedback would be welcome. It’s probably best
to just e-mail feedback to me directly for now, or to the pkg-grub-devel list.

Secondly, a couple of news sites have picked this up and run it as
“Canonical intends to ship Ubuntu 14.04 LTS with a beta version of GRUB”.
This isn’t in fact my intent at all. I’m doing this now because I think
GRUB 2.02 will be ready in non-beta form in time for Ubuntu 14.04, and
indeed that putting it in our development release will help to stabilise it;
I’m an upstream GRUB developer too and I find the exposure of widely-used
packages very helpful in that context. It will certainly be much easier to
upgrade to a beta now and a final release later than it would be to try to
jump from 2.00 to 2.02 in a month or two’s time.

Even if there’s some unforeseen delay and 2.02 isn’t released in time,
though, I think nearly three months of stabilisation is still plenty to
yield a boot loader that I’m comfortable with shipping in an LTS. I’ve been
backporting a lot of changes to 2.00 and even 1.99, and, as ever for an
actively-developed codebase, it gets harder and harder over time (in
particular, I’ve spent longer than I’d like hunting down and backporting
fixes for non-512-byte sector disks). While I can still manage it, I don’t
want to be supporting 2.00 for five more years after upstream has moved on;
I don’t think that would be in anyone’s best interests. And I definitely
want some of the new features which aren’t sensibly backportable, such as
several of the new platforms (ARM, ARM64, Xen) and various networking
improvements; I can imagine a number of our users being interested in things
like optional signature verification of files GRUB reads from disk, improved
Mac support, and the TrueCrypt ISO loader, just to name a few. This should
be a much stronger base for five-year support.

I’ve just finished deploying automatic installability checking for Ubuntu’s
development release, which is more or less equivalent to the way that
uploads are promoted from Debian unstable to testing. See my ubuntu-devel
post
and my ubuntu-devel-announce
post
for details. This now means that we’ll be opening the archive for general
development once glibc 2.16 packages are ready.

I’m very excited about this because it’s something I’ve wanted to do for a
long, long time. In fact, back in 2004 when I had my very first telephone
conversation with a certain spaceman about this crazy Debian-based project
he wanted me to work on, I remember talking about Debian’s testing migration
system and some ways I thought it could be improved. I don’t remember the
details of that conversation any more and what I just deployed may well bear
very little resemblance to it, but it should transform the extent to which
our development release is continuously usable.

The next step is to hook in autopkgtest
results. This will allow us to do a degree of automatic testing of
reverse-dependencies when we upgrade low-level libraries.

OpenSSH 6.0p1 was released a
little while back; this weekend I belatedly got round to uploading packages
of it to Debian unstable and Ubuntu quantal.

I was a bit delayed by needing to put together an improvement to privsep
sandbox selection that
particularly matters in the context of distributions. One of the experts on
seccomp_filter has commented favourably on it, but I haven’t yet had a
comment from upstream themselves, so I may need to refine this depending on
what they say.

(This is a good example of how it matters that software is often not built
on the system that it’s going to run on, and in particular that the kernel
version is rather likely to be different. Where possible it’s always best
to detect kernel capabilities at run-time rather than at build-time.)

I didn’t make it very clear in the changelog, but using the new
seccomp_filter sandbox currently requires UsePrivilegeSeparation sandbox
in sshd_config as well as a capable kernel. I won’t change the default
here in advance of upstream, who still consider privsep sandboxing experimental.

I’ve managed to go for eleven years working on Debian and nearly eight on
Ubuntu without ever needing to teach myself how APT’s resolver works. I get
the impression that there’s a certain mystique about it in general
(alternatively, I’m just the last person to figure this out). Recently,
though, I had a couple of Ubuntu upgrade bugs to fix that turned out to be
bugs in the resolver, and I thought it might be interesting to walk through
the process of fixing them based on the Debug::pkgProblemResolver=true log files.

Breakage with Breaks

The first was Ubuntu bug #922485
(apt.log). To understand
the log, you first need to know that APT makes up to ten passes of the
resolver to attempt to fix broken dependencies by upgrading, removing, or
holding back packages; if there are still broken packages after this point,
it’s generally because it’s got itself stuck in some kind of loop, and it
bails out rather than carrying on forever. The current pass number is shown
in each “Investigating” log entry, so they start with “Investigating (0)”
and carry on up to at most “Investigating (9)”. Any packages that you see
still being investigated on the tenth pass are probably something to do with
whatever’s going wrong.

In this case, most packages have been resolved by the end of the fourth
pass, but xserver-xorg-core is causing some trouble. (Not a particular
surprise, as it’s an important package with lots of relationships.) We can
see that each breakage is:

This is a
Breaks
(a relatively new package relationship type introduced a few years ago as a
sort of weaker form of Conflicts) on a virtual package, which means that
in order to unpack xserver-xorg-core each package that provides
xserver-xorg-video-6 must be deconfigured. Much like Conflicts, APT
responds to this by upgrading providing packages to versions that don’t
provide the offending virtual package if it can, and otherwise removing
them. We can see it doing just that in the log (some lines omitted):

OK, so that makes sense - presumably upgrading those packages didn’t help at
the time. But look at the pass numbers. Rather than just fixing all the
packages that provide xserver-xorg-video-6 in a single pass, which it
would be perfectly able to do, it only fixes one per pass. This means that
if a package Breaks a virtual package which is provided by more than ten
installed packages, the resolver will fail to handle that situation. On
inspection of the code, this was being handled correctly for Conflicts by
carrying on through the list of possible targets for the dependency relation
in that case, but apparently when Breaks support was implemented in APT
this case was overlooked. The fix is to carry on through the list of
possible targets for any “negative” dependency relation, not just
Conflicts, and I’ve filed a patch as Debian
bug #657695.

My cup overfloweth

The second bug I looked at was Ubuntu
bug #917173
(apt.log). Just as in
the previous case, we can see the resolver “running out of time” by reaching
the end of the tenth pass with some dependencies still broken. This one is
a lot less obvious, though. The last few entries clearly indicate that the
resolver is stuck in a loop:

So ultimately the problem is something to do with libc6; but what? As
Steve Langasek said in the
bug,
libc6’s dependencies have been very carefully structured, and surely we
would have seen some hint of it elsewhere if they were wrong. At this point
ideally I wanted to break out GDB or at the very least experiment a bit with
apt-get, but due to some tedious local problems I hadn’t been able to
restore the apt-clone state file for this bug onto my system so that I
could attack it directly. So I fell back on the last refuge of the
frustrated debugger and sat and thought about it for a bit.

Eventually I noticed something. The numbers after the package names in the
third line of each of these log entries are “scores”: roughly, the more
important a package is, the higher its score should be. The function that
calculates these is pkgProblemResolver::MakeScores() in
apt-pkg/algorithms.cc.
Reading this, I noticed that the various values added up to make each score
are almost all provably positive, for example:

Scores[I->ID]+=abs(OldScores[D.ParentPkg()->ID]);

The only exceptions are an initial -1 or -2 points for Priority: optional
or Priority: extra packages respectively, or some values that could
theoretically be configured to be negative but weren’t in this case. OK.
So how come libc6 has such a huge negative score of -17473, when one would
normally expect it to be an extremely powerful package with a large positive score?

Oh. This is computer programming, not mathematics … and each score is
stored in a signed short, so in a sufficiently large upgrade all those
bonus points add up to something larger than 32767 and everything goes
haywire. Bingo. Make it an int instead - the number of installed
packages is going to be on the order of tens of thousands at most, so it’s
not as though it’ll make a substantial difference to the amount of memory
used - and chances are everything will be fine. I’ve filed a patch as
Debian bug #657732.

I’d expected this to be a pretty challenging pair of bugs. While I
certainly haven’t lost any respect for the APT maintainers for dealing with
this stuff regularly, it wasn’t as bad as I thought. I’d expected to have
to figure out how to retune some slightly out-of-balance heuristics and not
really know whether I’d broken anything else in the process; but in the end
both patches were very straightforward.