Syndication

My last post
explained how I retrieved and corrected data from snapshot.debian.org so that
dose3 was able to parse it. In this post I will cover some surprising results I
found when using my tools on those Packages and Sources files from 2005 until
today.

For each pair of Packages and Sources files I did the following:

created a reduced distribution

calculated the dependency graph

I call a reduced distribution the smallest set of binary and source packages
with the following properties:

all source packages can be built with the available binary packages

all binary packages are built from the available source packages

Creating a reduced distribution first, greatly increases the execution speed of
my algorithms as it reduces the amount of binary and source packages by an
order of magnitude while still preserving the dependency cycle situation of the
core packages. In many cases, once the packages of a reduced distribution are
available, all the rest of Debian can be compiled from them without any
dependency cycles.

As also mentioned in earlier posts, there is always one central, big strongly
connected component (SCC) in the dependency graph.

I am especially interested in how the size of the reduced distribution and the
SCC change over time as both are an indication of:

the amount of interdependencies between core packages

the amount of dependency cycles in the dependency graph

Lets look at the plots I did from the data I gathered. The gray data points
indicate that at that point in time, one or more of the core source packages
(the ones in the reduced distribution) in Debian Sid was not compilable. This
means that the resulting values cannot be fully trusted. But as it is mostly
only a single source package that doesnt compile, it doesnt influence the
overall result much and therefor I included them anyways. Red and green data
points represent a fully successful run.

The only thing that I do not yet understand is what happened in 2007...

So while a potential porter in 2005 only had to look at a graph of 150 nodes,
he now needs to solve a graph of nearly 1000 nodes. The amount of edges in the
dependency graph grew even more dramatic from about 500 to over 8000 edges.

While the dependency situation for Debian Sid in 2005 can easily be printed
using xdot and visually solved, this in not possible anymore in 2012.

While dependencies of only a few dozen source packages had to manually be
dropped in 2005, now even dropping build dependencies from a few hundred source
packages doesnt solve the dependency
situation.

So my assumption is, that due to a growing amount of interdependencies between
source and binary packages (as both gain more features), bootstrapping Debian
for a new architecture becomes harder over time. Is this also the perceived
subjective impression of people that ported Debian in the past?

If my assumption is correct, then there is a growing need for official support
of droppable build dependencies (or "stage builds" or "profile builds") to
break dependency cycles during the bootstrapping process. Work of a porter
would be much easier if source packages would already contain information about
what build dependencies can be dropped (if so needed). In the best case, a
machine could use those annotations to calculate a build order automatically.

As one can see in the graph above, there are currently 370 source packages in
the main SCC. This means that no more than this amount of packages (but
probably much less) have to be annotated to break the SCC into a directed
acyclic graph.

Discussion about what syntax to use to mark potentially droppable build
dependencies currently happens in bug#661538
but should maybe be discussed by a wider audience. The currently favored
solution was proposed in said bugreport by Guillem Jover and is called "build
profiles". It has the advantage that it is not only trivial to implement (a
patch exist for
dpkg and dose3
already supports them) but would also be useful for other purposes like
embedded builds. The format is similar to how architecture restrictions for
individual dependencies are specified but uses "triangular brackets":

Build-Depends: huge (>= 1.0) [i386 arm] <!embedded !bootstrap>, tiny

The work Patrick McDermott did for his GSoC project over the summer already
uses above syntax.

When I wanted to use my dependency graph analysis tools to analyze earlier
states of Debian Sid, I naturally used
snapshot.debian.org to retrieve the Packages and
Sources files from which my tools retrieve the dependency information.

The problem is, that many of those Packages and Sources files contain syntax
errors that make the dose3 parser choke. This leads to my tools being unable to
parse the affected files.

The following script does not only download all Packages and Sources files in a
five day interval (4460 MB from 2005/03/12 to 2012/10/11) but also cleans all
the syntax errors that were not parsable by dose3. This includes invalid
version naming, architecture lists separated by commas, disjunctions in
Conflict fields and incorrect braces/bracket usage.

Maybe this helps others who also want to profit from Packages and Sources files
from the past.

Fun fact #1: starting from June 2010, there were no more syntax errors in the
Packages and Sources files of Debian Sid.

Fun fact #2: starting from December 2009, there are no more mismatches between
versions of binary packages in the Packages file and the versions of the
corresponding source packages in the Sources file.

Automatically devising a build order that allows to bootstrap Debian, currently
fails (amongst other reasons) because of the lack of metadata information about
which build dependencies can potentially be dropped from source packages. If
that information was available, an algorithm could decide which build
dependencies to drop so that dependency cycles can be broken.

Finding droppable build dependencies of a source package is something only
humans can do. This is because it involves to manually analyze and test the
build system of a source package. Build systems are neither uniform nor do they
encode their dependencies in a way that can directly be mapped to Debian
packages. Therefor they are not machine readable.

One idea to solve the dilemma, is to find a Linux distribution that provides
the following:

allows to do "profile builds" of its source packages with different features
enabled or disabled

stores information about which feature requires which build dependency

stores everything in a format that can be parsed and analyzed

covers a similar range of software packages as Debian does

If such a distribution can be found then the information from it can be used to
find dependencies that can also be dropped from Debian source packages.

Gentoo is a distribution that fulfills above requirements through so called USE
flags that allow to enable or disable features during compilation. Dependencies
of Gentoo source packages are stored in .ebuild files that control the build
process. Since .ebuild files are bash scripts, parsing them is not trivial. I
therefor used the emerge software package to extract that information. Thanks
to the well written emerge code and to quick help in the Gentoo IRC channel, it
didnt take long to make the code run on Debian. My sourcecode is downloadable
here:

Before I list the results of using Gentoo USE flags to determine dependencies
that can potentially be dropped from Debian source packages, let me list the
problems that this method entails.

Only package name matching, no version matching

When writing the mapping from Debian to Gentoo packages and back I discard
version information. There are just too many versions that either Debian or
Gentoo have and are not present in the other. So the assumption is, that
Debian Sid and Gentoo have both the most recent major versions of upstream
software which has roughly the same requirements in terms of build
dependencies.

Gentoo packages are matched to Debian source packages

In Gentoo there are only source packages and no binary packages. So I map
Gentoo packages to Debian source packages. But Gentoo source packages build
depend on other source packages while Debian source packages depend on binary
packages. So at some point I have to translate Gentoo packages to Debian source
packages and those source packages to Debian binary packages. I do this by
analyzing the original binary package build dependencies of a Debian source
package and then filter out those binary packages as being droppable that are
built by the Debian source packages that were found to be droppable.

Not the exact same package set

There is some software that is only in Gentoo and some that is only in Debian.
Debian and Gentoo also split some source packages differently.

Gentoo has more direct dependencies

Many build dependencies in Debian are indirectly pulled in through dependencies
of direct build dependencies. In Gentoo source packages directly depend on most
things they need to build successfully. This leads to the list of dependencies
in Gentoo to be much larger than the list of dependencies in the corresponding
Debian source package. It also means that lots of dependencies that can be
dropped in Gentoo are not found to be droppable in Debian because they are not
direct dependencies of that source package.

There are no implicit dependencies

Gentoo will often drop dependencies that are essential or build-essential
packages in Debian and are therefor implicit build dependencies that cannot be
dropped.

Result

Despite the many problems, the result doesnt look too wrong. I got some Debian
source packages that were found to have droppable build dependencies from
Thorsten Glaser and all dependencies that Gentoo found to be droppable were
also dropped by him.

To put everything into numbers: the current 912 nodes big SCC in Debian Sid can
be reduced to 6 individual SCC with 422, 5, 5, 3, 2 and 2 nodes each. So using
Gentoo cuts the size of the central component to more than half.

Surely, there will be a number of dependencies that were found to be droppable
in Gentoo but are actually not droppable in Debian. The point is, that it is
better to have "some" data even if it contains false positives than no data at
all. It is easier for a human to verify if some suggested droppable build
dependencies are actually correct than going through hundreds of source
packages with thousands of dependencies manually.