A Download Redirector and Torrent/Metalink Generator

News Archive

Here's 2.18.0, nearly two years after the last release. Well, with
MirrorBrain running solidly here and there, what much should there be to
do? Yes, one must prevent "bitrotting", make sure that building on
today's platforms works, and there were also quite some accumulated bug
reports and even a few patches. So here we go! Foremostly, an annoying
bug was fixed that hit new installs (error message about a missing
database column). Plus numerous other small bug fixes. And most
pleasingly, the HTML output of the .mirrorlist pages has been
modernized. (You might want adjust your CSS styling, therefore.)

Update: We issued another point release already, 2.18.1, because the
geoip-lite-update script had a little bug now.

I nearly missed this one! Since quite a while now, VLC downloads are handled via MirrorBrain. That's great news! The VLC folks describe here how they used Sourceforge for a certain time, and then rethought their mirror infrastructure to use MirrorBrain.

MirrorBrain 2.13.4 improves usability of the mirror scanner, by adding a terse report format (which makes it easy to spot problems), and a totally quiet mode where only errors will be output. Surely something that everybody has been waiting for (and myself not the least).

This release also improves usability in some other corners, and adds important documentation. Noteworthy are the added instructions on setting up automatic GeoIP database updates. See the 2.13.4 release notes for details.

Packaged binaries are built and ready for upgrading. You will find them on the download page, as usual.

The Document Foundation was launched on 28th September 2010 and is proud to be the home of LibreOffice, the next evolution of the world's leading free office suite. Its mirror network was built on MirrorBrain from the start, and in a very short time. Thanks to the awesome support of the involved mirrors, mere 24 hours were enough to get 33 mirrors up and running. Viva la LibreOffice!

MirrorBrain 2.13.2 adds worthwhile new features to the mirror list generator that you will enjoy:

The content of the mirror lists (details pages) are now wrapped into a XHTML/HTML DIV container to allow for individual styling. In addition, an arbitrary XHTML/HTML header and footer can be specified to be are placed around the page body.

Due to popular demand, the way hashes are sent can now be influenced. A client can request the pure hash, without filename, via a query parameter in the URL. Likewise, admins can configure this site-wide with a new Apache configuration directive.

I am glad to announce that OpenOffice.org just completed switching their download system to MirrorBrain.

The project is releasing OpenOffice.org 3.2 today. While MirrorBrain has already delivered 3.1.1 recently, 3.2 is the first release that is fully handled through MirrorBrain.

Downloads of OOo were suffering from lack of mirror selection and stability issues since long. The new setup should greatly help users in obtaining OOo, and facilitate the spread of this important piece of free software.

A lot of work went into this migration, and I want to thank everybody involved! A really good team!

Just 5 days after the last release, MirrorBrain 2.11.0 has been released. In addition to lots of bug fixes and minor corrections, there is a new feature.

It’s now possible to configure "fallback mirrors", via Apache config using the MirrorBrainFallback directive, for mirrors being used when no reachable mirror is found in the database. Thus, these mirrors get all those requests that MirrorBrain would normally deliver itself (which is the normal last-resort behaviour). This allows to run a MirrorBrain instance with a pseudo file tree (cf. recently added null-rsync script.) In planning is a "degraded mode" that keeps MirrorBrain running in a database outage, for which the new feature is one of the foundations. This new feature is still its infancy, but ready to be tested. It may be subject to refinement, based on future discussion.

Other enhancements and bug fixes:

mod_mirrorbrain:

Compile fix for old APR (1.2)

Obsolete MirrorBrainHandleDirectoryIndexLocally removed

Default of MirrorBrainHandleHEADRequestLocally changed to off

mb:

Parse errors in the configuration file are not caught and and reported nicely.

Passwords now can contain special characters.

mb scan:

A warning that appeared since the last release has been removed. It was caused by the removal of obsolete code, and purely cosmetic.

null-rsync:

--exclude commandline option has been implemented, to be passed through to rsync.

MirrorBrain 2.10.3 has been released. This is a minor bugfix and feature update,
and nevertheless the changes are not insignificant.

First, there is a new program called null-rsync. It creates a pseudo mirror of a remote
file tree, without occupying significant disk space. Use case: running MirrorBrain instances
without hosting the file tree locally; and also experimentation and development.

Then, this release fixes usability issues in the mb tool that could occur when creating new
mirrors and running into DNS intricacies. The change is that the admin is now given a link to
in-depth background information. Which is hopefully helpful.

Finally, some small sorting issues in the generation of mirror lists have been fixed.

(In fact, this release followed 2.10.0 by only a few days, and thus has been available since 9th of September. Due to lack of time it wasn't formally announced earlier. Apologies.)

2.10.1 revised the metalink hash cache again, after it was found that some filesystems do not guarantee stable inode numbers. To avoid expensive regeneration of hashes, previously existing hash files are automatically migrated. As a new feature, the metalink-hasher can now easily be run in parallel on large file trees, since it uses per-directory locking to make sure that two jobs won't work on the same files.

Closely following 2.9.0, there is 2.9.1 out now. This fixes two (old) bugs that became apparent just now.

One concerns new installations: If the supplementary tool geoiplookup_continent wasn't installed yet, it was impossible to create a new mirror, because the mb new tool relied on its existance. Now, a meaningful error message should point into the right direction.

Regarding the other issue, is not likely that anyone (but me) ran into it. It turns out that database connection strings used in the Apache configuration need to be unique per vhost. This release adds debugging output that may be helpful to track this down.

An important change is that a restriction in the mb tool which made it require mod_asn to be installed alongside MirrorBrain has been removed. Thus, MirrorBrain can now be installed without installing mod_asn.

The tools have been much revisited. The metalink-hasher received major work. File probing has been parallized, and enhanced with many features.

Perhaps the most significant advance is new docs subdirectory in the code tree. Any changes there are automatically reflected online at http://mirrorbrain.org/docs/. The current content there still needs to be looked at with one eye slightly squinted, but now everything's up and running to really document things.

I wrote up what I believe could be a good plan for collecting download
statistics. I believe it would satisfy the needs of many projects. And not only MirrorBrain
users — as projected, it could be used independently, or with other
redirectors.

To make it easy to try out MirrorBrain and play with it, there's now a VirtualBox Appliance ready to download: it contains an openSUSE 11.1 system with installed MirrorBrain 2.8.1 setup. (regularly updated - seedownload page)

The image is about 500 MB in size and can be downloaded from http://mirrorbrain.org/eval/openSUSE_11.1/ or rsync'ed from rsync://mirrorbrain.org/mirrorbrain-eval/ . There's a README file which contains further notes useful to set up the image.

svn switch --relocate doesn't work in this case, unfortunately, because both the server URL and the path inside the repository has changed. The following worked for me on Linux and OSX, but your mileage may vary. It recommendable that you just get a new working copy. If you want to try it, do so on a backup of your working copy. Don't update your working copy from the old location first:

The MirrorBrain web site was completely rewritten and launched today. On the surface, it looks very similar, but behind everything is new and shiny. I switched from the Zope application server (which I have been very happy with) to the Django web framework (which I'm even happier with).

SourceForge.net has worked on their mirror redirector to improve mirror
selection, and announced the launch of their new redirector
yesterday. Their new mirror selection uses parts of MirrorBrain. This is
great, and there could be room for more collaboration in the future!

The security of the way how openSUSE delivers its content has been
recognized by a paper in ;login, the USENIX association's magazine.
According to the article, openSUSE is the only community Linux
distro that's on par with enterprise Linux distributions in protecting
against recently discovered package management vulnerabilities.

This is a combined result of the design of metadata, of client features
and the setup chosen by openSUSE - and MirrorBrain. MirrorBrain plays a
central role because it provides cryptographic signatures and allows
fine-grained configuration to make sure that certain key files are
always delivered directly.

The goal of this is that users can download software and deploy updates
safely even though they're obtaining them through a decentralized system
of community maintained mirrors.

The mirror scanner program underwent a cleanup and now offers a
better way to include or exclude files on mirrors. Old, hardcoded
excludes have been removed from the program, and has been made
configurable where one would expect it: in /etc/mirrorbrain.conf.
There are two ways to configure excludes:

scan_exclude = REGEXP [...]
scan_exclude_rsync = RSYNC_PATTERN [...]

The former directive takes regular expressions and is effective for
FTP and HTTP scans, while the latter takes rsync patterns, which
are passed directly to the remote rsync daemon. Therefore, rsync
patterns are used in that case. (This constitutes a duplication for the
admin, and it would be nice if it would be possible to automatically
convert rsync patterns into regexps and vice versa, to be able to
specify the excludes only once.)

A mirrorbrain.conf directive with similar effect is
scan_top_include. It lists directories at the top level of the tree
that are scanned; all others are ignored:

scan_top_include = DIR [...]

With the new configurability, and much better excludes,
the size of the openSUSE database could be decreased by 20%. For many
mirrors, scan time is considerably shorter with good exclusions. (The
reason is that some mirrors that have foreign stuff in-tree, or keep old
files.)

A bug was fixed where the scanner could abort when encountering
filenames in (valid or invalid) UTF-8 encoding.

There is mb dirs, a new subcommand for showing directories that the
database contains, useful to tune scan exclude patterns. See output of "mb
help dirs".

There is mb export --format=vcs, which implements a new output format named
"vcs". It is suitable to commit changes to a subversion repository and get
change notifications from it. The command generates a file tree which
can be imported/committed into a version control system (VCS). This
mechanism can be used to periodically dump the database into a working
copy of such a repository and commit the changes, making use of the
standard commit mail mechanism of the VCS to send change notifications.

In the scanner, deletion of files for subdirectory scans from the
mirror database is now implemented. This required a full scan before,
because the database was too bloated to efficiently select the affected
files. So this became possible with the new database schema. Very cool
is that this opens the door for better scanning, which works much more
directory-based now and can do cleanups whenever needed. This again
allows for a tighter integration of mirror syncing with the database
update. A (push) rsync can not only trigger a scan right after syncing a
directory, but it could also enter the files directly into the database
-- and delete the ones that are obsolete.

A bug in the scanner which prevented the correct usage of
inclusion/exclusion of top-level directories in relation to subdirectory
scans as been fixed.

The mirror choice can now be influenced with a query parameter,
as=1234, appended to the URL. The number specifies the autonomous system
number which the server will base its mirror selection on,
instead of the AS of the client IP. Another possible parameter is country=XY,
where XY is a two-letter country code. As an example, you could look at
the following URLs:

The first URL gives a result depending on your location. The other two
generate a list for AS 680, or for the United Kingdom, respectively.
This shows some of the criteria for mirror selection that MirrorBrain
uses. (In reality, it uses more criteria for mirror selection; whatever
is available.)

Just as the mirrorlist is more or less for human admins to see what's
going on, the as= and country= are not meant for machine clients to
technically influence the mirror selection. For that, it would be more
appropriate to override the IP address "detection" in the first place.
The IP address, as looked up by mod_geoip and mod_asn, could be passed
via a X-Forwarded-For header, for instance. This would allow frontend
servers to influence the mirror selection appropriately. mod_geoip
already supports this. For mod_asn I plan to add this in the future.
mod_mirrorbrain just lets mod_asn and mod_geoip do that work and uses
what it finds in Apache's subprocess environment.

The "mb list" tool has new options to customize what's being displayed
when mirrors are listed, namely:

--country --region --prefix --as --prio

The "mb file ls" tool can now probe files that were looked up in
the mirror database. So, contrary to "mb probefile", which probes for a
given file on all mirrors, "mb file ls --probe" looks up which mirrors
are known to have a certain file, or a certain list of files matching a
pattern. The --probe switch causes it to probe the file on each mirror,
and the --md5 switch to display the md5 hash of the returned content.
This can be used to check functionality of the mirrors. Example:

[Quite a long text, which aims to explain the recent under-the-hood
changes.]

MirrorBrain 2.7 has been released, with the main change being a huge
improvement in the database structure.

I have been using a "classic" relational database schema for years now, and
wasn't not being very happy with the relational table alone being 2-3G
in size with indexes, for the huge openSUSE file tree. For a
small database that doesn't matter at all, but that file tree
happens to be large enough (and growing) that 2.000.000 files and 200 mirrors result
in sufficiently large number of rows in the relational table that the
size is unavoidable. After optimizing out everything which wasn't
needed, I still found 48 bytes used per row (two references to primary
keys, and one timestamp column that was used to determine whether a file
has been seen before or during the last scan).

I had an idea about a completely different organization of these data
which doesn't waste 48 bytes per file per mirror where, in theory, one bit in
a bit field would suffice. I found something that comes close in
PostgreSQL in the form of the array datatype. The "list of mirrors per
file" is now an array of two-byte integers which lives in a single
column directly next to the path name. That way, only a single index
remains.

All in all, the openSUSE database is now 5 times faster and 1/3
the size, which is exactly what I wanted.

The data is also more logically structured, looking up mirrors for a
file doesn't require table joins anymore (which already were damn
fast...), and the single index is a fast b-tree which is perfect for all
needs. In particular, it is now easy to do efficient substring matches
on the beginning of path names, which would have required a join over
huge tables in the past. (The smaller size helps a lot as well, of
course.)

This opens the door for fixing a previous shortcoming in the scanner: it was not
possible to efficiently delete files from a subdirectory only, which
have disappeared between two scans. That's now straightforward to
implement. It also opens the door for a tight integration of mirror
syncing with database updating, which would work it's way through a
large tree on a directory basis.

The scanner doesn't need a timestamp anymore. It now creates a
temporary table with the list of files at the beginning, scans, and in
the end it just deletes all files that are still in the temp table.

With this change, MySQL is no longer supported; at least not by the
framework in the whole. The core, mod_mirrorbrain, will still work,
-- it doesn't care about the database, it just runs a database query and
the query can be anything.
The rest of the framework has now become quite adjusted to the
PostgreSQL database schema now.

Of course, if there's interest, MySQL support in the toolchain could be
maintained as well. For now, nobody uses it.

MirrorBrain 2.6 has been released, with a major new feature. Through the
Apache module mod_asn, it uses BGP routing data to introduce two
additional mirror selection criteria: network prefix and autonomous
system number (AS). This network-topological knowledge supplements the
country-based mirror selection (which relies on the GeoIP database).
They work on a pretty much lower level and don't replace the latter. The
country lookup is still needed for many requests, because there are many
more ASs than mirrors — but for a subpopulation of users the
change has a significant impact.

I owe a big "thank you" to Björn Metzdorf who approached me with this
idea, nearly a year ago. Also, Christian Deckelmann, Simon Leinen and Marko
Jung have provided very fruitful discussion, insight and support.

The change has a number of important implications:

It increases the likelihood to select the fastest mirror for a
client. (See below.)

Traffic from clients of, for instance, a large university network
can be sent to their local mirror automatically, with full-featured
fallback to external mirrors if the internal one doesn't have what's
requested yet. Such a local mirror is highly likely to be the fastest one.
This has the potential to save large amounts of needless traffic
between organizations.

Due to the further narrowing on subnet prefix, this works also for
huge "hypertrophic" autonomous systems like the German AS680 which
contains the majority of the universities.

This can be interesting for corporations / organizations which desire
to run a mirror and have only their clients sent to it. The point is:
the new criteria can effectively be used not only for mirror
selection, but also to limit mirror selection to a certain client
population, based on network topology. The option to set up a
"private" mirror can spare the organization external traffic.

And this should be helpful for regions with thin or costly Internet
bandwidth, enabling them to establish new mirrors. They can receive normal
redirects from MirrorBrain, but have the requests restricted to those
from clients in the vicinity of the mirror (same network). Thus,
traffic to clients would primarily be local traffic, and the need for
outgoing bandwidth would be small compared to what a "traditional"
public mirror would have to expect.

This might hopefully lower the bar to find mirrors in many countries.
Please spread the word!

The change is up and running on download.opensuse.org and also on the
other MirrorBrain instances.

It is written with scalability in mind. To do lookups in high-speed, it uses
the PostgreSQL ip4r datatype that is indexable with a Patricia
Trie algorithm to store network prefixes. This is the algorithm that
can search through the ~250.000 existing prefixes in a breeze.

It comes with script to create such a database (and keep it up to date) with
snapshots from global routing data - from a router's "view of the
world", so to speak.

Apache-internally, the module sets the looked up data as env table variables,
for perusal by other Apache modules. In addition, it can send it as response
headers to the client.

MirrorBrain actually uses this already. Announcement to follow. :-)

The source code is available under the terms of the Apache License, Version 2.0.

It is available here (requires an openSUSE buildservice account) or
here (in source RPM form). You can browse (or check out) the source
code from the svn repository (viewvc link).

Version 2.5 was released: it adds support for using the PostgreSQL
database as backend, alternatively to MySQL.

MySQL is still fully supported. However, PostgreSQL is recommended now,
particularly for large installations with dozens of mirrors and more.
PostgreSQL support is an important step to a next-generation mirror
selection regime that is in the works. Migration is pretty easy; the mb
tool can export data in a format the PostgreSQL can understand.

The 2.5 release also sees major improvements in the mirror scanner. It
now produces a much more readable output and error reporting. This makes
it easier to see spot problems encountered on mirrors. Database
operations done by the scanner are more efficient in this release.

The installation instructions have undergone a rework to be more
complete, and reflect the recent changes.

Second, the toolchain got the following new features and improvements:

the mirrorprobe now does GET requests instead of HEAD requests. This is safer.
A mirror with crashed filesystem might still be able to answer a HEAD correctly.

mb, the mirrorbrain tool, has a powerful "probefile" command now that can
check for existance of a file on all mirrors, probing every known URLs -
HTTP, FTP and rsync ones. This is especially useful for checking whether the
permission setup for staged content is correct on all mirrors.

Third, the database got new fields named public_notes,
operator_name, operator_url, to store additional data about
mirrors. Plus, it got two new tables:

The principal change of this release are massive space savings in the
mirror database. An unused database column was eliminated - which was
intended to serve as a special index, when the database was designed. It
didn't bring any benefits, but increased the database by as much as
30-40%.

There is a new feature: It's now possible to configure specific mirrors to get
only requests for files smaller than a certain size.

Mirrors with limited bandwidth can easily become very slow, and result in a bad
user experience, when large files are downloaded. So there is often a need to
disable redirection to such mirrors at all. However, those mirrors could still
be useful to handle smallish requests. So the idea is that you just don't send
them requests for the very large files. At the same time, this takes load off
them and should increase their performance for those smaller requests.

With the latest change in MirrorBrain, you can configure a maximum filesize for
specific mirrors in the database.

Another change in this release is a significant simplification of the Apache
configuration.

The fallback mirror selection has been considerably improved. Fallback
mirrors are now defined in the SQL database instead of the Apache
configuration. This approach is a lot more flexible, and allows to
assign arbitrary mirrors to handle arbitrary countries. But most
importantly, fallback mirrors are no longer considered unconditionally,
but are chosen only when no local mirror could be found. (Note that the
obsolete ZrkadloTreatCountryAs directive has been removed from the
Apache config.)

Since today, generated metalinks can automatically include links to
Bittorrent resources. When a file name ending in .torrent is found, a
hyperlink to it is included into the the metalink.

The webserver does all this fully automatically. However, the additional
check doesn't need to be done by Apache for every request. With a new
configuration directive, a file mask can be given which specifies the
files for which this happens, e.g. *.iso. Thus, there is no tradeoff
in scalability.

So how can you use this new feature? The command line metalink client
aria2 can automatically use P2P resources and HTTP resources from
metalinks at the same time. There are other clients, check
http://en.wikipedia.org/wiki/Metalink.

Metalinks can now automatically include PGP signatures. When a file name
ending in ".asc" is found, its content is embedded into the the
metalink.

The command line metalink client aria2 automatically downloads the
the PGP signature file, so it can be verified locally. Note that aria2
doesn't verify the signature itself.

This new feature is implemented carefully to have no impact on scalability and
performance. Apache doesn't need to scan for further files or open them and
read their content. The signature files content is saved together with the
piece-wise hashes - which are created offline with the
metalink-hasher script.

MirrorBrain is the first metalink generator that automates this. Hopefully,
this makes way to more usage of this very interesting feature of metalinks.

A few days ago, I gave a presentation about the current state at the
openSUSE offices. See the presentations page. It is updated for
newest state of affairs, and gives details about the deployment at
openSUSE and things that can be learnt from it. It is available as ogg
video and PDF (slides). The video includes a live demo of two popular
metalink clients.