Submitting a research paper to a conference or journal can be a stressful
process; there are many things to remember, and mistakes can have dire
consequences. Here is my attempt to summarize all the steps and my
experience about some points, because it's easier to follow a checklist than to
remember every time what you have to do. Note that I'm making some implicit
assumptions, e.g., that you are using a revision control
system, etc.; also, this list
may not be sensible or exhaustive in all fields (I'm talking from the perspective
of theoretical computer science).

TODO flushing.

There should be no tasks left before you start finalizing the
submission. Of course, it's fine to remove a TODO because you decided to not
do it because you did not have the time, the energy, it wasn't worth the
effort, etc.; but the status of all TODOs must have been cleared, so you
won't realize after proofreading that there were some changes left that you
intended to do. Possible sources of TODOs to process:

In the main text: those should be the most obvious.

In LaTeX comments: I don't use those but some people do, consider grepping for
"TODO" or such keywords (for this reason, TODOs that you gave up on
should probably be moved to a separate file, not just commented out)

In a separate TODO file: did you start jotting down things to do in
separate files ("TODO", "ideas", "meeting_notes")?

In your mailbox: if you have any mail in your inbox about the submission,
you probably intend to get rid of it once submitted, so go through it
before you start finalizing, and ensure nothing is left to do.

In the reviews of a previous unsuccessful submission of the paper.

Once the only task left is "proofread the submission", you can continue.

Proofreading.

Go over the submission once, linearly, from start to end. You may
want to do it on screen, or on paper (annotating the paper copy and then
implementing any changes). If using a
WYSIWYM editor such as LaTeX, when
editing anything during proofreading, it is crucial to recompile and
read again what you just changed, preferably the whole paragraph: fixing errors
can easily introduce other errors, and you won't be proofreading again what
you are currently changing.

Hopefully, at this point, the scientific content of the submission should not
change. If it turns out you have to make significant fixes when proofreading,
consider doing a second proofreading pass.

Check that you proofread the following, which are easily forgotten:

Paper title, author information: especially affiliations, which are often
filled in casually early on and may not be sensible.

Section titles in the paper: go once through the paper looking just at
section titles, to check that they make sense and match the "Paper
structure" paragraph if you have one.

Footnotes.

Floats, figures, algorithms, etc.

Formal statements: it's easy to focus on the prose, but you should
double-check that the precise wording of theorems makes sense, no matter
how they are framed.

Bibliography: Do all references follow the same format? Are you consistent
when abbreviating venue names? when including or discarding page numbers,
addresses, publishers? Are all author names sensible? Are all acronyms and
proper names correctly capitalized in titles (e.g., no "Designing an xml
language for hadoop")?

Once you are done proofreading, you have a known good version. It may make
sense, when doing further changes, to check their diff against that version to
spot any errors that you would be introducing.

Spellchecking.

Use any spellchecking tool. Also check for notations and terms and other
things that shouldn't be there anymore:

Remove the definitions of macros that should no longer be used (for old
macros, for TODO notes, etc.)

If you introduced a macro to replace a notation and changed the notation,
check for hardcoded instances of the notation or of previous notations.

If a name has multiple variants and you have been using them
inconsistently, standardize to one name (e.g., "foo bar" to "foo-bar").
When looking for occurrences of multi-word names, don't forget that
there may be a line break in the middle (i.e., check for "foo" at end of
line).

Typography.

Check for the following:

Math identifiers at the beginning of lines: if "of x" is split, make the
space between the words non-breaking.

Overfull \hboxes: check for them in the LaTeX log and fix them by
rewording, or tweaking hyphenation for technical terms that LaTeX does not
know.

General page layout: widows and
orphans, spacing around environments
and floats, positioning of section titles (not at the bottom of a column!),
etc.

Sanity checks.

Verify the following one last time:

Go through the call for papers and formatting instructions:

Did you get the page limit right? Including or excluding references?

Are you using the correct style file? Are the font size and paper size
correct?

Should you be including author information? (depends if the venue is
double-blind or not)

Search a last time in the PDF and grep in the source folder for bad strings like "TODO",
"FIXME", and other such stuff.

Check the LaTeX log for bad errors like broken references or
multiply-defined labels. Search for "??" in the PDF to double-check that there are no broken references.

Will you be submitting your LaTeX source? If yes (e.g., on arXiv), make sure
there are no embarrassing comments left in the LaTeX.

Are all your changes committed to the revision control system? Check this
on all the machines where you have been editing the document (to make sure
that there aren't uncommitted changes on a different machine).

Submitting.

Submit the paper. Check the following:

Are the title, author information, track, keywords, etc., correct on the
submission interface?

If you had to indicate conflicts of interest, did you do it?

Are you supposed to submit additional things, like uploading an appendix
elsewhere? If yes, don't forget them.

Update the abstract on the submission interface.

Double-checking.

Re-download the file that you submitted from the submission
interface and go through it to check that it looks good.

Are all pages there?

What is the last change that you did? Is it there?

Did you submit the right file, for the right paper? This may sound stupid, but strange things happen at
5 AM. It is especially important if you have multiple submissions for
the same deadline.

It's normal to find mistakes at the last minute when going over the PDF after
submitting. Of course, when fixing them, be extra careful; at this stage you
should probably be double-checking the changes that you make, and checking them
again by proofreading the diff before you commit.

Concluding.

Decide you are done. Let your coauthors know. Thank them for their hard work. Then, celebrate as you see
fit. :)

Aftermath.

The following are things that you should probably do,
but there's no urgency in doing them, so the safest way to avoid doing them
multiple times because of last-minute fixes is to wait for the deadline to
be over.

Ensure that your version control system knows what was the final submitted
version.

If applicable and not double-blind, upload your preprint on your website
and/or a preprint repository such as arXiv.

In April 2014,
Google Paris
organized an onsite programming contest, the
Google Hash Code.
The task to solve was quite fun. This post describes the problem, and some of
the strategies to solve it that were used by the ENS teams.
The reason why this gets published now is because the
2015 edition
is currently being organized, which for some reason gave us the impetus to
document our work.

ENS scored
pretty well
in the contest
(ENS students
were in teams ranked 1st, 2nd, 3rd, 4th, 7th, 10th; mine was "ENS Ulm 1" which
ranked 7th), and substantially improved on their solutions afterwards during the
Extended Contest,
which lasted for some time after the contest.

Problem description

We are given a
map of Paris
(from OpenStreetMap data)
that includes1 all intersections and streets.
As in real life, some of the streets are one-way.
The segments are labeled by their length in meters and the time it takes to traverse them.
We have 8 Street View cars
that start from the Google office and want to photograph as many street meters as possible
in a limited amount of time (15 hours).
You can traverse the same street as many times as you want, but it only counts
once towards your total meters.

The goal is to choose a route for the 8 cars which allows you to traverse as
many unique meters of the streets of Paris as possible, within the alloted time.

The brute-force method

Of course, in principle we can solve the problem by enumerating all of the
possible routes for the 8 cars, and picking the best one. There are finitely
many routes, so this approach is guaranteed to find the optimal solution. This
is good enough for a mathematician, but not for a computer scientist:
the paths are too numerous and you will never
be able to enumerate them in practice.

Imagine there is a single car that reaches every minute an intersection with at
least two possible streets, during 10 hours. The number of paths is \(2^{10 \times 60}\), which is about
\(10^{180}\). Even if every atom in the Solar System were an impossibly powerful
supercomputer, you wouldn't be able to reach this number before the Sun goes
bust.

The greedy method

At the opposite end of the spectrum, the simplest possible approach (beyond
purely random approaches) is to choose every street to take, one after the
other, based on some greedy optimization criterion.

Basic greedy

For instance, plan a route for each car, one after the other. Whenever the car
reaches an intersection that allows it to reach streets that haven't been
traversed yet, choose one of them. Maybe pick the one with most meters, or the
one with most meters per second. Otherwise, if all reachable streets have
been visited, choose one of them at random. Repeat while there is time, then do
the same for all cars.

The greedy strategy will yield reasonable results. There are, of course, tweaks
to make, for instance planning the route for all cars simultaneously instead of
sequentially... Or sending the cars at the beginning to hardcoded start positions which are far
apart, to increase coverage.

However, the key drawback of greedy solutions is that they
cannot plan things in advance. When choosing among already visited streets, for
instance, you are not trying to get closer to unvisited streets. If
visiting a new street will trap you in a maze of already visited streets, you
will still fall in the trap.

Greedy with lookahead

A simple solution to this problem is to improve the greedy strategy with
brute-force that looks ahead for a few steps. Whenever you need to take a decision,
don't take it based on what it will bring you now. Rather, consider all possible
paths of length, say, 10, that you can do from this point, and see which one is
bringing you the best results. Choose your next move to be the first move of the
best path. Do the routing for cars in parallel following this idea.
The path length that you use is tweakable, and in general using higher path
lengths will improve the result: I think we used something like 17 in practice.

This simple idea is easy to implement and yields pretty good results.
My team
used
it
with a few tweaks, and to my knowledge many of the top-ranking teams
also did.

The clever method

Of course, the point of this post is to describe the clever solution, the one
that was implemented after the contest, for the extended round.

Rephrasing the problem as Eulerian paths

The key insight is to notice the similarity between our problem and the
Eulerian path
problem: computing a path in a graph which visits each edge exactly once.
The map of Paris is seen as a graph, where the intersections are the vertices,
and the streets are the edges.
If we could find an Eulerian path in this graph, then it would be the ideal
solution, going through all segments without wasting any time in already visited
segments.

Of course, you need to squint a bit to see the reduction. First, we have 8
cars; but for now we will assume that we have only one, and will split the
path between the various cars later. Second, we have a time limit, and edges
have a value in terms of their number of meters; but we will disregard this. We
will pretend that we will be able to visit all edges in the alloted time
(which isn't so unrealistic, as we will see ;)), so that we ignore the time limit
and the lengths of the streets in meters (we only care about the time needed to
traverse them).

Third, and more importantly, our graph is a bit weird, because some edges are
undirected, and others (the one-way streets) are directed.
Because of the one-way streets, we cannot rely on undirected Eulerian paths, and
will be forced to reduce to directed Eulerian path. We will thus need to choose
an orientation for the undirected edges.

Fourth: well, obviously, we won't be able to find an Eulerian path. There is a
well-known characterization of directed graphs that have Eulerian paths (I call
them Eulerian graphs), which is not hard to prove: the in-degree of each vertex
(the number of edges that go towards it) must be equal to its
out-degree2. Does the input graph satisfy
this condition? Not a chance. However, in our case it isn't really forbidden
to traverse edges multiple times; we just want to avoid doing it as it is
wasteful.

Hence, there will be two steps to our process to make the graph Eulerian.
First, we will choose a direction for all undirected edges, in a way that makes the
graph "more Eulerian". Second, we will add edges to make it really Eulerian,
which materialize the (wasteful) choice of going through edges which are
otherwise visited (that is, that have been visited or will be visited at another
point in the solution).

Orienting undirected edges

In order to reduce to a directed Eulerian path, we have to assign a direction to
all undirected edges. To make the Eulerian path problem easier to solve on the
resulting graph, we will try to minimize the sum, across all vertices, of the
difference between in-degree and the out-degree: intuitively, the smaller this
quantity, the less edges we have to add to make the graph Eulerian.

To introduce some terminology, I will say that a vertex is overloaded (in the
original graph, or in a directed graph obtained from it by edge orientation) if
its in-degree (counting only directed edges) is more than its out-degree; it is
underloaded if it is less, and balanced if both are equal. The load of the
vertex is the difference between in-degree and out-degree (which is \(>0\) for
overloaded vertices and \(<0\) for underloaded vertices).

How can we choose the best orientation of the edges? Our choice of terminology
gives some insight: we would like edge orientations that allow originally
overloaded vertices to "transfer" some of their load to the underloaded vertices,
following directed paths, which have to be disjoint to avoid double-counting the
intermediate edges. It turns out that this can actually be phrased neatly as
a flow problem.

Overloaded and underloaded vertices are respectively represented as sources and
sinks, with capacity equal to their load. Each undirected edge is represented as
a pair of edges going in both directions, with capacity 1. Edges that are
already oriented are ignored.

We solve this flow problem using any flow algorithm. By the
integral flow theorem,
the flow along each edge in the result is either 0 or 1. Hence, the result
describes a way to orient some (but not all) of the undirected edges, depending
on the direction of the flow going through them.

We still have the problem that some edges in the result are still undirected,
namely those which were not used at all in the optimal flow. To take care of
this, we simply orient all of the remaining edges at random. Of course,
this may make the final sum worse, and to alleviate this we enforce manually the
property that the flow algorithm had enforced for us before: for any path
of our newly oriented edges, if reversing the direction of
the path makes the sum better (because the path origin was underloaded and the
path target was overloaded), we do the reversal, which increases the load of the
origin by one, decreases the load of the target by one, and does not change the
other degrees.
To perform this last step efficiently, observe that it suffices to consider all
pairs of vertices. Whenever we find a pair of an underloaded vertex and an overloaded
vertex, we can search for a path from one to the other, and reverse it if it
exists.

Hence, at the end of this process, we have a directed graph where we have tried
to minimize the sum of differences between in-degree and out-degree, to make it
as Eulerian as possible. We now make it Eulerian for real, by adding edges so
that all vertices are balanced.

Adding edges

Intuitively, the edges that we add to make the graph Eulerian correspond to the
possibility of adding already traversed edges to the path.

In other words, if we add an edge from vertex \(i\) to vertex \(j\), it means that
we will be going from \(i\) to \(j\) through a path of edges that will be visited
anyway. All the time that we spend doing such traversals is wasted, and does not
contribute towards our objective. Hence, the cost of adding an edge from \(i\) to
\(j\) should be the time needed to go from \(i\) to \(j\) following any edge path. In
fact, we may follow a path that violates our chosen orientation for the edges
that used to be undirected, and it does not matter; so we are really talking
about following a shortest path in the original graph.

Hence, we start by precomputing
the shortest path time between all pairs of edges in the input graph, using
the
Floyd-Warshall algorithm.

Now, we model the problem of adding more edges as a
minimum weight matching in a weighted bipartite graph.
On one part of the bipartite graph, we put all vertices that are overloaded in
the result of the previous phase, duplicated as many times as their load.
We analogously create the other part3
of the graph with the underloaded vertices.
We add edges in the bipartite graph between any two vertices in the two parts,
weighted by the time of the shortest path from one to the other.

Now, a
perfect matching
in this graph represents a way to add the right number of edges to ensure that
all vertices are balanced,
and if its weight is minimal then it wastes little
total time as possible traversing edges that are otherwise visited. We use the
Hungarian algorithm to
solve this problem, and to complete the oriented graph.

Constructing the Eulerian path

The resulting graph is now directed, and each vertex is balanced, so it has an
Eulerian path. To compute it, it is well-known that a
simple greedy approach
works. (Maybe there
would be possible improvements with respect to the later steps, but we didn't
look into this.)

From the Eulerian path, expanding the edges that were added in the previous
phase to the shortest paths to which they correspond, we obtain an itinerary,
which is a fairly good solution for a single car to traverse all the
streets of Paris in as little time as possible.

Splitting the path among cars

We now remember that we have 8 cars rather than a single one, so that our route
must be split among them.

Here is our approach: we take the first car, and have it run the beginning of
the route for one quarter of the alloted time. Then, we look at the rest of the
route until the end of the alloted time, and search for the traversed vertex
that is the easiest to reach (in terms of shortest path time) from the starting
vertex (the Google office). We continue the route of the first car until it
reaches this vertex, at which point we end it. Then, the next
car starts at the starting vertex, goes to the cut vertex following the shortest
path, follows the rest of the route again for one quarter of the alloted time,
and continues until the closest point to the starting vertex. We repeat until
the last car, which takes care of all the rest of the route.

Of course, the reason why we cut when each car goes close to the starting vertex
is to minimize the effort spent by the next car to reach the position where they
can resume the route.
The solution will still give too much work to the last car, but we will see later
how to balance the duration of the routes of all cars. The reason why we ensured
that each car ran for at least one quarter of the time is to make this balancing
operation easier, by ensuring that the car paths are sufficiently long to
intersect many times.

We use this as our initial solution, and now optimize it further.

Optimizations

We now iterate a bunch of optimizations to tweak the paths and make them better,
until we get something which we judge is good enough.

Of course, not all optimizations yield the same savings; and the
order in which they are applied can make a large difference. That said, I don't
remember in which order, or how many times, they were applied;
there probably wasn't much logic to it. :)

Pruning

For any subpath of the path of any car that consists only of otherwise visited
edges, we replace it by a shortest path between its endpoints (courtesy of the precomputed
Floyd-Warshall).

Loop swapping

Assume some car's path includes a move from vertex \(i\) to vertex \(j\), then a
cycle \(C\), then a move from vertex \(j\) to vertex \(i\); and that the edge from \(i\)
to \(j\) is otherwise visited. If we can find any other
car that passes through an intersection of \(C\), we can have that other car take
care of the edges of \(C\) on the way, so that the first car doesn't need to do
\(C\) and doesn't need to do the steps from \(i\) to \(j\) and back (so we save twice
the time of the \(i\)—\(j\) edge).

Of course, this can lead to a situation where some cars have a longer path than
others, but this will be dealt with later.

Trimming

If the last edge of a car's path is already visited by someone else, just remove it.

Brute-force

For every subpath of every car's path, up to certain length, if the path can be
replaced by any shorter path that leaves no street unvisited, then do so. This
differs from the pruning optimization in that the subpaths may not consist only
of edges that were not visited; of course, the replacement will also need to
traverse these edges (possibly in a different order).

Balancing

The time limit of 15 hours has been ignored until now, but it has to be obeyed
by all cars, so we shouldn't return a solution where one car does all the work
and the others do not do much.
Further, in case we do manage to traverse all edges (which we did!), the quality
of a solution is judged according to its remaining time, which is
\(T - \max_{c \in \{1, \ldots, 8\}} t_c\)
where \(T\) is the time limit and \(t_c\) the time of car \(c\).

Hence, we would ideally want all the cars to follow routes that take the same
time. This implies that the busiest cars should try to swap longer parts of
their paths with shorter parts of less busy cars.

In practice, look for any two vertices \(x\) and \(y\) and two cars such that both
cars will pass through \(x\) before passing through \(y\). Now, in general the paths
followed by the two cars between \(x\) and \(y\) will take a different amount of
time; if swapping the paths will decrease the time difference between the two
cars' routes, then swap them.

This is a simple greedy scheme that can be iterated as long as we want. In fact,
as the paths of cars will tend to intersect very often, it works very well and
quickly converges to a state where all cars take the same time to do their
routes, no matter the initial routes.

Results

The clever strategy described here managed to traverse all streets on Paris in
the alloted time (in fact, in 9 minutes less than the alloted time). It's a pity
that this didn't happen during the contest proper, but still, it is nice. :)

Related work

We developed all of this on our own, and we didn't look seriously into the
related work of people who already tried to solve similar problems. It turns out
that our problem is known as the
postman problem,
because of course it is also useful to optimize mail delivery.

More specifically, our
version is known as the "\(k\)-mixed postman problem", because we have multiple
cars (here, \(k = 8\)), and because the graph is mixed: some of the edges are
directed while others are undirected. Surprisingly, the directed postman problem
and the undirected postman problem (for \(k = 1\)) can be optimally solved in polynomial
time4, whereas the mixed postman problem is
NP-hard5.

It would be nice to compare the results of our method to state-of-the-art
approaches. We did a preliminary comparison to
existing datasets and we have no
reason to believe that our algorithms are competitive. But, still, we had fun. :)

Credits

The ideas in this post are not mine: essentially everything was done by
other ENS students. All steps of the clever solution except for the
post-processing were conceived and implemented by
gnomnain
(code).
The optimizations were implemented by
elarnon
(code),
and
Mc and
mehdi
(code).
(All code is purely write-only and there is no documentation of any kind, so
don't try to use it.)

This post was re-written from an earlier version authored by Mc.
Thanks to Jill-Jênn,
elarnon, Armavica
and gprano for proofreading drafts and offering feedback.
There is another
write-up about the
topic, in French, by Jill-Jênn.

The input also included the geographic coordinates of the intersections, but we
won't be using this. That said, to anticipate on the sequel, some teams had nice
results with the simple greedy heuristic of taking the leftmost non-visited
street at each intersection, to make spiral-like itineraries. ↩

Technically, this is the criterion for the existence of a cycle, not a
path, but as we would have to force the path at the starting point anyway, it is
simpler to look for a cycle, which is just a special case of path. ↩

It is clear that both parts have the same size, because the total of in-degrees
is the same as the total of out-degrees: it is exactly the number of edges in the
directed graph of the previous phase. ↩

This post is about how (and why) I recently produced my own Debian packages for a recent
upstream release of Mesa to work around a bug on my own machine. Though the
specifics are not specifically interesting, you may extrapolate my story to a more
general guide of what to do when you need to run an upstream software release that's too recent to
be packaged by Debian, and too complex to be easy to install by hand with make
install.

I recently tried to figure out the source of weird crashes on my machine, which
runs Debian testing. I
understood from /var/log/syslog that I was probably affected by
Bug #771045. From
the linked discussions it seemed that a
patch
in Mesa had a good chance of fixing the problem. The patch was part of
version 10.3.4 of Mesa, but Debian
testing only had 10.3.2-1, judging from the output of:

dpkg-query -l | grep -i mesa

So, if I wanted the patch, I needed to install manually a newer version of Mesa.

A first thing to check was whether it was not packaged in Debian unstable. I
have the unstable repository set up and pinned at a lower priority with the
following in /etc/apt/preferences.d/prefs:

I use this to manually install a recent version of
iceweasel from unstable.
Could it be the case that a recent Mesa was packaged in unstable? Alas, no,
checking on a random mesa package among those at version 10.3.2-1:

The source for the packaged version (10.3.2-1) should be downloaded.
You could then go in the source tree and run uscan from the devscripts package:

cd mesa-10.3.2
uscan

This would automatically search for a newer release on the Mesa website and
download it. However, it would pick the latest version (10.4.1 as of this
writing) for which the packaging wouldn't work as is. So instead, download
directly
Mesa 10.3.4,
the version that we want, from the official website,
and symlink it to the conventions used by the Debian tools.

Now we can go back in the source tree for the old version, and use uupdate to
unpack the new version and migrate the Debian-specific stuff. Then go to the new
source tree thus created:

cd mesa-10.3.2
uupdate ../mesa_10.3.4.orig.tar.gz
cd ../mesa-10.3.4

Now we are ready to start the compilation process:

debuild -us -uc

This should compile everything and take 10-30 minutes.
If a problem occurs, you should have copious output to try to understand what
went wrong.
Otherwise, if all goes well, all the
deb packages for the new version will have been generated in the parent folder.
That was easy!2 :)

That's all nice and well, but then I realized that I needed the i386 version of
the packages, because I use Skype, whose "multiarch" package is actually i386.
How to cross-compile the packages? In fact, it's quite easy.
We will use pbuilder to create a
chroot where the compilation can take place,
but everything is automatic:

This should take care of everything. The packages will be produced in
/var/cache/pbuilder/result.

Now you just have to figure out which packages you need to install (using
dpkg-query -l) and install them with sudo dpkg -i, then correct any
dependency problems with sudo apt-get -f install. Alternatively, you can just
issue sudo debi -u from the source tree, which should do those steps
automatically. However, I haven't tested it. Of course to install the i386
packages in this way you will need to move them, or let debi know where to
find them.

Note that to avoid problems
it looks like it is safer to install all packages at once. If you run into file
conflicts between the i386 and amd64 packages, a solution may be to apt-get
remove all the old versions and then install the new ones. Of course, this will
probably remove a bunch of useful packages on your system that depend on Mesa;
if you do this
should make sure that you have kept a trace of the packages that you will want
to reinstall afterwards, and also that you are confident using a tty or ssh if
your graphical environment breaks.

And, well, that's it. If you have installed the packages successfully, you can
just reboot. Then you can check in dpkg-query that you are indeed using the
newer version of the packages, and double-check using:

glxinfo | grep -i version

Of course, if something went bad, maybe your system will not boot (or at least
not boot graphically) and then it will be up to you to investigate. Hopefully
removing the new packages and installing back the Debian ones should make things
work again. :)

One last thing: hopefully, as the version numbers of your home-made Mesa packages correspond to
the actual version number of the release, apt-get upgrade should move you to
the more recent versions as soon as they start getting packaged by Debian.

Note that the packages installed by apt-get build-dep will
be marked as "manually installed" and clutter the output of apt-mark
showmanual. If you care, you may want to save it
before proceeding. ↩

If you're wondering how this automated process could take care of things like
updating the changelog, have a look at
debian/changelog in the source folder. A dummy entry will have been
generated. ↩

If you don't have the file
../mesa_10.3.4-1.dsc, it should have been produced when compiling for amd64
before. ↩

I am a fond user of Wikipedia, so I was interested to
find out how to use it when I have no Internet connection. On my mobile phone, I
already use the free and open-source Aard Dictionary,
but for space reasons I only have the French
Wikipedia, without images, and with some formatting
glitches. On my computers, where I have more space and where it is easier to set
things up, I wanted to have more, so that I can have a usable Wikipedia when I'm
on the train, in a plane, or in a place with a crappy Internet connection.

This blog post explains how to set up a local Wikipedia mirror using
Kiwix. Start by
downloading Kiwix and
unpacking it. Kiwix ships with a special GUI to browse Wikipedia offline, but I
prefer to use my usual Web browser. Fortunately, Kiwix also includes the
kiwix-serve program that can serve dumps as a regular Web server.

Next, download the dumps
for the projects that you want. Ideally I would be interested in generating my
own dumps, but I haven't looked into this yet. I used Wikipedia and Wiktionary
French and English, totalling to about 60 GB.

Next, to be able to perform full-text search, you must index the dumps. Maybe
the pre-indexed dumps can be used, I haven't tried. I indexed them manually
instead. For each file a.zim, I did kiwix-index -v a.zim a.zim.idx, where
kiwix-index is provided by Kiwix. The process takes a lot of time (10-30 hours
or so) but does not require any interaction. The indexes take another 30 GB.

To serve all dumps with kiwix-serve, you need to build a library. First, move
the .zim and the .zim.idx files (or symlink them) to have shorter names;
this will make the URLs shorter afterwards. I use wen.zim, wfr.zim,
wtfr.zim and wten.zim. Now run, for each file a.zim in the working
directory: kiwix-manage `pwd`/wiki.xml add `pwd`/a.zim
--indexPath=`pwd`/a.zim.idx. (It is safer to use absolute paths.)

Now, to start the server, choose a port number (say 4242) and run kiwix-serve
--port=4242 --library /where/you/put/wiki.xml. Test it by browsing to
http://localhost:4242 and checking that it works. If it does, you probably want to
arrange for this command to be run at startup.

The last step, if you use Firefox, is to use Smart
Keywords to
be able to reach your local dump efficiently. To do this, for every of the dumps
that you have on localhost:4242, right-click on the search text field and add
a keyword for it. As I use Firefox
Sync, those bookmarks are
synchronized across my different machines.

These smart keywords work for full-text search in the dumps. If you want to have
bookmarks that directly reach articles and will never perform full-text search
(because it is faster), you can edit the bookmarks in Firefox to set "Location"
to, e.g., "http://localhost:23552/wen/A/%s.html". However, this technique does
not work with all dumps (it depends on the URL structure), and in this case you
must remember that the first letter of the search term must be uppercase.

In terms of formatting, the Kiwix dumps are fairly OK. There are occasional
glitches (e.g., with <span>) but not many of them, and certainly a lot less
than with Aard Dictionary. Equations are supported, and images are there if you
pick the right dumps. Most templates, formatting, etc., is fine. The HTML
interface added by kiwix-serve is not perfect but it's mostly unobtrusive.

I have recently migrated my passwords to pass
and so far I've been really happy about it. The previous system was to have them
all in a huge text file, which wasn't especially convenient or secure1, and wasn't shared between my various machines. Here is some info
about pass.

pass has been packaged for Debian since Jessie, so installing it is as simple
as sudo apt-get install pass. However, it's just a shell script just over 600
lines, so really easy to review, and install manually if you need to.

The way pass manages passwords is dead simple: a hierarchy of gpg-encrypted
files. The assumption is that each file corresponds to a website, or machine, or
other authentication realm, and contains the password. The use of gpg provides a
layer of security, so that your gpg key and passphrase serve as a master
password. Of course, it is nice to have a properly configured gpg-agent(1) to
avoid having to enter the passphrase multiple times.

The basic commands of pass are pass init KEYID which sets up the store for gpg
key KEYID (by default in ~/.password-store), pass FILE which decrypts and
shows FILE, and pass edit FILE, which decrypts FILE to a secure temporary
location in /dev/shm, edits it, and encrypts it back. You can also use pass
ls (which shows a nice output using
tree), pass find to search
for files using find, pass grep to search in the decrypted password files
using grep, and pass rm, pass mv, pass cp. Of course, you can also mess
around in the password store by hand.

As pass has this very nice CLI interface, migrating my passwords from my custom
system was very easy, although it seems like the Debian package also installs a
bunch of script to migrate from other password managers.

Beyond the generic commands I have presented, pass obviously offers commands
tailored for password management. You have pass insert FILE which creates FILE
with the password you provide (and turns off echo and makes you enter it twice
for confirmation). You have pass -c FILE which copies the password in FILE to
the clipboard, so you can input the password where you need it, and
automatically clears it after 45 seconds (which is a reasonable thing to do).
You have pass generate FILE LENGTH which generates a password of LENGTH chars
in FILE and displays it (or copies it to the clipboard with -c); what is very
nice is that pass itself does not include password generation logic, but entrusts
pwgen(1) with the task.

Icing on the cake: pass is designed to be used with git, and provides pass
git to call git commands. If you use git, all the pass commands will
automatically git commit what's needed. This makes it very easy to share
passwords between different machines. Of course, as the files are encrypted, git
cannot be expected to solve conflicts within files, but it can nicely merge
changes across various files. You can also use this setup to share passwords
between different people, as pass supports encrypting for multiple keys.

For once, I find it hard to find something to dislike about pass. Eventually I
may want to tweak password generation so that it generates passwords the way I'm
used to, but this would be easy to do. I'm also missing support for usernames,
as I use different usernames on different websites, but pass allows you to
store anything in the password file (and only the first line is taken into
account for pass -c and others), so I can just add the username as the second
line if needed, it's just that I will have to retrieve it by hand, or script
something that does what I want. Other than that, I'm very happy to have a
convenient, lightweight, and secure way to manage my passwords and share them
across machines using git.

My home partition is encrypted, but there was no security whatsoever if
the machine was ever compromised. ↩