Summary

We propose several enhancements to the Apport crash retracer bots to work in a more robust and maintenance friendly way.

Release Note

Not applicable, it's an infrastructure change not directly exposed to users.

Rationale

Current retracer bots crash too often, have a huge memory leak, and do not notify anyone if things go wrong. This also results in some bad retraces, when chroots are out of date, broken, or debug symbols not available.

Use Cases

A bug in fakechroot or a maintainer script causes the automatic dist-upgrade of retracer chroots to fail. The retracer stops itself and sends a mail to the maintainers with the problem.

During chroot setup for retracing a particular crash, a package or ddeb fails to download. The retracer stops itself and sends a mail to the maintainers with the problem.

The retracer crashes due to a new Launchpad version breaking current python-launchpad-bugs. The exception is mailed to the maintainers and the retracer stops itself.

The computer hosting the retracers is rebooted. After reboot, the retracers automatically continue to work.

Chris maintains the server which hosts the retracer. He never has to prod Martin or Sebastien any more about restarting them because they take up all available memory.

Design

Change program to not run forever any more, but process one batch of currently pending bugs to be retraced, and exit. This will work with cron, makes memory leaks irrelevant, and also allows running it in the foreground for testing and debugging. (Use case 5)

Move away from a permanently running foreground program in screens, since this is nontrivial to set up after reboot. Use cron instead to regularly call the retracer to process batches. (Use case 4)

Implementation

causes the retracer to abort immediately when already being present, and thus prevents both parallel instances (I/O trashing), as well as bad retracing results due to broken chroots/system.

retracer creates lock file at startup

retracer removes lock file at clean shutdown, but not on crash (use cases 1 to 3)

after fixing the problem, maintainers need to remove the lock file

apt-get update needs to be run for every retrace to make sure that the package lists are up to date. If this, or a dist-upgrade, or a package install exits with nonzero, program raises a SystemError.

Exceptions are printed on stderr; set up cron to mail output to martin.pitt@ubuntu.com and seb128@ubuntu.com. Call crash-digger with redirecting stdout to log file, so that normal logging does not cause cron mail.

On addition, we need a daily cronjob which creates backups of the duplicate database, so that in the event of corruption we don't lose all our data. (This happened once already.)

Unresolved issues

Consolidating the duplicate database (i. e. updating the bug status in the database to their status in Launchpad) currently takes about an hour. This could be dramatically improved by direct readonly database access. This is not currently possible from ronne, though.