Tag: mailman

I wanted to search discussions on mailing lists and view conversations. I didn’t want to use some webinterface because that wouldn’t allow me to search quickly and offline. So making my mail client aware of these emails seemed to be the way to go. Fortunately, the GNOME mailinglists are mbox archived. So you download the entire traffic in a standardised mbox.

But how to properly get this into your email clients then? I think Thunderbird can import mbox natively. But I wanted to access it from other clients, too, so I needed to make my server aware of these emails. Of course, I configured my mailserver to use maildir, so some conversion was needed.

I will present my experiences dealing with this problem. If you want to do similar things, or even only want to import the mbox directly, this post might be for you.

The archives

First, we need to get all the archives. As I had to deal with a couple of mailinglists and more than a couple of month, I couldn’t be arsed to click every single mbox file manually.

The following script scrapes the mailman page. It makes use of the interesting Splinter library, basically a wrapper around selenium and other browsers for Python.

I use splinter because handling cookies is not fun as well as parsing the web page. So I just use whatever is most convenient for me, I wanted to get things done, after all. The script will print a line for each link it found, nicely prefixed with wget and its necessary arguments for the authorization cookie. You can pipe that to sh but if you want to download many month, you want to do it in parallel. And fortunately, there is an app for that!

Conversion to maildir

After having received the mboxes, it turned out to be a good idea nonetheless to convert to maildir; if only to extract properly formatted mails only and remove duplicates.

I came around mb2md-3.20.pl from 2004 quite soon, but it is broken. It cannot parse the mboxes I have properly. It will create broken mails with header lingering around as it seems to be unable to detect the beginning of new mails reliably. It took me a good while to find the problem though. So again, be advised, do not use mb2md 3.20.

As I use mutt myself I found this blog article promising. It uses mutt to create a mbox out of a maildir. I wanted it the other way round, so after a few trial and errors, I figured that the following would do what I wanted:

wouldn’t work, because the $target would be interpreted as a raw string due to the single quotes. And I couldn’t find a way to make it work so I decided to make it work with the language that I like the most: Python. So an hour or so later I came up with the following which works (kinda):

But well, I shouldn’t become productive just yet by doing real work. Mutt apparently expects a terminal. It would just prompt me with “No recipients were specified.”.

So alright, this unfortunately wasn’t what I wanted. I you don’t need batch processing though, you might very well go with mutt doing your mbox to maildir conversion (or vice versa).

Damnit, another two hours or more wasted on that. I was at the point of just doing the conversion myself. Shouldn’t be too hard after all, right? While researching I found that Python’s stdlib has some email related functions *yay*. Some dude on the web wrote something close to what I needed. I beefed it up a very little bit and landed with the following:

And it seemed to work well! It recovered many more emails than the Perl script (hehe) but the generated maildir wouldn’t work with my IMAP server. I was confused. The mutt maildirs worked like charm and I couldn’t see any difference to mine.

I scped the file onto my .maildir/ on my server, which takes quite a while because scp isn’t all too quick when it comes to many small files. Anyway, it wouldn’t necessarily work for some reason which is way beyond me. Eventually I straced the IMAP server and figured that it was desperately looking for a tmp/ folder. Funnily enough, it didn’t need that for other maildirs to work. Anyway: Lesson learnt: If your dovecot doesn’t play well with your maildir and you have no clue how to make it log more verbosely, check whether you need a tmp/ folder.

But I didn’t know that so I investigated a bit more and I found another PERL script which converted the emails fine, too. For some reason it put my mails in “.new/” and not in “.cur/“, which the other tools did so far. Also, it would leave the messages as unread which I don’t like.

Fortunately, one (more or less) only needs to rename the files in a maildir to end in S for “seen”. While this sounds like a simple

for f in maildir/cur/*; do mv ${f} ${f}:2,S

it’s not so easy anymore when you have to move the directory as well. But that’s easily being worked around by shuffling the directories around.

Another, more annoying problem with that is “Argument list too long” when you are dealing with a lot of files. So a solution must involve “find” and might look something like this: find ${CUR} -type f -print0 | xargs -i -0 mv '{}' '{}':2,S

Duplicates

There was, however, a very annoying issue left: Duplicates. I haven’t investigated where the duplicates came from but it didn’t matter to me as I didn’t want duplicates even if the downloaded mbox archive contained them. And in my case, I’m quite confident that the mboxes are messed up. So I wanted to get rid of duplicates anyway and decided to use a hash function on the file content to determine whether two file are the same or not. I used sha1sum like this:

So it’ll check the files for the same contents via a sha1sum. In order to make uniq detect equal lines, we need to give it sorted input. Hence the sort. We cannot, however, check the whole lines for equality as the filename will show up in the line and it will of course be different. So we only compare the size of the hex representation of the hash, in this case 40 bytes. If we found such a duplicate hash, we cut off the hash, take the filename, which is the remainder of the line, and delete the file.