Several weeks ago MarkMail, a project sponsored and run by Mark Logic, started indexing the KDE mailinglist archives. After about a week of hard work, the KDE archives are now directly searchable from MarkMail. Besides interesting analytics, this brings some powerful search capabilities to the table. Read on for a short interview with Jason Hunter who was responsible for engineering on the project.

Hi Jason! Could you give a little introduction of yourself and Mark Logic?

Hi, KDE! I'm a Silicon Valley hacker. I've been working at Mark Logic for about 5 years now, since the days it was an early startup. We sell MarkLogic Server, a special-purpose database built for content (where "content" is the stuff that's textual, hierarchical, irregular, and not often regularly repeating - like books, articles, and presentations). We use XML as our native data type instead of tables, and pride ourselves on performing very well at high scale.

Until about a year ago I worked with our customers to help them write content apps. I had the idea that we could use the core server to build a public email archive repository, using some of the product features to push the envelope of what people had done before with email archives. That's where MarkMail came from. We started with 4,000,000 emails from the Apache Software Foundation mailing lists.

I've been involved with open source for a long time, leading JDOM and participating as a member of the Apache Software Foundation, so it felt natural to put MarkMail to work initially on the problem of getting more value from open source mailing lists.

Konqueror showing MarkMail's search results

Why did you decide to grab the KDE mailinglists?

Cornelius Schumacher started the ball rolling when he asked if we could load the KDE lists. OK, that's not quite true. We have a long list of communities whose lists we hope to load, and KDE was actually on that list since the very beginning. It's just that one day in April we heard from Cornelius, and the next day received a separate request from Adriaan de Groot. That popped KDE to the top of the priority list.

The KDE mailinglists aren't the largest you have at MarkMail, but they sure aren't small. Did that pose any problems?

Yes, KDE is Big. At current count there's 2.7 million KDE emails. Hosting those emails isn't an issue (we're designed to scale to hundreds of millions) but we had to work hard to gather clean historical archives. We have one person on the MarkMail team dedicated only to this (we like to call him an email archaeologist. I'm not sure he's happy about that nickname).

Why the challenge? Well the most authoritative archives for KDE were the web-based Pipermail archives (I'm using past tense because I'd like to think that today the most authoritative archives are in MarkMail). Pipermail exposes a set of "mbox" files for each archived list. Very handy. The mbox file format is a classic storage format for email and a format from which we can readily load. But as we found out, the mbox files aren't really mbox and there was a lot of post-processing we had to do. Some examples:

Pipermail "scrubs" attachments from its mbox files. Instead of placing the attachment content into the message as normal, it gets placed at an external URL with a marker in the message dictating where you can find it. We had to recognize the scrubbed references, fetch the attachments, and then inline the contents. Sounds simple, doesn't it? It probably would be if the external links were always accurate. Sometimes we could guess and fix things and sometimes we couldn't - bonus points go to anyone who finds an email in MarkMail mentioning an attachment that doesn't really exist. Extra bonus points if you know our search syntax well enough to write a query that directly lists those emails.

Then there's the problem with character encodings in old emails. If you look at an mbox file it seems like ASCII, but in fact it's a binary file. That's because each message may have a different character encoding for its body (or even portion of the body). The Pipermail list archiver didn't always realise this, and fixing that was non-trivial and imperfect.

There are more examples, but I don't need to bore you. I should make clear it's nothing special with KDE or even with Pipermail. Turns out if you load a couple million emails you'll see at least one example of almost every problem that's ever existed. It's the same for every community, just with different challenges.

Graphically drilling down to a specific date

You mentioned pushing the envelope. Can you give an example of that?

Sure, here's a good example: When you do a search, besides getting the top 10 most relevant emails, you see lots of analytics. You see a histogram chart showing the number of messages matching your query each month across time. With it you can watch trends for lists, people, ideas, or any combination. Every query also shows the top senders, lists, attachment types, and message types for the messages matching the query. You can learn who's an expert on a topic, on what lists something is being discussed, which people are most involved on lists, and so on. By dragging across bars on the graph you can limit the view to just a particular time period. You can also click on any person's name or list name to limit the search. It's convenient to start with a simple query and refine interactively.

We've also strived to make the site easy to navigate. You can hit "n" and "p" to go to the next and previous search results. To move up and down the thread view you hit "j" and "k" (a homage to vim users). If you find an attachment (search for ext:pdf) you can view it inline in your browser.

Oh, and here's a little-known tip. If your screen is sufficiently wide, we give you all three panes (analytics, results, messages) at once. If not, you get the "slide".

Do you have any tips for the KDE community to take advantage of the available capabilities in MarkMail?

The first thing to remember is that you can limit your view to KDE-related mails by going to kde.markmail.org. The use of a subdomain adds an implicit constraint to all your queries.

Are there also services like these for non-public mailing lists or means to stop at least google indexing?

The reason I detest public archives is that it gets indexed by google and may also pose legal liabilities under German law (Abmahnungen) you can hardly manage if enforced in a ruthless manner. It further invades privacy.

I always had a problem with attachments to mailing list archives under mailman.

This is a nice idea, but I think there is a big privacy issue. First a lot of bug reports get automaticly forwarded to mailinglists, exposing the email address of the reporter to the public and search engines. I know, this has happend with the old archives, but not it is even easier. I guess most of the bug reporters are not aware oft this fact. A similar problem arises through some support-email addresses of some KDE-programs, which are in fact mailing lists. Here also the emailaddresses of people asking for support are exposed into the public through the archives being scanned by search engines. It would be great, if there was a solution for this.

Can I get free advertising for my proprietary privacy invading web20 web site just cause I'm barely helping a FOSS project so I can get free advertising for my proprietary privacy invading web20 web site?

Have you ever heard about lists.kde.org ? It already have everything to invade your privacy, this site adds nothing new with it. There is also gmane.org - here you even have search interface (for example: http://dir.gmane.org/gmane.comp.kde.devel.core ). The problem with privacy is more general - all data is available regardless of MarkMail.

What MarkMail adds is really good interface for browsing the archieves (which is really important for people who have not subscribed for that archieves for ages and thus don't have local copy of the history). For example one can use it to check whether given topic was already discussed before asking it themself. Yes, of course its not free, and it sucks. But hey, the google is also not free, but you don't blame it when it sponsors FOSS projects, do you ?

And about free advertising - well, its actually not free. The company does something for project and as reward gets advertasement. It's quite common in free software world, and it's good for both the company and the project and is very natural way of working together.

There are plenty of websites who offer nice interfaces for searching mailing list, source code, coders etc... Are we going to advertise each and every one of them that helps FOSS a little more than other especially if they are not FOSS themself? (not to mention this one is full of flash nasty stuff).

> But hey, the google is also not free, but you don't blame it when it sponsors FOSS projects, do you ?

What does it have to do with it? I've never seen an announcement on the dot about how google products should be used to help KDE developers. The only google announcements here are about Google SoC which are not about how people should use google products. And for what I know, markmails is not sending any checks to young kde hackers.

"Are we going to advertise each and every one of them that helps FOSS a little more than other especially if they are not FOSS themself?"

Probably not. And why should they?

"not to mention this one is full of flash nasty stuff)."

Then don't use it. Sheesh. "we should not talk about this web-service because it uses Flash!"...

And if you consider this service to be "invasion of privacy".... Well, how on earth can you expect one shred of privacy on a PUBLIC mailinglist? All the data they display and use is 100% public by default.

Well, the search is for the "better tool". In the recent years the concern for privacy is getting higher as no data is lost anymore and governments apply anti-terror madness. People are not sensitive enough and reveal everything about themselves they can over myspace and the like.

On the other hand mail encryption is still immature and poses an usability burden. Who uses gmail makes the decision to go public. Fine.

Paranoid mode: when I reveal what software version I am using I consider this a personal security problem for a targeted attack.

Two-thirds of respondents throughout the EU (65%) indicated that their
company transferred personal data via the Internet. One in three respondents (32%) admitted that their company did not take any security measures when transferring personal data over the Internet.

Among companies that transferred personal data to non-EU countries,
almost half of respondents (46%) indicated that this data mostly concerned clients or consumers data for commercial purposes, and 27% said it was human resources data for HR purposes.

"Security by obscurity" relates to non-transparency of the tools as such.

If someone has the ability to find out that I am using a service x he can target me specifically. E.g. I get a lot of phising messages for financial institutions I am not a client with. But if an attacker has access to my customer data or knows that I contacted the bank and who the person at the bank is he might succeed as he can customize the attack in a more intelligent way. Don't reveal what you don't have to reveal.

Even the fact that you use Linux may be an issue you have no desire and no reason to disclose to the public at large. It is as private as which books you have in your room. I don't want "my room" to communicate to the rest of the world which books I store.

For submitting a defect report an association with my real name is factually irrelevant.

That's normal in FOSS - if someone does something for the project, we try to thank them. MarkMail didn't ask for this interview, you know... WE asked them to index our stuff, they did, we decided it would be cool to let the world know about it.

All email addresses shown by MarkMail are obfuscated. Only by clicking on an address and solving a CAPTCHA will you be able to see what the true address is. This is true for any emails mentioned in the body of a message also.

The goal is to make the site useless to address harvesters but useful to people who want to communicate with each other. We think this strikes a nice balance.

No, if I click on a message, I get a *list* of messages in a conversation, not a tree-like view that gives me an idea of which message is a reply to which other message (the different threads in the conversation). Or is there something I am overlooking here?

I personally don't like that it wants to use flash for the graph. The layout (results in a scollbar container) seems broken, but that could be my outdated KHTLM (3.5.9), overall, when searching for myself, I found a pretty nice set of results, and good way to view it.

I would say this site is very useful. Only a pity that its software is not Free, is it?

I really wish markmail provided an nntp interface. I far prefer using a good nntp client with threading to any online forum/list archive. I find discussions a whole lot easier to follow. I also prefer it to mailing lists since i don't have to worry about my mail account filling up with lots and lots of messages.