What is MarkMail?

MarkMail is a community-focused searchable message archive, accessible at http://markmail.org, developed and hosted by MarkLogic Corporation.
It provides end users
with powerful search and discovery tools for finding answers and understanding
activity in popular mailing lists, such as those used by open source projects.

What powers MarkMail?

MarkMail is powered by
MarkLogic Server,
an Enterprise NoSQL database,
built to load, query, manipulate, and render large amounts of data.
In MarkMail every email is represented and held as an XML document.
All the text searches, faceted navigation, analytic queries and
HTML page renderings you see are performed by a small
MarkLogic Server cluster against millions of XML documents. You can
download a free copy
of MarkLogic Server, if you're interested in checking it out.

Why did you create MarkMail?

One reason we built the site is to show what MarkLogic Server can do. Many of
our customers are content
publishers who build really interesting sites on our platform but host them
behind passwords, for paying customers only. Other customers of ours work in
the defense and intelligence sectors, so you're even less likely to see their
work. MarkMail gives us a public, open site to demonstrate MarkLogic Server
capabilities.

We've chosen to focus on email because we believe there is tremendous
value in email archives, and they are so underutilized today.
So much information is locked up in email, read once and never
to be found again. We're hoping to change that. There's lots of potential.
We're starting with public email lists.

Will you load more lists?

Absolutely. We're constantly loading new list archives. We prioritize
which list archives to load based on feedback, so
let us know if you'd like us to host
the archives for your community.

Why would I use MarkMail instead of Google or some other search engine?

Many reasons:

1. Scope. Many of the emails we host aren't in Google's
index. Or at least they weren't in Google's index before we went online. But
even now with Google and the rest spidering us you'll still want to use
MarkMail because...

2. Speciality. Search engines only index public email if
it appears on an HTML page somewhere. Google doesn't know why the words are
on the page - doesn't know the sender, the date, the subject, what's in the
body of the attachments, and so on. We do. So if you want to search for an
issue with Apache James involving respooling, with MarkMail you type
list:james respool. And hey, if you remember the email you want
came from your friend James you type list:james from:james
respool and you're there. You can be very specific because MarkMail is
a site built specifically for email.

3. Structure. We know about the structure of email.
This lets us exclude searching those annoying "copyright notice" footers at
the bottom of emails, add relevance weight to the important message headers,
reduce the importance of quoted message text, or let you exclude quoted
message text if you'd prefer (see the opt:noquote feature). We
even understand the structure inside attachment files.

4. Analytics. Where else have you seen a chart showing
the historical activity corresponding to any arbitrary query you type in?
It's a ton of fun to watch activity trends for lists, people, and keywords, or
any mix of the three. Don't forget to use a minus sign to negate a query:
javaone -from:sun.com.

5. Attachments. Do a search for emails with PowerPoint
attachments (hint, the query is ext:ppt), click on the attachment
link, and watch how you can view the attachment without leaving the search
results! Same goes for Word files and PDFs. If you include a search term,
we'll even show you which slides include the term: ext:ppt axis.

6. Convenience. We've worked to build the MarkMail site
as an immersive experience. We don't like how regular search engines make you
click, read, hit the Back button, only to click, read, and hit the Back button
again. With MarkMail all the important information stays in front of you all
the time, results right next to hits.

7. Shortcuts. Part of convenience is keyboard shortcuts.
Try hitting "n" and "p" to move to the next and previous emails in the
results. You can hit "s" to jump to the search box, and "x" to close the
attachment popup. Fans of the VI editor will find comfort in using "j"
and "k" to move up and down the messages listed in the thread view.
Want more?

8. Security. From previous experience running email archive
systems, we know one of the biggest complaints of users is showing their
email addresses out in the open where spam harvesters have a field day
collecting them. We obfuscate every email address in the system, even those
in the bodies of messages, something most sites overlook. You can still
search on email addresses (because we know what they are) and you can view the
emails in a particular message if you solve a captcha (one of those squiggly
line words that prove to us you're a human).

How do I post a reply to a message I've found?

MarkMail hosts list archives in read-only mode and doesn't yet provide
a mechanism for you to participate in mailing list discussions directly.
To post an email to one of the lists, you need to use your normal
mail program (Outlook, Thunderbird, etc) to send an mail to the list.

You can derive the listname from our archive name. If the archive name
is com.domain.project.list then the list's public email address is
list@project.domain.xcom.

Important Note: most of these mailing lists require you to subscribe before
posting, as a way to reduce spam. You can find subscribe instructions
on the project web sites. If you're interested in having us expose a
direct-reply capability, write in.

Is there a discussion forum?

Absolutely. We've setup an email
list for discussion. You have to join to post, to reduce spam. We
anticipate low traffic. If you'd like to kibitz with the creators of the
site, sign up. And of course, there's a searchable
archive.

I found a bug, how do I file it?

What's next for the site?

We have a long list of ideas. If you'd like to make your own suggestions,
join the discussion forum. It's always better to add what people want than
only what you think they want. Legally, only post thoughts to the public
forum that you're OK with other people seeing and our engineering team
implementing in MarkMail (see the
Terms of Use and our
Content Policy
for the full terms and conditions about this and other matters).

How many emails and lists do you manage?

The archive includes many millions of emails across thousands of lists.
Current counts are always displayed on the home page. New messages
to those lists are added continuously throughout the day, and are immediately
available for search.
We also run a version of MarkMail inside Mark Logic for our internal mailing
lists (but it's a lot smaller).

How many senders does that make?

In the original 4,000,000 Apache archives we counted a little over
150,000 unique poster names. In 2007 we saw almost 20,000 posters
not seen in years past (and 12,000 repeat posters).

How many emails have attachments?

We find roughly 1% of emails have attachments.

Which browsers do you support?

We strive to work with Firefox 1.5+, Internet Explorer 6+, Opera 9+, and
Safari 1.2+. More recent browsers sometimes have more features. The site
works best if you leave your font size normal.

How can I help?

Can I buy one?

At this point MarkMail is only offered as a free service hosting public
emails. If you're interested in getting MarkMail for your own lists, company,
or personal inbox, let us know using the feedback
form.

What search syntax are you using, and what can it do?

If you're a beginner, just type your word or phrase into the search box.
We'll show you relevant results, and you can always use the dynamic
faceted navigation links in the left
side analytics pane to get more specific based on list, sender, attachment
types, and message types.

If you're using the MarkMail gadget on iGoogle.com, the search box
can be found in the Edit Settings dialog. Clicking on any message
or thread will bring you through to markmail.org. The tabs in the
gadget let you see some of the same analytics you can see in the
faceted navigation at MarkMail.

Over time you may want to learn our search syntax, explained below.
We support searching for...

Search Capability

Example Query

General terms:

javaone

Or phrases:

"godwin's law"

Terms in the sender's name or email:

from:"Roy Fielding", from:fielding, from:ibm.com

Terms in the subject:

subject:"apachecon eu hackathon"

Terms in the list name:

list:tomcat

An attachment file extension:

ext:ppt

Any part of the attachment name:

attach:ajax

The classification of the message:

type:development type:checkins

Messages by date:

date:2007, date:200712, date:20071230

Or by date range:

date:1995-2001, date:20070502-20070909

Or by one-sided range:

date:20050509-, date:-2001

Negations:

from:mazzocchi -list:cocoon

Order:

javaone order:date-backward

Search terms are case insensitive. They are also stemmed, meaning a search
for "issue" will additionally match "issues" and vice-versa.

Constraints are ANDed together except in the case of multiple fielded
constraints of the same type which will be OR'd together. So for example the
query list:tomcat list:struts includes both tomcat and struts
lists.

Any of the above constraints may be negated with a minus sign indicating that
matching messages should be excluded from the results. For example, the
following query finds Tomcat traffic excluding CVS or SVN commits:
list:tomcat -type:checkins. Negations are not OR'd; more
negations always exclude more messages.

By default search results are ordered by relevance, most relevant first.
If the search string contains order:date-forward the results
will be sorted chronologically, oldest to newest; if the string contains
order:date-backward it's the reverse. If that's too much
typing you can try order:df and order:db.
Searches that don't have a meaningful relevance score, like
list:tomcat, are by default sorted date-backward.

If the search string contains the special opt:noquote argument the results
will avoid searching text that appears in quoted messages. This can be used
to find messages where a particular person said something.

Lastly, the opt:nostem argument in a search indicates you
want the search without "stemming", meaning a search for run won't
match runs or ran as would normally happen. This lets you be
a bit more precise in what you're looking for.

Are there any useful keyboard shortcuts?

We're geeks, so we love keyboard shortcuts. Here's a list:

Keystroke

Action

n/p

Next and previous in search results

j/k

Up and down in the thread view

s

Jump to the search box

arrows

Move left and right in the attachment popup

v

Toggle text/image view in the attachment popup

z

Toggle zoom in the attachment popup

x

Close the attachment popup

Can I put a MarkMail search box on my site?

Sure, and that can be convenient if you want to give people the opportunity
to search messages without visiting the MarkMail site first. Just copy and
paste this HTML into your site:

You should replace "tomcat" with the name of your list or project. You can
adjust the default search constraint too. For example, the following searches
messages sent by a particular person, in this case Sam Ruby:

How do I link to MarkMail?

We love incoming links. You'll probably want to link to one of our
project-based homepages:

What to Search

Sample URL

All of MarkMail

http://markmail.org

Apache lists

http://apache.markmail.org

Tomcat lists

http://tomcat.markmail.org

Ant lists

http://ant.markmail.org

Maven lists

http://maven.markmail.org

MySQL lists

http://mysql.markmail.org

(You get the idea.)

How do I link to an email or thread?

Each email in MarkMail has a special canonical identifier, something people
frequently call a "permalink". It looks like
http://markmail.org/message/fbdkpdqfgutyp47h.
Next to the message headers for each email you'll see a link offered that's
that email's permalink, something you can bookmark or email. There's another
link near the search results that's a permalink for the full browser state.
If bookmarked or emailed this link will recreate the whole view including the
search performed and selected message.

How can I avoid having my emails archived?

MarkMail supports the x-no-archive header. If your email
includes this header with a value of yes, then in most situations
your email will not be shown publicly on the MarkMail site. Note that because
in Apache's history this header has often been added accidentally by list
management software, we may in some cases choose to display emails with this
header set. There may be other situations in the future where we judge it
best to ignore the header. The best and only 100% reliable way to make sure
your email doesn't appear in MarkMail or an archive system like it is to avoid
posting to a public mailing list.

Can you load my private email?

At this point, no. But if you have specific feature ideas, let us
know. And if you'd like to see MarkMail running on your private e-mail,
send us a note using the feedback form,
because we're curious to hear from you. No promises, mind you.

Can I just browse the lists and messages?

Yes, we have a rudimentary browse interface.
It's primarily intended for web crawlers and debugging, but you can use it too if you'd like.
No need to bookmark, there's a "Browse" link in the footer of every page.

Tip for geeks: You can manually add a ?q=<term>
constraint to the query string of the browse page to restrict your view to messages matching
the constraint. For example manually typing
http://markmail.org/browse?q=from:mcclanahan
gets you a browsing look at emails from Criag McClanahan.

Any known issues?

There are a few known issues:

Use of large fonts or larger DPI settings can cause text to exceed the allotted bounds

Firefox can report "Transferring data from markmail.org..." in the browser's status bar
even after all data has been transferred

How do I request that content be removed?

TECHIE FAQ

What's hard about searching email?

Email doesn't work well in a relational model because there's too much free
text (all those words in the message). It doesn't work well in a search
engine either because there's too much ad hoc structure (headers, footers,
paragraphs) and hierarchy (attachment files containing pages containing
paragraphs). And if you try to marry relational with search you only get
into trouble with slow joins across the systems and indexes being out of
sync and slow to update.

We've found email works naturally as XML, with the structure represented by
elements and hierarchy by the nesting of the elements. We're hoping that on
this solid base we'll be able to push the envelope for email management.

How do you store the emails?

Each email is stored an XML document inside MarkLogic Server. Every document
representing a message has a <message> root element with attributes to
hold the message's unique id, thread id, list name, date, and so on.

Underneath the root, there's a <headers> element containing more
elements, one for each of the email's headers. For special headers like
<from> we examine the content and extract it in a normalized format, so
for example <from> has @address and @personal
attributes holding the sender's email and real name. We use these attributes
for search and display instead of the raw From: header value.

Following <headers> there resides an optional <attachments>
element that holds information about the email's attachments - pointers to
their original versions, any image representations, and (for attachments we
know about like DOC, PPT, and PDF) the content on each page. The binary files
reside in MarkLogic Server also.

Lastly, under the root there's a <body> element that hold the message
content. The body isn't held as simple text but rather as <para>
elements and <quotepara> elements with attributes indicating the quote
level. We also have <footer> elements. Each footer has a @type
attribute to let us know if it's a person's signature block, a confidentiality
statement, a footer added by a free hosting provider, a list subscription
management line, or several other classifications. Signature footers we
display in italics, confidentiality footers we display in gray, and some
footers aren't displayed at all. Inline we've added markup for recognized
emails and URLs. This enables targeted searching against these items as well
as custom display (i.e. email obfuscation). A fun query we can do with all
this markup is: given a person's name find the email address embedded within
their most recent footer. (Note that we haven't exposed this query to the
public yet. Let us know if you want us to.)

What generates the charts?

The raw data for the charts comes from a MarkLogic Server product feature
called lexicons. Lexicons enable fast calculation of the distinct values for
a specific element or attribute along with the occurrence frequency of each
value. To generate the historical activity chart, we ask for the distinct
month values and frequencies across the emails satisfying the user's query.

We pass that raw data as XML to a commercial Flash charting library whose code
we have modified to better suit our purposes.

How do you load the emails?

A Java program converts email to XML and loads them into MarkLogic Server.
The program can load emails from an mbox archive or catch new emails as they
come across the wire. We subscribe a user to each archived list in order to
receive the new messages.

The delay between message receipt and searchability is less than a second.
You're always searching the very latest.

How do you do the search?

We use the XML-aware full text search capabilities built into MarkLogic
Server. With it we're able to use all the markup in each message as part of
the query - even though the markup is hierarchical, ordered, and irregular.
See How do you store the emails?

What runs the front-end?

We actually use MarkLogic Server to generate the HTML pages. Well,
technically they're XHTML. Our back-end data model holds emails as XML, each
visitor's browser requires (X)HTML, and in between we have the XML-centric
scripting language XQuery. It seems wasteful and unnecessarily difficult to
marshall XML into objects or strings and then marshall it back out as XML.
Instead we query XML, process XML in an XML-aware scripting language, and
output XML as XHTML. It's the only time in our web lives we haven't suffered
an impedence mismatch at some level of a web architecture. And as a bonus,
the XQuery language always outputs well-formed XHTML.

How will this scale?

At this point we're running on a small MarkLogic Server cluster,
utilizing the inbuilt capabilities of MarkLogic Server to scale as necessary
for user load and email traffic. The largest clusters deployed at some of
our customer projects exceed 100 machines. We have loads of runway left.

This sounds like fun, are you hiring?

Actually, yes, we are hiring. If you're the kind of person who reads to the
bottom of a technical faq (ahem), you reside or can relocate to the Bay Area,
and you'd like to work as part of a small team focused on making MarkMail
even better, let us know.