Saturday, October 16, 2010

It's been about seven months since I last ran code coverage tools on Thunderbird, so I thought I would do it again. In these intervening seven months, Thunderbird has moved to libxul, and the mozmill tests have become more important, which means I get to change my methodology slightly.

Problematic caveats first, then. For various reasons, my build for code-coverage results runs on a 64-bit machine which I have to ssh to get to and on which I lack administrative privileges. Previously, the build setup caused gcov to fail to work properly for the mozilla-central code, causing my build scripts to require hacks to only cover the comm-central code. It seems that something in either the environment, the code, or the building of libxul caused it to be fixed.

On the other hand, the environment prevented me from running the mozmill tests correctly. On this computer, the user accounts are set up via LDAP, so the gtk initialization code tried to get user information from libc, which got it from NSS, which got it from LDAP, which tried to get it from another LDAP library. Unfortunately, it chose to use the directory/c-sdk ldap instead of the system ldap causing a crash. The only solution was to disable ldap, which required a few tricks to work correctly. Oh, and somehow the tests failed if I didn't enable libxul.

My last issue was with mozmill. I had a plan to use Xvfb for the display for mozmill (the intent being that I could automate mozmill tests on several computers overnight via screen). Turns out that it someone complained about needing Xrandr, so I got to run all of the mozmill tests via twice-forwarded X connections.

Saturday, September 4, 2010

I have noted before, by a nonscientific and utterly biased survey, that Thunderbird appeared to account for a significant share of the newsreader market (testing bug 16913 was what caused me to discover this fact). But actually finding any attempts to measure usage share of newsreaders via Google has actually been rather frustrating. You can easily find market shares of web browsers, desktop operating systems, server operating systems (though the numbers vary wildly?), and mobile platforms. But not things like email client shares or newsreader shares.

Okay, I am not about to find market shares of email clients. I have no access to anywhere near enough a representative sample that could work. But collecting newsreader market shares should not be that hard. After all, pretty much anyone can pick up a large, representative sample of news postings... connect to a NNTP server of your choice. So, seeing as how it's a three-day weekend, I thought I might as well collect the data myself. The other reason for my collecting this data was to demonstrate that a significant number of Thunderbird users are NNTP, so removing NNTP support would adversely affect the userbase.

Methodology

First off, I have to define what I mean by "usage share." Unlike other mediums, a relatively small numbers of users account for a relatively large share of NNTP postings. I've decided to measure it by the number of posts generated by each NNTP client, since it's easier to calculate, and I think it is more informative than the measuring by individual users.

I also have to pick the subset to log. For this set of data, I collected every single news article in the Big-8 newsgroups on my school's NNTP server (news.gatech.edu), which has a retention time of a month (30 days is the exact number, I think). I did not even attempt to filter out spam messages, and I did not account for cross-posting (my script managed to crash due to races a few times, so the totalized data which accounted for cross-posting was unreported).

Essentially, this is what I did. I ran a python which collected every group in the big 8 (determined by LIST ACTIVE wildmats) on the server. Then, it entered every group, performed an XOVER to find all messages, and then XHDR'd User-Agent, X-Mailer, and X-Newsreader to figure out what the user agent was. This script output, for every group, a total for each full string that represented the UA into a ~2MB csv file.

I collected all of the csv data into OpenOffice.org Calc and then ran a macro which attempted to collect the program number and the version from the UA strings. Unsurprisingly, I had to do some hacks to get it to recognize SeaMonkey and Thunderbird correctly (Mnenhy was not helping). I output tables that broke readers down by versions and by total program counts.

Results

It turns out that there is an incredibly long tail of newsreaders. There are about 250 different UA strings I found. Excluding one particularly prevalent bot and those postings for which I could not find a UA string, I found around 430,000 messages (there may be some other things dropped by copy-paste errors). Of these, the top 5 newsreaders account for just 79% of the total count. By contrast, the top 5 web browsers account for very nearly 100% of the total. Some interesting newsreaders I found in the long tail:

Mozilla 4.8 [en] (Windows NT 5.0; U)

Mozilla 3.04 (WinNT; U)

trn 4.0-test70 (17 January 1999)

Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/0.8.12

MyBB

Finally, here is the table of the top newsreaders:

Newsreader

Total

Percent

Google Groups

189536

43.94%

Thunderbird

52258

12.11%

Forte Agent

47100

10.92%

Microsoft Outlook Express

41042

9.51%

Microsoft Windows Live Mail

11196

2.60%

MT-NewsWatcher

10872

2.52%

Other

79359

18.40%

That Google Groups has the highest market share is not surprising, but I was surprised by the strong showing of Forte Agent and the poor showing of traditional newsreaders (e.g., tin, rn-based newsreaders). I guess this goes to show you that Windows has a surprisingly large market share in the Big 8 newsgroups. For SeaMonkey enthusiasts, your newsreader has a mere 4,187 postings (with another ~5K provided by other Mozilla distributions, some of whom cannot be determined... Mnenhy made processing UA strings difficult).

In terms of individual versions, one of Outlook Express's versions clocks in #1 at 31,617 total posts, with Thunderbird 3.1.2 trailing at a "mere" 23,661. Thunderbird has around 14,000 on the 2.x branch, 11,000 on the 3.0.x branch, and 25,000 on the 3.1.x branch. There is apparently some spoofing going on for SeaMonkey users as well (I found a dozen or so Firefox entries, which I presumed is a SeaMonkey-spoofed UA string).

Another datum incidentally collected was the number of postings in each hierarchy. Here they are:

Hierarchy

Count

Largest Newsgroup

comp.*

64,360

comp.soft-sys.matlab (8,399)

humanities.*

2,460

humanities.lit.authors.shakespeare (1,455)

misc.*

28,518

misc.test (8,796)

news.*

31,635

news.list.filters (26,238)

rec.*

217,548

rec.games.pinball (14,707)

sci.*

47,948

sci.electronics.design (6,076)

soc.*

87,192

soc.retirement (6,053)

talk.*

12,513

talk.origins (6,498)

Remind me again why we have the humanities hierarchy? Almost 60% of its messages come from a single newsgroup, and it has just 8 newsgroups.

Future Work

What could be done in the future is to expand this research into binary newsgroups. However, merely counting posts becomes a more inappropriate metric because binary newsgroups use a lot of multipart messages, so just because someone uploads a ginormous binary does not mean it should be counted 50 times. I also don't have access to any binary newsservers.

Another opportunity for fixing is to discount spam. As a brief test, I looked into only those newsgroups which had the name `moderated'--this resulted in a paltry sample of 3,272 messages. The statistics also appear to not change much, but the newsgroups are likely not a representative sample of the Big 8 anyways.

Finally, this needs to be broadened and run repeatedly so it can collect snapshots of the data across time. This metric suffers poorly at capturing historical data, but it could be an excellent way to get data every few months from the future, so long as someone collects all of the data in the future.

Monday, June 7, 2010

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to explain the necessary actions in creating new account types. I hope to add a new post once every two weeks (I cannot guarantee it, though).

This blog post is a continuation of my previous two posts, which is being broken up into multiple segments to lower the amount of text one has to read in a single sitting. The current step is to actually implement the folder update.

Only new messages

Now that we know how to add messages to the database, we need to figure out how to find the downloaded messages.

It should go without saying that checking to see if you actually need to update the folder should be the first thing you probably want to do in this function. In my extension, I need to download the front page of the specific board and check the topic list to see if it matches what is stored in the database.

For now, at least, I can rely on the forum telling me about the number of replies in a thread (one less than the number of total messages), as this is shown in the thread index of a forum. What I do is grab the reply count that I've seen and subtract that from the number that is listed to get the number of new messages I need to download. Then I need only to look at the last few messages to add them to the database.

At this point, I have two main issues to worry about. First, I am working with paginated return results. That means I actually need to load multiple documents. Second, I am not getting a list of messages, but a list of threads; therefore, I need a database that is associated with threads [8].

The database I use is a simple JSON object that exists for each folder, and so far only has a mapping of threads to the reply count that I've seen; I may give it more in later iterations of this extension.

Pagination is where the trickery in implementation comes in. First, I need to look at the thread index for new messages; if I have seen all of the messages in the last thread, I can stop looking at new pages. Otherwise, I have to grab the next page and continue recursing. Note that it is possible to hit a thread that I've fully seen and still have threads I've not seen: sticky messages can be infrequently updated yet still make it first on my list of messages.

The other issue is when loading threads. The link I end up scraping is to the first page of messages for that thread, which I may already have seen. So I
need to skip over pages until I find the page that first has new messages. For now, I'm doing this naïvely by actually loading each page and counting the number of posts rather than trying to deconstruct URLs and calculating where to load. I then need to look at the last set of posts, not the first set, so I calculate the start position and read forwards. Since I'm using querySelectorAll, I get an array of results, so I don't worry about having to throw out a number of iterations; I can just start in the middle when iterating.

Once all of that is implemented, we can then put everything together to make a proper implementation of updateFolder, the function we started implementing a few pages ago. The end result is that, when all is said and done, you can load up the message pane (the last column is the number of messages in the thread):

By comparison, here is an equivalent view of the forum that I loaded this from:

Now, I wish to ask you, which user interface would you rather use to view the forum?

Some notes for implementors: be prepared to delete your msf files over and over again. I would recommend tackling the individual components in this order: first build a message, then your protocol object (I found it easier to test when the running tasks were already known to be working), and then start work on tying it all into the database. Leave issues like threading for after the basic stuff is laid out, then tackle determining which messages are new if it's not implicit in what you do (i.e., you don't have a "get new messages" query you can readily use). Pagination should be last: everything is easier to test if you only have a small number of messages you really need to test.

I apologize for the excessive length of this step; this happened to be pretty much the first step where most of the necessary technology had to be used. The next step is to actually be able to display the messages in our database, which should be shorter.

Notes

Kent James and I are both working on developing new account type extensions (he doing an Exchange connector and I this blog series); both of us have identified the narrow-mindedness of the database as an issue. It is therefore possible that my workaround here will not be necessary in the next few versions.

Friday, May 21, 2010

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to explain the necessary actions in creating new account types. I hope to add a new post once every two weeks (I cannot guarantee it, though).

This blog post is a continuation of my previous post, which is being broken up into multiple segments to lower the amount of text one has to read in a single sitting. The current step is to actually implement the folder update.

Folder updating

To actually achieve our goal of getting a correct message list, we are going to modify the implementation of updateFolder. This function is called whenever a folder is selected in the folder pane; conceptually, you can view the function as causing the cached database to be resynchronized with the actual folder. For example, this is where a local folder would actually reparse the mailbox if the database was incorrect or missing.

This function essentially consists of three steps: figure out new messages, process them (i.e., apply filters), and then announce to the world that they exist. Some account types (like IMAP) may need to do more involved message processing, but this is the general gist of what goes on [4]. I'll ignore the processing step until I start talking about filters.

Database Details Devil

To start with, I'll cover the last step. Announcing to the world that a message exist boils down to adding a new header to the database. So how do you add a new header to the database? It requires three easy steps: create the header, populate the fields, and then add it to the database. With the proper listener setup, all of the other notification is done for you automatically. But as they say, the devil is in the details.

Let me begin by explaining some things about messages. There are five different representations of the message: the message key, the message header, the message ID, the message URI, and the necko URL object. Siddharth Agarwal has a nice diagram that shows how to convert between these representations. The last two are more concerned with displaying messages; it is the first three that are interesting right now.

Message keys are the internal database key for a message; the tuple (folder, key) is guaranteed to be unique by the database. Message keys are unsigned 32-bit integers (with 0xFFFFFFFF, or -1 in 2's complement, reserved as the "no message here" key). In general, any time a property needs to refer to another message, the message key is used; as a consequence, it means that such properties cannot refer to stuff across folders.

Message IDs are the RFC 5322 identifier for a message. These identifiers are supposed to be unique (for logical messages, not in a "the message at offset 0x234f3d in this file" sense). The most important use case for message IDs is that they are a critical
component for threading.

The message header object is an object of type nsIMsgDBHdr. These are objects are directly backed by the database. However, many of the properties do not notify the database of changes, so you generally do not want to actually set them. Like all generalities [5], there are exceptions to this rule. Right now, we want to manipulate headers before adding them to database, and therefore we do not want to notify people of changes to not-yet-existing headers, so we want to actually use the fields of nsIMsgDBHdr.

So, the first thing you need to do is to decide what your message key is. Message keys are going to be used to get the message URI, so it should be a property that is easy to associate with methods. IMAP uses message UIDS, local folders the offset into the mbox [6], and NNTP uses the key numbers in the group. In my case, it appears that the forum assigns each post a unique number, so that is what I'll use.

After the message key, the most important properties are the major ones for display. The author attribute correlates to the "From" header, subject to the "Subject" header, and date to the "Date" header. All of these will be used to generate values in the thread pane columns; things would look strange without these.

The other major property in the display is flags. Flags, as the name implies, is an integer where each bit corresponds to a different flag. The most important of these are probably HasRe, Flagged, and New. Flags should be set with OrFlags and AndFlags instead of manipulating the value directly. And don't set these values with the mark* methods, as these cause notifications to be fired (remember that we haven't added the message to the database yet).

If you want to do real threading, you will want to set message IDs and references [7]. The References header is a space-separated list of message ID tokens (wrapped in angle brackets), although the parser routine in the database does a pretty good job of ignoring any random crap. The list is in the reverse order of hierarchy, so the last element is the message's parent, second-to-last the grandparent, etc.

Threading is implemented in the following manner. First, the database attempts to find a message for each message ID in reverse order. If it finds one, that is made the parent header and threading stops. Otherwise, if correct threading is enabled, an attempt is to made to find a thread which has that message ID. Otherwise, if use strict threading is not enabled, a thread that has a message which has the same subject (without Re) is used as the thread. If threading without re is
disabled, the message has to have the HasRe flag checked to perform the last step. Finally, if a thread could not be found by this point, a new one is created.

To combine messages in a thread, then, the References field needs to be set for the messages. If people enable correct threading (this is done by default), you can use a simple trick: create a valid message ID for each thread and stuff that as the References header.

A practical example

In my case, I have an author (without email addresses), a subject (with possible non-ASCII text but without Re: stuff), a date in a standard format, as well as a simple per-thread unique identifier for message keys. I also want to make threads—although this will only be two-level threads. Ideally, I should also be flagging the sticky threads, but I'll leave that for a later version. So what does this code look like?

First, we get a reference to the database. Remember we implemented this in our last step, so this shouldn't present any problems. We also get the things that are shared in this thread: the subject, hostname of the server, and the charset. For each of the posts, we collect the post ID, the author, and the date of the post as text strings, and then convert them into an integer, string, and a date respectively.

Using the CreateNewHdr function, we get a new message header that we can manipulate. Since I'm trying to be aware of non-ASCII text, I'm using the MIME encoding strings to prepare the author and subject. Remember that the MIME specifications want you to encode non-ASCII text in the headers; the function we use is the simplest way to do the encoding.

If you're not working with actual email, the from string can be contorted. What I did was to create a fictituous email that could be theoretically tied back to the author in a systematic way (for a possible future compose code that does forum private messaging). The purpose of the pipe character at the end is to prevent accidental mail delivery; I also used the hostName and not the realHostName, so this email address would be traceable even if the user changes the host name on me.

The message date I have is a formatted string; the Date constructor is pretty handy at converting most forms of these strings into a usable JS Date object. Then I have a JS Date object, which is measured in milliseconds, whereas the date attribute is a PRTime, which is measured in microseconds, so I need to multiply by 1000 to actually set the property. Ironically, the date is actually stored in seconds in database and is converted to and from microseconds on the fly.

The Charset attribute, apparently only used for search right now, is derived from the character set as reported by the DOM. This means that it is the same character set as would be assumed by the layout engine, including character set overrides.

The message ID is simpler to generate: valid URIs are pretty much valid right-hand-sides of a message ID. A post is pretty much representable as a tuple of the thread page and the path to the post in the DOM, so this message ID is also an easy way to get to the message. References are also generated as I described above; in a later version, I may try to do sniffing to figure out from quoting who is replying to whom and recreate actual threads. Note that when setting the message
ID, the outer angle brackets are optional.

The last thing I set is the flags. A complete listing of flags can be found on MDC. In this case, the only flags I care about are HasRe (since I want to generate "Re:" headers) and New; most of the others will probably be set by the user in the UI.

Finally, we add the header to the database. The last parameter tells the database to tell anyone listening that we have a new message. After we have loaded all of the messages, we need to commit the database:

database.Commit(Ci.nsMsgDBCommitType.kLargeCommit);

A brief note to make here: it doesn't really matter if you do a large or session commit, they both end up doing the same thing. Small commits end up doing nothing.

Notes

Like most synchronization stuff, you theoretically also have to deal with deletion on the remote side as well as read changes, etc. The more I think about it, the more I'm torn on whether or not I should implement it. For now, I'll recommend that you weigh the cost of trying to determine deleted messages versus the commonality of deletion or other modification.

Except, I am told, that all words that end in -tion in French are female.

Incidentally, this is a major part of the reason why there is a 4 GiB limit on mailbox size in Thunderbird and SeaMonkey.

What about In-Reply-To, you may ask. This information is pretty much redundant with References, so what happens is that, for the purposes of computing threading, this header is appended to the References header. And you do this before calling on the database header.

Wednesday, May 12, 2010

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to explain the necessary actions in creating new account types. I hope to add a new post once every two weeks (I cannot guarantee it, though).

In the previous blog post, I showed how to get an empty message list displayed in the folder pane. The next step is to actually implement the folder update. Since this task involves several tasks, I will be breaking this step into multiple blog posts.

Getting a DOM for HTML

In terms of webscraping, I treat the first step as simply turning a URI into a DOM. The developer center actually has some good resources on this, if you have access to a document object. The issue, though, is getting a document object, since your code will likely be running from an XPCOM component [1]. What is needed then, is a utility method for loading the DOM. This is the code I've been using:

The first argument is the URI to load, as a string, and the second argument is the function to be called back with the DOM document as its sole argument. An added benefit to this method is that it also uses an asynchronous callback method, so you're not blocking the UI while you wait for the page to download. This code will likely not be called except by the protocol object, though, since we probably want to throttle the number of pages loaded up at once.

The protocol object

Earlier, I mentioned that one of the implemented objects wasn't actually mandatory. This object was the protocol object. An instance of this object is meant to wrap around an actual connection to the server; where you don't need to connect to a server, this object might not be worth implementing. In reality, it is still a useful thing to have if you have a non-trivial account type—any time a task is more complicated than "load this thing and use it," a protocol object can help with managing multiple subtasks.

For a wire protocol, the implementation of this object should be straightforward. It would essentially be a state machine, with an idle state entered after setting up the connection during which the instance can accept tasks to do. A state machine could also be done for webscraping-based account types, but I am using a more queue-based
approach due to how I have structured the web loads.

At a high level, server requests are chunked at two levels. On the higher level, the application makes calls to functions like updateFolder; these calls I have decided to term tasks. The lower level requests are the requests you communicate to the server; for lack of any better terminology, I will refer to these as states[2]. In my implementation, I keep two queues, one for each of these.

Managing the queue for tasks is best done at the server. The overall logic is actually rather simple:

The runTask method is designed to be called with a task object; for the core mailnews protocols, this is primarily being called by the service [3]. For now, I've made the value for the maximum number of protocol objects unchangeable, but it is probably better to allow this
value to be configurable via a per-server preference.

The core implementation of the protocol running object for webscraping is not too difficult:

The protocol is initialized by calling loadTask, which calls runTask on the task object. This would make some calls to loadUrl which will load it (since the max has not been loaded yet). When the function is loaded, via UrlRunner.runUrl, the callback function is called
and then the onUrlLoaded function is called to clean up the URL from the queue and run any more. When this function detects that there are no more URLs are being loaded—hence why the callback is called before this function is—finishTask is called on the task object.

The working of loadUrl bears special mention. The first argument is the URL (as a string) to be loaded. The second argument is the method on wfProtocol to be called when the URL is loaded. This implies that the actual code for implementing tasks is mostly contained on wfProtocol as opposed to the task objects. All subsequent arguments are passed in as arguments to the callback function; the first argument to this function is the DOM document.

Notes

Well, there is an nsIDOMParser which can turn text into a DOM without needing a document object. Unfortunately, it only supports XML. There is a patch for making it parse HTML, but it has gotten no traction in recent months.

Just to muddle it all up, the URL instances in most mailnews implementations are actually how the tasks are implemented, although I internally use a URL to represent a state (kind of). A potentially clarifying discussion can be found in mozilla.dev.apps.thunderbird.

I am not totally happy with the current model of the protocol system in mailnews, particularly with the technique of crossing over to the service to make the calls to the protocol. In my implementation, I've made those functions static functions on the protocol object. Since this is somewhat different from the current
implementations and I'm not sure I want to keep this, I've couched my statements of how things work.

Tuesday, April 27, 2010

Seeing as how the first build candidates of Thunderbird 3.1 beta 2 are currently being spun, it is time for me to update from the 3.0 builds to 3.1 builds (I have a policy of switching to the next branch of Thunderbird as my primary at the time of the last beta). I decided to take this opportunity to work out issues in my folder categories extension for real.

I've decided to give up on supporting 3.0 (trying to support two different broken versions of code is not fun), so the oldest supported version is now listed as 3.1 beta 2. In reality, it would work with any nightly since the broken code was fixed (it's still not fully fixed now, but it's just a cosmetic issue). The result is an experimental addon on addons.mozilla.org. For those of you dying for screenshots, this page on mozdev should satiate you.

Sunday, April 11, 2010

I recently spent a fair amount of time collecting historical code coverage data for Thunderbird; the result is 312 distinct files of raw lcov data covering the first year of Thunderbird in a mercurial repository. I also recently wrote a program that makes a treemap for each day (thanks to the geninfo man page and this treemap library), and then wrote another program to convert that treemap into a static PNG image:

I then converted the high-def, lossless AVI into an Ogg file, and produced the following animated video of historical code coverage:

Use a web browser that supports playing box Ogg videos. It's the least you could do, considering how much time converting the AVI to Ogg took!

Okay, so no sound yet for the animation—the encoding is painful enough that I don't want to try it out right now. I also didn't filter out any of the days where the tests failed early, so you will occasionally see flashes of red. The data also doesn't have recent stuff (I am holding off until I can figure out how to run mozmill tests and get JS code coverage). Anyways, enjoy!

Sunday, April 4, 2010

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to
explain the necessary actions in creating new account types. I hope to add a new post once every two weeks (I cannot guarantee it, though).

In the previous blog post, I showed how to get an account displayed in the folder pane. Now, we will prepare the necessary components of getting an empty message list displayed in the folder pane.

Database basics

As mentioned previously, the database is one of the key components of an account. It is, essentially, the object that actually stores the state of messages in folders and even some folder attributes themselves. The database is currently backed by a mork database (the .msf files you see in your profile storage); in principle, you could make your own database from scratch that doesn't use mork, but that is likely a very bad idea. [1]

Originally, as I understand it, the database was merely a cache of the data in the actual mailbox. Its purpose was to store the data that was needed to drive the user interface to prevent having to reparse the potentially large mailbox every time you opened up Netscape. The
implicit assumption here was that blowing away the database was more or less lossless. Well, times change, and now such actions are no longer lossless: pretty much any per-folder or finer-grained property is stored in the message database; in many cases, these properties are not stored elsewhere.

The database itself is represented by the nsIMsgDatabase interface. Messages and threads are represented by the nsIMsgDBHdr and nsIMsgThread interfaces, respectively. Per-folder property stores are represented by nsIDBFolderInfo. Finally, the code to open a new database comes from nsIMsgDBService. Most of the database
stuff just works; subclasses would implement only a few methods to override the default
ones.

Getting databases

There are two main entry points for getting databases: msgDatabase, and getDBFolderInfoAndDB. Both of these must be implemented for anything to work:

This portion of the code can turn out to be surprisingly complicated. What is listed is generally a safe option: if the database is incorrect (out of date or non-existent), blow away the database and re-retrieve the information from other sources. Recreating the database is done in the catch statement. Then we set the member variable to be the newly-created database (this is also used by nsMsgDBFolder code) and we return it. Retrieving the folder info should be self-explanatory.

You may notice that when the database is invalid, all we do is create a new database: we don't try to fix it. This is because these calls to get the database are interested in getting a version of the database quickly: this is one of the calls the folder pane makes, and it is synchronous. Imagine what would happen if, say, a local folder which had a 3GiB backing store needed to be reparsed during this call. The actual recovery of the database would most likely happen when the folder is told to update.

Other stuff can be added to these calls. Not everything is necessarily stored in the database: news folders store their read information in the newsrc file, so it needs to sync this with the
database in the method too.

Displaying an empty message list

If you just try to implement this code and run, you will discover that this is not sufficient to load the database. The key is in the getIncomingServerType function, which is what tells the database service which implementation of nsIMsgDatabase to use. For now, we can just use the default implementation of nsMsgDatabase, but we can't change the parameter output (otherwise URIs will get messed up). The solution is to create a DB proxy:

What this does is use some XPCOM magic to link creating one contract ID to creating the other. I have not yet used the extend-C++-in-JS glue to create the ability to subclass nsMsgDatabase due to the fact that the nsIMsgDatabase interface is more complicated than the others, as well as it being more C++-specific codewise and generally less
useful to override methods.

The next thing to do to display the list is to write a simple no-op implementation for updateFolder (the default implementation doesn't do this, for some reason [2]):

Here, atoms is merely is an associative array that contains a list of necessary atoms for the code. The end result of all of these changes is the following screenshot:

In the next part, I'll cover how to replace that screenshot with one containing an actual folder list.

Notes

As annoying as it would be, implementing nsIMsgIncomingServer or nsIMsgFolder from scratch is still somewhat feasible. I don't think the same holds true for nsIMsgDatabase (or the other database helper interfaces): static_casts permeate the code here, with the note that it is a "closed system, cast ok".

If you're wondering why this post took so long to be produced, this is a major reason why. It turns out that not having this implementation causes the folder display to not display the database load, so it just displayed the server page with the server name changed to the folder name. That, on top of having no time to debug it.

Friday, April 2, 2010

If all goes well, sometime tonight I will have completed 362 builds of Thunderbird, one for each day from July 24, 2008 to July 23, 2009 excluding August 3, 2008, December 25, 2008, and July 4, 2009 (more may turn up as I get more data; bonus points if you can figure out the significance of each of those days!). Included for each build is either a build log to tell me why the build failed or a test log telling me what ran. Also included is a copy of the Thunderbird code coverage data.

What, you may ask, do I intend to do with a year's worth of code coverage data? I intend to use this data to help answer some questions I have about our code coverage. Already, I've wondered about a more general overview of code coverage data (see my last post for more details). Now, I want to pose some of the following questions:

Whose code is not covered?

Who is adding code right now without making sure to cover it?

Whose tests are responsible for most improving code coverage?

How is code coverage being impacted over time?

My answers to these questions involves taking a snapshot of the code coverage data over time. That, however, proves to be a little more difficult than you'd imagine. First of all, hg doesn't support, as far as I can tell, an "update to what the repo looked like at this time" (hg up -d goes to the revision that most matches that date, not to a snapshot at that time). So I had to write a few scripts to pull out the revisions to look at. Second, gloda ruined some of this data. Fortunately, that's easy to tell due to the <1KB log files complaining about no client.mk. Then there's the issue of my revision logs containing m-c data, not m-1.9.1, so I have to hack around the client.py for Thunderbird trying to pull a different revision.

Another source of complaints was actually building and running the things. The computers I'm doing this on are all 64-bit Linux. There are a few m-c revisions that cause 64-bit to break, and libthebes and gcov just can't seem to work together on 64-bit Linux. Plus, libpango has some breaking API changes between 2.22 and 2.24. One of the XPCOM tests seems to crash and sit there with a prompt saying "Do you want to debug me?" Finally, the test plugins seem to cause massive test failure due to assertions. Not to mention that these machines don't have lcov on them and I don't have sudo privileges (so I'm not running mozmill tests yet).

In short, it's somewhat surprising to me that this actually works. Just looking at some of the build generation shows some coarse changes: between October 2008 and June 2009, the size of the compressed test log files increase 6-fold, and the compressed lcov output has nearly doubled in the same period. Lcov also reported that the coverage increased from about 20% to around 40% as well.

Sometime later, I'll hope to get mozmill tests working, as well as improving the JS code coverage to actually work for Thunderbird (it doesn't like E4X nor some of the other files for no apparent reason). Since jscoverage works by modifying the JS code, I can run that without really needing the builds (archived nightlies plus tricking the build-system will work). When all that data is collected, or sometime before, I'll make a nice little web-app that shows all of this information so people can gawp at pretty pictures.

If you want to try this on your own, here is the shell script I used to actually collect data:

Don't bother complaining to me if it doesn't work for you. I just did what I needed to do to get it to reliably work. And be prepared to wait for a few hours to collect any non-trivial number of builds. It took me about 12 hours to get 6 months worth of data using 6 different computers; the next 6 months is still going on right now.

Thursday, March 25, 2010

One recent goal of Thunderbird development has been to increase test coverage. Murali Nandigama has prepared a nice document on getting code coverage data. Running this on just the xpcshell tests for a recent build gave me this output.

So the output of LCOV (which does the post-processing) is passable. With enough clicks, I can figure out which lines are being covered and which ones aren't. But if you step back and try to look at the big picture… that's hard to do. Some directories sure seem good at code coverage: I mean, we hit both the lines of code in there. On the other hand, we seem bad at covering IMAP, only hitting around 11,000 lines of code (note the difference of scale). There's got to be a better big picture.

The answer I came up with was to use a treemap. Basically, treemaps are a good way to display two key attributes of data on the leafs of a tree at once: one is the color, the other is the size (actually, you can probably manage to squeeze three attributes of data under certain conditions if you vary color/saturation independently, but I'm not going that far here). In this case, the hierarchy is the folder hierarchy under mailnews (I'm not interested in m-c coverage) with the leaves being individual files, size being number of functions in a file, and color being the ratio of functions. The result with the same coverage data is the following graphic:

I've also taken the liberty to label the top-level directories so you can read them without having the mouseover capabilities. Immediately, you can see some interesting points about mailnews:

The IMAP code is the largest of the protocol code in terms of functions by a considerable amount. Local code, NNTP, and compose are all roughly equal in size by the same metric.

Some of the extensions (SMIME and MDN, actually) are not tested at all. Import code is also poorly tested.

Some of the MIME code is well tested; others aren't. In fact, it's hard to test a function in libmime without testing half the functions in that file. Perhaps we should have more encrypted messages in our tests?

Speaking of libmime, it's spread out across several files. In other components, functions are centralized into fewer files: specifically protocol, server, and folder. Wonder why? :-)

Wednesday, March 24, 2010

One goal I've had for a while with respect to JSHydra was to have it actually spit out an easy-to-understand AST, akin to the kind of AST you get from Pork, as opposed to the parse tree from SpiderMonkey. After reading around in a fashion, I've written a postprocessing script to do so.

The basic idea for the output format is along the lines of the JsonML AST format, with a mixture of pork and "I think this is what's happening" to top it off. The actual ["Type", {}, child1, child2] format I quickly gave up using because it proves cumbersome to look at; in the interest of keeping something akin to the Pork format, I moved to a more ad-hoc format, which loosely follows the visitor pattern they mention.

I've added this output format to the WebJSHydra reader (yes, it is a copy-paste in part of the webpork code), so you can play with it to your heart's content. Just don't make it large. It also doesn't support E4X, and I'm not entirely assured of its correctness. Also, I don't support the visitor yet, nor do I have a C or C++ version of the AST for static analysis tools.

Tuesday, March 23, 2010

One complaint I have made a few times is that my hierarchy of accounts does not necessarily match up to the logical structure. For example, I have Mozilla-related folders splayed out across three accounts, one newsgroup, and two email accounts. They're different because, well, you can't combine mail folders, newsgroups, and RSS feeds all under one account.

It doesn't actually work in Thunderbird 3, only some of the newer nightlies. It turns out that the folder tree view stuff changed between Thunderbird 3 and Thunderbird 3.1, and the newer version is what I used to make the extension.

Speaking of which, it turns out that there is a bug in gFolderTreeView.load. Just to make life fun, the strings in the bundles are different between Thunderbird 3 and 3.1. Argh!

Categorizing works by setting a property on DBFolderInfo, for now at least. So this means it doesn't appear to work on server folders.

Uncategorized folders fall under the categories of their parents. So, basically, at the beginning, everything is laid out like the all folders view just shifted one level down. As you categorize more stuff, portions are spliced under different categories.

Categories should be marked as having new or unread messages if any folders beneath them are so marked.

Friday, February 5, 2010

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to explain the necessary actions in creating new account types. I hope to add a new post once every two weeks (I cannot guarantee it, though).

In the previous blog post, I gave a broad overview on the overall structure of the backend interfaces and the components of account implementation. Now, we will prepare the necessary components of getting your extension's folder displayed in the folder pane.

Account implementation decisions

Before you start implementing, you have to decide how to structure the account. The first decision is what the internal account type will be. This
will be the value of nsIMsgAccount::type and will dictate the contract IDs for several interfaces. The next decision is what the account URI scheme is. This will be the scheme for the URI and dictates the contract IDs for a few more interfaces; for mailbox accounts, this scheme will be mailbox. For my extension, I have decided to choose webforum for both of these.

Another important decision to make will be the server for which you will be doing most of your initial tests. It should be something that is manageable for debugging purposes. In my case, I've decided to bestow this honor on the
Kompozer web forum, because it seems lower traffic than any other forum I'm reasonably interested in. As you may notice, I am starting my extension with the intention of focusing on phpBB access—it's sufficiently widely used that I expect that only supporting phpBB at first would still make a worthwhile extension.

Once you have decided that, you should take the time to study how things will be structured: what determines a folder? What determines a message? A thread? Replies? How are you going to be carrying out new actions, such as checking for new messages? What internal information are you going to need to save for accessing? Heck, what determines the "server" to begin with? In my case, the DOM inspector is an invaluable tool for answering this questions. Don't worry about how to figure out the list of possible subscribable folders yet. Subscription will come into play much later; we are going to start by just hardcoding this list somewhere.

In my case, I am choosing to structure the folders as a Category → Forum hierarchy. I'll pick a few of the smaller forums to use so I don't overwhelm debug logs.

Implementing protocol information

Since nsIMsgProtocolInfo is the shortest and simplest of the interfaces, let me start by implementing this one. There are a total of 12 attributes and 1 function on this interface, so the code will not be hard to write. Following is an implementation of the code [1]:

The meaning of each of the attributes can be found in more detail on the MDC page. The properties used by the account wizard mostly control initial preference values; those used by the UI code mostly control which UI elements are enabled. I have excluded from the implementation also those attributes which are unused.

Perhaps the most leeway you have is in implementing defaultLocalPath. In this case, I have adapted the RSS implementation, which does not allow users to change this location. The other implementation (used by IMAP, POP, NNTP, Movemail, and Local Folders) uses a preference to return the default path. An example implementation of this method
is like thus:

get defaultLocalPath(){ // This will probably be found in the constructorthis._prefs =Cc["@mozilla.org/preferences-service;1"].getService(Ci.nsIPrefService).getBranch("extensions.webfora."); // Preference looks like [ProfD]WebForumslet pref =this._prefs.getComplexValue("rootDir",Ci.nsIRelativeFilePref);return pref.file;},

Once you have completed that, you should test that the service implementations work as expected via test snippets in the Error Console. The account manager can be mean when it comes to unusable
account types [2], so this will help fix the most obvious bugs before the account manager attempts to do it for you.

Server and root folder discovery

Before I start going any further with code, let me take a minute to explain how servers and folders interact. The server objects themselves do surprisingly little in the UI; the most common property calls are probably rootFolder and type. This even includes
what you might think of as server attributes: the bold display name, has new messages treeview properties, etc. Instead, those features can be found on the root folder, which is a "fake" folder object. Most of what we care about in this part happens on the root folder instead of the server; however, if you browse the implementation in nsMsgDBFolder, you can see that some of the property calls get forwarded back to the server for root folders.

The backend code will create server objects early on and hold onto them for the duration of the program (or until they are deleted). The server objects then create the root folders which then create subfolders as necessary. Links that go backwards (parent links and server links) are weak references to avoid refcount cycles. Most of this work is hidden in nsMsgDBFolder for you. After creation, various properties are accessed at will; some properties will be loaded in from the database info (a topic for later).

In more concrete code terms, the following is the steps in loading the
folder pane:

The account manager loads the mail.accountmanager.accounts preference; the values here are a comma-separated list of account keys.

For each account key, an account is instantiated. Per-account data is read off of the mail.account.<key> preference branch; in specific, the server preference contains the server key to load and the identities preference is a comma-separated list of identity keys.

The identities and servers are then bootstrapped. In the case of servers, the server is created as an object with the @mozilla.org/messenger/server;1?type=<type>
contract ID. The server pref branch is mail.server.<key>; key preferences here are type, the type for the contract ID; userName, the (optional) username of the server; and hostname, the (required) host of the server.

The account manager sets the key, type, username, and hostName properties, in that order on the server object instance and then retrieves the port property. The (type, username, hostName, port) tuple is the unique identifier for a server: no two servers can have the same
combination of these values. Now your server is constructed and returned to the folder pane.

The folder pane retrieves the rootFolder of your server. If you happened to be saved in the expanded state, subFolders is recursively retrieved from folders as corresponding to the saved open state. The folder pane also calls performExpand() on the
server if the root folder is expanded.

So that explains how your server gets created; how do your folders get created? nsMsgIncomingServer::GetRootFolder[3] calls nsMsgIncomingServer::CreateRootFolder, which calls serverURI and uses it to construct an RDF resource. serverURI creates a URI of the form localstoretype://[<username>@]<hostname> by default. This URI is actually the URI of your root folder; other code will assume that this invariant holds true (especially subscribe!). Other folders are created when you get the subFolders property. When the
folder URI is parsed (which is pretty much the first time a useful property is called), getIncomingServerType is called to get the type of the server.

In summary, you may need to implement localStoreType and possible serverURI on your server, and subFolders, and getIncomingServerType, and CreateBaseMessageURI on your folder [4]. First we'll start by getting the root folder display working:

At this point, I recommend you again check to make sure resources are properly registering via the Error Console. With that in hand, it's time to modify your preferences manually. I personally recommend changing settings via editing prefs.js while Thunderbird is off so you
don't accidentally confuse the account manager. I'm using the keys account99 and server99 to make it plain which account is being edited. First, I copy the mail.identity.id3 pref branch (any identity would do) and change the id3 to id99. Then I copy the mail.account.account3 pref branch and change the 3's to 99's.

The next changes are the server preferences, which are going to be the most unique. directorydirectory-rel are set to a folder where I want to store stuff ([ProfD]WebForums/kompozer, in my case). download_on_biff and login_at_startup are set to false (to avoid
dealing with biff for a bit longer). name is set to be the display name of the server. hostname and userName were set to the appropriate values for this account [5]. To the preference mail.accountmanager.accounts, I appended account99. With those changes done, I then start up Thunderbird to see the outcome:
Perhaps I should have chosen a shorter name for display.

Folder discovery

Now that the root folder is displayed, we need to get the folders added to the display pane.
Somehow, we need to figure out what the folder hierarchy looks like—it has to be stored in some file, in other words. The NNTP code uses the newsrc file to store its folder tree, and local folders looks at the directory hierarchy for its map, to name two examples.

In my code, I'm going to choose the use of a JSON file to store this data. I've considered SQLite, but I don't really need synchronization (per-server files work nicely here), and I'm mostly doing simple lookups. Plus, I can probably handle automatic schema migration more easily in SQLite.

For this next part, we concentrate on a single property: subFolders. This function typically has two parts: it first checks for initialization (if so, it returns the enumerator to the stored values); if it's not initialized, the rest of the function, or perhaps a second function altogether, is used to create the subfolders.

Some code to initialize these subfolders is as follows (the logic to retrieve the database is not included and can instead be found in the source code for my extension):

There are a few major things to note. First, the new folders are created via the RDF resource. Both Thunderbird and SeaMonkey use RDF for folder access, so it is still a good idea to create via the RDF service so you don't confuse the caller code. Also, with that in mind, the subfolder name still needs to be escaped as well in the URI, hence the calls to nsINetUtil. The auxiliary function array2enum takes in a JS array and returns a proper nsISimpleEnumerator for the array. I've excluded it's definition here do to its simplicity and the length of this document; if you want to see it, you can view it from the extension source code. The last thing to note is that this code is using this._inner: this variable is a link to the nsMsgDBFolder implementation which was created for us by the JSExtendedUtils inheritance call. I will defer a more thorough treatment of this C++-JS glue until later.

Folder pane extras

At this point, you should have a simple, plain folder hierarchy, which is navigable if not fully usable. In terms of UI, though, it's not quite fully perfect: if you have an inbox, it will be rather indistinguishable from other folders; similarly, "fake" folders (think the [Gmail] folder if you have Gmail IMAP) show up as regular folders. These things are handled to a large degree by CSS.

I would provide some example styling code here, but when I was doing testing, I discovered some related assertion failures that I have not yet had time to grok. In the interest of keeping to a posting every two weeks, I am going to defer this until either a mini "part 1.5" or the beginning of part 2, depending on how much time I will have available next week.

Notes

I will not, in general, post the full code for any of the classes, only enough to demonstrate what needs to be done. For example, the classID property is omitted in this example. Something to note is that I have a modification to XPCOMUtils locally that will accept arrays of contract IDs as opposed to a single one (wfService will be implementing more than one contract ID).

What it specifically does is attempt to get the server; if it fails, then it removes the account from the accounts pref. If you are compiling your own builds for your extension development profile, I recommend you remove the lines in nsMsgAccountManager::LoadAccount that remove the account on failure.

In general, I will mix the IDL and C++ names for methods and properties in the course of the guide. As a basic rule of thumb, if you see a :: in the name, it's a C++ name; otherwise, it's the IDL name.

getBaseMessageURI is a local function called by nsMsgDBFolder during initialization that is used to set up the URIs for getting individual messages. This function will be covered in more depth as we get messages working, but it is technically necessary for startup (a stub that does nothing is provided).

A strong temptation for accounts whose sources are some web address (for example, RSS or my web forums account) is to put the base address as the hostname property. However, as you would quickly realize, that plays havoc on URI parsing, and nsMsgDBFolder::parseURI is not virtual. A better option would probably be to leave the hostname as some identifier that you use only for guaranteeing uniqueness and to store the base URI somewhere else. Since all of my folders have independent URIs associated with them, I can safely ignore the issue until account creation and subscription are covered.