Thursday, June 30, 2011

For the past few days, it appears that Google Groups has stopped mirroring newsgroups properly; this included the various Mozilla mailing lists. It does, however, appear to be forwarding posted messages, so if you send a message via the UI and you don't see anything show up... don't panic, and don't repost it 5 times.

I don't know how long Google Groups will be down, but, in the meantime, you can use your favorite newsreader to read the newsgroups. Heck, Thunderbird has a built in news client. If you wish to use mozilla mailing lists, just add in the server news.mozilla.org. You can then happily see everything that you would be able to see with Google Groups, if it were working.

Wednesday, June 29, 2011

Some of you may be wondering why it is taking me so long to build a DXR index of mozilla-central. Part of the answer is that I'm waiting to bring DXR on clang back up to the original instantiation in terms of feature support. The other part of the answer is that mozilla-central is rather big to index. How big? Well, some statistics await…

I first noticed a problem when I was compiling and found myself at 4 GiB of space remaining and dwindling fast. I'm used to fairly tight space restrictions (a result of malportioning the partitions on my last laptop), so that's not normally a source of concern. Except I had been at 10 GiB of free space an hour before that, before I started the run to index mozilla-central.

The original indexing algorithm is simple: for each translation unit (i.e., C++ file), I produce a corresponding CSV file of the "interesting" things I find in that translation unit—which includes all of the header files included. As a result, I emit the data for every header file each time it is included by some compiled file. The total size of all data files produced by the mozilla-central run turned out to be around 12GiB. When I collected all of the data and removed duplicate lines, the total size turned out to be about 573MiB.

Step back and think about what this means for a moment. Since "interesting" things to DXR basically boil down to all warnings, declarations, definitions, and references (macros and templates underrepresented), this implies that every declaration, definition, and reference is parsed by the compiler, on average, about 20-25 times. Or, if you take this as a proxy for the "interesting" lines of code, the compiler must read every line of code in mozilla-central about 20-25 times.

The solution for indexing is obviously not to output that much data in the first place. Since everyone who includes, say, nscore.h does so in the same way, there's no reason to reoutput its data in all of the thousands of files that end up including it. However, a file like nsTString.h (for the uninitiated, this is part of the standard internal string code interfaces, implemented using C's versions of templates, more commonly known as "the preprocessor") can be included multiple times but produce different results: one for nsString and one for nsCString in the same place. Also, there is the question of making it work even when people compile with -jN, since I have 4 cores that are begging for the work.

It was Taras who thought up the solution. What we do is we separate out all of the CSV data by the file that it comes in. Then, we store each of the CSV data in a separate file whose name is a function of both the file it comes in and its contents (actually, it's <file>.<sha1(contents)>.csv, for the curious). This also solves the problem of multiple compilers trying to write the same file at the same time: if we open with O_CREAT | O_EXCL and the open fails because someone else created the file… we don't need to do anything because the person who opens the file will write the same data we wanted to write! Applying this technique brings the total generated CSV file data down to around 1GiB (declaration/definition mappings account for the need for duplicates), or down to about 2 times the real data size instead of 20 times. Hence why the commit message for fixing this is titled Generate much less data. MUCH LESS.

Of course, that doesn't solve all of my problems. I still have several hundred MiBs of real data that need to be processed and turned into the SQL database and HTML files. Consuming the data in python requires a steady state of about 3-3.5 GiB of resident memory, which is a problem for me since I stupidly gave my VM only 4GiB of memory and no swap. Switching the blob serialization from using cPickle (which tries to keep track of duplicate references) to marshal (which doesn't) allowed me to postpone crashing until I generate the SQL database, where SQL allocates just enough memory to push me over the edge, despite generating all of the SQL statements using iterators (knowing that duplicate processing is handled more implicitly). I also switched from using multiple processes to using multiple threads to avoid Linux failing to fork Python due to not enough memory (shouldn't it only need to copy-on-write?).

Obviously, I need to do more work on shrinking the memory usage. I stripped out the use of the SQL for HTML generation and achieved phenomenal speedups (I swore that I had to have broken something since it went from a minute to near-instantaneous). I probably need to move to a light-weight database solution, for example, LevelDB, but I don't quite have a simple (key, value) situation but a (file, plugin, table, key, value) one. Still not hard to do, but more than I want to test for a first pass.

Wednesday, June 22, 2011

I'm going to go out on a limb and guess that the GNU autotools are one of the most heavily used build systems in existence. What I'm not going to guess is that, as a build system, it sucks. I will grant you that C and C++ code in particular are annoying to compile by virtue of the preprocessor mechanics, and I will also grant that autotools can be useful in trying to work out the hairiness of working on several close-but-not-quite-the-same platforms. But that doesn't justify why you need to make a build system that is so bad.

One of the jobs of autotools (the configure script in particular) is to figure out which compilers to use, how to invoke them, and what capabilities they support. For example, is charsigned or unsigned? Or does the compiler support static_assert(expr) or need to resort to extern void func(int arg[expr ? 1 : -1])? Language standards progress, and, particularly in the case of C++, compilers can be slow to correct their implementations to the language.

The reason I bring this up is because I discovered today that my configure failed because my compiler "didn't produce an executable." Having had to deal with this error several times (my earlier work with dxr spent a lot of time figuring out how to manipulate $CXX and still get a workable compiler), I immediately opened up the log, expecting to have to track down configure guessing the wrong file is the executable (last time, it was a .sql file instead of .o). No, instead the problem was that the compiler had crashed (the reason essentially boiling down to std::string doesn't like being assigned NULL). That reason, this is the program that autoconf uses to check that the compiler works:

#line 1861 "configure"
#include "confdefs.h"
main(){return(0);}

(I literally copied it from mozilla's configure script, go look around line 1861 if you don't believe me). That is a problem because it's not legal C99 code. No, seriously, autoconf has decided to verify that my C compiler is working by relying on a feature so old that it's been removed (not deprecated) in a 12-year old specification. While I might understand that there are some incredibly broken compilers out there, I'm sure this line of code is far more likely to fail working compilers than the correct code would be, especially considering that it is probably harder to write a compiler to accept this code than not except it. About the only way I can imagine writing this "test" program to make it more failtastic is to use trigraphs (which is legal C code that gcc does not honor by default). Hey, you could be running on systems that don't have a `#' key, right?

Addendum: Okay, yes, I'm aware that the annoying code is a result of autoconf2.13 and that the newest autoconfs don't have this problem. In fact, after inspecting some source history (I probably have too much time on my hands), the offending program was changed by a merge of the experimental branch in late 1999. But the subtler point, which I want to make clearer, is that the problem with autoconf is that it spends time worrying about arcane configurations that the projects who use them probably don't even support. It also wraps the checks for these configurations in scripts which render the actual faults incomprehensible, including "helpfully" cleaning up after itself so you can't actually see the offending command lines and results. Like, for example, the fact that your compiler never produced a .o file to begin with.

Friday, June 17, 2011

I am pleased to announce an alpha release of DXR built on top of Clang. A live demo of DXR can be found at http://dxr.mozilla.org/clang/, which is an index of a relatively recent copy of the Clang source code. Since this is merely an alpha release, expect to find bugs and inconsistencies in the output. For more information, you can go to #static on irc.mozilla.org or contact the staticanalysis mailing list. A list of most of the bugs I am aware of is at the end of this blog post.

So what is DXR? DXR is a smart code browser that works by using instrumented compilers to use what the compiler knows about the code to provide a database of the code. For C and C++ in particular, using an instrumented compiler is necessary, since it is the only reliable way to fix the issue of macros. Take, for instance, RecursiveASTVistor in the Clang codebase. Most of the almost 1200 functions are defined via macros as opposed to in raw code; as a consequence, the doxygen output for this class is useless: as far as I can tell, there are only five methods I can override to visit AST nodes. On the other hand, DXR neatly tells me all of the methods that are defined, and can point me to the place where that function is defined (within the macro, of course).

Friday, June 10, 2011

For various reasons, I want to get the extent of a particular expression in clang. Most AST objects in clang have a method along the lines of getSourceRange(), which, on inspection, returns start and end locations for the object in question. Naturally, it's the perfect fit (more or less), so I start using it. However, it turns out that the documentation is a bit of a liar. I would expect the end location to return the location (either file offset or file/line) of the end of the expression, give or take a character depending on whether people prefer half-open or fully closed ranges. Instead, it returns the location of the last token in the expression. Getting the true last token requires this mess: sm.getFileOffset(Lexer::getLocForEndOfToken(end, 0, sm, features)). Oh yeah, in case you couldn't have guessed, that causes it to relex the file starting at that point. For an idea of my current mood, pick a random image from Jennifer's recent post and you'll likely capture it.

It's been almost a week since I last discussed DXR, but, if you look at the commit log, you can see that I've not exactly been laying back and doing nothing. No, the main reason is because most of the changes I've been doing so far aren't exactly ground-breaking.

In terms of UI, the only thing that's changed is that I know actually link the variable declarations correctly, a result of discovering the typo that was causing it not to work properly in the midst of the other changes. From a user's perspective, I've also radically altered the invocation of DXR. Whereas before it involved a long series of steps consisting of "build here, export these variables, run this script, build there, run that script, then run this file, and I hope you did everything correctly otherwise you'll wait half an hour to find you didn't" (most of which takes place in different directories), it's now down to . dxr/setup-env.sh (if you invoke that incorrectly, you get a loud error message telling you how to do it correctly. Unfortunately, the nature of shells prohibits me from just doing the right thing in the first place), build your code, and then python dxr/dxr-index.py in the directory with the proper dxr.config. Vastly improved, then. But most of my changes are merely in cleaning up the code.

The first major change I made was replacing the libclang API with a real clang plugin. There's only one thing I've found harder (figuring out inheritance, not that I'm sure I've had it right in the first place), and there are a lot of things I've found easier. Given how crazy people get with build systems, latching onto a compiler makes things go a lot smoother. The biggest downside I've seen with clang is that it's documentation is lacking. RecursiveASTVisitor is the main visitor API I'm using, but the documentation for that class is complete and utter crap, a result of doxygen failing to handle the hurdle of comprehending macros. I ended up using the libindex-based dxr to dump a database of all of the functions in the class, of which there are something like 1289, or around 60 times the functions listed in the documentation. Another example is Decl, where the inheritance picture is actually helpful. It, however, manages to have failed to document DeclCXX.h, which is a rather glaring omission if what you are working with is C++ source.

The last set of changes I did was rearchitecting the code to make it hackable by other people. I have started on a basic pluggable architecture for actually implementing new language modules, although most of the information is still hardcoded to just use cxx-clang. In addition, I've begun work on ripping out SQL as the exchange medium of choice: the sidebar list is now directly generated using the post-processing ball of information, and linkification is now set up so that it can be generated independent of tokenization. In the process, I've greatly sped up HTML generation times only to regress a lot of it by tokenizing the file twice (the latter part will unfortunately remain until I get around to changing tokenization API). It shouldn't take that much longer for me to rip SQL fully out of the HTML builder and shove SQL generation into a parallel process for improved end-to-end time.

Some things I've discovered about python along the way. First, python closures don't let you mutate closed-over-variables. That breaks things. Second, if you have a very large python object, don't pass it around as an argument to multiprocessing: just make it a global and let the kernel's copy-on-write code make cloning cheap. Third, apparently the fastest way to concatenate strings in python is to use join with a list comprehension. Finally, python dynamic module importing sucks big time.

Where am I going from here? I'll have HTMLification (i.e., remove SQL queries and replace with ball lookups) fixed up tomorrow; after that, I'll make the cgi scripts work again. The next major change after that is getting inheritance and references in good shape, and then making sure that I can do namespaces correctly. At that point in time, I'll think I'll be able to make a prerelease of DXR.

Wednesday, June 8, 2011

This is perhaps a bit belated, but I thought I'd give a brief overview of where I think fakeserver should be going over the next several months (i.e., when I find time to work on it). For those who don't know, Fakeserver is a generalized testing framework for the 4 major mail protocols: IMAP, NNTP, SMTP, and POP. For various reasons, I'm actually getting around to fixing problems with it right now. The following are the features I am going to be adding:

This is a long-standing bug in fakeserver, which I first discovered about 3 years ago, when I started testing the IMAP implementation. Fakeserver uses the same connection handler for every connection, which means the state of the connection is shared between all connections, obviously very problematic for a protocol as stateful as IMAP. In my defense, I cribbed from the stateless (more or less) HTTP server, and used the barely-stateful NNTP as my first testbed. This bug is probably the most API-breaking of any change to fakeserver code: every invocation of the server must replace new nsMailServer(new handler(daemon)) with new nsMailServer(function (d) { return new handler(d); }, daemon). Subsequent to that, handlers are no longer easily exposed by the server, which means all manipulation must be on the daemon end. This causes major changes to SMTP code and NNTP code for different reasons; pretty much everyone else hides between helper functions that act as a firewall to minimizing code changes.

This change (which has no patch yet) will move the fakeserver running into another process. I have some working prototypes for manipulating objects across two different processes, but I don't yet have a good IPC mechanism (which Mozilla strangely lacks) for actually communicating this information. As for API changes, I haven't gotten far enough yet to know the impact of multiprocess fakeserver, but the answer will probably that handler information goes from "hard to get" to "impossible to get". On the plus side, this should essentially enable mozmill tests to be able to use fakeserver.

I figure fakeserver needs SSL support--it's common in the mail world to use SSL instead of plaintext, so we need to support it. In terms of external API, the only change will be some added methods to start up an SSL server (and methods to get STARTTLS working to, provided my knowledge of how SSL works is correct). The API pseudo-internal to maild.js gets more convoluted, though: essentially, I need to support multiple pumping routines for communication. On the plus side, though, if the API I'm planning on implementing turns out to be feasible, we should also be able to get generic charset decoding for nearly free, for i18n tests.

LDAP address book code is almost completely untested. I'd like to see that fixed. However, LDAP is a binary protocol (BER, to be precise). If I can implement the SSL support in the API framework I want to, then it shouldn't be that much more to get a BER-to-text-ish layer thrown in on top of it. The downside is that I'm pretty much looking at using our own LDAP library to do BER encoding/decoding since I don't want to write that myself.

Saturday, June 4, 2011

A primary goal of DXR is to be easy to use. Like any simple statement, translating that into design decisions is inordinately difficult, and I best approach such issues by thinking out loud. Normally, I do this in IRC channels, but today I think it would be best to think in a louder medium, since the problem is harder.

There are three distinct things that need to be made "easy to use". The first is the generation of the database and the subsequent creation of the web interface. The second is extension of DXR to new languages, while the last is the customization of DXR to provide more information. All of them have significant issues.

Building with DXR

Starting with the first one, build systems are complicated. For a simple GNU-ish utility, ./configure && make is a functioning build system. But where DXR is most useful is on the large, hydra-like projects where figuring out how to build the program is itself a nightmare: Mozilla, OpenJDK, Eclipse, etc. There is also a substantial number of non-autoconf based systems which throw great wrenches in everything. At the very least, I know this much about DXR: I need to set environment variables before configuring and building (i.e., replace the compiler), I need to "watch" the build process (i.e., follow warning spew), and I need to do things after the build finishes (post-processing). Potential options:

Tell the user to run this command before the build and after the build. On the plus side, this means that DXR needs to know absolutely nothing about how the build system works. On the down side, this requires confusing instructions: in particular, since I'm setting environment variables, the user has to specifically type ". " in the shell to get them set up
properly. There are a lot of people who do not have significant shell exposure to actually understand why that is necessary, and general usage is different enough from the commands that people are liable to make mistakes doing so.

Guess what the build system looks like and try to do it all by ourselves. This is pretty much the opposite extreme, in that it foists all the work on DXR. If your program is "normal", this won't be a problem. If your program isn't... it will be a world of pain. Take also into consideration that any automated approach is likely to fail hard on Mozilla code to begin with, which effectively makes this a non-starter.

Have the user input their build system to a configuration file and go from there. A step down from the previous item, but it increases the need for configuration files.

Have DXR spawn a shell for the build system. Intriguing, solves some problems but causes others.

Conclusion: well, I don't like any of those options. While the goal of essentially being able to "click" DXR and have it Just Work™ is nice, I have reservations about such an approach being able to work in practice. I think I'll go for a "#1 and punt on this issue to someone with more experience."

Multiple language

I could devote an entire week's worth of blog posts to this topic, I think, and I would wager that this is more complicated and nebulous than even build systems are. In the end, all we really need to worry about with build systems is replacing compilers with our versions and getting to the end; with languages, we actually need to be very introspective and invasive to do our job.

Probably the best place to start is actually laying out what needs to be done. If the end goal is to produce the source viewer, then we need to at least be able to do syntax highlighting. That by itself is difficult, but people have done it before: I think my gut preference at this point is to basically ask authors of DXR plugins to expose something akin to vim's syntax highlighting instead of asking them to write a full lexer for their language.

On the other end of the spectrum is generating the database. The idea is to use an instrumenting compiler, but while that works for C++ or Java, someone whose primary code is a website in Perl or a large Python utility has a hard time writing a compiler. Perhaps the best option here is just parsing the source code when we walk the tree. There is also the question about what to do with the build system: surely people might want help understanding what it is their Makefile.in is really doing.

So what does the database look like? For standard programming languages, we appear to have a wide-ranging and clear notion of types/classes, functions, and variables, with slightly vaguer notions of inheritance, macros (in both the lexical preprocessing sense and the type-based sense of C++'s templates), and visibility. Dynamic languages like JavaScript or Python might lack some reliable information (e.g., variables don't have types, although people often still act as if they have implicit type information), but they still uphold this general contract. If you consider instead things like CSS and HTML or Makefiles in the build system, this general scheme completely fails to hold, but you can still desire information in the database: for example, it would help to be able to pinpoint which CSS rules apply to a given HTML element.

This begs the question, how does one handle multiple languages in the database? As I ponder this, I realize that there are multiple domains of knowledge: what is available in one language is not necessarily available in another. Of the languages Mozilla uses, C, assembly, C++, and Objective-C[++] all share the same ability to access any information written in the other languages; contrast this to JS code, which can only interact with native code via the use of a subset of IDL or interactions with native functions. IDL is a third space, which is a middle ground between native and JS code, but is insufficiently compatible with either to be lumped in with one. Options:

Dump each language into the same tables with no distinction. This has problems in so far as some languages can't be shoehorned into the same models, but I think that in such cases, one is probably looking for different enough information anyways that it doesn't matter. The advantage of this is that searching for an identifier will bring it up everywhere. The disadvantage... is that it gets brought up everywhere.

Similar to #1, but make an extra column for language, and let people filter by language.

Going a step further, take the extra language information and build up the notion of different bindings: this foo.bar method on a python object may be implemented by this Python_foo_bar C binding. In other words, add another table which lists this cross-pollination and takes it into account when searching or providing detailed information

Instead of the language column in tables, make different tables for every language.

Instead of tables, use databases?

Hmm. I think the binding cross-reference is important. On closer thought, it's not really languages themselves that are the issue here, it's essentially the target bindings: if we have a system that is doing non-trivial build system work that involves cross-compiling, it matters if what we are doing is being done for the host or being done for the target. Apart from that, I think right now that the best approach is to have different tables.

Extraneous information

The previous discussion bleeds into this final one, since they both ultimately concern themselves with one thing: the database. This time, the question is how to handle generation of information beyond the "standard" set of information. Information, as I see it, comes in a few forms. There is additional information at the granularity of identifiers (this function consumes 234 bytes of space or this is the documentation for the function), lines (this line was not executed in the test suite), files (this file gets compiled to this binary library), and arguably directories or other concepts not totally mappable to constructs in the source tree (e.g., output libraries).

The main question here is not on the design of the database: it's only a question of extra tables or extra columns (or both!). Instead, the real question is in the design of the programmatic mechanisms. In dxr+dehydra, the simple answer is to load multiple scripts. For dxr+clang, however, the question becomes a lot more difficult since the code is written in C++ and isn't dynamically loading modules like dehydra does. It also begins to beg the question of the exposed API. On the other hand, I'm not sure I know enough of the problem space to be able to actually come up with solutions. I think I'll leave this one for later

Thursday, June 2, 2011

As you might have gathered from my last post, I spend a fair amount of time thinking up ways of visualizing software metrics. Today's visualization is inspired by a lunch conversation I had yesterday. Someone asked "how much space does SVG take up in libxul?" This isn't a terribly hard problem to solve: we have a tool that can tell us the size of symbols (i.e., codesighs), and I have a tool that gives me a database of where every symbol is defined in source code.

Actually, it's not so simple. codesighs, as I might have predicted, doesn't appear to work that well, so I regressed back to nm. And when I actually poked the generated DXR database for mozilla-central using dehydra (I haven't had the guts to try my dxr-clang integration on m-c yet), I discovered that it appears to only lack the information about functions. Both static functions and member C++ functions—my suspicions of breakage are now confirmed. So I resorted to one of my "one-liner" scripts, as follows:

It's really 4 commands, which just happens to include two HEREDOC cats and a multi-line python script. The output is a littlenot-so-little HTML file that loads a treemap visualization using the google API. The end result can be seen here. To answer the question posed yesterday: SVG accounts for about 5-6% of libxul, in terms of content, and about another 1% for the layout component; altogether, about as much as our xpconnect code accounts for libxul. For Thunderbird folks, I've taken the time to also similarly divvy up libxul, and that can be found here.

P.S., if you're complaining about the lack of hard numbers, blame Google for giving me a crappy API that doesn't allow me to make custom tooltips.

This is a rant in prelude to a blog post I expect to make tomorrow morning (or later today, depending on your current timezone). As someone who is mildly interested in data, I spend a fair amount of time thinking up things that would be interesting to just see dumped out visually. And I'm not particularly interested in small datasets: one of my smaller datasets I've been playing with is "every symbol in libxul.so"; one of my larger datasets is "every news message posted to Usenet this year, excluding binaries" (note: this is 13 GiB worth of data, and that's not 100% complete).

So while I have a fair amount of data, what I need is a simple way to visualize it. If what I have is a simple bar chart or scatter plot, it's possible to put it in Excel or LibreOffice... only to watch those programs choke on a mere few thousand data points (inputting a 120K file caused LibreOffice to hang for about 5 minutes to produce the scatter plot). But spreadsheet programs clearly lack the power for serious visualization information; the most glaring chart they lack is the box-and-whiskers plot. Of course, if the data I'm looking at isn't simple 1D or 2D data, then what I really need won't be satiable with them anyways.

I also need to visualize large tree data, typically in a tree map. Since the code for making squarified tree maps is more than I care to do for a simple vis project, I'd rather just use a simple toolkit. But which to use? Java-based toolkits (i.e., prefuse) require me to sit down and make a full-fledged application for what should be a quick data visualization, and I don't know any Flash to be able to use Flash toolkits (i.e., flare). For JavaScript, I've tried the JavaScript InfoVis Toolkit, Google's toolkit, and protovis, all without much luck. JIT and protovis both require too much baggage for a simple "what does this data look like", and Google's API is too inflexible to do anything more than "ooh, pretty treemap". Hence why my previous foray into an application for viewing code coverage used a Java applet: it was the only thing I could get working.

What I want is a toolkit that gracefully supports large datasets, allows me to easily drop data in and play with views (preferably one that doesn't try to dynamically change the display based on partial options reconfiguration like most office suites attempt), supports a variety of datasets, and has a fine-grained level of customizability. Kind of like SpotFire or Tableau, just a tad bit more in my price range. Ideally, it would also be easy to put on the web, too, although supporting crappy old IE versions isn't a major feature I need. Is that really too much to ask for?

Wednesday, June 1, 2011

As I'm building yet another program from scratch (I've now run dxr on four separate trees in the past week, and I'm still looking for a good one to use for tests), let me take the time to explain how to set up DXR on your own computer. Instead of stuffing it in this blogpost, most of the information is instead on wiki.mozilla.org. One thing not on that page is that I currently have the sql files in the clang dumper dump to the source directory instead of the object directory, so the DXR config needs to have the source and object directory be the same setting, even if you don't actually build the source in-place!