Faster JavaScript parsing

Over the past year or so I’ve almost doubled the speed of SpiderMonkey’s JavaScript parser. I did this by speeding up both the scanner (bug 564369, bug 584595, bug 588648, bug 639420,bug 645598) and the parser (bug 637549). I used patch stacks in several of those bugs, and so in those six bugs I actually landed 28 changesets.

Notable things about scanning JavaScript code

Before I explain the changes I made, it will help to explain a few notable things about scanning JavaScript code.

JavaScript is encoded using UCS-2. This means that each character is 16 bits.

There are several character sequences that indicate the end of a line (EOL): ‘\n’, ‘\r’, ‘\r\n’, \u2028 (a.k.a. LINE_SEPARATOR), and \u2029 (a.k.a. PARA_SEPARATOR). Note that ‘\r\n’ is treated as a single character.

JavaScript code is often minified, and the characteristics of minified and non-minified code are quite different. The most important difference is that minified code has much less whitespace.

Scanning improvements

Before I made any changes, there were two different modes in which the scanner could operate. In the first mode, the entire character stream to be scanned was already in memory. In the second, the scanner read the characters from a file in chunks a few thousand chars long. Firefox always uses the first mode (except in the rare case where the platform doesn’t support mmap or an equivalent function), but the JavaScript shell used the second. Supporting the second made made things complicated in two ways.

It was possible for an ‘\r\n’ EOL sequence to be split across two chunks, which required some extra checking code.

The scanner often needs to unget chars (up to six chars, due to the lookahead required for \uXXXX sequences), and it couldn’t unget chars across a chunk boundary. This meant that it used a six-char unget buffer. Every time a char was ungotten, it would be copied into this buffer. As a consequence, every time it had to get a char, it first had to look in the unget buffer to see if there was one or more chars that had been previously ungotten. This was an extra check (and a data-dependent and thus unpredictable check).

The first complication was easy to avoid by only reading N-1 chars into the chunk buffer, and only reading the Nth char in the ‘\r\n’ case. But the second complication was harder to avoid with that design. Instead, I just got rid of the second mode of operation; if the JavaScript engine needs to read from file, it now reads the whole file into memory and then scans it via the first mode. This can result in more memory being used but it only affects the shell, not the browser, so it was an acceptable change. This allowed the unget buffer to be completely removed; when a character is ungotten now the scanner just moves back one char in the char buffer being scanned.

Another improvement was that in the old code, there was an explicit EOL normalization step. As each char was gotten from the memory buffer, the scanner would check if it was an EOL sequence; if so it would change it to ‘\n’, if not, it would leave it unchanged. Then it would copy this normalized char into another buffer, and scanning would proceed from this buffer. (The way this copying worked was strange and confusing, to make things worse.) I changed it so that getChar() would do the normalization without requiring the copy into the second buffer.

The scanner has to detect EOL sequences in order to know which line it is on. At first glance, this requires checking every char to see if it’s an EOL, and the scanner uses a small look-up table to make this fast. However, it turns out that you don’t have to check every char. For example, once you know that you’re scanning an identifier, you know that if you hit an EOL sequence you’ll immediately unget it, because that marks the end of the identifier. And when you unget that char you’ll undo the line update that you did when you hit the EOL. This same logic applies in other situations (eg. parsing a number). So I added a function getCharIgnoreEOL() that doesn’t do the EOL check. It has to always be paired with ungetCharIgnoreEOL() and requires some care as to where it’s used, but it avoids the EOL check on more than half the scanned chars.

As well as detecting where each token starts and ends, for a lot of token kinds the scanner has to compute a value. For example, after scanning the character sequence ” 123 ” it has to convert that to the number 123. The old scanner would copy the chars into a temporary buffer before calling the function that did the conversion. This was unnecessary — the conversion function didn’t even require NULL-terminated strings because it gets passed the length of the string being converted! Also, the old scanner was using js_strtod() to do the number conversion. js_strtod() can convert both integers and fractional numbers, but its quite slow and overkill for integers. And when scanning, even before converting the string to a number, we know if the number we just scanned was an integer or not (by remembering if we saw a ‘.’ or exponent). So now the scanner instead calls GetPrefixInteger() which is much faster. Several of the tests in Kraken involve huge arrays of integers, and this made a big difference to them.

There’s a similar story with identifiers, but with an added complication. Identifiers can contain \uXXXX chars, and these need to be normalized before we do more with the string inside SpiderMonkey. So the scanner now remembers whether a \uXXXX char has occurred in an identifier. If not, it can work directly (temporarily) with the copy of the string inside the char buffer. Otherwise, the scanner will rescan the identifier, normalizing and copying it into a new buffer. I.e. the scanner de-optimizes the (very) rare case in order to avoid the copying in the common case.

JavaScript supports decimal, hexadecimal and octal numbers. The number-scanning code handled all three kinds in the same loop, which meant that it checked the radix every time it scanned another number char. So I split this into three parts, which make it both faster and easier to read.

Although JavaScript chars are 16 bits, the vast majority of chars you see are in the first 128 chars. This is true even for code written in non-Latin scripts, because of all the keywords (e.g. ‘var’) and operators (e.g. ‘+’) and punctuation (e.g. ‘;’). So it’s worth optimizing for those. The main scanning loop (in getTokenInternal()) now first checks every char to see if its value is greater than 128. If so, it handles it in a side-path (the only legitimate such chars are whitespace, EOL or identifier chars, so that side-path is quite small). The rest of getTokenInternal() can then assume that it’s a sub-128 char. This meant I could be quite aggressive with look-up tables, because having lots of 128-entry look-up tables is fine, but having lots of 65,536-entry look-up tables would not be. One particularly important look-up table is called firstCharKinds; it tells you what kind of token you will have based on the first non-whitespace char in it. For example, if the first char is a letter, it will be an identifier or keyword; if the first char is a ‘0’ it will be a number; and so on.

Another important look-up table is called oneCharTokens. There are a handful of tokens that are one-char long, cannot form a valid prefix of another token, and don’t require any additional special handling: ;,?[]{}(). These account for 35–45% of all tokens seen in real code! The scanner can detect them immediately and use another look-up table to convert the token char to the internal token kind without any further tests. After that, the rough order of frequency for different token kinds is as follows: identifiers/keywords, ‘.’, ‘=’, strings, decimal numbers, ‘:’, ‘+’, hex/octal numbers, and then everything else. The scanner now looks for these token kinds in that order.

That’s just a few of the improvements, there were lots of other little clean-ups. While writing this post I looked at the old scanning code, as it was before I started changing it. It was awful, it’s really hard to see what was happening; getChar() was 150 lines long because it included code for reading the next chunk from file (if necessary) and also normalizing EOLs.

In comparison, as well as being much faster, the new code is much easier to read, and much more DFA-like. It’s worth taking a look at getTokenInternal() in jsscan.cpp.

Parsing improvements

The parsing improvements were all related to the parsing of expressions. When the parser parses an expression like “3” it needs to look for any following operators, such as “+”. And there are roughly a dozen levels of operator precedence. The way the parser did this was to get the next token, check if it matched any of the operators of a particular precedence, and then unget the token if it didn’t match. It would then repeat these steps for the next precedence level, and so on. So if there was no operator after the “3”, the parser would have gotten and ungotten the next token a dozen times! Ungetting and regetting tokens is fast, because there’s a buffer used (i.e. you don’t rescan the token char by char) but it was still a bottleneck. I changed it so that the sub-expression parsers were expected to parse one token past the end of the expression, instead of zero tokesn past the end. This meant that the repeated getting/ungetting could be avoided.

These operator parsers are also very small. I inlined them more aggressively, which also helped quite a bit.

Results

I had some timing results but now I can’t find them. But I know that the overall speed-up from my changes was about 1.8x on Chris Leary’s parsemark suite, which takes code from lots of real sites, and the variation in parsing times for different codebases tends not to vary that much.

Many real websites, e.g. gmail, have MB of JS code, and this speed-up will probably save one or two tenths of a second when they load. Not something you’d notice, but certainly something that’ll add up over time and help make the browser feel snappier.

Tools

I used Cachegrind to drive most of these changes. It has two features that were crucial.

First, it does event-based profiling, i.e. it counts instructions, memory accesses, etc, rather than time. When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful.

Second, it gives counts of these events for individual lines of source code. This was particularly important for getTokenInternal(), which is the main scanning function and has around 700 lines; function-level stats wouldn’t have been enough.

I’m driving a bunch of very similar-looking changes to the CSS scanner (bug 543151, as Boris mentions, and its dependencies, plus some more that haven’t made it into Bugzilla yet). I’d like to know a bit more about your testing setup. Unfortunately there’s no CSS equivalent of the JS shell and I’ve had nothing but bad luck trying to use any valgrind tool on a full Firefox build (understandably). I played with processor performance counters for a while, but they weren’t as stable as I would have liked.

The other thing I’m wondering about is keyword processing. CSS is unusual in that it has far more keywords and far less use of arbitrary identifiers than conventional programming languages. I’ve thought for some time that the right way to handle this would be DFA-style keyword detection in the scanner, sort of like what FindKeyword() does in jsscan.cpp, but integrated more deeply into the main scanner loop. On the other hand, I’m not sure that approach scales to hundreds of keywords (the set is layout/style/nsCSSKeywordList.h plus nsCSSPropList.h plus a small handful of extras currently being detected with strcmp() in the parser — order of 750 unique identifiers). What do you think?

Zack: I setup parsemark so it was structured exactly like SunSpider, and this meant I was able to use my existing scripts that run Cachegrind on the JS shell. Using the shell instead of the browser makes things *much* easier. I’ve successfully run Cachegrind on the whole browser before, but only once or twice, and I haven’t done it in a serious, repeated fashion so I don’t know how much non-determinism complicates things. I wonder if Cachegrind’s cg_diff tool might be useful?

As for keywords, that’s one part of the scanner I didn’t touch. It tended not to show up very high in profiles, and already looked like it had been highly optimized. I can believe that a DFA approach would be best for CSS keywords, but I don’t have any great insights.

I just read this whole post. Most of it was over my head but it’s interesting to read how much thought went into this work. Anyway, having got to the end, I’m still stunned by the first sentence:

“Over the past year or so I’ve almost doubled the speed of SpiderMonkey’s JavaScript parser.”

On a personal level, was it an incredible buzz to be able to write that sentence? Superheroes have been ‘crowned’ on less

Where are you based Nicholas? It’s curious that you are not based in Mozilla central in relation to how much your work and ideas make a lot more sense than some people who work in the central Mozilla office!

Thanks for this Nicolas; I’ve often wondered what was meant by “faster JavaScript parsing” and thanks to your post, the meaning is a lot clearer. (While not familiar with JavaScript that much, or FF internals at all, I have written a parser or two in my career and the issues at hand I do understand.)

John raises an interesting point, indirectly. Are there any tips you can give ordinary website JS developers for speed based on this new code of your Nicholas? For example, is it better to always use semi-colons to end lines, even though JS does not require it?

pd: my advice is this: don’t worry about it! Minify your JS code if you like, anything other than that and I suspect you’ll be wasting time saving mere nanoseconds. I’m sure there are 100 better ways to optimize a website, both at the JS level and at other levels.

Thanks Nick. Nice to get that confirmed. I do use Google PageSpeed which makes a big difference. For years I’d avoided compression when serving up sites because of a nasty case of IE legacy pain. Lately I’ve turned it on again and is great how much difference it makes (that said, have not used IE much – who would wants to?).

So what’s the story with Mozilla? Do they have an employee of the month spatula, McDonalds style? LOL. If so, you and the team on pdf.js would be neck ad neck I reckon

I really liked this post, thanks. I wrote my own js parser (in js) and you gave me a few ideas to try out. You can read a (recent) post on my blog about tokenizing js, if you like. I doubt it’ll give you any new insights, but who knows.

Oh and I recently made a poc with Zeon* for doing the exact kind of profiling you mentioned, with visual inline feedback (number of times a certain statement was hit, relative to the entire source, as a heatmap). Still need to work that one out as a firefox plugin, so you can use it on live code (because the setup for the poc was a bit of a hassle, as you can imagine).

I thought javascript was using UTF-16, else how are you supporting characters like \u1F34E? In which case, this is 2 UTF-16/UCS-2 code points. If not, it seems a waste because UTF-8 supports these upper code points, which would have better support.

Yes, understood, but after 15 years, I thought it was rolled into the new specs at some point. C/C++ was created in the 70s and they support UTF-16 now, and it was generally, only created for ASCII. Also, you should clarify that they are code points not characters, because 2 or more code points could be one character. You could have several succeeding diacritics or markups. Its one of the reasons comparison operators almost never work. For example, “\u0041\u030A” == “\x00C5″, should return true, as they represent the same character.

Just looking I saw this on wikipedia, “As of 2009, the latest version of the language is JavaScript 1.8.1. It is a superset of ECMAScript (ECMA-262) Edition 3″. The weird thing here is that it says it conforms to Unicode standard 2.1 but you can use UCS-2 or UTF-16. Which seems weird since 2.0 removed UCS-2 in favor UTF-16. In ECMAScript 5.0, it conforms to Unicode 3.0, which seems odd because 6.0 has been out for a while now. By using UCS-2, you are not conforming to Unicode 2.0+. Also, using UCS-2 will be insanely slow on Intel/AMD chipsets because it only conforms to big endian. “UCS-2 encoding is defined to be big-endian only.” So either you are byte swapping, or not using UCS-2. UTF-16 comes in LE/BE flavors and can be determined by the BOM.