Syntax Highlighting Done Right, Part II

Syntax highlighting is a secondary form of notation that helps human readers interpret what a code snippet does and how. The idea is to use different colors—and other font attributes—for string literals so that the contrasts become visual cues for keywords, blocks of text, operators and symbols. Most code editors have a syntax highlighting strategy that along with line numbering, brace matching, and code folding help developers determine scope, detect errors and organize code into sections.

In Part I of this series I explained what is at the heart of every syntax highlighter, and presented a basic strategy that highlights well-formed code snippets using HTML, CSS and JavaScript, which incidentally is the strategy I use to highlight every code snippet on this website.

In this last part of the series, I will I demonstrate how I use the Syntax object to highlight code.

The HTML code table

A HTML page typically contains more than one code snippet, therefore I use a repeatable structure to identify and contain code snippets: a
table with a specific ID (tbcode) containing certain elements:

Figure 1. HTML elements in a code table

The TEXTAREA element is the source control, where the actual code resides. It is hidden through the style attribute, so it is just a
placeholder. As opposed to most HTML tags, the contents of a text area element are not parsed by the HTML engine, which is in
advantage in this scenario.

The preformatted (PRE) element is the target control, where the highligted code will be shown, and the SPAN
element is the placeholder for line numbers, filled after highlighting the code. I could have used any other tag than PRE as the target, but if the syntax highlighter fails for any reason, the contents of the text area are copied to the preformatted tag, ensuring that the code is at least shown with a mono-spaced font.

When a page on my website loads, I gather browser info to handle cross-browser mishaps, and I collect all the elements that require highlighting in the code tables (ctables) array. I then instance an array of syntaxes. A page typically has various code snippets, and sometimes code nippets in different languages; if a Syntax object has been instanced for a particular language, I just "reuse" it to minimize the code's footprint on the client's machine.

Also, before parsing the code snippets, a universal style sheet object is added to the HEAD section of the page by createStyleElement (found in dom.js):

The onload event continues by looping through code tables, and for each, it grabs the expected elements and the target language, embedded in the class attribute of the target element. It then references a syntax object for the language if it exists in the array:

In the switch statement, a subclassed Syntax object is created for the target language when none exists, and it is pushed into the array to complete the reusability logic.

Each sublclassed Syntax object is defined in syntax.js, with its own set of predefined highlighting groups and blocks, but notice how we can add custom keyword groups on the onload event. We could have different pages load different scripts with different syntax objects.

To add a language, say Visual Basic, you start by defining a syntax object for it, say VisualBasicSyntax. It should subclass the Syntax object, add keyword groups and block definitions to it—along with their CSS styles—and return it. Then, we add a case to the switch statement to handle the "VisualBasic" language, and then we add code tables containing Visual Basic code snippets to our HTML page.

Still inside the code tables loop, if a syntax object is successfully created, it is used to parse the source. A helper function then counts source lines and tweaks the SPAN's style; I could have hardcoded it a priori, but at the very minimum the font family and size properties must be the same as the PRE object's. Finally, if anything goes wrong when parsing or if the language is unknown to the script, the source is copied to the target "as is".

And that is it for the parsing logic. It works based on a structure of HTML placeholders and JavaScript. In general, I can simply copy the code into the text area placeholder and render the page. The exceptions are complicated expressions such as embedded hierarchical HTML and JavaScript regular expressions (see Part I for more details). Hopefully I have given you a good starting point to think on the meta level of languages: remember that the code on this page is highlighted by the code described on this page!

I will leave it to you to build on this... If you want a challenge, I have always wanted to highlight the background of HTML color literals with the actual color being defined... This takes a new type of object, namely a prefix object, not covered here.