Outline

This is the first of a 2 articles serie. In this article, the techniques and ideas are discussed and a Javascript solution is given. In Part 2, a C# solution is given.

Unfortunately for JScript users, I will not update the JScript code and focus on C# only.

Introduction

Have you ever wondered how the CP team highlights the source code in their edited article ? I suppose it's not by hand and they must have some clever code to do it.

However, if you look around in the forums on the web, you will see that there are few if any who have this feature. Sad thing, because colored source code is much easier to read. In fact, it would be great to have source code in forums automatically colored with your favorite coloring scheme.

The last-but-not-least reason for writing this article was to learn regular expressions, javascript and DOM in one project.

The source code entirely written in JScript so it can be included server-side or client-side in your web pages.

The techniques used are:

regular expressions

XML DOM

XSL transformation

CSS style

When reading this article, I will assume that you have little knowledge of regular expressions, DOM and XSLT although I'm also a newbie in those 3 topics.

Live Demo

CP does not accept script or form tags in the article. To play with the live demo, download the "JScript" enabled page (see download section).

Transformation Overview

Parsing pipe

All the boxes will be discussed in details in the next chapter. I will give here an short overview of the process.

First, a language syntax specification file is loaded (Language specification box). This specification is a plain xml file given by the users. In order to speed up things, preprocessing is made on this document (Preprocessing box).

Let us suppose for simplicity that we have the source code to colorize (Code box). Note that I will show how to apply the coloring to a whole html page later on. The parser, using the preprocessed syntax document, builds an XML document representing the parsed code (Parsing box). The technique used by the parser is to split up the code in a succession of nodes of different types: keyword, comment, litteral, etc...

At last, an XSTL transformation are applied to the parsed code document to render it to HTML and a CSS style is given to match the desired appearance.

Parsing Procedure

The philosophy used to build the parser is inspired from the Kate documentation (see [1]).

The code is considered as a succession of contexts. For example, in C++,

keyword: if, else, while, etc...

preprocessor instruction: #ifdef, ...

literals: "..."

line comment: // ...

block comment: /* ... */

and the rest.

For each context, we define rules that have 3 properties:

a regular expression for matching a string

the context of the text matched by the rule: attribute

the context of the text following the rule: context

The rules have priority among them. For example, we will first look for a /* ... */ comment, then a // ... line comment, then litteral, etc...

When a rule is matched using a regular expression, the string matched by the rule is assigned with the attribute context, the current context is updated as context and the parsing continues. The diagram show the possible path between contexts. As one can see, some rule do not lead to a need context.

Context dynamics

Let me explain a bit the schema below. Consider that we are in the code context. We are going to look for the first match of the code rules: /**/, //, "...", keyword. Moreover, we have to take into account their priorities: a keyword is not really a keyword in a block of comment, so it has a lower priority. This task is easily and naturally done through regular expressions.

Once we find a match, we look for the rule that triggered that match (always following the priority of the rules). Therefore, pathological like is well parsed:

// a keyword while in a comment

while is not considered as a keyword since it is in a comment.

Rules Available

There are 5 rules currently available:

detect2chars: detects a pattern made of 2 characters.

detectchar: detects a pattern made of 1 character.

linecontinue: detects end of line

keyword:detect a keyword out of a keyword family

regexp:matches a regular expression.

regexp is by far the most powerful rule of all as all other rules are represented internally by regular expressions.

Language Specification

From the rules and context above, we derive an XML structure as described in the XSD schema below (I don't really understand xsd but .Net generates this nice diagram...)

Language specification schema. Click on the image to view it full size.

I will breifly discuss the language specification file here. For more details, look at the xsd schema or at highlight.xml specification file (for C++). Basically, you must define families of keywords, choose context and write the rule to pass from one to another.

Nodes

Name

Type

Parent Node

Description

highlight

root

none

The root node

needs-build

A (optional)

highlight

"yes" if file needs preprocessing

save-build

A (optional***)

highlight

"yes" if file has to be saved after preprocessing

keywordlists

E

highlight

Node containing families of keywords as children

keywordlist

E

keywordlist

A family of keywords

id

A

keywordlist

String identifier

pre

A (optional)

keywordlist

Regular to append before keyword

post

A (optional)

keywordlist

Regular to append at the end of the keyword

regexp

A (optional*)

keywordlist

Regular expression matching the keyword family. Build by the preprocessor

kw

E

keywordlist

Text or CDATA node containing the keywords

languages

E

highlight

Node containing languages as children

language

E

languages

A language specification

contexts

E

language

A collection of context node

default

A

contexts

String identifying the default context

context

E

contexts

A context node containing rules as children

id

A

context

String identifier

attribute

A

context

The name of the node in which the context will be stored.

detect2chars**

E

context

Rule to dectect pair of characters. (ex: /*)

char

A

detect2chars

First character of the pattern

char1

A

detect2chars

Second character of the pattern

detectchar**

E

context

Rule to dectect one character. (ex: ")

char

A

detectchar

character to match

keyword**

E

context

Rule to match a family of keywords

family

A

keyword

Family indentifier, must match /highlight/keywordlists/keyword[@id]

regexp

E

context

A regular expression to match

expression

A

regexp

the regular expression.

Comments:

*: this argument is optional at the condition that preprocessing takes place. The usual way to do is to always preprocess or to preprocess once with the "save-build" parameter set to "yes" so that the preprocessing is save. Note that if you modify the language syntax, you will have to re-preprocess.

**: all those element have two other attributes:

attribute (optional)

A

a rule

The name of the node in which the string match will be stored. If not set or equal to "hidden", no node is created.

context

A

a rule

The next context.

***: Client-side javascript is not allowed to write files. Hence, this option aplies only to server-side execution.

Preprocessing

In the preprocessing phase, we are going to build the regular expressions that will be used later on to match the rules. This section makes an extensive use of regular expressions. As mentionned before, this is not a tutorial on regular expressions since I'm also a newbie in that topic. A tool that I have found to be really useful is Expresso (see [3]) a regular expression test machine.

Keyword Families

Building the keyword families regular expressions is straightforward. You just need to concatenate the keywords togetter using |:

<keywordlist...><kw>if</kw><kw>else</kw></keywordlist>

will be matched by

\b(if|else)\b

The generated regular expression is added as an attribute to the keywordlist node:

Basic Templates

This template appies to the node cpp-linecomment which corresponds to single line comment in C++.We apply the CSS style to this node by encapsulating it in span tags and by specifying the CSS class.Moreovern, we do not want character escaping for that, so we use

<xsl:value-ofselect="text()"disable-output-escaping="yes"/></span>

The Parsedcode Template

It gets a little complicated here. As everybody knows, XSL quicly becomes really complicated once you want to do more advanced stylesheets. Below is the template for parsedcode, it does simple thing but looks ugly:Checks if in-box parameter is true, if true create pre tags, otherwize create code tags.

Javascript Call

This is where you have to customize a bit the methods. The rendering is done in the method highlightCode:

highlightCode( sLang, sRootTag, bInBox, sCode)

where

sLang is a string identifying the language ( "cpp" for C++),

sRootTag will the node name encapsulation the code. For example, pre for boxed code, code for inline code,

bInCode a boolean set to true if in-box has to be set to true.

sCode is the source code

it returns the modified code

The file names are hardcoded inside the highlightCode method: hightlight.xml for the language specification, highlight.xsl for the stylesheet. In the article, the XML syntax is embed in a xml tag and is simply accessed using the id

Applying Code Transformation to an Entire HTML Page.

So now you are wondering how to apply this transformation to an entire HTML page? Well surprisingly, this can be done in... 2 lines! In fact, there exist the method String::replace(regExp, replace) that replaces the substring matching the regular expressions regExp with replace. The best part of the story is that replace can be a function... So we just (almost) need to pass highlightCode and we are done.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

Jonathan de Halleux is Civil Engineer in Applied Mathematics. He finished his PhD in 2004 in the rainy country of Belgium. After 2 years in the Common Language Runtime (i.e. .net), he is now working at Microsoft Research on Pex (http://research.microsoft.com/pex).

Works fine in theory, I have just found a whole swag of issues with sytle positioning of textbox controls and other controls when using a mozilla browser to view asp.net pages.

I'm not quite sure why, when microsoft detect a mozilla browser they refrain from sending critical positioning information to the client. Weird... I understand older versions of mozilla don't handle it properly, but the new browser supposably works correctly.

I don't see what I could do from your syntax files without the parsing code. Anyway, after a quich look on them, I'm glad to see that you had the same ideas as I. Of course, 30 languages syntax files is impressive!

Maybe you could add details on the Client Javascript used to do the folding ? This is a developper site, no?

NOOOH thats not advertising , is it :P
oke , maybe it was a little attempt to do that

actually its not that easy to share the javascript . coz
the exporter uses the "SyntaxDocument" class (which parses the document and handles all the folding structures internally) , and the exporter merely loop through all rows in the document and reads the folding settings from that row , and incase the row is foldable , the exporter outputs a litte java script....

so the javascript is nothing that can be applied to a set of text or anything , you need the hirarchies from the document in order to make anything usefull of it...
it is not much more that a function that sets the "display" style of a html DIV element.

although , it would be interesting pros and cons of our different syntax definition files

I had a brief look through the code and the article and I didn't see any quick qay of specifying if keywords are case sensitive or not. Eg C/C#/C++ are case sensitive while VB/VB.NET/SQL etc aren't. Did I miss something obvious?

Also: I was wondering if you've thought of extending the algorithm to be context sensitive. When colorizing HTML, the HTML itself obeys one set of rules but javascript or C# code blocks obey a completely different set of rules. Any thoughts on this?

Chris Maunder wrote:I had a brief look through the code and the article and I didn't see any quick qay of specifying if keywords are case sensitive or not. Eg C/C#/C++ are case sensitive while VB/VB.NET/SQL etc aren't. Did I miss something obvious?

Damn VB ! A priori, there is no straightforward solution since the "case sensitivity" is specififed for the entire regular expression.

Maybe I could specify if a language is case sensitive at all ? This could be added quite easily, but it's not a very flexible solution. For example, if a language mixes case sensitive, insensitive keywords it will fail.

Chris Maunder wrote:Also: I was wondering if you've thought of extending the algorithm to be context sensitive. When colorizing HTML, the HTML itself obeys one set of rules but javascript or C# code blocks obey a completely different set of rules. Any thoughts on this?

I'm afraid I don't understand the meaning of the sentence ? Do you mean CSS class?

Jonathan de Halleux wrote:Maybe I could specify if a language is case sensitive at all ?

Yep - that would be fine.

Jonathan de Halleux wrote:I'm afraid I don't understand the meaning of the sentence ? Do you mean CSS class?

What I mean is that the colorization works well for HTML and for VBScript but what happens when you have a VBScript block inside some HTML and you want the HTML to be coloured too? They are different languages so they need to be treated separately.

Maybe just a regular expression to define code boundaries. That way you can colour everything inside a code block (eg VBScript) in one pass, and then colour everything outside the code block using a different colourisation scheme (HTML) in another pass.

hello Sir,
How can i get the highlighted text in IE, or Netscape, by JavaScript?
As i am writing a Java program to speak the highlighted text in webpages
Should i communicate with Windows API? If you got any idea about this, Pls
tell me~
Thank You very much~