Recently I have started contributing my own syntax highlighters for M4 macros and Dockerfiles, and it is this latter project that this blog is about.

Under the hood, Brackets uses CodeMirror to provide language syntax highlighting. It comes with a range of language “modes”, which are really just javascript modules that state-fully tokenize code into CSS styles for the syntax colouring/highlighting. They can also handle indenting and commenting.

I wrote my original extensions in a similar manner, hand-coding the state-machine and tokenizing from the code using regular-expressions. However, I quickly realised, with my Dockerfile extension, that this code had become too complicated, too convoluted and difficult to maintain. Just look at this code in my project’s history…

Now, my background is in C coding and experience with tools like Lex/Flex & Yacc/Bison. Flex is an opensource Lexical Analyzer and Bison a Grammer Parser. What I wanted was something similar, but for Javascript. On searching, I found Jacob (also available via NPM here) – which provides both of these capabilities in one tool. It seemed the Lexer component of Jacob would be an ideal way of coding, and hopefully simplifying, my Dockerfile extension.

Installing Jacob was easy:

Installing Jacob

Shell

1

$npm install jacob

I created a Dockerfile.jacoblex file. This provides a lexical definition of the language I wanted to parse and tokenize. This file is divided up into 3 sections, separated by %%.

In this case, just a regex matching all of the Dockerfile’s possible keywords.

The final section defines the parsing rules and state-machine. Here is a simple example. This parses a comment and returns the ‘COMMENT’ token:

JavaScript

1

2

3

4

5

6

7

8

9

10

%%

/********************************************************************

* Eat Comments in default state

*/

<>#.*{

return'COMMENT';

}

A more complicated example, using the above named regex:

JavaScript

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

/********************************************************************

* Match all directives and enter respective states

*/

<>{directive}{

vard=this.jjtext.match(/(\w+)/)[1];

this.jjval=d;

if(d.match(/RUN|CMD/i)){

this.localMode=this.bashMode;

this.localState=this.bashMode.startState();

}else{

if(d.match(/FROM|MAINTAINER|EXPOSE/i)){

this.stringAs=d.toLowerCase();

}

if(!d.match(/ENV|LABEL/i)){

d='DOCKDIR';

}

this.pushState(d.toUpperCase());

}

return'DOCKDIR';

}

<DOCKDIR>\s+{

this.jjval=this.jjtext.trim();

this.popState();

this.pushState('UNQUOTEDSTRING');

returnnull;

}

<DOCKDIR>.{

return'ERROR';

}

The first part of this rule matches on the {directive} (Dockerfile keywords) and then uses this.pushState() to advance the state-machine, e.g. to DOCKDIR, so the rules associated with that state, denoted by <DOCKDIR> can then be applied. The method this.popState(), as it’s name implies, reverts back to the previous state on the stack.

Integrating the generated lexer into a custom CodeMirror Mode proved a little challenging, until I realised that I could simply 1) use the lexer itself as the mode’s State object, and 2) extend the Stream object to provide the extra methods expected by Jacob.

As CodeMirror was feeding my tokenizer stream line-by-line, I needed to think carefully how the lexer could work (e.g. the regex ‘$’ directive does not work, requiring an alternative approach using this.input.more()), and also reapply the stream on each iteration.

The start state being created using:

JavaScript

1

2

3

4

5

6

7

8

9

10

if(state.start){

state.start=false;

state.localMode=null;

state.localState=null;

state.stringAs=null;

state.quote=null;

state.setInput(stream);

}

state.bashMode=bashMode;

Then for each iteration, I ensured the lexer’s input was reset to the current stream object:

JavaScript

1

2

3

4

5

6

7

8

9

state.input=stream;

vartoken=state.nextToken();

if(token.name==='STRING'&&state.stringAs){

token.name=state.stringAs;

}

return(token.name?token.name.toLowerCase():null);

};

The call to state.nextToken() in fact calls the lexer generated by Jacob. The return token’s name attribute is then passed back as the syntax highlighting style name (e.g. ‘def’, ‘string’, ‘error’, etc).

I realised CodeMirror’s internal copyState() method couldn’t fully copy the lexer state object, so I coded a custom method: