Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer).

Create a lexical analyzer for the simple programming language specified below. The
program should read input from a file and/or stdin, and write output to a file and/or
stdout. If the language being used has a lexer module/library/class, it would be great
if two versions of the solution are provided: One without the lexer module, and one with.

Input Specification

The simple programming language to be analyzed is more or less a subset of C. It supports the following tokens:

Operators

Name

Common name

Character sequence

Op_multiply

multiply

*

Op_divide

divide

/

Op_mod

mod

%

Op_add

plus

+

Op_subtract

minus

-

Op_negate

unary minus

-

Op_less

less than

<

Op_lessequal

less than or equal

<=

Op_greater

greater than

>

Op_greaterequal

greater than or equal

>=

Op_equal

equal

==

Op_notequal

not equal

!=

Op_not

unary not

!

Op_assign

assignment

=

Op_and

logical and

&&

Op_or

logical or

¦¦

The - token should always be interpreted as Op_subtract by the lexer. Turning some Op_subtract into Op_negate will be the job of the syntax analyzer, which is not part of this task.

Symbols

Name

Common name

Character

LeftParen

left parenthesis

(

RightParen

right parenthesis

)

LeftBrace

left brace

{

RightBrace

right brace

}

Semicolon

semi-colon

;

Comma

comma

,

Keywords

Name

Character sequence

Keyword_if

if

Keyword_else

else

Keyword_while

while

Keyword_print

print

Keyword_putc

putc

Identifiers and literals

These differ from the the previous tokens, in that each occurrence of them has a value associated with it.

Name

Common name

Format description

Format regex

Value

Identifier

identifier

one or more letter/number/underscore characters, but not starting with a number

[_a-zA-Z][_a-zA-Z0-9]*

as is

Integer

integer literal

one or more digits

[0-9]+

as is, interpreted as a number

Integer

char literal

exactly one character (anything except newline or single quote) or one of the allowed escape sequences, enclosed by single quotes

'([^'\n]|\\n|\\\\)'

the ASCII code point number of the character, e.g. 65 for 'A' and 10 for '\n'

/// <summary>/// Advance the cursor forward given number of characters/// </summary>/// <param name="characters">Number of characters to advance</param>privatevoid advance(int characters=1){try{// reset position when there is a newlineif(CurrentCharacter =="\n"){ _position =0; _line++;}

// ensure that any incompatible characters are not next to the token// eg 42fred is invalid, and neither recognized as a number nor an identifier.// _letters would be the notNextClassif(notNextClass !=null&& notNextClass.Contains(CurrentCharacter)) error("Unrecognised character: "+ CurrentCharacter, _line, _position);

// only add tokens to the stack that aren't marked as discard - dont want// things like open and close quotes/commentsif(!discard){ Token token =new Token(){ Type = tokenType, Value= tokenValue, Line = line, Position = position - offset }; tokens.Add(token);}

Lisp has a built-in reader and you can customize the reader by modifying its readtable. I'm also using the Gray stream, which is an almost standard feature of Common Lisp, for counting lines and columns.

#| Returns the value of a matched char literal, or dies if it is invalidsub char_val {my$str= string_val();die"Multiple characters\n"iflength$str>1;die"No character\n"iflength$str==0;ord$str;}

#| Returns the value of a matched string literal, or dies if it is invalidsub string_val {my($str,$end)=($1,$2);die"End-of-file\n"ifnotdefined$end;die"End-of-line\n"if$str=~/\n/;$str=~s/\\(.)/$1eq'n'?"\n":$1eq'\\'?$1:$1eq$end?$1:die"Unknown escape sequence \\$1\n"/rge;}

#| Returns the source string of a matched literalsub raw {$&}

#| Returns the source string of a matched string literal, or dies if invalidsub string_raw { string_val();# Just for the error handling side-effects$&;}

#| Returns a closure, which can be fed a string one piece at a time and gives#| back the cumulative line and column number each timesub linecol_accumulator {my($line,$col)=(1,1);sub{my$str=shift;my@lines=split"\n",$str,-1;my($l,$c)=@lines?(@lines-1,length$lines[-1]):(0,0);if($l){$line+=$l;$col=1+$c}else{$col+=$c}($line,$col)}}

token tokens {[|<operator>{ make $/<operator>.ast}|<keyword>{ make $/<keyword>.ast}|<symbol>{ make $/<symbol>.ast}|<identifier>{ make $/<identifier>.ast}|<integer>{ make $/<integer>.ast}|<char>{ make $/<char>.ast}|<string>{ make $/<string>.ast}|<error>]}

prototoken operator {*}token operator:sym<*>{'*'{ make 'Op_multiply'}}token operator:sym</>{'/'<!before '*'>{ make 'Op_divide'}}token operator:sym<%>{'%'{ make 'Op_mod'}}token operator:sym<+>{'+'{ make 'Op_add'}}token operator:sym<->{'-'{ make 'Op_subtract'}}token operator:sym('<='){'<='{ make 'Op_lessequal'}}token operator:sym('<'){'<'{ make 'Op_less'}}token operator:sym('>='){'>='{ make 'Op_greaterequal'}}token operator:sym('>'){'>'{ make 'Op_greater'}}token operator:sym<==>{'=='{ make 'Op_equal'}}token operator:sym<!=>{'!='{ make 'Op_notequal'}}token operator:sym<!>{'!'{ make 'Op_not'}}token operator:sym<=>{'='{ make 'Op_assign'}}token operator:sym<&&>{'&&'{ make 'Op_and'}}token operator:sym<||>{'||'{ make 'Op_or'}}

Deviates from the task requirements in that it is written in a modular form so that the output
from one stage can be used directly in the next, rather than re-loading from a human-readable
form. If required, demo\rosetta\Compiler\extra.e contains some code that achieves the latter.
Code to print the human readable forms is likewise kept separate from any re-usable parts.

procedure skipspacesandcomments() while 1 do if not find(ch,whitespace) then if ch='/' and col<length(oneline) and oneline[col+1]='*' then tok_line = line -- (in case of EOF error) tok_col = col ch = next_ch() -- (can be EOF) ch = next_ch() -- ( "" ) while 1 do if ch='*' then ch = next_ch() if ch='/' then exit end if elsif ch=EOF then error("EOF in comment") else ch = next_ch() end if end while else exit end if end if ch = next_ch() end whileend procedure