token-type: one of the following symbols:
'NAME, 'NUMBER, 'STRING,
'OP, 'COMMENT, 'NL,
'NEWLINE, 'DEDENT, 'INDENT,
'ERRORTOKEN, or 'ENDMARKER. The only difference between
'NEWLINE and 'NL is that 'NEWLINE will only occurs
if the indentation level is at 0.

text: the string content of the token.

start-pos: the line and column as a list of two numbers

end-pos: the line and column as a list of two numbers

current-line: the current line that the tokenizer is on

The last token produced, under normal circumstances, will be
'ENDMARKER.

If a recoverable error occurs, generate-tokens will produce
single-character tokens with the 'ERRORTOKEN type until it
can recover.

Unrecoverable errors occur when the tokenizer encounters eof
in the middle of a multi-line string or statement, or if an
indentation level is inconsistent. On an unrecoverable error,
generate-tokesn will raise an exn:fail:token or
exn:fail:indentation error.

2Translator comments

The translation is a fairly direct one; I wrote an
auxiliary package to deal
with the while loops, which proved invaluable during the
translation of the code. It may be instructive to compare the
source
here to that of
tokenize.py.

Here are some points I observed while doing the translation:

Mutation pervades the entirety of the tokenizer’s main loop.
The main reason is because while has no return type and
doesn’t carry variables around; the while loop communicates
values from one part of the code to others through mutation, often in
wildly distant locations.

Racket makes a syntactic distinction between variable definition
(define) and mutation (set!). I’ve had to deduce
which variables were intended to be temporaries, and hopefully I
haven’t induced any errors along the way.

In some cases, Racket has finer-grained type distinctions than
Python. Python does not use a separate type to represent individual
characters, and instead uses a length-1 string. In this translation,
I’ve used characters where I think they’re appropriate.

Most uses of raw strings in Python can be translated to
uses of the
at-exp
reader.

Generators in Racket and in Python are pretty similar, though
the Racket documentation can do a better job in documenting them.

When dealing with generators in Racket, what one really wants to
usually produce is a generic sequence. For that reason, the
Racket documentation really needs to place more emphasis in
in-generator, not the raw generator form.

Python heavily overloads the in operator. Its expressivity
makes it easy to write code with it. On the flip side, its
flexibility makes it a little harder to know what it actually means.

Regular expressions, on the whole, match
well between the two
languages. Minor differences in the syntax are potholes: Racket’s
regular expression matcher does not have an implicit begin
anchor, and Racket’s regexps are more sensitive to escape characters.

Python’s regexp engine returns a single match object that can support
different operators. Racket, on the other hand, requires the user to
select between getting the position of the match, with
regexp-match-positions, or getting the textual content with
regexp-match.

3Release history

1.0 (2012-02-29): initial release

1.1 (2012-09-10): corrected an infinite-loop bug due to mis-typing a paren. Thanks to Joe Politz for the bug report!