VERSION

DESCRIPTION

This module provides a tokenizer, "tokenize", which breaks C source code into its smallest meaningful components, and the regular expressions which match each of these components. For example, the module supplies a regular expression "$comment_re" which matches a C comment line.

It also supplies some extra regular expressions for, for example, local include statements, "$include_local", or C variables, "$cvar_re", as well as extra functions, like "decomment" for removing traditional C comments.

REGULAR EXPRESSIONS

The following regular expressions can be imported from this module using, for example,

use C::Tokenize '$cpp_re'

to import $cpp_re.

Most of the following regular expressions do not do any capturing, except where noted. If you want to capture, add your own parentheses around the regular expression.

Comments

$trad_comment_re

Match /* */ comments.

$cxx_comment_re

Match // comments.

$comment_re

Match both /* */ and // comments.

Preprocessor instructions

$cpp_re

Match all C preprocessor instructions, such as #define, #include, #endif, and so on.

$include_local

Match an include statement which uses double quotes, like #include "some.c".

Because in theory this can contain very complex things, this regex is somewhat heuristic and there are edge cases where it is known to fail. See t/cvar_re.t in the distribution for examples.

This was added in version 0.11 of C::Tokenize.

VARIABLES

@fields

The exported variable @fields contains a list of all the fields which are extracted by "tokenize".

FUNCTIONS

decomment

my $out = decomment ('/* comment */');
# $out = " comment ";

Remove the traditional C comment marks /* and */ from the beginning and end of a string, leaving only the comment contents. The string has to begin and end with comment marks.

tokenize

my $tokens = tokenize ($file);

Convert $file into a series of tokens. The return value is an array reference which contains hash references. Each hash reference corresponds to one token in the C file. Each token contains the following keys:

leading

Any whitespace which comes before the token (called "leading whitespace").

type

The type of the token, which may be

comment

A comment, like

/* This */

or

// this.

cpp

A C preprocessor instruction like

#define THIS 1

or

#include "That.h".

char_const

A character constant, like '\0' or 'a'.

grammar

A piece of C "grammar", like { or ] or ->.

number

A number such as 42,

word

A word, which may be a variable name or a function.

string

A string, like "this", or even "like" "this".

reserved

A C reserved word, like auto or goto.

All of the fields which may be captured are available in the variable "@fields" which can be exported from the module:

use C::Tokenize '@fields';

$name

The value of the type. For example, if $token->{name} equals 'comment', then the value of the type is in , $token->{comment}.

if ($token->{name} eq 'string') {
my $c_string = $token->{string};
}

line

The line number of the C file where the token occured. For a multi-line comment or preprocessor instruction, the line number refers to the final line.

EXPORTS

Nothing is exported by default.

use C::Tokenize ':all';

exports all the regular expressions and functions from the module, and also "@fields".