Introduction

This article will present the tokenizing and splitting functionality
of a simple C++ library called the String Toolkit. Tokenization in
the context of string processing, is the method by which a sequence of
elements are broken up or fragmented into sub-sequences called tokens.
The indices in the original sequence that determine such breaks in the
sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which
are of length one element and thick delimiters which are of length
two or more elements. Even though tokenization is primarily used in
conjunction with strings, any sequence of types that can be iterated
in a linear fashion can be tokenized, examples may be list of
integers, a vector of person classes or a map of strings.

Another Tokenizer?

To date there have been many attempts to define a
"standard" Tokenizer implementation in C++. Of them all the
likely candidate might be the implementation in the Boost
library. Regardless proposed implementations should to some extent
consider one or more of the following areas: over-all usage
patterns, constructions, generality (or is it genericty these
days?) of design, performance-efficiency.

1. Overall Usage Patterns

This requirement is concerned with how easy it is to instantiate the
tokenizer and integrate it into existing processing patterns, which
most often than not requires integration with C++ STL algorithms and
containers. A tokenzier by definition would be classed as a producer,
so the question becomes how easy is it for others to consume its
output? Another issue is consistency of the definition of a token in
the situation where one has consecutive delimiters but is not
compressing them - can or should there be such a thing as an
empty token? and what do preceding and trailing delimiters mean? and
when should they be included as part of the token?

2. Constructions

In the context of string tokenizing, the majority of implementations
return the token as a new instance of a string. This process requires
a string to be created on heap, populated by the sub-string in
question from the original sequence, then returned back (some of
this may be alleviated by Return Value Optimization RVO).
In the case of iterators this is essentially another copy to the
caller. Furthermore two kinds of tokens can make this situation
worse, they are primarily a large sequence made up of lots of very
short tokens or a large sequence made up of lots of very large
tokens. The solution is not to return the token as a string but
rather as a range and allow the caller to decide how they wish to
inspect and further manipulate the token.

This minor change in interface design provides a great deal of
flexibility and performance gain.

3. Generality(Genericity) of design

Most tokenizer implementations concern themselves only with strings,
which for the most part is ok, because most things that need
tokenizing are strings. However there will be times when one has a
sequence of types that may need to be tokenized that aren't strings,
hence a tokenizer should be designed in such a way to enable such
features, moreover it becomes clear that the most basic of tokenizing
algorithms are invariant to the type of the delimiter.

4. Performance and Efficiency

Tokenizing strings range from low frequency inputs such as user input
or parsing of simple configuration files to more complex situations
such as tokenizing of HTML pages that a web crawler/indexer might do,
to parsing of large market-data streams in FIX format. Performance is
generally important and can usually be helped along with good usage
patterns that encourage reuse of instances, minimal preprocessing and
allow for user supplied predicates for the more nasty areas of the
process. It should be noted that everything in the proceeding article
can be done by purely using the STL/IOStream libraries - that said,
C++'s ability to allow one to skin the proverbial cat in
numerous way gives rise to novel solutions that are for the most
part not of any practical use other than to showcase ones abilities
in using the STL/IOStreams.

Getting started

The String Toolkit Library (StrTk) provides two common tokenization
concepts, a split function and a token iterator. Both these concepts
require the user to provide a delimiter predicate and an iterator
range over which the process will be carried out.

The tokenization process can be further parametrized by switching
between "compressed delimiters" or "no compressed delimiters" mode.
This essentially means that consecutive delimiters will be compressed
down to one and treated as such. Below are two tables depicting the
expected tokens from various types of input. The tables represent no
compressed and compressed tokenization processes respectively. The
delimiter in this instance is a pipe symbol | and <> denotes
an empty token.

No Compressed Delimiters

Input

Token List

a

a

a|b

a,b

a||b

a,<>,b

|a

<>,a

a|

a,<>

|a||b

<>,a,<>,b

||a||b||

<>,<>,a,<>,b,<>,<>

|

<>,<>

||

<>,<>,<>

|||

<>,<>,<>,<>

Compressed Delimiters

Input

Token List

a

a

a||b

a,b

|a||b

<>,a,b

||a||b||

<>,a,b,<>

|

<>,<>

||

<>,<>

|||

<>,<>

Delimiters

Two forms of delimiters are supported and they are single
delimiter predicate and multiple delimiters predicate otherwise known
as SDP and MDP respectively. Essentially an SDP is
where there is only one type that can break or fragment the
sequence, where as with MDPs there is more than one unique type that
can break the sequence. It is possible to represent a SDP using
the MDP, however from a performance POV having separate
predicates is far more efficient. Additionally for strings based
on char or unsigned char (8-bit versions) there is a MDP that has a
look-up complexity of O(1) making it greatly more efficient than the
basic MDP.

Single Delimiter Predicate

Instantiation requires specialization of type and construction
requires an instance of the type.

As previously mentioned tokenization of data need not be limited to
strings comprised of chars, but can also be extended to other PODs or
complex types. In the above example a predicate used for tokenizing a
sequence of unsigned ints is being defined.

Multiple Char Delimiter Predicate

Instantiation requires an iterator range based on either unsigned
char or char. This delimiter is more efficient than the simple MDP as
it has a predicate evaluation of O(1) due to the use
of a lookup-table as opposed to O(n) where n is the
number of delimiters. Performance increase is seen as more
delimiters are used.

The delimiter concept can be extended to the point where the
predicate itself can act as a state machine transitioning from
state to state based on conditions and rules related to the current
symbol being processed. This simple extension can lead to some very
interesting parsing capabilities.

Split

This is a function that performs tokenization over an entire sequence
in one go. strtk::split takes a sequence through
a range of iterators or in the case of a string through
a reference to its instance, a delimiter predicate and an output
iterator or otherwise known as a type sink. It populates the output iterator with the tokens it
extracts. The tokens in this case are std::pairs of iterators for
the sequence.

Split can be used in a "simple - no frills" manner by simply
passing the necessary parameters:

strtk::split provides an additional usage option that
allows the user to specify if they would like to either compress the
delimiters and whether or not they would like to include the
delimiter as part of the token range. This enum parameter is called
strtk::split_options and has the following values:

Split Option

Definition

strtk::split_options::default_mode

Default options

strtk::split_options::compress_delimiters

Consecutive delimiters are treated as one

strtk::split_options::include_1st_delimiter

The first delimiter is included in the resulting token range

strtk::split_options::include_delimiters

All delimiters are included in the resulting token range

The simple example below demonstrates a split that will occur over
a string given a predicate where the provided split options indicate
that consecutive delimiters will be treated as one and also all delimiters
encountered after each token will also be included in the token up until
the next token begins.

Split N-Tokens

A natural extension of strtk::split is strtk::split_n.
This function provides the ability to tokenize a sequence up until a specific
number of tokens have been encountered or until there are no more tokens left.
The return value of the strtk::split_n would be the number of tokens
encountered.

Offset Splitter

Another simple variant is the strtk::offset_splitter.
This form of split takes a series of offsets and an iterator range or
string and determines the tokens based on the offsets. This function
can be set to perform a single pass of the offsets or to rotate
them until the range has been completely traversed. The example
below demonstrates how a string representing date and time can be
tokenized into its constituent parts (year, month, day, hour, minutes
,seconds,milliseconds)

Note: The parameter regex_match_mode represents the capture
of the marked sub-pattern in the current match. By default it is
strtk::regex_match_mode::match_all which in this case
would provide the entire match including the bounding pattern,
eg: Token3 would be (0ijkx). However in the above example
we are only interested in the sub-pattern between the round-brackets,
hence strtk::regex_match_mode::match_1 is used resulting
in Token3 being 0ijkx.

The following examples demonstrate the use of strtk::split_regex
and strtk::split_regex_n routines in extracting specific
types of data - in this case the POD types int and double.

Tokenizer

The tokenizer is specialized on a sequence iterator and predicate. It
is constructed with a range of iterators for the sequence and a
reference to the desired predicate. Definitions exist for std::string
tokenizers. The tokenizer provides an iteration pattern as a means
for accessing the tokens in the sequence.

Note: For performance and efficient resource management purposes
the strtk::tokenizer does not take ownership or make an
internal copy of the sequence being tokenized, as such the
strtk::tokenizer expects the range to be valid during
the entirety of the tokenization process, this is also the case for
the specified predicate.

The Parse Routine

Till now the mentioned routines worked specifically with tokens, or in
other words ranges of characters. The responsibility of managing the
tokens and converting the tokens to user specified types was done
manually via "range to type" oriented back inserters and converters.
This can be a bit cumbersome and as such StrTk provides a series of
helper routines called strtk::parse. Parse takes an
std::string representing a tuple of delimited values as
input data, a delimiter set, and a series of references to variables
that are to be populated with the values from the parsed tokens. The
following diagram demonstrates the flow of data, tokens and the
corresponding relationships and conversions between each of the
parameters.

Note:strtk::parse returns a boolean value of true
upon successful parsing and false for all other results. Situations
that cause strtk::parse to fail include:

Insufficient number of tokens for the given number of variables

Conversion failure from token range to variable type

Empty or null token(s)

Some Simple Parse Examples

strtk::parse can take an arbitrary number of variable
references. The code below demonstrates the basic usage of strtk::parse
taking various number of parameters.

Note: The return value of the routine strtk::parse_n
indicates how many elements were parsed and placed into the specified
sequence.

Some Initial Simple Examples

Simple Example 0

As a first example, we'll tackle the simple problem of reversing words in a sentence.
That is given a sentence, to have the first word be the last and the last to be
the first, the second word to be the second last so on and so forth. Using StrTk
we come up with the following solution:

Simple Example 2

Another simple example would be given a text file to read each
of the lines and populate a word list structure by
tokenizing each line into words. The following is an example
of how this can be achieved using StrTk:

Simple Example 3

The following simple example takes a user specified text file,
proceeds to process it and returns information relating to
the file, such as word, letter, uppercase character,
lowercase character, vowel and consonant counts.

Simple Example 4

For the next example, assume we have a text file with a list of names,
one per line that represents the order of people that entered a
building. Some of the people may enter and leave then reenter the
building many times, hence their name will appear multiple times in
the list. Our task is to reduce this list to a list of unique names
but to also maintain the relative order of names found in the original
list. The following is how this particular requirement can be
accomplished by using StrTk:

Simple Example 5

As a final simple example, we would like to calculate the word
frequency model of a user specified text file. The process involves
reading each line, splitting the line into words, then incrementing
the relevant count for each word and maintaining a global word count.
Once the file has been processed, the occurrence frequency of each
word will be printed to stdout.

A Practical Example

Lets assume we have been given an English text file to process, with
the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line
tokenization problem. In this case we would define a functional
object such as the following which will take the container in which
we plan on storing our tokens (words) and a predicate and insert
the tokens as strings into our container.

Now coming back to the original problem, that being the construction
of a lexicon. In this case the set of "words" should only
contain words of interest. For the sake of simplicity lets define
words of interest as being anything other than the following
prepositions: as, at, but, by, for, in, like, next, of, on,
opposite, out, past, to, up and via. This type of list is commonly
known as a Stop Word List. In this example the stop-word list
definition will be as follows:

Some minor updates to the parse_line processor include using the
filter_on_match predicate for determining if the
currently processed token is a preposition and also the invocation of
the range_to_type back_inserter to convert the tokens
from their range iterator representation to a type representation
compatible with the user defined container. For the new implementation
to provide unique words of interest the simplest change that can be
made is to replace the deque used as the container for the word_list
to some kind of 1-1 associative container such as a set. The following
is the improved version of the parse_line processor:

The above described predicate can be greatly simplified by using
binders and various lambda expressions.

Another Example

When performing serialization or deserialization of an
instance object such as a class or struct, a simple approach
one could take would be to take each of the members and
convert them into string representations and from those strings
construct a larger string delimiting each member with a special
character guaranteed not to exist in any of the string
representations.

In this example we will assume that there exists a struct which
represents the properties of a person, a person struct if you
will:

The process of populating a person struct would entail having an
instance of a person and the necessary data string. The following is
an example of how this would be done using the
strtk::parse function.

Batch processing of a text file comprised of one person tuple
per-line is somewhat similar to the previous example. A predicate
is defined that takes a container specialized on the person struct,
and a delimiter predicate with which the
strtk::parse function will be invoked. This
predicate is then instantiated and coupled with the text
file name, is fed to the strtk::for_each_line
processor.

The following assumes an input of date-time values separated by a
pipe. To facilitate parsing of a date-time by
the strtk::parse routine into an STL compatible sequence an
implementation of string_to_type_converter_impl specific
to the datetime type is required. The following demonstrates how such
a routine can be generated and used within the strtk::parse
context:

As a side note, the more commonly used date, time and date-time formats
can be easily parsed with a simple utilities library based on StrTk called
Datetime_Utils
The library makes use of the technique described above in conjunction
with the strtk::offset_splitter to provide efficient and
high performance parsers for formats such as the ones denoted below:

Format

Example

YYYYMMDD

20060304

YYYYDDMM

20060403

YYYY/MM/DD

2006/03/04

YYYY/DD/MM

2006/04/03

DD/MM/YYYY

04/03/2006

HH:MM:SS.mss

13:27:54.123

HH:MM:SS

13:27:54

YYYYMMDD HH:MM:SS.mss

20060304 13:27:54.123

YYYY/MM/DD HH:MM:SS.mss

2006/03/04 13:27:54.123

DD/MM/YYYY HH:MM:SS.mss

04/03/2006 13:27:54.123

YYYYMMDD HH:MM:SS

20060304 13:27:54

YYYY/MM/DD HH:MM:SS

2006/03/04 13:27:54

DD/MM/YYYY HH:MM:SS

04/03/2006 13:27:54

YYYY-MM-DD HH:MM:SS.mss

2006-03-04 13:27:54.123

DD-MM-YYYY HH:MM:SS

04-03-2006 13:27:54

YYYY-MM-DDTHH:MM:SS

2006-03-04T13:27:54

YYYY-MM-DDTHH:MM:SS.mss

2006-03-04T13:27:54.123

YYYYMMDDTHH:MM:SS

20060304T13:27:54

YYYYMMDDTHH:MM:SS.mss

20060304T13:27:54.123

In the following simple example we have an array of data
representing tuples of trade executions in CSV format. The
objective is to populate the trade_list instance with the
given data via the defined trade struct. In
the example the dt_utils::datetime_format6
date-time parser is used, it populates a general date-time
type instance called dt_utils::datetime. If
the parse operation succeeds, then the date-time components
the trade type requires are updated and the
instance itself is subsequently added to the trade_list.

Parsing Sub-Lists

So far the demonstrated capabilities of the strtk::parse
function has been based on passing a series of parameters that are
populated in a linear fashion as the parser processes the tokens it
encounters. That said, some formats have their own sub-structures, a
simple example would be a list of values (such as integers) that need
to be loaded into a deque or stack. StrTk provides a series of sink
facilities that consume a range and an STL container which can be
forwarded onto strtk::parse.

In the following example, the data string is comprised of 3
separate lists delimited by a pipe "|". An integer, a
string and a double type list. Each list is to be parsed into an STL
container of appropriate type. In this case a vector, a deque and a
list. StrTk provides the ability to instantiate a sink for the
specific container type that is compatible with
strtk::parse.

If only a certain number of elements in the list are required, for
example only the first 3, then the element count on the sink can be
set appropriately. The above example could be modified as follows:

Note: If there aren't enough elements in a particular
list, then parsing of that list fails and subsequently the call to
strtk::parse will fail.

Parsing Trailing-Lists

Another way one might want to parse a tuple of values might be to parse
a prefix of values into a specific number of possibly varying types,
then to parse the remaining values (assuming they are all of the same
type) into a sequence or list etc. StrTk provides the following simple
solution to the given problem, as demonstrated below:

Extending The Date-Time Parser Example

Building upon the previous datetime example, We are presented with a
tuple of data that represents an astronomical event. The event
defines a name, a location and a series of date-times in UTC the
event was observed. In order to construct the necessary sink(s) that
will be used for parsing the required type into a container, the
macro strtk_register_userdef_type_sink with the specified type
is invoked. The following is a definition of the struct one might
construct:

Token Processing During Parsing

StrTk offers a set of convenient and simple to use token processing
primitives that can be invoked during a call to the strtk::parse
routine to perform various actions upon the tokens being parsed.
These actions include such things as modifications and
validations of tokens.

The following is a list of token processing primitives used for
constraint and verification purposes:

strtk::ignore_token

strtk::expect

strtk::iexpect

strtk::like

strtk::inrange

The following is a list of token processing primitives used for
modifying and normalising purposes:

strtk::trim

strtk::trim_leading

strtk::trim_trailing

strtk::as_lcase

strtk::as_ucase

The primitives all return either a true or false value upon
parsing completion, which is then further used by the
strtk::parse routine to determine if the parse
operation as a whole has succeeded or failed.

Ignore Token Processing

There may be scenarios when given a delimited tuple of data, that
one or more of the tokens need to be ignored or skipped. StrTk
provides a mechanism called strtk::ignore_token that allows the
parser to consume specific tokens easily without affecting
overall performance. Below is an example of how
strtk::ignore_token can be used in conjunction with
strtk::parse to skip the 2nd and 4th tokens in the
tuple:

Expect and IExpect Token Processing

When parsing a tuple, one may want to ensure that specific tokens
of the tuple are of a certain string value. StrTk provides this
type of functionality via the strtk::expect and
strtk::iexpect mechanisms. The strtk::expect form
enforces an exact string match, whereas the strtk::iexpect
enforces only a case-insensitive match. The following is an
example where we attempt to parse a 'pascal-like' variable
declaration and definition. The requirement is that the first
token be "var" followed by a variable name and then a type name
of 'Integer' which is not case sensitive.

Like Token Processing

Similar to the above mentioned strtk::expect and strtk::iexpect
primitives, StrTk provides a simple wildcard matching of tokens
functionality via the strtk::like mechanism. The special
characters of '*' and '?' are used denoting 'zero or more' and
'zero or one' match modes respectively. The following example
uses the strtk::like in conjunction with strtk::expect to parse a
tuple of key/value pairs.

In-Range Token Processing

When parsing tokens, one may wish to determine if the token when
viewed in its target type resides within a specified range
[r0,r1]. As the tokens can be of any type, not necessarily
just string or numerical in nature, the type must have a less-
than comparable attribute. The following example attempts to
parse a key/value tuple that contains a temperature and a name
component.

Trim Token Processing

At times tokens within a tuple may have padding added to either
the left, right or both ends. When processing the token it maybe
necessary to remove the superfluous padding before attempting
to convert the string or range representation of the token into
its target type(int, double etc). The example below, demonstrates
the use of various forms of token trimming in conjunction strtk:parse.

Case Normalisation Token Processing

Another pair of token processing mechanisms provided by StrTk,
are the strtk::as_lcase and strtk::as_ucase mechanisms. They
convert the string representation of the token to lowercase and
uppercase characters respectively. The following example, parses
a two token tuple, and converts the first token s0 to all
lowercase and the second token s1 to all uppercase.

Parsing Truncated Values

There may be times during parsing when a token which is intended to be
parsed as an integral type (eg: int, short, unsigned int et al) may be
represented using decimal notation (eg: 1234.00000).

Normally if the token were to be passed as-is it would cause a
conversion error due to the fact that there are invalid characters
within the token.

StrTk provides a facility called strtk::truncated_int
that can be used with both signed and unsigned integral types. The
type truncated_int is specialised with the required type, then an
instance of the type is registered with it either prior to or inline
with the conversion/parse call. When the conversion occurs
strtk::truncated_int simply redefines the end of the token range
to be the decimal point (if it is present) and then passes
it along to the appropriate string_to_type_converter_impl call.

An optional parameter that can be utilized is the
'fractional_size' which denotes the exact number of
digits after the decimal place that is expected. In the event this
condition is not met a conversion error is returned.

In the following example, we have an array of trade execution
tuples in csv format. The tuple is comprised of the following
fields: ticker name (string), trade id (uint64 right
aligned with space as padding), execution price (double),
executed volume (unsigned int with 4 decimal place suffix).
The struct trade will be used to store the tuples in memory.
The objective is to parse each tuple and populate the trade_list
structure with all the trades, noting any errors that occur along
the way.

It should be noted that truncated_int can be used in conjunction
with the various StrTk token processing primitives such as strtk::trim,
strtk::trim_leading, strtk::as_lcase et al . The following example
demonstrates parsing a tuple of values intended for int types, where the tokens
have random space padding. The simple composition of strtk::truncated_int and
strtk::trim allows for efficient and error free parsing of the tuple.

Columnwise Parsing

In the previous section the ability to ignore tokens in a tuple was
discussed. The concept works well if only a few tokens need to be
ignored. However problems arise when the tuples contain a large number
of tokens and the tokens that are to be ignored are numerous and
distributed uniformly over the entire tuple.

Situations such as this one are common, and using the ignore_token
technique can not only make both the coding of the solution cumbersome
and error prone but also make the parsing process itself quite
inefficient.

A natural extension to ignore_token that scales and is also extremely
efficient, can be found in the combined functionalities of the
parse_columns and column_list. The column_list is a structure used to
hold the indexes of the tokens in the tuple that are required. The
indexes have to be valid, unique and in ascending order.

The strtk::parse_columns function takes a string of
data representing a tuple, a delimiter to determine the tokens in
the tuple, a strkt::column_list and a compatible number
of types as target references.

The number of types has to be equal to the number of indexes in the
column_list, and the types need to be convertible within the strtk
namespace from a string range representation to the type. In the
following example we have a tuple consisting of integers. We're only
interested in the first four even numbered indexes in the tuple, the
code below demonstrates how the tuple is parsed with the given
constraints:

In the following example we have a tuple consisting of a mixture of
types. We are only interested in the first, fifth and eighth indexes
in the tuple, which happen to be of type int, double and std::string
respectively. The code below demonstrates how the tuple is
parsed with the given constraints:

Simple 3D Mesh File Format Parser

Lets assume there is a file format for expressing a mesh. The format
consists of a section that defines the unique vertexes in the mesh,
and another section that defines the triangles in the mesh as a tuple
of three indicies indicating the vertexes used. Each section is
preceded by an integer that denotes the number of subsequent
elements in that section. An example of such a file is the following:

Simple Semantic Actions

A semantic action is an action that is undertaken upon a token, it can be
in the form of either an assessment or a manipulation. StrTk provides a very
simplified semantic action capability, named strtk::util::semantic_action
for types that are being parsed via the strtk::parse function.
A function (stateful or stateless), taking an iterator range is used to construct
the semantic_action. The body of the function performs whatever operations
are required and also makes sure to maintain the contract with regards to the
return status for the parse routine to complete successfully.

The following is an example where a comma delimited string is parsed into 3 types,
an integer, a double and a string. The rules regarding parsing and updating of the
variables is as follows, the int variable "i" will only be updated if the
value parsed is odd, the double value "d" will only be updated if the parsed
value is greater than or equal to 99.99 and the string value "s" will only
be updated if the presented range contains the string "ring". Upon every successful
update a corresponding counter will be incremented.

Pick A Random Line From A Text File

A random line of text is to be selected from a user provided
text file comprised of lines of varying length, such that the
probability of the line being selected is 1/N where N is the number
of lines in the text file. There are many trivial solutions to this
problem, however if one were to further constrain the problem by
indicating the file is very large (TBs) and that the system upon
which the solution is to run is very limited memory-wise, most if
not all trivial solutions such as storing indexes of all line offsets,
or reading the entire file into memory etc will be eliminated.

However, there exists a very elegant solution to this problem of O(n),
O(1) time and space complexities respectively, that involves
scanning the entire file once line by line, and at every ith
line choosing whether or not to replace the final result line with
the current line by sampling a uniform distribution of range
[0,1) and performing a replace if and only if the random value is
less than 1 / i.

The logic behind this solution revolves around the fact that the
probability of selecting the ith line will be 1/i and as such the
total probability for selecting any of the previous i-1 lines will be
1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the
ith line, we divide the previous sum of probabilities by (i - 1),
resulting in a selection probability of 1/i for any one of the lines
up to and including the ith line. If the ith line were to be the last
line of the text file, this then results in each of the lines
having a selection probability of 1/N - simple and sweet, and so
too is the following implementation of said solution:

What changes to the above code would be required If the probability of line selection
was changed to 1/(N-K) where 0 <= K <= N and K is the number of lines that will be randomly
ignored.

Note: TAOCP Volume II section 3.4.2 has an in depth discussion
about this problem, which is generally known as reservoir sampling,
and other similar problems relating to random distributions. Also one
should note that the above example has an inefficiency, in that upon
every string replace, an actual string is being copied, normally all
one needs to do is maintain a file offset to the beginning of the line,
not doing this causes slow downs due to continuous memory allocations/reallocations
which is made all the worse when large lines are encountered.

The Buy Low And Sell High Problem

Assume we are given a piece of data in csv format which represents
a time-series for the close prices of the SPDR S&P 500 ETF.
The time-series ranges from 04/01/1999 to 11/11/2011. The objective
is to select two dates, the first being the buy date and the second
being the sell date, such that if we were to buy then sell X shares
of the ETF we will have maximized our profit. It should be noted that of course
the buy date must be before the sell date and that short selling is not
an option in this strategy. The following is a chart that represents the
price of the SPDR over the given period:

Through visual inspection we can approximate that the best buy date
would be towards the end of 2002 and the corresponding sell date would
be towards the end of 2007, as these time points seem to give the largest
difference between the two prices. However it also seems that a buy
roughly at the start of 2009 and a sell at the beginning of 2011
could also provide such a large price difference. As such the visual
inspection approach has lead to an ambiguity, hence a more thorough
and precise approach is required.

One could take the naive approach to solving the buy low sell high
problem, by initially loading the entire time series into memory, then
for each ith time point take the pricei and test it against
every other pricej where i < j, and maintain a "best
profit encountered" structure that contains the best profit so
far and the corresponding buy/sell prices and dates. This solution has
a few problems, initially it is of O(N2) time complexity
and O(N) space complexity. As an example for one million time points
it will require one trillion comparisons and one million date/price
units of storage. If memory and computational processing was limited
on the hardware this solution can not be practically executed in a
continuous online manner. Furthermore as suggested by the time
-complexity as the size of the data grows, regardless of computational
abilities, the compute times for the results would become astronomical
and practically useless - specially for real-time reactive systems.

In these types of problems one tries to assess if an online or
streaming based solution is feasible. That is a solution that does not
require the data to be available all at once, can work on the data
incrementally and requires no more than one-pass for each piece of
data. Such a solution would typically have a time complexity of O(N)
and a space complexity of O(1).

With regards to this problem the crucial insight required to convert
the naive solution from O(N2) complexity to an online
version of O(N) time complexity, is that every new global minima
encountered is the beginning of a new period and an indicator of an
end to the previous period. Looking at the chart, if one were to scan
from left to right, the intuitive response is to find the point with
the lowest price, ignore everything preceding, then try and find the
next point with the highest price or global maxima. There are a few
edge cases that need to be dealt with. The main one being the problem
described above that there are two buy/sell points that could
potentially be the solution. The way around this is to simply maintain
the best encountered period, and compare the profit from any new
period to the best so far, if it is better (more) then replace
the best with the current period. Another edge case is when the data
is in a continual decline, in a situation like this there will
be no profitable buy/sell points. The following is a small
subsection of the time-series in question:

The code below is a very simple single pass incremental solution to the given problem.
It reads every line of the input file, parses each line into a date and price variable,
checks to see if the current price is less than the current global minima price, if it
is the case, it will set the current period start to the current date and set the buy price
to be the current price, otherwise it checks to see if the current price is larger than the
global maxima price, if that is the case then it updates the current profit, sell price
and sell dates accordingly. If at the end, the buy price is not less than the sell price,
it is indicative that there exists no two time points within the given time series for
which a profitable transaction could occur, otherwise it prints out buy and sell dates
for the required transaction and the expected profit per share.

We want to process independent sections of the time series concurrently so as to speed up overall processing time.

The prices could be extremely large or small?

Token Grid

StrTk provides a means to easily parse and consume 2D grids of tokens
in an efficient and simple manner. A grid is simply defined as a
series of rows comprised of tokens, otherwise known as Delimiter
Separated Values (DSV). The ith token of a row is grouped with
every ith token of all other rows to produce a column. Tokens can be
processed as either rows or columns.

An example of a simple token grid, where each numeric value is deemed
to be a token:

1.1

2.2

3.3

4.4

5.5

1.1

2.2

3.3

4.4

5.5

1.1

2.2

3.3

4.4

5.5

1.1

2.2

3.3

4.4

5.5

1.1

2.2

3.3

4.4

5.5

A token grid can be either passed in via a file or a string buffer.
The delimiters to be used for parsing the columns and rows can also
be provided, if not provided standard common defaults will be used.

The following demonstrates how each cell in the grid can be access
and cast to a specific type.

The strtk::token_grid provides various helper functions
for traversing rows and columns in batch mode. The functions are namely:
for_each_row that is used for iterating either all or a sub-range
of rows of the token_grid, and for_each_column that is used
for iterating either all or a sub-range of columns of a row.

Processing Of Comma Separated Values Data

The original intent of the token grid was to support fast and
efficient processing of simple data tuples, such as comma separated
values (CSV) formats et. al. The following example demonstrates a
simple summation of traded floor volume and average daily volume based
on NASDAQ OHLC (Open-High-Low-Close) data.

The strtk::token_grid is thread-safe iff read operations
are in play. As such the above calls to accumulate_column et al. can
all be safely and easily executed concurrently using threads. This
allows for a far more efficient data processing methodology.

TIMTOWTDI

Playing devil's advocate, another way of performing the above
processing task, assuming only the specific values for computing the
ADV are required and no further processing of the CSV data is
needed, then the problem can be solved efficiently by utilizing a
single pass of the data as follows:

TIMTOWTDI - II (with a vengeance)

Playing the devil's other advocate, the above two examples, have both
required that the filter condition be explicitly defined at compile
time. However even though the condition maybe be set in stone at
compile time, some of the underlyings (such as symbol) can be
engineered to be modified at run-time. That still doesn't give us the
freedom to perform arbitrarily complex filter expressions determined
at run-time.

That said, an extremely efficient and very simple solution is at hand. The solution
is called the C++ DSV Filter library, it is based on StrTK and ExprTk libraries. It
uses the strtk::token_grid as a CSV/DSV store and index, and ExprTk as
the underlying expression evaluation engine. The example below takes the OHLC market
data table defined above and performs a row-wise query. The expression's definition
is:

select all rows where the open price is greater than the close price and the
symbol matches the wild-card pattern of "*FT*" and the date is equal to or
after 20090101.

It should be noted that in the example above, the rows begin at index 1.
That is done because the dsv_filter expects the first row or
row at index 0 to be a column definition header. The format of the column
definitions is to simply add a suffix of "_s" if the values in the column
are to be treated as strings or "_n" if they are to be treated as numbers.
When defining expressions the suffixes should not be included when including
the column names. The above mentioned OHLC csv file's header would be as
follows:

Date_s,Symbol_s,Open_n,Close_n,High_n,Low_n,Volume_n

C++ DSV Filter and Dependencies

Sequential Partitions

A typical operation carried out upon time-series data is to group
tuples into buckets (or bins) based upon the time index
value. For example grouping data into 2-minute buckets and then
performing some kind of operation upon the grouped tuples such as a
summation or an average etc. This process is sometimes also called:
"discretization"

The strtk::token_grid class provides a method known as
sequential_partition. The sequential_partition
method requires a Transition Predicate, a Function
and optionally a row-range. The Transition Predicate
consumes a row and returns true only if the row is
in the next partition. All other subsequent consecutive rows until
the transition predicate returns a true are said to be
in the current partition. Prior to transitioning to a new partition,
the function predicate is invoked and provided with the range of
rows in the current partition.

The following example takes a simple time-series (time and
value), partitions the tuples into groups of Time-Buckets
of period length 3 and then proceeds to compute the total sum of each
group. The below summarizer class provides provides a
Transition Predicate and Function.

Parsing CSV Files With Embedded Double-Quotes

One of the simple extensions to the CSV format is the concept of
double quoted tokens. Such tokens may contain column or row
delimiters. When such a scenario is encountered, all subsequent
delimiters are ignored, and kept as part of the token, until the
corresponding closing double quote is encountered. The StrTk
token_grid supports the parsing of such tokens. This
parsing mode can be easily activated via the token_grid
option set. Below is an example of a token_grid loading a
CSV data set representing various airports from around the world and
their specific codes and locations, in which some of the cells are
double-quoted:

ICAO

IATA

Airport

City

Country

AYGA

GKA

"Goroka Gatue"

Goroka

Papua New Guinea

BGCO

GCO

"Nerlerit Inaat Constable Pynt"

"Nerlerit Inaat"

Greenland

BZGD

ZGD

Godley

Auckland

New Zealand

CYQM

YQM

"Greater Moncton International"

Moncton

Canada

EDRK

ZNV

"Koblenz Winningen"

Koblenz

Germany

FAHU

AHU

"HMS Bastard Memorial"

Kwazulu-Natal

South Africa

FQMP

MZB

"Mocimboa Da Praia"

"Mocimboa Da Praia"

Mozambique

KINS

INS

"Indian Springs AF AUX"

Indian Springs

USA

UHNN

HNN

Nikolaevsk

"Nikolaevsk Na Amure"

Russia

WBKK

BKI

"Kota Kinabalu International"

Kota Kinabalu

Malaysia

ZSJD

JDZ

"Jingdezhen Airport"

Jingdezhen

China

The following is the StrTk code example using token_grid to parse the above CSV data set:

Extending Delimiter Predicates

As previously mentioned the concept of a delimiter based predicate
can lead to some very interesting solutions. A predicate as has been
defined so far, with the exception of the offset predicate, has been
a stateless entity. Adding the ability to maintain a state based on
what the predicate has encountered so far can allow it to behave
differently from the simple single and multiple delimiter predicates.

For this example, lets assume a typical command line parsing problem
which consists of double quotation mark groupings and escapable
special characters, which can be considered being dual use as either
delimiters or data. An example input and output is as follows:

Inputs

Data

abc;"123, mno xyz",i\,jk

Delimiters

<space>;,.

Output Tokens

Token0

abc

Token1

123, mno xyz

Token2

i\,jk

In order to tokenize the above described string, one can create a
composite predicate using a multiple char delimiter predicate and some
simple state rules. The following is an example of such an extended
predicate:

High Performance Key-Value Parsing

Taking our previous person struct as an example. It
is clear that the tuple format has to be very specific with
regards to the ordering of data. For most situations this is
acceptable as the serializers and deserializers attempt to
function in the simplest manner possible, however sometimes
pieces of information may come in different orderings, or may be
deemed optional and hence not be present in a particular
tuple of data.

An approach is required that provides the means of mapping
specific pieces of data to the corresponding variables (or
members) that will store or make sense of them, in a very
efficient and simple manner. For clarity purposes, the term
"mapping" here means to populate the desired member with the
given data. The association between the two being made with the
key that is paired in-line with the data. Key-Value pairs or some
times known as Attribute-Value pairs are one of the means
by which such an association can be accomplished. The following
is a diagram that demonstrates the mapping of various fields
of a data struct to their corresponding data elements in a tuple
comprised of key-value pairs.

In the diagram above, the key-value pairs are separated (delimited)
by the pipe symbol "|". With regards to the key-value pairs
themselves, the key is traditionally the first element in the pair
and the value is the second element, they are separated in this case
by a single equal sign "=".

Back to our original problem relating to the person struct.
If we were to add keys to each of the pieces of data, we could then not
only parse a tuple of data representing the various fields of the person
struct, but those fields could be any order within the tuple. Furthermore
any of the fields could be deemed optional, hence not necessarily be present
in the tuple. It should be noted that what would denote a correct or successful
parse of tuple may not only depend upon successful parsing of ranges into the
various types, but it may also depend upon the mandatory presence of certain
fields.

Our objective will be to successfully and in the most efficient way
possible parse the following list of tuples that represent instances
of our person struct.

StrTk provides a means to achieve the above key-value pair parsing task, namely
via the strtk::keyvalue::parser and associated key-mappers. The
following is a table that depicts the various kind of key to value mappers that
are available in the StrTk library:

Key To Value Mappers

Mapper

Type

Key Lookup Complexity

Maximum Size

strtk::keyvalue::stringkey_map

std::string

O(log(n))

Limited to available memory

strtk::keyvalue::uintkey_map

cardinal value

O(1)

Limited to expected maximum key value

In the code below we initially begin by defining the delimiters we expect to see
between pairs of key-values and in between the key and value pairs themselves. In
the example below the pair_block_delimiter field denotes the delimiter
we expect between pairs (or blocks) of key-values and the field pair_delimiter
denotes the delimiter we expect between a particular key and value.

Next we define p as an instance of the person struct and
register each of the members of the instance p with a corresponding
key with the keyvalue::parser. After which we then process
each tuple of data, parsing the tuple, populating the instance p, then
printing out the various fields to stdout.

It should be noted that the underlying associative container for the
strtk::keyvalue::stringkey_map can be explicitly specified
at compile time. By default it is std::map, however it can
be easily changed to std::unordered_map giving it a key
lookup complexity of O(1) or in fact any other container that is
compatible with STL associative map semantics. The following are three
examples of keyvalue_parsers specialized using std::unordered_map,
the first being based on the std::string type, the second
being based on the int type and the third being based on
the double type:

Key-Value Pairs And Multiple Distinct Data Structures

To further the previous example, one might have a situation where
a tuple contains values for different members of different types.
The following example demonstrates parsing of key-value pairs that
map to members of multiple types. The tuple in the example consists
of values some of which are intended for the instance of data1 and
others which are intended for the instance of data2.

Key-Value Pairs And Lists

The values in the key-value pairs need not always be singular. In some
scenarios, a value could be a list of values. The StrTk key-value parser
supports parsing of such key-value pairs through the previously demonstrated
sequence sink mechanisms.

In the following example we have a complex_data struct that
consists of some PODs but also a number of sequences (vector,deque,list).
The tuple to be parsed consists of simple key-value pairs for the PODs and
more complex looking pairs where the values for the specific sequence of values
are separated by commas ",".

Key-Value Pairs With Cardinal Keys

So far all the examples have assumed keys of arbitrary values. In
some situations such as the FIX protocol,
the keys are always guaranteed to be positive integer values. If
this is the case then a different kind of key-mapper can be used
that is much more efficient than the general purpose string key-mapper,
providing a key lookup complexity of O(1). As such StrTk provides
the strtk::keyvalue::uintkey_map type for this purpose.
The only difference in terms of setting up the uintkey_map
from the stringkey_map is that it requires a key_count
to be set. This value represents the largest possible key value that
can exist. The following is an example of how the strtk::keyvalue::uintkey_map
key-mapper can be used.

Optional Key-Value Pairs

As previously mentioned there is a use-case where certain fields
may be deemed optional and hence their absence would not constitute a
parsing error. That said, it would also be beneficial to know if a
particular field has been populated or not once the sequence of key
value pairs has been completely parsed and mapped. StrTk provides such
functionality though the use of the strtk::util::attribute
type.

The strtk::util::attribute acts as a proxy for the underlying
type which it is specialized upon. It provides a conversion cast to the
underlying type, and also maintains an 'initialised' state value that can
be used to query the attribute about the underlying type's initialised
status. The following is an example of how one could use the
strtk::util::attribute type in conjunction with the
keyvalue::parser:

Semantic Actions with Key-Value Pairs

There may be times that when key-value pairs are being parsed certain
actions need to be executed or behaviours exhibited in-situ with the
parsing process. As previously mentioned, StrTk provides the type
semantic_action that can act as a proxy for a generic
type during the parsing process that also takes a functor or lambda
and executes it at the conversion call. The following example
demonstrates the parsing of an array of tuples comprised of key-value
pairs that map to members of a struct namely data_store.
The keys 111, 222 and 333 each represent a specific value type, they
also require a certain behavior to be exhibited. In this example, for
simplicity, as the values of the various keys are being parsed, a
simple message will be printed to the console denoting the nature of
the parsing process. The code is as follows:

Note: It should be noted that semantic actions during the parsing
process have a multitude of uses, some of which are: validation of parsed
values, that is making sure that they're in a specified range or within a
predefined set of values, complex parsed value manipulations, invoking of
external state-machines to transition to new states based on the parsed
value or even simply the presence of a key etc.

An Attempt At Improving File Processing Performance

In a number of examples from above, a call to strtk::for_each_line
was invoked. The routine is intended to provide a simple wrapper around
the error prone boiler plate code that is required when processing a
text file line by line.

The idea behind strtk::for_each_line is that the user
provide a lambda/functor that accepts a std::string, the routine will
in turn invoke the lambda passing the current line it has read from
the file, all other details related to opening of the file, reading
lines from the file and the handling of errors are transparent to
the user specified lambda.

For most situations this is perfectly fine, however for certain
types of high performance data processing, the overhead incurred
from std::ifstream(virtual method calls on a per
char basis, multiple copies due to back buffers etc) via the
invocation of std::getline, result in an unacceptable
performance profile.

Due to the simple nature of the I/O access pattern (forward only)
utilized within strtk::for_each_line, a memory mapped
file approach would be far more efficient, being roughly 5x-7x
faster than the standard strtk::for_each_line routine. The following
is a simple example of code that utilises the Boost.IOStreams mmap
facility namely: mapped_file_source for mmap'ing
the user specified file in conjunction with strtk::split
to delimit on line boundaries that are denoted by new-line characters.

Note: The following are some points to consider with the solution
presented above:

Because the routine attempts to map the entire file, on 32-bit
systems this may fail when the file is very large, due to the
limited availability of virtual address space. This can be easily
remedied by windowing the mmap access over the file.

If the system is under heavy load memory-wise, the performance
observed is expected to be similar to but no worse than if the
original strtk::for_each_line routine were used.

The interface with the lambda can be improved to instead only
pass a range rather than to populate and pass a std::string
as is done when using std::getline. This improvement
results in a roughly 5% increase in performance.

An interesting side effect of the range based approached, is
that the ranges denote the exact offsets from the original
file for the begin and end positions of the string currently
being processed. This can be very useful in situations where
the exact location of a token is required post processing -
as is the case with the problem presented earlier:
"Pick A Random Line From A Text File"

'for_each_line' via MMap Benchmark

The following benchmark demonstrates all three variations of
strtk::for_each_line routine, which include normal,
mmap+std::string and mmap+ranges. The example,
iterates through a file of roughly 825MB in size, comprised of
two types of CSV lines, each containing 100 tokens, where some
of the tokens are to be parsed as integers. Each line is parsed,
the integers extracted as necessary and then accumulated into
a final sum so as to minimize any aggressive optimizations.

The following is a set of results derived from a run of the above
benchmark, based on a build using GCC 4.8, with O2 and PGO, running on
a Xeon X5650 3.2GHz 64GB RAM, kernel 3.11. The base measure is lines
processed per second. The greater the value the higher
the performance.

Method

Lines/sec

Standard

126817

MMap

710317

MMap+Range

735317

The Letters Game

In the popular TV game show Countdown
(aka Letters and Numbers), contestants in the Letters round take turns
choosing letters from either a vowel or consonant bin. Typically up to
9 letters are chosen, after which the contestants are given a certain
amount of time (usually 30 seconds) to find the longest 'valid'
English word made up of only the letters that had been chosen. The
contestant with the longest valid word wins the round. What defines
a valid word is usually specific to the version of the show. Some examples
are using the OED
or the
Macquarie dictionary
coupled together with rules related to proper nouns, plurals and combination
words. The Letters round is essentially an anagram solving challenge.

The Letters game can be generally defined as: Given a canonical set of
substrings of varying length called C, generated from the alphabet A,
and a subset of not necessarily unique elements derived from A called
D, find the longest substring in the set C that is also in the set of
the 2|D| unique combinations generated from the set D.

Notes On The Letters Round Solution

The time complexity of the given solution is O(2min(L,N)),
which is quite large. What makes the solution practical, is the fact
that natural languages such as English tend to have short common words
derived from relatively small alphabets, with an upper range length of
about length 10-12 characters (excluding names et al and of course
Pneumonoultramicroscopicsilicovolcanoconiosis). So for example an
N of 10 or even 20 (assuming L is adequately large), will only amount
to a total of 1024 and 1048576 unique combinations respectively -
furthermore both search spaces can be trivially enumerated using brute
force in a mere fraction of a millisecond using modern hardware.

However the problem space becomes daunting at around N of 64 and larger. At which
point a constant multiplier can be applied by distributing the enumeration
process and performing said computations concurrently. Note this will not
reduce the overall complexity of the solution, just the time it will take
to complete, furthermore today this technique may only be practical for
values of N less than 67.

One final note, the above process will not only provide the first
solution it encounters, it will return all possible solutions for
largest encountered length combination.

Past Letters Round Games

The following is a short list of the Letters round games played on countdown
during the 2010 season.

Performance Comparisons

The following are tables of results generated by running the
strtk_tokenizer_cmp test. Currently it covers simple
comparisons between Boost String Algorithms, Boost lexical_cast, The
Standard Library, Spirit (Karma Qi) and StrTk in the following areas:

Tokenization

Splitting

Integer To String

String To Integer

String To Double

Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

8.5857sec

2795359.4074tks/sec

100.0%, 100.0%

StrTk

Tokenizer

24000000

3.5019sec

6853393.1186tks/sec

40.7%, 245.1%

Boost

Split

9600000

5.5414sec

1732414.5137tks/sec

100.0%, 100.0%

StrTk

Split

9600000

0.8218sec

11681814.9167tks/sec

14.8%, 674.3%

sprintf

Integer To String

80000000

35.8128sec

2233840.0564nums/sec

100.0%, 100.0%

Boost

Integer To String

80000000

19.3994sec

4123832.0477nums/sec

54.1%, 184.6%

Karma

Integer To String

80000000

6.2528sec

12794349.6524nums/sec

17.4%, 572.7%

StrTk

Integer To String

80000000

1.5664sec

51071439.9822nums/sec

4.3%, 2286.2%

atoi

String To Integer

88500000

5.1802sec

17084370.4936nums/sec

100.0%, 100.0%

Boost

String To Integer

88500000

119.6261sec

739805.3712nums/sec

2309.2%, 4.3%

Qi

String To Integer

88500000

2.1951sec

40317238.6629nums/sec

42.3%, 235.9%

StrTk

String To Integer

88500000

1.8181sec

48677773.5466nums/sec

35.0%, 284.9%

atof

String To Double

30650000

15.2306sec

2012396.7122nums/sec

100.0%, 100.0%

Boost

String To Double

30650000

52.9244sec

579127.8866nums/sec

347.4%, 28.7%

Qi

String To Double

30650000

2.8665sec

10692313.5853nums/sec

18.8%, 531.3%

StrTk

String To Double

30650000

1.6069sec

19073975.7679nums/sec

10.5%, 947.8%

Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

9.4715sec

2533910.4769tks/sec

100.0%, 100.0%

StrTk

Tokenizer

24000000

2.8889sec

8307786.9292tks/sec

30.5%, 327.8%

Boost

Split

9600000

7.2291sec

1327965.9706tks/sec

100.0%, 100.0%

StrTk

Split

9600000

1.1301sec

8494610.9664tks/sec

15.6%, 639.6%

sprintf

Integer To String

80000000

38.2576sec

2091088.8038nums/sec

100.0%, 100.0%

Boost

Integer To String

80000000

28.9931sec

2759277.4769nums/sec

75.7%, 131.9%

Karma

Integer To String

80000000

4.9173sec

16269254.0190nums/sec

12.8%, 778.0%

StrTk

Integer To String

80000000

1.8270sec

43786838.0279nums/sec

4.7%, 2093.9%

atoi

String To Integer

88500000

6.0076sec

14731435.8942nums/sec

100.0%, 100.0%

Boost

String To Integer

88500000

185.4955sec

477100.6474nums/sec

3087.0%, 3.2%

Qi

String To Integer

88500000

2.5060sec

35314785.8370nums/sec

41.7%, 239.7%

StrTk

String To Integer

88500000

2.2095sec

40054213.0736nums/sec

36.7%, 271.8%

atof

String To Double

30650000

17.6435sec

1737179.9302nums/sec

100.0%, 100.0%

Boost

String To Double

30650000

78.6528sec

389687.3997nums/sec

445.7%, 22.4%

Qi

String To Double

30650000

3.8034sec

8058494.1994nums/sec

21.5%, 463.8%

StrTk

String To Double

30650000

2.0450sec

14987780.2310nums/sec

11.5%, 862.7%

Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

9.6533sec

2486184.8282tks/sec

100.0%, 100.0%

StrTk

Tokenizer

24000000

3.4748sec

6906943.9529tks/sec

35.9%, 277.8%

Boost

Split

9600000

10.2600sec

935674.7490tks/sec

100.0%, 100.0%

StrTk

Split

9600000

1.3793sec

6959830.0652tks/sec

13.4%, 743.8%

sprintf

Integer To String

80000000

24.6427sec

3246397.8287nums/sec

100.0%, 100.0%

Boost

Integer To String

80000000

27.5865sec

2899968.5753nums/sec

111.9%, 89.3%

Karma

Integer To String

80000000

5.4864sec

14581630.6963nums/sec

22.2%, 449.1%

StrTk

Integer To String

80000000

2.4224sec

33025441.1256nums/sec

9.8%, 1017.2%

atoi

String To Integer

88500000

5.9297sec

14924814.8683nums/sec

100.0%, 100.0%

Boost

String To Integer

88500000

186.1372sec

475455.6660nums/sec

3139.0%, 3.1%

Qi

String To Integer

88500000

2.0874sec

42397446.1804nums/sec

35.2%, 284.0%

StrTk

String To Integer

88500000

2.0485sec

43202160.1371nums/sec

34.5%, 289.4%

atof

String To Double

30650000

18.0458sec

1698455.0767nums/sec

100.0%, 100.0%

Boost

String To Double

30650000

77.4527sec

395725.4111nums/sec

429.2%, 23.2%

Qi

String To Double

30650000

3.9631sec

7733881.1294nums/sec

21.9%, 455.3%

StrTk

String To Double

30650000

2.0723sec

14790236.0804nums/sec

11.4%, 870.8%

Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

10.0096sec

2397697.7836tks/sec

100.0%, 100.0%

StrTk

Tokenizer

24000000

3.1837sec

7538416.8541tks/sec

31.8%, 314.4%

Boost

Split

9600000

9.5450sec

1005760.0310tks/sec

100.0%, 100.0%

StrTk

Split

9600000

1.4292sec

6716893.1359tks/sec

14.9%, 667.8%

sprintf

Integer To String

80000000

23.8979sec

3347577.5824nums/sec

100.0%, 100.0%

Boost

Integer To String

80000000

27.5618sec

2902565.2045nums/sec

115.3%, 86.7%

Karma

Integer To String

80000000

4.6600sec

17167208.7654nums/sec

19.4%, 512.8%

StrTk

Integer To String

80000000

2.8450sec

28119857.2736nums/sec

11.9%, 840.0%

atoi

String To Integer

88500000

5.9386sec

14902610.8922nums/sec

100.0%, 100.0%

Boost

String To Integer

88500000

180.5856sec

490072.4001nums/sec

3040.8%, 3.2%

Qi

String To Integer

88500000

2.5273sec

35017073.8639nums/sec

42.5%, 234.9%

StrTk

String To Integer

88500000

1.8718sec

47281492.1287nums/sec

31.5%, 317.2%

atof

String To Double

30650000

18.4357sec

1662538.0810nums/sec

100.0%, 100.0%

Boost

String To Double

30650000

78.1543sec

392172.9598nums/sec

423.9%, 23.5%

Qi

String To Double

30650000

2.8321sec

10822353.0510nums/sec

15.3%, 650.9%

StrTk

String To Double

30650000

2.2930sec

13366541.5515nums/sec

12.4%, 803.9%

Scenario 4 - GCC 4.5 (O3, PGO)

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

9.2510sec

2594305.4347tks/sec

100.0%, 100.0%

StrTk

Tokenizer

24000000

3.9717sec

6042688.5734tks/sec

42.9%, 232.9%

Boost

Split

9600000

5.0640sec

1895728.2331tks/sec

100.0%, 100.0%

StrTk

Split

9600000

1.5411sec

6229231.8384tks/sec

30.4%, 328.5%

sprintf

Integer To String

80000000

14.7807sec

5412477.0993nums/sec

100.0%, 100.0%

Boost

Integer To String

80000000

19.1131sec

4185620.7707nums/sec

129.3%, 77.3%

Karma

Integer To String

80000000

6.4455sec

12411808.2841nums/sec

43.6%, 229.3%

StrTk

Integer To String

80000000

4.5174sec

17709364.5349nums/sec

30.5%, 327.1%

atoi

String To Integer

88500000

5.2139sec

16973721.6103nums/sec

100.0%, 100.0%

Boost

String To Integer

88500000

50.5326sec

1751344.8498nums/sec

969.1%, 10.3%

Qi

String To Integer

88500000

1.9694sec

44937612.8835nums/sec

37.7%, 264.7%

StrTk

String To Integer

88500000

1.9008sec

46558706.5833nums/sec

36.4%, 274.2%

atof

String To Double

30650000

6.6975sec

4576328.3036nums/sec

100.0%, 100.0%

Boost

String To Double

30650000

29.6375sec

1034162.2422nums/sec

442.5%, 22.5%

Qi

String To Double

30650000

2.9852sec

10267435.7138nums/sec

44.5%, 224.3%

StrTk

String To Double

30650000

1.5961sec

19202937.1409nums/sec

23.8%, 419.6%

Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

29.1370sec

823695.4389tks/sec

100.0%, 100.0%

StrTk

Tokenizer

24000000

12.3607sec

1941644.0499tks/sec

42.4%, 235.7%

Boost

Split

9600000

16.5261sec

580899.9726tks/sec

100.0%, 100.0%

StrTk

Split

9600000

4.9102sec

1955110.2611tks/sec

29.7%, 336.5%

sprintf

Integer To String

80000000

50.3456sec

1589015.6118nums/sec

100.0%, 100.0%

Boost

Integer To String

80000000

91.1475sec

877698.1401nums/sec

181.0%, 55.2%

Karma

Integer To String

80000000

21.8904sec

3654568.8712nums/sec

43.4%, 229.9%

StrTk

Integer To String

80000000

12.1877sec

6564009.9274nums/sec

24.2%, 413.0%

atoi

String To Integer

88500000

17.6615sec

5010896.5768nums/sec

100.0%, 100.0%

Boost

String To Integer

88500000

191.9446sec

461070.5357nums/sec

1086.7%, 9.2%

Qi

String To Integer

88500000

6.2808sec

14090561.7119nums/sec

35.5%, 281.1%

StrTk

String To Integer

88500000

6.1552sec

14378086.8208nums/sec

34.8%, 286.9%

atof

String To Double

30650000

21.4865sec

1426474.1027nums/sec

100.0%, 100.0%

Boost

String To Double

30650000

139.8166sec

219215.7409nums/sec

650.7%, 15.3%

Qi

String To Double

30650000

11.3916sec

2690567.9223nums/sec

53.0%, 188.6%

StrTk

String To Double

30650000

6.4396sec

4759608.7027nums/sec

29.9%, 333.6%

Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon E5540

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

7.5657sec

3172216.8787tks/sec

100.0%, 100.0%

StrTk

Tokenizer

24000000

2.7379sec

8765832.8290tks/sec

36.1%, 276.3%

Boost

Split

9600000

3.0706sec

3126386.1126tks/sec

100.0%, 100.0%

StrTk

Split

9600000

1.1279sec

8511136.2899tks/sec

36.7%, 272.2%

sprintf

Integer To String

80000000

10.9012sec

7338642.9638nums/sec

100.0%, 100.0%

Boost

Integer To String

80000000

12.3317sec

6487328.7872nums/sec

113.1%, 88.3%

Karma

Integer To String

80000000

3.7202sec

21504260.6660nums/sec

34.1%, 293.0%

StrTk

Integer To String

80000000

2.5183sec

31768042.4612nums/sec

23.1%, 432.8%

atoi

String To Integer

88500000

4.0087sec

22077037.6357nums/sec

100.0%, 100.0%

Boost

String To Integer

88500000

30.3659sec

2914454.4393nums/sec

757.4%, 13.2%

Qi

String To Integer

88500000

1.7976sec

49231871.5454nums/sec

43.3%, 223.0%

StrTk

String To Integer

88500000

1.7384sec

50908881.7303nums/sec

43.3%, 230.5%

atof

String To Double

30650000

5.2118sec

5880843.9328nums/sec

100.0%, 100.0%

Boost

String To Double

30650000

21.5546sec

1421966.9538nums/sec

413.5%, 24.1%

Qi

String To Double

30650000

3.2149sec

9533840.3118nums/sec

61.6%, 162.1%

StrTk

String To Double

30650000

1.3929sec

22003661.2944nums/sec

26.7%, 374.1%

Scenario 7 - GCC 4.5 (O3, PGO) Intel Xeon X5650

Source

Test

Size

Time(sec)

Rate

% from Baseline

Boost

Tokenizer

24000000

4.1944sec

5721901.2924tks/sec

100.00%, 100.00%

StrTk

Tokenizer

24000000

2.5087sec

9566860.3956tks/sec

59.81%, 167.19%

Boost

Split

9600000

2.9104sec

3298520.2014tks/sec

100.00%, 100.00%

StrTk

Split

9600000

1.1105sec

8644949.2334tks/sec

38.15%, 262.08%

sprintf

Integer To String

80000000

9.7148sec

8234840.3537nums/sec

100.00%, 100.00%

Boost

Integer To String

80000000

11.5726sec

6912860.7120nums/sec

119.12%, 83.94%

Karma

Integer To String

80000000

4.0832sec

19592620.4395nums/sec

42.03%, 237.92%

StrTk

Integer To String

80000000

2.2204sec

36029641.5861nums/sec

22.85%, 437.52%

atoi

String To Integer

88500000

3.1028sec

28522836.1561nums/sec

100.00%, 100.00%

Boost

String To Integer

88500000

6.1270sec

14444145.2249nums/sec

197.46%, 50.64%

Qi

String To Integer

88500000

1.5313sec

57794031.2153nums/sec

49.35%, 202.62%

StrTk

String To Integer

88500000

1.4409sec

61417814.6362nums/sec

46.43%, 215.32%

atof

String To Double

42880000

6.1342sec

6990292.6549nums/sec

100.00%, 100.00%

Boost

String To Double

42880000

28.9461sec

1481374.9232nums/sec

772.37%, 21.19%

Qi

String To Double

42880000

3.6549sec

11732137.3557nums/sec

59.58%, 167.83%

StrTk

String To Double

42880000

1.3860sec

30937683.0792nums/sec

22.59%, 442.58%

Note 1: The tests are compiled with specific optimisation
flags to produce the best possible results for the respective
compilers and architectures. Furthermore the tests are run
natively (no virtualizations were used) on an almost completely idle
machine so as to reduce interference from background processes. The
Boost version used was 1.55. Furthermore the standard libraries
including libc were rebuilt for the linux system based tests, using
architecture specific flags and optimizations. The following is a
table mapping the scenarios to their respective architectures:

Note 2:
The percentages in the final column represent the percentage of the
current row versus the baseline in total running time and rate
respectively. For the first percentage the lower the value the better
and for the second percentage the higher the value the better. The
baseline used for a specific combination of tests is defined in the
following table:

Test Combination

Baseline

Boost, StrTk

Boost

Boost, StdLib/STL, Spirit, StrTk

StdLib/STL

StdLib/STL, Spirit, StrTk

StdLib/STL

Note 3:
The test sizes are set such that no single run will result in a
running time less than one second. This is done so as to ensure
that runs-per-second results are not deemed to have been projected.
In the future these sizes may need to be revisited once 3.5+GHz CPU
speeds become more commonplace. Furthermore the charts represent the
rate of operation over a one second interval - In short, the larger
the rate the better.

Note 4:
The binaries used for the above performance tests can be downloaded from
here

Note 5:
It would be great to have comparisons for other architectures. If you can
provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the
following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please
feel free to get in contact.

A Final Digression - Fast Integer To String Conversion

The task of converting an integer to a std::string
is quite common and rather trivial. It is mainly used in serialisations,
such as printing to stdout or a file et al. The above benchmarks do
include timings for int to std::string conversions,
however let's see if there's more that can be done.

The following takes this very simple task to the next level. Based
on a series solutions found on StackOverflow
and elsewhere, the following benchmark has been derived. Its objective
is simple: Determine the fastest integer to std::string
conversion routine, written entirely in conforming C++. The results
are as follows:

Note: The following are details and also some points to consider with
regards to the benchmark presented above:

A total of nine conversion routines are employed, of varying
complexities, ranging from simple divide/mod and loop to
elaborate single pass bit-twiddle LUT based strategies.

A wide range of values within the int space are used. They consist
of sets of small and large random values, values near to and including
min/max of int and all values from -20000000 to +20000000. Furthermore
the various values are interleaved amoungst each other during the main
loop for each conversion routine.

Prior to the benchmark beginning, a simple sanity check is run over
all the conversion routines to make sure each routine functions
correctly.

Before each conversion routine's benchmark begins, an attempt is made
to flush the I/D caches. This is done so as to avoid the previous
conversion routine's execution profile affecting the next routine.

It should be noted that the results presented were derived from running
the benchmark upon a processor which has a large L1/L2 cache. This is
important because when run on processors that have smaller caches or
none at all, the rankings change quite a bit. In fact the simple div/mod
and loop strategies will tend to out do most of the LUT based strategies
under certain architectural conditions. This point is applicable not only
to processes running atop of embedded and low power processors but also
to processes running within virtualized environments.

The conversion implementations found in boost::lexical_cast,
NumToA
and Folly
were found to be exceedingly lacking when it came to performance
and were hence left out.

StrTk Library Dependency

StrTk makes use of the Boost library for its
boost::lexical_cast routine for types other than
PODs, and its TR1 compliant Random and Regex libraries. These
dependencies are not compulsory and can be easily removed simply
by defining the preprocessor: strtk_no_tr1_or_boost. That
said Boost is an integral part of modern C++ programming, and
having it around is as beneficial as having access to the STL,
hence it is recommended that it be installed. For Visual Studio
users, BoostPro provides a free and easy to use
installer for the latest Boost libraries that can be obtained
from Here. For Linux users, mainstream
distributions such as Ubuntu and Red-Hat(Fedora) provide easy
installation of the Boost libraries via their respective package
management systems. For more information please consult the
readme.txt found in the StrTk distribution.

Compiler Support

The following is a listing of the various compilers that StrTk can be
built with error and warning free.

GCC - verions 3.1+

Clang/LLVM - versions 1.0+

Intel C++ Compiler - versions 8+

MSVC - versions 7.1+

Comeau C/C++ - versions 4.3+

PGI C++ - versions 10.x+

IBM XL C/C++ - versions 10.x+

Note: Versions of compilers prior to the ones denoted above
"should" compile, however they may require a very lenient
warning/error level be set during compilation.

Conclusion

StrTk was designed with performance and efficiency as its sole primary
principles, and as such some of the available interfaces may not be as
user-friendly as they should - however that said, the gains
made in other areas hopefully will compensate for any perceived
difficulties. Like most things there is a trade-off between
performance and usability with the above mentioned tokenizers and
parsing methods. The original aim was to provide an interface similar
to that of the Boost Tokenizer and Split routines, but to also avail
the developer with abstractions and various other simplifications that
will hopefully provide them more flexibility and efficiency in the long
run. That said, tokenizing a string isn't the most fascinating problem
one could tackle but it does have its interesting points when one has
a few TB of data to process, doing it properly could mean the difference
between finishing a simple data processing job today or next month.

Comments and Discussions

What is the proper way to parse delimited hex values "0xA1|0xdf|0x23|...." into a vector of unsigned chars?
In strtk_examples parse_example01() when I change strtk::hex_to_number_sink to unsigned char it does not compile:
C2665: 'strtk::details::string_to_type_converter_impl' : none of the 20 overloads could convert all the argument types strtk.hpp 329 strtk_test

I'm not sure why you made the change. A comment can be found at strtk.hpp:13838 that should help explain the problem you are seeing.

Hex conversions with hex_to_number_sink is used in the person tuple part, I tried to adapt it for my task with little success - vc9 complained about smth I was not able to solve quickly. I tried your samples and they compiled just fine, so I modified input string in parse_example01 to hex separated bytes. It worked but the sink used unsigned int and the result was byte reversed, that's why I tried to use unsigned char in this place.

Thanks for your answer. English is not my native language and it's very difficult for me to understand the texts. When I searched, the FAQ I found was unavailable.

I understand that if I use your code without modification, then I can add a note pointing to your site and then I would be complaint with the license. If somebody could tell me if this is correct I'd be very grateful.

My understanding of the CPL with regards to source code, in a general sense, is as follows:

1. You can use the code in commercial, academic and open source projects without the need to give back any changes you make, or any code that may use the code/library(derivatives).

2. You do not have to pay royalties of any kind.

3. It is not necessary to include a 'note' or similar in your project, though it would be appreciated. However the preamble found at the top of the strtk.hpp file must remain.

4. You can not re-license the library with another license that does not allow for or limits any of the above. For example you can not re-license it under GPL or LGPL but you can change it to actual/real open source licenses such as: BSD, MIT, Mozilla licenses etc.

I'm primarily using wide strings in my code and I often suffer from the lack of wchar_t support in StrTk. Right now I was going to join() an array of strings and ooops... only std::string version is there.

Thus I would like to contribute to the library by parameterizing functions with char type. Does that make sence to you? Will you review/accept 3rd party contributions to your great lib?

Now with regards to wchar_t, I've thought about it for some time too, and the main problem I see, is that it has no standard definition, other than it has to be no less than char. On some systems it is 2 bytes (win32) on others it is 4 bytes, on some it is even 8 bytes and on certain embedded targets it is compiled as 1 byte. In short it makes writing efficient wchar_t routines really difficult. Furthermore a general unicode solution would be great, however no one to this day has been able to develop a proper iterator algorithm that gives both efficient read and write access to an arbitary unicode stream.

The next problem with wchar_t and strtk:join, is that join (and join-like routines) hooks into the underlying strtk type conversion system - which has been purpose built for speed. If a wchar_t or std::wstring interface were to be created it would dispatch to the default converter which uses stringstream - which is very very slow.

That said most of the routines that take iterator pairs over a "string" like object and make no assumptions about the underlying type will work seamlessly with wchar_t - or any type such things include split, tokenzier, the delimiter predicates, removes, replaces etc... In fact, there's an example in the article that tokenizes a string-type which is comprised of unsigned ints: