This library is routinely tested on Steel Bank CL, Clozure CL,
Embeddable CL and Armed Bear CL. Chances are really high that it will
work on other platforms without problems (check its status on
CL-TEST-GRID).

RATIONALE

Since the standard search function is working fine, one might ask:
why do we need a yet another implementation? Answer is simple:
advanced algorithms offer different benefits compared to the standard
implementation that is based on the brute-force algorithm.

Benchmarks
show that depending on environment and pattern of application, a
Boyer-Moore-Horspool algorithm implementation can outperform standard
search function in SBCL by almost 18 times! Check the code in the
bench folder for further details.

USAGE

CL-STRING-MATCH is supported by Quicklisp and is known by its system name:

(ql:quickload :cl-string-match)

CL-STRING-MATCH exports functions in cl-string-match package (that
is also nicknamed as sm).

Shortcut functions search given pattern pat in text txt. They are
usually much slower (because they build index structures every time
they are called) but are easier to use:

string-contains-brutepattxt — Brute-force

string-contains-bmpattxt — Boyer-Moore

string-contains-bmhpattxt — Boyer-Moore-Horspool

string-contains-kmppattxt — Knuth-Morris-Pratt

string-contains-acpattxt — Aho-Corasick

string-contains-rkpattxt — Rabin-Karp

A more robust approach is to use pre-calculated index data that is
processed by a pair of initialize and search functions:

initialize-bmpat and search-bmbmtxt

initialize-bmhpat and search-bmhbmtxt

initialize-bmh8pat and search-bmh8bmtxt

initialize-rkpat and search-rkrktxt

initialize-kmppat and search-kmpkmptxt

initialize-acpat and search-acactxt. initialize-ac
can accept a list of patterns that are compiled into a trie.

Brute-force algorithm does not use pre-calculated data and has no
"initialize" function.

Boyer-Moore-Horspool implementation (the -BMH and -BMH8 functions)
also accepts :start2 and :end2 keywords for the "search" and
"contains" functions.

Following example looks for a given substring pat in a given line of
text txt using Boyer-Moore-Horspool algorithm implementation:

It should be noted that Boyer-Moore-Horspool (bmh) implementation
can offer an order of magnitude boost to performance compared to the
standard search function.

However, some implementations create a "jump table" that can be the
size of the alphabet (over 1M CHAR-CODE-LIMIT on implementations
supporting Unicode) and thus consume a significant chunk of
memory. There are different solutions to this problem and at the
moment a version for the ASCII strings is offered: initialize-bmh8pat and search-bmh8bmtxt as well as string-contains-bmh8pattxt work for strings with characters inside the 256 char code
limit.

CONTRIB

This project also contains code that is not directly invloved with the
pattern search algorithms but nevertheless might be found useful for
text handling/processing. Check the contrib folder in the repository
for more details. Currently it contains:

ascii-strings.lisp aims to provide single-byte strings
functionality for Unicode-enabled Common Lisp implementations. Another
goal is to reduce memory footprint and boost performance of the
string-processing tasks, i.e. read-line.

simple-scanf implements a subset of the original POSIX standard
scanf(3) function features.

TODO

The project still lacks some important features and is under constant
development. Any kind of contributions or feedback are welcome.

5.3 ascii-strings.system

5.4 ascii-strings

This library implements functions and data types
similar to the standard Common Lisp functions and types but prefixed
with ub- to avoid naming conflicts.

This package aims at providing single-byte strings functionality
for Unicode-enabled Common Lisp implementations. Another aim is to
reduce memory footprint and boost performance of the
string-processing algorithms.

There are similar libraries/packages with slight differences. Check,
for instance, com.informatimago.common-lisp.cesarum.ascii.

This package also provides a faster alternative to the standard
read-line function. A line reader is created by the
make-ub-line-reader function, an ub-string is read by the
ub-read-line, and a standard line can be read by the
ub-read-line-string.

Please note, that while ASCII uses 7-bits per character, this library
works with octets, using 8-bits per character.

Returns the current position within the stream according to the
amount of information really read.

When the buffer caches more information than was really read by one of
UB-READ-LINE function the standard FILE-POSITION function will return
position of the buffer that is larger than the position that was read
by the user.

Returned number can be used by the standard FILE-POSITION function to
adjust the position within a stream.

When optional argument POSITION is supplied, the file position is
adjusted accordingly in the underlying stream. The buffer is flushed.

Reads data into the pre-allocated buffer in the READER structure
and returns two values: start and end positions of the line within the
buffer that can be used to extract this line contents from the
buffer.

Please note, that unlike the standard read-line or the
liberal-read-line by jasonmelbye this function works with the
Unix-type of lines - sequence of characters delimited by the Newline
symbol.

Each node of a trie contains a list of child nodes, a label (the
letter) and a mark (some value attributed to the matching string).

Trie root node is like all other nodes but its ‘ID‘ is used as an
increment to create ids for new nodes.

Slots:

* ‘id‘ - unique node identifier, root node has the largest id.
* ‘children‘ - a hash table with labels as keys, ‘trie-node‘ as
values.
* ‘mark‘ - output function, when not null marks the last character of a
keyword and is returned as the search result.
* ‘fail‘ - fail transition to another node.
* ‘depth‘ - number of nodes from the root node to this node.

The standard optimize settings used by most declaration
expressions. Tuned for best performance by default, but when the
SM-DEBUG-ENABLED keyword is present in the *FEATURES* list, makes
debug and safety its priority at the expense of everything else.

Function branch is used to test if a position is the end point
and turn the implicit node to explicit node if necessary. Because
sentinel node is not used, the special case is handled in the first
if-clause.

Computes the hash function for an END-digit base- +ALPH-SIZE+ number
represented as a char array in time proportional to END. (We pass END
as an argument so that we can use the function for both the pattern
and the text.)