SRFI 13 (string library). Certain procedures contained in this SRFI, such as string-append, are identical to R5RS versions and are omitted from this document. For full documentation, see the original SRFI-13 document.

On systems that support dynamic loading, the srfi-13 unit can be made available in the Chicken interpreter (csi) by entering

(require-extension srfi-13)

The string-hash and string-hash-ci procedures are not provided in this library unit. Unit srfi-69 has compatible definitions.

Upper- and lower-casing characters is complex in super-ASCII encodings. SRFI 13 makes no attempt to deal with these issues; it uses a simple 1-1 locale- and context-independent case-mapping, specifically Unicode's 1-1 case-mappings given in ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt.

On Chicken, case-mapping is restricted to operate on ASCII characters.

Chicken does not currently have shared-text substrings, nor does its implementation of SRFI 13 routines ever return one of the strings that was passed in as a parameter, as is allowed by the specification.

On the other hand, the functionality is present to allow one to write efficient code without shared-text substrings. You can write efficient code that works by passing around start/end ranges indexing into a string instead of simply building a shared-text substring.

START and END parameters are half-open string indices specifying a substring within a string parameter; when optional, they default to 0 and the length of the string, respectively. When specified, it must be the case that 0 <= START <= END <= (string-length S), for the corresponding parameter S. They typically restrict a procedure's action to the indicated substring.

A PRED parameter is a unary character predicate procedure, returning a true/false value when applied to a character.

A CHAR/CHAR-SET/PRED parameter is a value used to select/search for a character in a string. If it is a character, it is used in an equality test; if it is a character set, it is used as a membership test; if it is a procedure, it is applied to the characters as a test predicate.

An I parameter is an exact non-negative integer specifying an index into a string.

LEN and NCHARS parameters are exact non-negative integers specifying a length of a string or some number of characters.

An OBJ parameter may be any value at all.

Passing values to procedures with these parameters that do not satisfy these types is an error.

Parameters given in square brackets are optional. Unless otherwise noted in the text describing the procedure, any prefix of these optional parameters may be supplied, from zero arguments to the full list. When a procedure returns multiple values, this is shown by listing the return values in square brackets, as well. So, for example, the procedure with signature

halts? F [X INIT-STORE] -> [BOOLEAN INTEGER]

would take one (F), two (F, X) or three (F, X, INIT-STORE) input parameters, and return two values, a boolean and an integer.

A parameter followed by "..." means zero-or-more elements. So the procedure with the signature

sum-squares X ... -> NUMBER

takes zero or more arguments (X ...), while the procedure with signature

spell-check DOC DICT_1 DICT_2 ... -> STRING-LIST

takes two required parameters (DOC and DICT_1) and zero or more optional parameters (DICT_2 ...).

If a procedure is said to return "unspecified," this means that nothing at all is said about what the procedure returns. Such a procedure is not even required to be consistent from call to call. It is simply required to return a value (or values) that may be passed to a command continuation, e.g. as the value of an expression appearing as a non-terminal subform of a begin expression. Note that in R5RS, this restricts such a procedure to returning a single value; non-R5RS systems may not even provide this restriction.

Checks to see if the given criteria is true of every / any character in S, proceeding from left (index START) to right (index END).

If CHAR/CHAR-SET/PRED is a character, it is tested for equality with the elements of S.

If CHAR/CHAR-SET/PRED is a character set, the elements of S are tested for membership in the set.

If CHAR/CHAR-SET/PRED is a predicate procedure, it is applied to the elements of S. The predicate is "witness-generating:"

If string-any returns true, the returned true value is the one produced by the application of the predicate.

If string-every returns true, the returned true value is the one produced by the final application of the predicate to S[END-1]. If string-every is applied to an empty sequence of characters, it simply returns #t.

If string-every or string-any apply the predicate to the final element of the selected sequence (i.e., S[END-1]), that final application is a tail call.

The names of these procedures do not end with a question mark -- this is to indicate that, in the predicate case, they do not return a simple boolean (#t or #f), but a general value.

PROC is an integer->char procedure. Construct a string of size LEN by applying PROC to each index to produce the corresponding string element. The order in which PROC is applied to the indices is not specified.

string->list is extended from the R5RS definition to take optional START/END arguments.

[procedure](reverse-list->string char-list) -> string

An efficient implementation of (compose list->string reverse):

(reverse-list->string '(#\a #\B #\c)) -> "cBa"

This is a common idiom in the epilog of string-processing loops that accumulate an answer in a reverse-order list. (See also string-concatenate-reverse for the "chunked" variant.)

[procedure](string-join string-list [delimiter grammar]) -> string

This procedure is a simple unparser --- it pastes strings together using the delimiter string.

The GRAMMAR argument is a symbol that determines how the delimiter is used, and defaults to 'infix.

'infix means an infix or separator grammar: insert the delimiter between list elements. An empty list will produce an empty string -- note, however, that parsing an empty string with an infix or separator grammar is ambiguous. Is it an empty list, or a list of one element, the empty string?

'strict-infix means the same as 'infix, but will raise an error if given an empty list.

'suffix means a suffix or terminator grammar: insert the delimiter after every list element. This grammar has no ambiguities.

'prefix means a prefix grammar: insert the delimiter before every list element. This grammar has no ambiguities.

The delimiter is the string used to delimit elements; it defaults to a single space " ".

[R5RS+] substring/shared returns a string whose contents are the characters of S beginning with index START (inclusive) and ending with index END (exclusive). It differs from the R5RS substring in two ways:

The END parameter is optional, not required.

substring/shared may return a value that shares memory with S or is eq? to S.

string-copy is extended from its R5RS definition by the addition of its optional START/END parameters. In contrast to substring/shared, it is guaranteed to produce a freshly-allocated string.

Use string-copy when you want to indicate explicitly in your code that you wish to allocate new storage; use substring/shared when you don't care if you get a fresh copy or share storage with the original string.

Copy the sequence of characters from index range [START,END) in string S to string TARGET, beginning at index TSTART. The characters are copied left-to-right or right-to-left as needed -- the copy is guaranteed to work, even if TARGET and S are the same string.

It is an error if the copy operation runs off the end of the target string, e.g.

string-take returns the first NCHARS of S; string-drop returns all but the first NCHARS of S. string-take-right returns the last NCHARS of S; string-drop-right returns all but the last NCHARS of S. If these procedures produce the entire string, they may return either S or a copy of S; in some implementations, proper substrings may share memory with S.

Build a string of length LEN comprised of S padded on the left (right) by as many occurrences of the character CHAR as needed. If S has more than LEN chars, it is truncated on the left (right) to length LEN. CHAR defaults to #\space.

If LEN <= END-START, the returned value is allowed to share storage with S, or be exactly S (if LEN = END-START).

Apply PROC<, PROC=, or PROC> to the mismatch index, depending upon whether S1 is less than, equal to, or greater than S2. The "mismatch index" is the largest index I such that for every 0 <= J < I, S1[J] = S2[J] -- that is, I is the first position that doesn't match.

string-compare-ci is the case-insensitive variant. Case-insensitive comparison is done by case-folding characters with the operation

(char-downcase (char-upcase C))

where the two case-mapping operations are assumed to be 1-1, locale- and context-insensitive, and compatible with the 1-1 case mappings specified by Unicode's UnicodeData.txt table:

The optional start/end indices restrict the comparison to the indicated substrings of S1 and S2. The mismatch index is always an index into S1; in the case of PROC=, it is always END1; we observe the protocol in this redundant case for uniformity.

These procedures are the lexicographic extensions to strings of the corresponding orderings on characters. For example, string< is the lexicographic ordering on strings induced by the ordering char<? on characters. If two strings differ in length but are the same up to the length of the shorter string, the shorter string is considered to be lexicographically less than the longer string.

The optional start/end indices restrict the comparison to the indicated substrings of S1 and S2.

Comparison is simply done on individual code-points of the string. True text collation is not handled by this SRFI.

Compute a hash value for the string S. BOUND is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0,BOUND).

If BOUND is either zero or not given, the implementation may use an implementation-specific default value, chosen to be as large as is efficiently practical. For instance, the default range might be chosen for a given implementation to map all strings into the range of integers that can be represented with a single machine word.

The optional start/end indices restrict the hash operation to the indicated substring of S.

string-hash-ci is the case-insensitive variant. Case-insensitive comparison is done by case-folding characters with the operation

(char-downcase (char-upcase C))

where the two case-mapping operations are assumed to be 1-1, locale- and context-insensitive, and compatible with the 1-1 case mappings specified by Unicode's UnicodeData.txt table:

Rationale: allowing the user to specify an explicit bound simplifies user code by removing the mod operation that typically accompanies every hash computation, and also may allow the implementation of the hash function to exploit a reduced range to efficiently compute the hash value. E.g., for small bounds, the hash function may be computed in a fashion such that intermediate values never overflow into bignum integers, allowing the implementor to provide a fixnum-specific "fast path" for computing the common cases very rapidly.

string-index (string-index-right) searches through the string from the left (right), returning the index of the first occurrence of a character which

equals CHAR/CHAR-SET/PRED (if it is a character);

is in CHAR/CHAR-SET/PRED (if it is a character set);

satisfies the predicate CHAR/CHAR-SET/PRED (if it is a procedure).

If no match is found, the functions return false.

The START and END parameters specify the beginning and end indices of the search; the search includes the start index, but not the end index. Be careful of "fencepost" considerations: when searching right-to-left, the first index considered is

END-1

whereas when searching left-to-right, the first index considered is

START

That is, the start/end indices describe a same half-open interval [START,END) in these procedures that they do in all the other SRFI 13 procedures.

The skip functions are similar, but use the complement of the criteria: they search for the first char that doesn't satisfy the test. E.g., to skip over initial whitespace, say

(cond ((string-skip s char-set:whitespace) =>

(lambda (i) ...)) ; s[i] is not whitespace.
...)

[procedure](string-count s char/char-set/pred [start end]) -> integer

Return a count of the number of characters in S that satisfy the CHAR/CHAR-SET/PRED argument. If this argument is a procedure, it is applied to the character as a predicate; if it is a character set, the character is tested for membership; if it is a character, it is used in an equality test.

The names of these procedures do not end with a question mark -- this is to indicate that they do not return a simple boolean (#t or #f). Rather, they return either false (#f) or an exact non-negative integer.

Unicode note: Reversing a string simply reverses the sequence of code-points it contains. So a zero-width accent character A coming after a base character B in string S would come out before B in the reversed result.

[procedure](string-concatenate string-list) -> string

Append the elements of string-list together into a single string. Guaranteed to return a freshly allocated string.

Note that the (apply string-append STRING-LIST) idiom is not robust for long lists of strings, as some Scheme implementations limit the number of arguments that may be passed to an n-ary procedure.

These two procedures are variants of string-concatenate and string-append that are permitted to return results that share storage with their parameters. In particular, if string-append/shared is applied to just one argument, it may return exactly that argument, whereas string-append is required to allocate a fresh string.

This procedure is useful in the construction of procedures that accumulate character data into lists of string buffers, and wish to convert the accumulated data into a single string when done.

Unicode note: Reversing a string simply reverses the sequence of code-points it contains. So a zero-width accent character AC coming after a base character BC in string S would come out before BC in the reversed result.

Interested functional programmers may enjoy noting that string-fold and string-unfold-right are in some sense inverses. That is, given operations KNULL?, KAR, KDR, KONS, and KNIL satisfying

(KONS (KAR X) (KDR X)) = X and (KNULL? KNIL) = #t

then

(string-fold KONS KNIL (string-unfold-right KNULL? KAR KDR X)) = X

and

(string-unfold-right KNULL? KAR KDR (string-fold KONS KNIL S)) = S.

The final string constructed does not share storage with either BASE or the value produced by MAKE-FINAL.

Note: implementations should take care that runtime stack limits do not cause overflow when constructing large (e.g., megabyte) strings with string-unfold-right.

[procedure](string-for-each proc s [start end]) -> unspecified

Apply PROC to each character in S. string-for-each is required to iterate from START to END in increasing order.

[procedure](string-for-each-index proc s [start end]) -> unspecified

Apply PROC to each index of S, in order. The optional START/END pairs restrict the endpoints of the loop. This is simply a method of looping over a string that is guaranteed to be safe and correct. Example:

This is the "extended substring" procedure that implements replicated copying of a substring of some string.

S is a string; START and END are optional arguments that demarcate a substring of S, defaulting to 0 and the length of S (i.e., the whole string). Replicate this substring up and down index space, in both the positive and negative directions. For example, if S = "abcdefg", START=3, and END=6, then we have the conceptual bidirectionally-infinite string

...

d

e

f

d

e

f

d

e

f

d

e

f

d

e

f

d

e

f

d

...

...

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

+1

+2

+3

+4

+5

+6

+7

+8

+9

...

xsubstring returns the substring of this string beginning at index FROM, and ending at TO (which defaults to FROM+(END-START)).

You can use xsubstring to perform a variety of tasks:

To rotate a string left: (xsubstring "abcdef" 2) => "cdefab"

To rotate a string right: (xsubstring "abcdef" -2) => "efabcd"

To replicate a string: (xsubstring "abc" 0 7) => "abcabca"

Note that

The FROM/TO indices give a half-open range -- the characters from index FROM up to, but not including, index TO.

The FROM/TO indices are not in terms of the index space for string S. They are in terms of the replicated index space of the substring defined by S, START, and END.

It is an error if START=END -- although this is allowed by special dispensation when FROM=TO.

Exactly the same as xsubstring, but the extracted text is written into the string TARGET starting at index TSTART. This operation is not defined if (eq? TARGET S) or these two arguments share storage -- you cannot copy a string on top of itself.

That is, the segment of characters in S1 from START1 to END1 is replaced by the segment of characters in S2 from START2 to END2. If START1=END1, this simply splices the S2 characters into S1 at the specified index.

Split the string S into a list of substrings, where each substring is a maximal non-empty contiguous sequence of characters from the character set TOKEN-SET.

TOKEN-SET defaults to char-set:graphic (see SRFI 14 for more on character sets and char-set:graphic).

If START or END indices are provided, they restrict string-tokenize to operating on the indicated substring of S.

This function provides a minimal parsing facility for simple applications. More sophisticated parsers that handle quoting and backslash effects can easily be constructed using regular-expression systems; be careful not to use string-tokenize in contexts where more serious parsing is needed.

Filter the string S, retaining only those characters that satisfy / do not satisfy the CHAR/CHAR-SET/PRED argument. If this argument is a procedure, it is applied to the character as a predicate; if it is a char-set, the character is tested for membership; if it is a character, it is used in an equality test.

If the string is unaltered by the filtering operation, these functions may return either S or a copy of S.

The following procedures are useful for writing other string-processing functions. In a Scheme system that has a module or package system, these procedures should be contained in a module named "string-lib-internals".

string-parse-start+end may be used to parse a pair of optional START/END arguments from an argument list, defaulting them to 0 and the length of some string S, respectively. Let the length of string S be SLEN.

If ARGS = (), the function returns (values '() 0 SLEN)

If ARGS = (I), I is checked to ensure it is an exact integer, and that 0 <= i <= SLEN. Returns (values (cdr ARGS) I SLEN).

If ARGS = (I J ...), I and J are checked to ensure they are exact integers, and that 0 <= I <= J <= SLEN. Returns (values (cddr ARGS) I J).

If any of the checks fail, an error condition is raised, and PROC is used as part of the error condition -- it should be the client procedure whose argument list string-parse-start+end is parsing.

string-parse-final-start+end is exactly the same, except that the ARGS list passed to it is required to be of length two or less; if it is longer, an error condition is raised. It may be used when the optional START/END parameters are final arguments to the procedure.

Note that in all cases, these functions ensure that S is a string (by necessity, since all cases apply string-length to S either to default END or to bounds-check it).

The Knuth-Morris-Pratt string-search algorithm is a method of rapidly scanning a sequence of text for the occurrence of some fixed string. It has the advantage of never requiring backtracking -- hence, it is useful for searching not just strings, but also other sequences of text that do not support backtracking or random-access, such as input ports. These routines package up the initialisation and searching phases of the algorithm for general use. They also support searching through sequences of text that arrive in buffered chunks, in that intermediate search state can be carried across applications of the search loop from the end of one buffer application to the next.

A second critical property of KMP search is that it requires the allocation of auxiliary memory proportional to the length of the pattern, but constant in the size of the character type. Alternate searching algorithms frequently require the construction of a table with an entry for every possible character -- which can be prohibitively expensive in a 16- or 32-bit character representation.

Build a Knuth-Morris-Pratt "restart vector," which is useful for quickly searching character sequences for the occurrence of string S (or the substring of S demarcated by the optional START/END parameters, if provided). C= is a character-equality function used to construct the restart vector. It defaults to char=?; use char-ci=? instead for case-folded string search.

The definition of the restart vector RV for string S is: If we have matched chars 0..I-1 of S against some search string SS, and S[I] doesn't match SS[K], then reset I := RV[I], and try again to match SS[K]. If RV[I] = -1, then punt SS[K] completely, and move on to SS[K+1] and S[0].

In other words, if you have matched the first I chars of S, but the I+1'th char doesn't match, RV[I] tells you what the next-longest prefix of S is that you have matched.

The following string-search function shows how a restart vector is used to search. Note the attractive feature of the search process: it is "on line," that is, it never needs to back up and reconsider previously seen data. It simply consumes characters one-at-a-time until declaring a complete match or reaching the end of the sequence. Thus, it can be easily adapted to search other character sequences (such as ports) that do not provide random access to their contents.

The optional START/END parameters restrict the restart vector to the indicated substring of PAT; RV is END - START elements long. If START > 0, then RV is offset by START elements from PAT. That is, RV[I] describes pattern element PAT[I + START]. Elements of RV are themselves indices that range just over [0, END-START), not [START, END).

Rationale: the actual value of RV is "position independent" -- it does not depend on where in the PAT string the pattern occurs, but only on the actual characters comprising the pattern.

[procedure](kmp-step pat rv c i c= p-start) -> integer

This function encapsulates the work performed by one step of the KMP string search; it can be used to scan strings, input ports, or other on-line character sources for fixed strings.

PAT is the non-empty string specifying the text for which we are searching. RV is the Knuth-Morris-Pratt restart vector for the pattern, as constructed by make-kmp-restart-vector. The pattern begins at PAT[P-START], and is (string-length RV) characters long. C= is the character-equality function used to construct the restart vector, typically char=? or char-ci=?.

Suppose the pattern is N characters in length: PAT[P-START, P-START + N). We have already matched I characters: PAT[P-START, P-START + I). (P-START is typically zero.) C is the next character in the input stream. kmp-step returns the new I value -- that is, how much of the pattern we have matched, including character C. When I reaches N, the entire pattern has been matched.

Rationale: this procedure takes no optional arguments because it is intended as an inner-loop primitive and we do not want any run-time penalty for optional-argument parsing and defaulting, nor do we wish barriers to procedure integration/inlining.

Suppose PLEN = (vector-length RV) is the length of the pattern. I is an integer index into the pattern (that is, 0 <= I < PLEN) indicating how much of the pattern has already been matched. (This means the pattern must be non-empty -- PLEN > 0.)

On success, returns -J, where J is the index in S bounding the end of the pattern -- e.g., a value that could be used as the END parameter in a call to substring/shared.

On continue, returns the current search state I' (an index into RV) when the search reached the end of the string. This is a non-negative integer.

Hence:

A negative return value indicates success, and says where in the string the match occurred.

A non-negative return value provides the I to use for continued search in a following string.

This utility is designed to allow searching for occurrences of a fixed string that might extend across multiple buffers of text. This is why, for example, we do not provide the index of the start of the match on success -- it may have occurred in a previous buffer.

To search a character sequence that arrives in "chunks," write a loop of this form: