9.34 gauche.unicode - Unicode utilities

Module: gauche.unicode

This module provides various operations on a sequence of Unicode codepoints.

Gauche can be compiled with a native encoding other than Unicode,
and the full Unicode-compatible behavior on characters and strings may
not be available on such systems. So we provide most operations in two
flavors: Operations on characters and strings, or operations on
codepoints represented as a sequence of integers.

If Gauche is compiled with its native encoding being none,
euc-jp or sjis, character-and-string operations
are likely to be partial functions of the operations defined
in Unicode standard. That is, if the operation can yield a
character that are not supported in the native encoding, it
may be remapped to an alternative character. Each manual entry
explains the detailed behavior.

The codepoint operations are independent from Gauche’s native
encoding and supports full spec as defined in Unicode standard.
If Gauche is compiled with the utf-8 native encoding,
the operations are essentially the same as character-and-string flavors
when you convert codepoints and characters by char->integer and
integer->char. The codepoint operations are handy when
you need to support the algorithms described in Unicode standard
fully, no matter what the running Gauche’s native encoding is.

9.34.1 Unicode transfer encodings

The procedures in this group operate on codepoints represented as integers.
In the following descriptions, ‘octets’ refers to an integer
between 0 to 255, inclusive.

They take optional strictness argument. It specifies
what to do when the procedure encounters a datum outside
of the defined domain. Its value can be either one of the
following symbols:

strict

Raises an error when the procedure encounters such input.
This is the default behavior.

permissive

Whenever possible, treat the date as if it is a valid value.
For example, codepoint value beyond #x10ffff is invalid
in Unicode standard, but it may be useful for some other purpose
that just want to use UTF-8 as an encoding scheme of binary data.

ignore

Whenver possible, treat the invalid input as if they do not exist.

The procedure may still raise an error in permissive or
ignore strictness mode, if there can’t be a sensible
way to handle the input data.

Function: ucs4->utf8codepoint :optional strictness

Takes an integer codepoint and returns a list of octets that
encodes the input in UTF-8.

(ucs4->utf8 #x3bb) ⇒ (206 187)
(ucs4->utf8 #x3042) ⇒ (227 129 130)

If strictness is strict (default), input codepoint
between #xd800 to #xdfff, and beyond #x110000,
are rejected. If strictness is permissive, it accepts
input between 0 and #x7fffffff, inclusive; it may produce
5 or 6 octets if the input is large (as the original UTF-8 definition).
If strictness is ignore, it returns an empty list
for invalid codepoints.

Function: utf8-lengthoctet :optional strictness

Takes octet as the first octet of UTF-8 sequence, and
returns the number of total octets requried to decode
the codepoint.

If strictness is strict (default), this
procedure returns either 1, 2, 3 or 4. An error is
thrown if octet cannot be a leading octet of
a proper UTF-8 encoded Unicode codepoint.

If strictness is permissive, this procedure
may return an integer between 0 and 6, inclusive.
It allows the codepoint range #x110000 to
#x7fffffff as the original utf-8 spec, so
the maximum number of octets can be up to 6.
If the input is in the range between #xc0
and #xdf, inclusive, this procedure returns
1–it’s up to the application how to treat these illegal
octets. For other values, it returns 0.

If strictness is ignore, this procedure
returns 0 when it would raise an error if
strictness is strict. Other than that,
it works the same as the default case.

Function: utf8->ucs4octet-list :optional strictness

Takes a list of octets, and decodes it as a utf-8 sequence.
Returns two values: The decoded ucs4 codepoint, and the
rest of the input list.

An invalid utf8 sequence causes an error if strictness
is strict, or skipped if it is ignore.
If strictness is permissive, the procedure accepts
the original utf-8 sequence which can produce surrogated pair
range (between #xd800 and #dfff) and the range
between #x110000 to #x7fffffff. The invalid
octet sequence is still an error with permissive mode.

Function: utf8->stringu8vector :optional start end

[R7RS]
Converts a sequence of utf8 octets in u8vector to a string.
Optional start and/or end argument(s) will limit the
range of the input.

If Gauche’s native encoding is utf8,
u8vector->string (see Uvector conversion operations)
will do the job faster; but this routine can be used regardless
of Gauche’s native encoding, and it raises an error if u8vector
contains octet sequences illegal as utf8.

Function: string->utf8string :optional start end

[R7RS]
Converts a string to a u8vector of utf8 octets.
Optional start and/or end argument(s) will limit the
range of the input.

If Gauche’s native encoding is utf8,
string->u8vector (see Uvector conversion operations)
will do the job faster; but this routine can be used regardless
of Gauche’s native encoding.

Function: ucs4->utf16codepoint :optional strictness

Takes an integer codepont and returns a list of integers
that encodes the input in UTF-16. The output is either
one integer or two integers, and each integer is in the
range between 0 and 65535 (inclusive).

If strictness is strict (default), input codepoint
between #xd800 to #xdfff, and beyond #x110000,
are rejected. If strictness is permissive, it accepts
high surrogates and low surrogates, in which case the result is
single element list of input. If strictness is ignore,
an empty list is returned for an invalid codepoint (including surrogates).

Function: utf16-lengthcode :optional strictness

Code must be an integer between 0 and 65535, inclusive.
Returns 1 if code is BMP character codepoint, or
2 if code is high surrogate codepoint.

If strictness is strict (default), an error is
signalled if code is a low surrogate, or it is out of range.
If strictness is permissive, 1 is returned
for low surrogates, but an error is signalled for out of range arguments.
If strictness is ignore, 0 is returned
for low surrogates and out of range arguments.

Function: utf16->ucs4code-list :optional strictness

Takes a list of integers and decodes it as a utf-16 sequence.
Returns two values: The decoded ucs4 codepoint, and the rest of
input list.

If strictness is strict (default), an invalid utf-16
sequence and out-of-range integer raise an error. If strictness
is permissive, an out-of-range integer causes an error, but
a lone surrogate is allowed and returned as is. If strictness
is ignore, lone surrogates and out-of-range integers are just
ignored.

9.34.2 Unicode text segmentation

From given string or codepoint sequence (a <sequence>
object containing list of codepoints), returns a list of
words. Each cluster is represented as a string, or
a sequence of the same type as input, respectively.

In the second example, the list is a list of codepoints
of characters in "That’s it."

Function: string->grapheme-clustersstring

Function: codepoints->grapheme-clusterssequence

From given string or codepoint sequence (a <sequence>
object containing list of codepoints), returns a list of
grapheme clusters. Each cluster is represented as a string,
or a sequence of the same type as input, respectively.

The following procedures are low-level building blocks
to build the above string->words etc.
A generator argument is a procedure
with no arguments, and returns a value (or some values) at at time
for every call, until it returns EOF.

Function: make-word-breakergenerator

Function: make-grapheme-cluster-breakergenerator

From given generator is a generator of characters or codepoints,
returns a generator that returns two values: The first value is the
character or codepoint generated from the original generator, and the
second value is a boolean flag, which is #t if a word
or a grapheme cluster
breaks before the character/codepoint, and #f otherwise.

Suppose a generator g returns characters in a string
That's it., one at a time. Then the created generator
will work as follows:

It shows the word breaks at those character boundaries shown
by the caret ^ below (for clearity, I use _ to indicate
the space).

T h a t ' s _ i t .
^ ^ ^ ^ ^

Function: make-word-readergenerator return

Function: make-grapheme-cluster-readergenerator return

The input generator is a generator of characters or codepoints,
and return is a procedure that takes a list of characters or
codepoints, and returns an object. These procedures creates a
generator that returns an object at at time, each consists of a
word or a grapheme cluster, respectively.

Suppose a generator g returns characters in a string
That's it., one at a time, again.
Then the created generator works as follows:

9.34.3 Full string case conversion

Function: string-upcasestring

Function: string-downcasestring

Function: string-titlecasestring

Function: string-foldcasestring

[R6RS][R7RS]
Converts given string to upper case, using language-independent
full case folding defined by Unicode standard.
They differ from srfi-13’s procedures
with the same names (see SRFI-13 String case mapping),
which simply uses character-by-character case mapping.
Notably, the length of resulting string may differ from the source string,
and some conversions are sensitive to whether the character is at the
word boundary or not. The word boundaries are determined according
to UAX #29 text segmentation rules.