Contents

Text strings in scripts

Terminology

Be warned — these terms are not always used correctly and consistently in the LSL wikis. Here's the terminology from their various standards:

Unicode is a character set of several thousand character glyphs and their assigned numeric ID codes. The numeric codes are 21-bit integers in the range 0 through 0x10ffff, although a few numbers in that range have special meaning other than character data. How are those 21-bit numbers stored in memory? That's what UTF-8 and UTF-16 are all about.

UTF-8 and UTF-16 are two ways to represent the Unicode numeric codes in memory. UTF-8 encodes a Unicode ID number in a variable-length sequence of one to four bytes. UTF-16 encodes most characters in 16 bits, and some in 32 bits.

For a summary of Unicode character set and examples of the UTF-8 and UTF-16 storage formats, see: Unicode_In_5_Minutes.

LSO-compiled scripts store strings internally in memory in UTF-8 format and Mono uses UTF-16, but all that should be transparent to the script for most purposes. The main impact for the script and scripter is the amount of memory used. This page says that globally scoped strings use one byte per character in LSO and two bytes per character in Mono, which is true for strings containing only 7-bit ASCII characters. When the strings contain mostly international characters outside the ASCII range, then the memory consumption for UTF-8 in LSO and UTF-16 in Mono will both be close to two bytes per character, and in some cases the UTF-8 will be longer if there are many characters that require the three-byte or four-byte forms of UTF-8 encoding.

So, summarize, ok?

So, if using only an ASCII character set and compiling with Mono, we can possibly reduce the memory requirements to near LSO levels through clever encoding. Otherwise any attempt made to compress text will probably use more instruction code space than the amount of memory saved.

ASCII text compression in Mono

This technique applies if your script uses only ASCII characters and only if compiling with Mono. We'll take the ASCII characters, two at a time, convert them to their Unicode numeric ID codes (which for these characters are identical to their ASCII numeric codes), combine them into a 14-bit integer, add a bias that so that the 14-bit numbers all are inside a valid range of Unicode characters, then convert it into a single Unicode character. This will result in a 16-bit Unicode character that bears no resemblance to the two ASCII characters that it encodes.

The decoding process is just the reverse — convert a Unicode character into its numeric value which will be a 14-bit integer, divide it into two 7-bit numbers, and convert that back to a string of two characters.

For encoding, we need a couple of functions — one to convert a single Unicode character to its numeric ID, and one to convert a 14-bit number into a single Unicode character. Here's the former:

This works because the function llStringToBase64() converts the character c into UTF-8 format first, then into base64 encoding. This isn't documented well, so let's be clear about it — regardless if running in LSO where text is UTF-8 or in Mono where text is UTF-16, the function llStringToBase64() will return a base64-encoded form of the UTF-8 encoding of its argument.

This function is very similar to the function

integer UTF8ToUnicodeInteger(string input);

found in Combined_Library. I'm offering my own version here for two reasons: (1) I've named the function to emphasize that it takes a single character as string input, and to remove "UTF-8" from the function name because the role of UTF-8 in the function is only as a necessary intermediate encoding. The function is meant to work with LSO or Mono so that the underlying internal encoding of the input argument is transparent; and (2) my version is a little easier to read (at the expense of being a few bytes longer).

And just for completeness, if you need to revise the function to also work with Unicode ID codes above 0xffff, the additional else clause looks like this:

Next, we need a way to combine two 7-bit character codes into a single Unicode character. We'll simply combine two numbers to make a 14-bit integer, then convert that to a Unicode character using the function encode15BitsToChar() found in User:Becky_Pippen/Numeric_Storage:

For decoding, we'll use the function decodeCharTo15Bits() found in User:Becky_Pippen/Numeric_Storage to get our 14-bit number, then split that into two 7-bit numbers, then use llUnescapeURL() to turn those into two characters in a string.

Code Example

First, we need to run a benchmark to measure memory usage in the normal uncompressed way. Simply put a bunch of ASCII text into a notecard and drop it into a prim with this script. It will concatenate all the notecard lines into a single global string named bigText. I used the text from http://www.gutenberg.org/files/6274/6274.txt as the notecard text. It runs out of memory after saving 1063 notecard lines:

Now we can compare — here's the same notecard reader script using ASCII compression. Even though the added compression functions consume about 3K of program space, we can now store 1854 lines. That's 74% more text using compression.