Cliff Hacks Things.

Saturday, April 01, 2006

Resuming exceptions in Mongoose

Since the initial rev of Mongoose's Signal framework (the basis for its exception handling), exceptions have supported resumption. You can send the message #resume to an exception and, if it supports resumption, execution will continue as though the exception had not been signaled.

Generally, of course, this is a bad thing, and most exceptions don't support resumption. There are a few that do, however; one of these is EncodingException.

EncodingException is signaled by a CharacterDecoder object when it encounters invalid data in its input. For example, the UTF8Decoder will signal an EncodingException if it encounters truncated, invalid, or overlong sequences in the input.

CharacterDecoders agree (in their interface) that if the EncodingException is resumed, they will insert the Unicode replacement character (U+FFFD) in their output and attempt to continue decoding.

So, let's look at two ways of handling a malformed UTF-8 byte sequence, taken from the UTF8Decoder test suite:

In an upcoming revision, you'll be able to provide a character or character sequence to use in place of invalid input, and have some control over overlong encodings specifically.

(Readers might note some subtle syntax changes in the code fragments above. The Mongoose syntax is evolving as we build out the standard libraries; stay tuned.)

Update: god, I love this language. The enhancements are in place, in under a dozen lines of code.

If you catch an EncodingException ex, your options are as follows:

ex resume will resume decoding. If the exception was signaled due to an invalid byte sequence, it becomes the Unicode REPLACEMENT CHARACTER U+FFFD in the output. If the exception was signaled due to an overlong encoding, it is decoded as if it were valid. (This replicates the behavior of most (broken) UTF-8 decoders.)

ex resumeAndSkip will resume decoding; whatever input caused the exception will simply be ignored, and decoding will resume after it.

ex resumeAndSubstitute: someCharacter does exactly what it sounds like: substitutes a character of your choosing for the invalid input. So, to be like most Unix libraries, you can substitute '?'.

About

A blog for my tech rants. My rants, suggestions, ideas, and general pedantry here are representative solely of this geek-at-arms, and not my employer, affiliated civic service organizations, deceased ancestors, or people whose names rhyme with my own.

About Me

Name:Cliff Biffle

Location:Mountain View, California, United States

I'm a Java developer for a large unspecified company, aficionado of programming languages, and general Mac bigot.
Contrary to what you may have read on the intarweb, I am not, in fact, a "Linux stalwart."