The Codec class describes a mapping between UString and anything else.

Unicode is used as the native character set and encoding in Warehouse. All other encodings are mapped to or from that: To unicode when e.g. parsing a mail message, from when storing data in the database (as utf-8).

A Codec is responsible for one such mapping. The Codec class also contains a factory to create an instance of the right subclass based on a name.

The source code for the codecs includes a number of generated files, e.g. the list of MIME character set names and map from Unicode to ISO-8859-2. We choose to regard them as source files, because we may want to sever the link between the source and our version. For example, if the source is updated, we may or may not want to follow along.

Codec::Codec( const char * cs )

Constructs an empty Codec for character set cs, setting its state to Valid.

The construction of a codec sets it to its default state, whatever that is for each codec.

Returns a codec likely to describe the encoding for s. This uses words lists and many other strategies.

Its assumptions:

If s contains a Unicode Byte Order Mark, it probably is a UTF-16BE or UTF-16LE string.

If s is a Russian string, it probably contains lots of common Russian words, and we have can identify the character encoding by scanning for KOI8-R and ISO-8859-5 forms of some common words. Ditto for other languages.

If s uses typical Windows punctation and is mostly ASCII, it's in a typical Windows encoding.

This function is a little slower than it could be, since it creates a largish number of short EString objects.

void Codec::recordError( uint pos )

Records that at octet index pos in input, an error happened and no code point could be found. This also sets the state() to Invalid.

void Codec::recordError( uint pos, uint codepoint )

Records that codepoint (at octet index pos) is not valid and could not be converted to Unicode. This also sets the state() to Invalid.

void Codec::reset()

This virtual function resets the codec. After calling reset(), the codec again reports that the input was wellformed() and valid(), and any codec state must have been set to the default state.

void Codec::setState( State st )

Sets the codec's state to st, which is one of Valid, BadlyFormed and Invalid.

Valid is the initial setting, and means that the Codec has seen only valid input. BadlyFormed means that the Codec has seen something it did not like, but was able to determine the meaning of that input. Invalid means that the Codec has seen input whose meaning could not be determined.

State Codec::state() const

Returns the current state of the codec, reflecting the codec's input up to this point.

This pure virtual function maps s from codec's encoding to Uncode, and returns a UString containing the result.

Reimplementations are expected to handle errors only by calling setState(). Each reimplementation is free to recover as seems suitable for its encoding.

bool Codec::valid() const

Returns true if this codec's input has not yet seen any syntax errors, and false if it has.

bool Codec::wellformed() const

Returns true if this codec's input has so far been well-formed, and false if not. The definition of wellformedness is left to each subclass. As general guidance, to be wellformed, the input must avoid features that are discouraged or obsoleted by the relevant standard.

Codec::~Codec()

Destroys the Codec.

This web page based on source code belonging to The Archiveopteryx Developers. All rights reserved.