Bulat Ziganshin <bulatz at HotPOP.com> writes:
> MQK> It should be possible to use iconv for recoding. Iconv works on
> MQK> blocks and it should not be applied to one character at a time.
>> recoding don't need any startup.
Calling iconv (or other similar routine) does need startup. And you
really don't want to reimplement all encoders/decoders by hand in
Haskell.
Processing a stateful encoding needs the time to pick up the state
and convert the materialized state into a form used during recoding
Dispatching to the encoding function (usually not known statically)
takes time. When we generically convert an encoder which fails for
invalid data, to an encoder which replaces invalid data with U+FFFD
or question marks, setting up exception handlers takes time. These
are all little times, but they can be avoided.
Converting newlines takes time, and it's very similar to character
recoding. It should be done transparently; network protocols often use
CR-LF newlines, and it's painful to remember to output a '\r' before
every newline by hand. It should be done on top of character recoding;
consider UTF-16, where newline conversion works in terms of characters
rather than bytes.
Some conversions can be implemented with tight loops which keep data
in machine registers. The tightness matters when there are many
iterations; loop startup is amortized by buffering.
Buffering can provide arbitrarily far lookahead, arbitrarily long
putback, and checking for end of stream while logically not moving
the current position. But this works only if buffering is the last
stage which changes stream contents.
> MQK> Byte streams and character streams should be distinguished in types,
> MQK> preferably by class-constrained parametric polymorphism. In particular
>> so that vGetBuf, vGetChar, and getWord32 can't be used at the same
> stream?
You can get bytes from a given byte stream, and get bytes from a
character stream put on top of that byte stream. Buf if the protocol
mixes bytes with characters and is specified in terms of bytes, it's
probably better to work in terms of bytes, and convert byte strings
to character strings after determining where they end.
--
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/