Service C++ functions and classes
"advanced" i/o, (arithmetic) compression, and networking
$Id: README,v 2.7 2005/06/26 23:04:50 oleg Exp oleg $
***** Platforms
***** Verification files
***** Highlights and idioms
---- Extended file names
---- Explicit Endian I/O of short/long integers
---- Reading and writing of floating-point numbers
---- Stream sharing
---- Simple variable-length coding of short integers
---- Arithmetic compression of a stream of integers
---- TCP streams
---- TCP transactor, a shell RPC-like tool
---- Logging Service
---- Convenience Functions
***** Revision history
***** Comments/questions/problem reports/etc
are all very welcome. Please send them to me at
oleg -at- pobox.com or oleg -at- okmij.org
http://pobox.com/~oleg/ftp/
***** Platforms
I have personally compiled and tested this package on the
following platforms:
i686/FreeBSD 4.9, gcc 3.2 and gcc 2.95.2
i686/Linux 2.4.21, gcc 3.2.3
I have received reports that the library compiles and tests
on Fedora Core 2 with GCC 3.3.3 and 3.4.0.
The previous (2.6) version also works on
Sun Ultra-2/Solaris 2.6, gcc 2.95.2
i686/FreeBSD 4.0, gcc 2.95.2
i686/Linux 2.2.14, gcc 2.95.2
WinNT, Visual C++ 6.0
BEOS R4b4
SunSparc20/Solaris 2.4, gcc 2.7.2, libg++ 2.7.1
SunSparc20/Solaris 2.3, SunPro C++ compiler
HP 9000/{750,770,712}, HP/UX 9.0.5, 9.0.7 and 10.0,
gcc 2.7.2, libg++ 2.7.2
PowerMac 7100/80, 8500/132,
Metrowerk's CodeWarrior C++, v. 7 - 11
Intel, Windows95, Borland C++ 4.5/5.0
(the binaries then ran under Windows NT 4.0 beta)
I know that the package also works on DEC Alpha
and Concurrent Maxion 8000/RTU 6.2V25 (all with gcc 2.7.2 compiler)
***** Verification files: vmyenv, vendian_io, vendian_io_ext,
vhistogram, varithm, vTCPstream, vTCPstream_server, vvoc
Don't forget to compile and run them, see comments in the Makefile for
details. The verification code checks to see that all the functions
in this package have compiled and run well. The code also can serve as
an example how package's classes/functions can be used.
For each verification executable, the distribution includes a
corresponding *.lst file, containing the output produced by that
validation code on one particular platform (Sun Ultra-2/Solaris 2.6
to be precise). You can use these files for reference or as a base
for regression tests.
***** Highlights and idioms
---- Extended file names
The package adds support for "extended" file names: file names that
contain a pipe symbol ('|') in a leading or trailing position,
or start with a "tcp://" or "ltcp://" prefixes. These "files"
can be opened for reading, writing or even reading _and_ writing.
EndianIn istream;
istream.open("gunzip < /tmp/aa.gz |");
EndianOut stream("| compress > /tmp/aa.Z");
image.write_pgm("| xv -");
FILE * fp = fopen("tcp://localhost:7","r");
fstream fp("| cat | cat",ios::in|ios::out);
The "pipes" can be uni- or bi-directional. "Piped" filenames are
actually commands that are passed to a '/bin/sh', which is launched
in a subprocess. The process' stdin, stdout or both are plumbed to
a pipe or a bidirectional socket, which is returned to the user
as a "file descriptor". Code vendian_io.cc shows many examples
of using variously extended file names.
This extension is implemented on the lowest possible level, right
before the request to open a file goes to the OS. A function sys_open()
(in a source file sys_open.c) acts as a "patch": that is, if you
call sys_open() instead of open() to open a file, you get all
the open() functionality plus the extended file names.
Makefile contains some "black magic" that shows how to effectively
"substitute" the standard open(2) function with sys_open(), without
changing any of the system code. The substitution is completely safe
and does not require any extra privileges or permissions beyond what a
regular user already has. With this substitution inplace, *no matter*
how you open a file -- with open(), fopen(), fstream(), etc -- you can
submit extended file names and enjoy their functionality.
---- Explicit Endian I/O of short/long integers
EndianOut stream("/tmp/aa");
stream.set_littlendian();
stream.write_long(1);
That means, 1 would be written as a long integer with the least
significant byte first, NO MATTER which computer (computer
architecture) the code is running on. Using an explicit endian
specification as above is the only way to ensure portability of
binary files containing arithmetic data.
Note it is perfectly appropriate to pass, say, -1 or any other signed
integer to write_short() even if write_short() was declared to take an
unsigned short. Any signed number can be transformed into the
corresponding unsigned number without any loss of precision or
range. You can use a typecast if your compiler wants it. The reverse
transformation, unsigned->signed is not generally possible (say, 32768
cannot be represented as a signed short). Still, if we know that we
wrote a signed integer, we are justified in demanding a signed number
back, e.g.,
const short exponent = (signed short)read_short("reading exp");
The cast is really necessary here. The methods read_short()/write_short()
are intentionally made to take or return unsigned numbers. This is
to emphasize that these methods are to operate on 16-bit chunks. They move
16-bit quantities without assigning any particular meaning to them.
It is the user who provides all interpretation, by using typecasts.
---- Reading and writing of floating-point numbers
It is certainly possible to use EndianIO to read/write floating
point numbers in a portable way. Although EndianIn/EndianOut streams
currently support reading/writing of only integers, every FP number
can be split into exponent/mantissa parts, and reconstructed from
them, in a portable, platform-independent way. ANSI C/POSIX specify
functions frexp(), ldexp() and modf() for that purpose. See functions
write_double() and read_double() in file vendian_io.cc as an example.
These functions transfer floating-point numbers without any loss of
precision. Chances are however that a particular application does not
require the full precision. If you can afford to lose some of it,
you can write out the values in a more compact way. For example,
if single precision is enough for you, only the first 24 bits of
mantissa need to be written. BTW, if you can tolerate some loss,
the best strategy would be to scan the array of numbers to write,
determine the min and max values, subtract the min value from all the
elements of the array, and normalize the differences to be in range,
say, [0,255] or [0,65565]. You can use ArithmCodingIn/Out to read/write thus
normalized numbers (taking advantage of a (lossless) compression
built in these c++advio's streams).
Because efficient storing and communication of floating point numbers
is so application-specific, the write_double() and read_double() functions
are not made members of EndianIn/EndianOut classes.
---- Stream sharing
EndianIn/Out streams can share the same i/o buffer. This is useful
when one needs to read/write a "stratified" (layered) file consisting
of various variable-bit encoded data interspersed with headers. For
example, a file may begin with a header (telling the total number of
data items, normalization factors) followed by some variable-bit
encoding of items, followed by another header, followed by an
arithmetic compressed stream of data, etc. Like a waffle pie a file
can be made of many layers: each of which being interpreted using
different streams, each of which collectively sharing the same file
and the same file pointer. The situation is similar to sharing an open
file and a file pointer among parent and child (forked) processes.
Note that a mere opening of a stream on a dup()-ed file handle, or
sync()-ing the stream doesn't cut it entirely. See endian_io.cc for
more discussion. This package implements stream sharing in a safe
and portable way: it works on a Mac and WinNT just as well as on
different flavors of UNIX.
---- Simple variable-length coding of short integers
The code is intended for writing a collection of short integers where
many of them are rather small in value; still, big values can crop up
at times, so we can't limit the size of the encoding to anything less
than 16 bits. The code is a variation of a start-stop code described
in Appendix A, "Variable-length representations of the integers" of the
"Text Compression" book by T.Bell, J.Cleary and I.Witten,
p.290-295. The present code features support for both negative and
positive numbers and an optimization based on the fact that all
numbers are no larger than 2^15-1 in abs value, and an assumption that
most of them are smaller than 512 (in absolute value).
---- Arithmetic compression of a stream of integers
The present package provides a clean C++ implementation of Bell,
Cleary and Witten's arithmetic compression code, with a clear
separation between a model and the coder. ArithmCodingIn /
ArithmCodingOut act as i/o streams that encode signed short integers
you put() to, and decode them when you get() them. The
ArithmCodingIn/Out object needs a "plug-in" of a class
Input_Data_Model when the stream is created. The Input_Data_Model
object is responsible for providing the codec with the probabilities
(frequencies) a given data item is expected to appear with, and for
finding a symbol given its cumulative frequency. Input_Data_Model may
also modify itself to account for a new symbol. Thus, the ArithmCoding
class is a sort of the 'iostream' class that writes/reads data items
to/from the stream performing encoding/decoding. It relies upon the
Input_Data_Model for the probabilities needed to perform the
arithmetic coding.
The current version of the package provides two Input_Data_Model
plug-ins, both performing adaptive "modeling" of a stream of
integers. The first plug-in uses a simple 0-order adaptive prediction
(like the model given in the Witten's book). The other one takes a
histogram to sketch the initial distribution, and is a bit
sophisticated in updating the model. It is used in compressing a
wavelet decomposition of an image. The code below (taken literally
from varithm.cc) demonstrates how the coder classes are actually used.
The first example writes two different streams (of different patterns,
that's why it was better to encode them separately) into the same file
EndianOut stream("/tmp/aa");
stream.set_littlendian();
const int sample_header = 12345;
{
AdaptiveModel model(-1,4);
ArithmCodingOut ac(model);
ac.open(stream);
for(i=0; i> resp_code;
if( ! link.get(buffer,sizeof(buffer)-1,'\r').good() )
_error("error reading a response line from the link);
if( resp_code >= 300 )
_error("bummer");
...etc...
See also vTCPstream.cc for more examples. This code has been used in
HTTP VFS,
http://pobox.com/~oleg/ftp/HTTP-VFS.html
and in the TCP transactor tool below.
TCP streams are helpful on a server side as well. See vTCPstream_server.cc
for an example.
---- TCP transactor, a shell RPC-like tool
tcp-trans is an application to perform a single transaction --
a request/reply exchange -- with a "server" on the other end of a TCP pipe.
tcp-trans is based on TCPStreams (see above), and shows an example of their
usage. This code establishes a connection to a server, sends a simple
request, listens to the reply and prints it out on its standard output.
This code can be used then to talk to any TCP server (HTTP daemon, or an
RPC-like service). tcp-trans is particularly useful as a scripting tool
(in sh or other scripts) to talk to TCP daemons. For example,
tcp-trans localhost:80 "GET / HTTP/1.0" ""
will fetch the root web page off the site.
tcp-trans some.host:25 "expn " "quit"
reveals the real person behind the postmaster
See the title comments in the tcp-trans.cc code for more examples.
---- Logging Service
A trivial service to help log various system activities onto
stderr, or some other log stream or file. One can use it like
Logger() << "Log this message" << "... and this too!" << endl;
Note that endl at the end is not necessary: ~Logger() destructor would
take care of it (provided anything was logged at all).
The Logger class is intended to be as light-weight as possible so that
all the logging operations can be inlined. Other examples:
Logger clog;
clog << "\nConnecting to " << connection_parms.q_host_to_connect() << ':'
<< connection_parms.q_port_to_connect() << endl;
...
const int resp_code = read_response_status_line(link);
clog << "\nreceived response code " << resp_code << endl;
...
Logger() << "soft errors will be re-tried " << max_retries_count
<< " times ";
---- Convenience Functions
The package defines a few functions I found convenient to use, like
message(...) (which is equivalent to fprintf(stderr,....)) and
_error(...) ( the same as message(...), abort();). One doesn't need to
to #include to use them.
Also included:
xgetenv() - getenv() with a fall-back clause
get_file_size() - also with a default clause
does_start_with_ci() - an amazingly useful function in input parsing
see vmyenv.cc for examples of their usage.
The validation file vmyenv.cc also illustrates how to catch an abort
condition, without crashing the main process (macro
must_have_failed())
---- Portability Tips
Borland C++ 4.5 is sometimes unhappy with the order BitIn, BitOut (in
endian_io.h) and ArithmCodingIn, ArithmCodingOut (in arithm.h) classes
are derived. Right now,
class BitIn : BitIOBuffer, public EndianIn
upsets BC because "RTTI class BitIn being derived from non-RTTI class
BitIOBuffer". I have a hunch that the error like that could be avoided
by tinkering with C++ compiler options. On the other hand, merely
switching the order of inheritance,
class BitIn : public EndianIn, BitIOBuffer
solves the problem. The same for BitOut, ArithmCodingIn, and
ArithmCodingOut.
***** Grand plans
Consider a shared BitIO class that permits switching ArithmCoding
streams freely, w/o the overhead of padding bits. See message
by Erik Kruus, Jun 23, 2000.
***** Revision history
Version 2.7 - Jun 2005
- Compiles with GCC 3.2-3.4
- Added support for the "ltcp://" file name prefix, to open a
listening socket and accept one connection. The code was
contributed by Bernhard Mogens Ege.
- Added an example simple-proxy.cc, which has actually been used
as a simple inetd-like server.
Version 2.6 - Nov 2000
- Added passive open to TCPStream and the corresponding validation
test vTCPstream_server.cc.
- Renamed library libserv.a into libcppadvio.a
- Minute corrections (mainly to make the compiler happier)
Version 2.5 - Jan 2000
- added tcp-trans.cc, a TCP transactor, a shell RPC-like tool.
- a new section on reading and writing of floating-point numbers.
- "renaming" of open() is tested on Solaris and FreeBSD systems
- A user of a TCPStream can affect async i/o and error
call-backs by instantiating and registering a
CurrentNetCallback object.
- sys_open() supports more "extended" file names, which denote
TCP connections and bidirectional pipes.
- validation code (vendian_io.cc) was updated to
test the new functionality (esp. sys_open())
- double pow(long x, long y) and double pow(double x, long y)
- a few minor adjustments to please gcc 2.95/egcs 2.xx
on Linux, FreeBSD, Solaris and BeOS platforms
Version 2.4 - Mar 1998
- a few minor adjustments to please gcc 2.8.1 and Visual C++ 5.0
- added primitive Logging service
- added TCP stream
- extended i/o is done in a more universal way (by "renaming"
of open(2), although no system function is changed)
Version 2.3 - Mar 1997
- added xgetenv(), does_start_with_ci(), get_file_size()
- created vmyenv.cc to validate myenv.h's functions
- a few adjustments (mainly to endian_io.h and arithm.h)
to account for changes in implementation (and interfaces,
) of the C++ iostream library, made in new versions
of libg++ (v. 2.7.2) and Metrowerk's CodeWarrior (v. 11)
This brings c++advio closer to the (ever evolving) C++ standard.
- _Vocabulary_ (an embedded language, actually) is now
distributed with the c++advio, see voc.h for more detail.
Version 2.2.3 - Mar 1996
- sys_open.cc now accepts an input pipe with more than one link
as a "file" name
- endian_io.*: added EndianIOData:unshare() method to break
sharing of a streambuffer (if was any). This method is intended
for destructors only (makes the code more portable).
- careful attention to comparisons between signed and unsigned
(mainly to get gcc 2.7.2 to shut up)
- now everything compiles with gcc 2.7.2/libg++ 2.7.1 and
Metrowerks Codewarrior 8.
- portability tweaks in myenv.h (declaring bool for platforms
that lack it)
- arithm_modadh.*: more logical (and efficient) way of "pulling-to-
the-front" when updating adaptive model frequency counters
by more than 1. Also, the initial distribution is slightly
tweaked. The upshot is that the compression is a tiny bit
better (at least, the algorithm makes more sense).
Version 2.2.1 - Jun 1995
Fixed the last remaining incompatibility glitches. Now, exactly the
same code compiles on a Mac with CodeWarrior 6 and on Unix with gcc
2.6.3
Version 2.2 - May 1995
Added a variable-length (start/stop) coding of signed short integers.
Added dealing with simple histograms of an integer-valued
distribution.
Version 2.1 - Mar 1995
Introducing bool where appropriate (instead of int) and adding checks
to make sure an EndianIn/Out stream was opened successfully.
Version 2.0 - Feb 1995
Big change: splitting EndianIO into EndianIn and EndianOut and
removing all libg++-specific things; everything should be very
portable now. Making sharing of the streambuffer portable.
Version 1.4 - Feb 1994
Updated for libg++ 2.5.3
Version 1.3 - Aug 1993
Introducing attachment of one stream to another, or sharing of a
streambuf among several streams. Took care of properly terminating an
arithm coding stream by writing a few phony bits at the end (so we
won't hit the EOF on reading). Thus it is possible now to concatenate
arithmetic coding streams.
Version 1.2 - Jun 1992
Updated to compile under gcc/g++ 2.2.1 and work with libg++ 2.0. The
first implementation of the arithmetic coding package
Version 1.1 - Nov 1991 - May 1992
Initial revision