XML and Erlang

I've just written some code to measure how big a parsed document
would be using several different representations. This is the same
DTD I used as an example yesterday, except that a couple of attributes
are really (large) enumerations and I didn't show you those.
There's a 5.4 ratio between best and worst overall.
But the really important factor is NOT whether you use a general-purpose
XML representation or one tailored to your particular problem,
it's HOW YOU STORE STRINGS.
Now when we are processing XML, it is quite true that much of the
time we actually ignore the strings and just transform one structure
to another structure, so it's the amount of memory we TOUCH that matters,
not the amount of memory we HOLD. But we do have to allocate and fill
in all that memory, and it does have to be reclaimed.
It looks as though the biggest space win for Erlang might be representing
parsed character data and attribute values other than enumeration values
as binaries rather than lists.
The original document was 31707 bytes, excluding the DTD.
Size is reported in 32-bit words.
Language is Erlang (cost model: [_|_] = 3 words, {X1,...,Xn} =
n+2 words), Prolog (WAM cost model), or Smalltalk (a
non-interactive Smalltalk dialect with cost model
unindexed object = 1 + #slots words,
indexed object = 2 + #slots words + element space).
C is just C.
Elem.rep is generic, meaning that it's like the current Erlang
XML representation in working for _any_ XML with or without
a DTD or schema, or specific, meaning that it is tailored
to this particular DTD. Erlang,specific is basically the
tightly packed "Erl'" version I outlined yesterday.
String rep is string=atom for Erlang and Prolog, string=list (of
integers) for Erlang and Prolog, char=byte (1 byte per
Latin-1 character) or char=word (4 bytes per 21-bit
Unicode character) for Smalltalk, or "my DVM2 library"
which uses UTF8 + unique storage. The string=atom case
is a useful approximation to what a string=binary
representatin would cost.
Size Language Elem.rep String rep
11779 words Smalltalk, specific, char=byte
13384 words Prolog, specific, string=atom
15912 words Smalltalk, generic, char=byte
16151 words Erlang, specific, string=atom
17820 words Prolog, generic, string=atom
18673 words C, generic, my DVM2 library.
22735 words Erlang, generic, string=atom
29343 words Smalltalk, specific, char=word
34583 words Smalltalk, generic, char=word
53101 words Prolog, specific, string=list
54918 words Erlang, specific, string=list
59752 words Prolog, generic, string=list
63441 words Erlang, generic, string=list
"Honesty is praised and starves." -- Juvenal