In my previous column, I started
out examining the Standard string class "in the real
world". That column looked primarily at the Standard string
interface. In this column I will get more into the
implementation. Let me preface these remarks with a caveat: it is
not my intent to in any way suggest that there is anything
"wrong" with the Standard string class. For the
vast majority of programmers, the Standard string class
will adequately meet all of their needs, and do so with elegant
efficiency. Nevertheless, as with any other library, there is
simply no such thing as "one size fits all". This
column is about situations that might affect only a small
fraction of C++ users, and then only occasionally. As I hope to
show, even when the implementation of the string class
that comes with your compiler is not what you need, my preferred
solution is to stick with the string specification as
provided by the Standard and provide a more appropriate
implementation.

[Soapbox - On the one hand I
really feel like I should not have to constantly make these
qualifications; most good programmers understand that it is up to
them to determine whether a given function, class, or library is
appropriate to their application or not. On the other hand, I
seem to be running into an awful lot of marginally competent
(shall we say) programmers these days. In particular, there seem
to be a lot of people out there who have nothing but criticism
for the C++ Standard. The C++ Standard is not perfect, but nobody
(at least nobody I know) claims that it is. It is true that in
the end the committee was constrained by time and procedural
rules, but most programmers working in the real world understand
how deadlines and QA requirements can keep you from making last
minute changes to a product. The overall consensus of the
committee remains: "The C++ Standard is the best we could
make it. If we could have agreed on how to make it better, then
we would have made it better." This is especially true for
the Standard C++ Library. In spite of this, there apparently are
a lot of people who seem convinced that had they been on the C++
committee, it would have turned out better. I want to make it
clear that I am not one of them. It is true that I personally
prefer a slightly different interface from the Standard string
class in my own work, but I am aware that there are others whose
opinions differ from mine. I am also aware of how difficult it is
to come up with something as good as the Standard string
class when starting from scratch. The key point is not that the
Standard isnt perfect as is, but that it is an excellent
starting point that most users will find perfectly adequate for
their tasks, and that others can use as a basis for something
more appropriate when necessary. While the pundits and the
amateurs complain about C++s complexity and the size of its
defining Standard, those of us with large-scale problems to solve
in the real world now have at our disposal a programming language
of almost immeasurable power and flexibility. You want C-like
efficiency? Use the C subset. Garbage collection? Link in a
garbage collector. Need multiple inheritance? No problem. Want to
do generic programming? Ditto. Need the flexibility of a symbolic
programming language? You can do it in C++ (though it does take
some work). Yes, C++ has depths of complexity that can require
study and practice to master. In my opinion, the Standard did not
really add a lot of complex features to what was already in the
C++ language -- namespaces being the only one I consider
important (of course, if your older C++ compiler did not support
templates or exceptions (or you were carefully avoiding them)
then you may disagree). The library is another story.
Unfortunately, study and practice seem to be anathema to the vast
majority of programmers out there. For myself, study and practice
of my craft do not bother me; if I wanted to be half-good at
something, I would still be playing the guitar. End Soapbox]

Background

I was recently working on a
financial data server. In that application a financial instrument
was identified by three elements: its type, source,
and symbol. Each of these was represented in network
requests by short strings. It therefore seemed reasonably to
create a class such as the following:

class ItemName {

string _type, _source,
_symbol;

public:

. . .

const string& type() const
{ return _type; }

const string& source()
const { return _source; }

const string& symbor()
const { return _symbor; }

ItemName& type(const
string& s) { _type = s; return *this; }

ItemName& source(const
string& s) { _source = s; return *this; }

ItemName& symbol(const
string& s) { _symbol = s; return *this; }

};

This is pretty straight forward,
and everything worked fine, but lets look a little deeper
into the actual values that this class represents.

The type value is
determined by the software itself (in this case it is actually
determined by a legacy C library). There are about 10 different types
defined, each represented by a three or four character string.
The most common type is "gnrl". The source
value is an identifier for the actual source of the instrument
data. Various codes represent these sources. Since these codes
have to be transmitted over the network to the server, they were
deliberately chosen to conserve bandwidth. For example, the New
York Stock Exchange is coded as the single character
N. Most source codes are 2 to 5 characters in
length. The symbol is the actual assigned symbol for the
instrument. Most of these are short 3-5 character strings.

Now consider the typical string
implementation. Many Standard libraries implement string
using a reference counting scheme similar to that shown in figure
1. The actual string object is just a pointer (often
called a handle) to the data on the free-store. The
Standard makes it clear that string can be implemented
using such a shared data scheme, but it must act as if
every copy of a string were separate. This is often referred to
as "delayed copy" or "copy on write." The
point is that reference counting is an internal optimization that
vendors can, and often do, adopt to improve the efficiency of
copying strings.

The header of the data area often
looks something like this:

struct stringRep {

int refCount;

int length;

int size;

bool flag;

};

On a typical machine with a
32-bit word size, this struct will take up 4 words, or 16
bytes. Most implementations will have at least one
flag field, many will have more than one, but I will
assume that all of the flags can fit into one word. Some
implementations will also include a copy of the allocator
object that was an argument of the string constructor.
Since the Standard string class uses the default allocator,
which in turn is specified to use operator new and operator
delete, it is possible for an implementation to optimize this
field away. For now I will assume that this has been done.

In addition to the stringRep
header, most string implementations will also have a
minimum data allocation size. I have seen this range from as low
as 8 bytes (my own string implementation) to as high as 64 bytes.
Lets assume 16 bytes for now. This means that our
"typical" string class will allocate a minimum
of 32 bytes (StringRep plus data) from the free-store for
even the shortest strings.

The Problem

Now lets return to ItemName.
One of the applications that used ItemName was a local
financial data server. This application was to act as a buffer
for the global financial network. As such, it would receive
requests for instruments and cache the data. An instruments
ItemName was the key by which its data could be requested,
and in turn was the index into the cache. Needless to say, a lot
of ItemName objects were typically in existence within
this application. A typical test run involved a cache of 10,000
instruments. A real world instance of this application was
expected to be able to handle up to 100,000 instruments, with
future versions going as high as 500,000 instruments. Looking
only at the data cache, 10,000 ItemNames used for indexes
implies 30,000 strings. With a typical minimal allocation
of 32 bytes, this implies almost a megabyte of data held in strings.
Not a big deal by itself, but when you consider that the
average length of each of those strings was less than 4
characters, the 700% overhead starts to look pretty bad. It
became unacceptable when we started to consider real-world size
caches and the hardware it would take to support them. Since I
was the C++ / library / STL / middle-ware guru, I was given the
problem of reducing this memory overhead to an acceptable level
(the problem was much larger than ItemName  ItemName
is just the subject of this column).

My first attempt at reducing ItemNames
overhead was aimed at the _type field. Since
there were only a limited number of types I knew I could
share the data representations. I decided that I should be able
to take advantage of the sharing inherent in the string class
- if I just did the copies right. So I changed the type
update function like so:

typedef set<string>
Types;

const Types types =
init_types(); // built at startup

ItemName&
ItemName::type(const string& s) {

Types::const_iterator it =
types.find(s);

check(it != types.end()); //
like assert

_type = *it;

return *this;

}

The idea here is to look up the
new value for the type in a table and then make the
assignment from the value in the table. For the table, I
naturally turned to the STL and used a set<string>.
Assuming the assignment just increments the reference count of
the string in the table, this should mean that every _type
field in the application just points to one of the elements in
the table.

In the world of strange
coincidences, it turned out that I had not had this change in
place 24 hours when we received a new C++ environment for one of
our platforms. I took some time to do an evaluation of this
versions Standard string class and immediately
discovered that their Standard string did not use a
reference counting / copy-on-write scheme. Instead, it was
implemented much like most vector<> implementations,
and made a copy of the data within the object at the point where
a copy was invoked. Initially, I was slightly annoyed at this; I
considered the non-reference counting implementation slightly
archaic#. Nevertheless, I quickly realized that the string
implementation was completely compliant with the Standard, and my
annoyance was better aimed at myself for developing code that
depended on a given implementation. As I noted above, the use of
reference counting is an internal optimization. If you
want portable code, you should only depend upon what is
guaranteed by the Standard and not make any assumptions about the
implementation. In this particular case, if I wanted different string
objects to share the same data, it was up to me to code it so
that they would.

At this point I admit to a brief
period of weakness whereby I fell back into old C habits. In C,
if you want to share common data, you use pointers. So I switched
ItemNames _type member to be a const char*
and its mutator and accessor functions became:

typedef set<const char*,
cstr_less> Types;

const Types types =
init_types();

ItemName&
ItemName::type(const string& s) {

Types::const_iterator it =
types.find(s.c_str());

check(it != types.end());

_type = *it;

return *this;

}

string ItemName::type() const
{

return string(_type);

}

You will note that I had to
change the return type of type() from a const
string& to just a string. You will also note that
the instantiation of the set<> template includes the
template argument cstr_less. This is one of several little
utility functors that I have sitting around in a utility library.
As you would expect, it compares two C-style strings using the strcmp()
function.

With this done, I turned my
attention to the _source field. Like the _type
field, the typical execution run would include only a few
distinct values for _source. Unlike the _type
field, I could not enumerate in advance all the possible
candidate values for source. I needed something a little
more dynamic than I had used for _type. It is true that I
could have just expanded upon the set<const char*>
idea, but that would have meant allocating character arrays and
copying strings whenever a new source value came along. Since new
source values are rare there was no performance issue involved;
my mind simply balked at doing anything so C-like. What I wanted
to do was put strings into a set, and then share the strings
themselves. In essence, I wanted to hold a pointer to the string
in the set<>. I could have used a string*,
but I realized that I already had something just like a
pointer-to-string that I could use instead. ItemName
became:

class ItemName {

typedef set<string>
StringSet;

static StringSet sources;

StringSet::iterator _source;

. . . .

};

ItemName&
ItemName::source(const string& s) {

_source =
sources.insert(s).first;

return *this;

}

const string&
ItemName::source() const {

return *_source;

}

Most people think of iterators as
just being useful for stepping through a collection. Here, I take
advantage of an iterators pointer-like property of being a
reference to something in a collection, and hold on to it. While
the mutator function looks like it inserts every requested source
value into the set, it really doesnt. Since a set<>
can not have duplicates, the insert function returns a pair<iterator,
bool>. The bool element indicates whether the
insert succeeded or not. If the insert succeeded, the iterator
portion of the pair<> references the newly inserted
value. If the insert fails because there is already an element in
the set<> with the same key, the iterator references
the element already in the set<>. In either case,
the iterator returned from insert is what I want, so I ignore the
second element in the pair<> and save the iterator.

Naturally, this scheme works for type
as well, so I went back and changed the _type member
to also be a StringSet::iterator. Since new types
are not allowed, the mutator function remained

ItemName&
ItemName::type(const string& s) {

StringSet::iterator it =
types.find(s);

check(it != types.end());

_type = it;

return *this;

}

With these two changes, I
significantly collapsed the memory footprint used by ItemNames
in an execution. The fact that the allocated data for a type
or source string might still have a considerable
amount of overhead was quite a bit less important when thousands
of such strings were all sharing the same data.

That left symbol. Up to
this point, the "improvements" had been relatively
painless. Clients of ItemName had been forced to recompile
(several times), but there had been no changes to the interface
which required coding changes (this had been an unstated, but
important, requirement of the tuning effort). Furthermore, the
coding effort necessary to realize significantly measurable
improvements had been small. Things might have been left as they
were at this point, but symbol was still a problem. With a
average symbol size being less than 4 characters, and the
typical string allocation to hold those 3.x characters
being 32+ bytes, there was still room for improvement. Again, the
first ideas involved falling back to ordinary C-style strings.
Besides my personal aversion to using C-style strings, I was now
sensitized to the whole idea of free store overhead. While it was
certainly true that

char* p = std::strcpy(new
char[4], "IBM");

would actually use much less
space than

string s("IBM");

it was still likely to use more
than I wanted. Beside the 4 characters of actual data, the
typical heap manager would also include 4-8 bytes for its own
control information. It would also be likely to align allocation
on double-word boundaries. Since I did not consider any of this
heap manager overhead in discussing the string class
overhead, it is not really fair to consider it now, but it is
there. Since "IBM" would fit into the space actually
used by the char*, it seemed a shame to have to put up
with any heap overhead for such short strings.

At this point, lets admit
that a great many programmers would just pick a size (4, 6, or 8
bytes) for a fixed length character buffer and say something like
"that will work for the vast majority of cases" and let
it go at that. Since it probably will work for all the test
cases, it is quite possible that the program could actually be in
users hands before somebody had a problem. Then the
"fix" would be to just make the buffer size bigger.
While there are a lot of advantages in simple solutions, I have
had to fix enough systems designed around such short-sighted
limitations that I dont even consider them acceptable for
in-house products. Of course this is an overwhelming argument for
the use of Standard C++ strings (and the STL): they provide open
ended, general-purpose solutions that are as simple (or even
simpler) to use as the more limited older versions.

The problem in this case was how
to deal with those occasions when the symbol would not fit
into a fixed 4-byte area (or 6, or 8, or whatever). I came up
with a new class that I called cpstring (that was suppose
to stand for Compact String  in retrospect the
name wasnt very good, but I am pretty much stuck with it
now). Its representation looks something like this

class cpstring {

struct Local {

short int len;

char data[6];

};

struct Remote {

short int len;

short int siz;

char* data;

};

union Impl {

local loc;

remote rem;

};

Impl _impl;

public:

. . . .

};

Within the implementation, if _impl.loc.len
< 6 then the actual character data is in _impl.loc.data.
If _impl.loc.len >= 6 then the data is on the heap and
is referenced via _impl.rem.data. I dont show the
details of the cpstring interface because they are exactly
the same as the Standard string -- with some additions
described below. Readers who are interested can get a copy of the
entire implementation from my web site (www.bleading-edge.com). I will mention a few of the
implementation details.

Obviously, there is no reference
counting with cpstring  any copies result in
physical copies being made.

Also, since the length and size
elements are declared to be shorts, the maximum length of
string that can be held in a cpstring is limited to 32767.
While this is usually not a problem, it is probably less than
what a Standard string will support (although there are no
guarantees).

Since the objective of cpstring
is to conserve memory, it tries to keep empty space allocated on
the free-store to a minimum. To hold a string of length l
(where l >= 6), it will allocate a buffer of size l+1$
rounded up to the next double-word boundary. If the string
shrinks, the allocated memory will shrink with it. Naturally, if
the size of the string shrinks below 6 characters, the data is
copied into the object itself and any memory allocated from the
free-store is released. This scheme would make the impl.rem.siz
element redundant were it not for reserve().

The member function reserve()
can be called to allocate a bigger chunk of memory than might
otherwise be needed, but the allocation created by reserve()
will remain only so long as any changes to a string do not cause
its current length to shrink. In other words: you can not do a reserve()
on a cpstring to create a fixed size character buffer that
you can then fill and empty as needed (the reserve()
function of the Standard string class does not guarantee
this kind of behavior either  though many implementations
handle it that way. This is another example where you need to be
careful to not depend upon a given implementation. If you need a
fixed size character buffer, then you should allocate a fixed
size buffer. If you want to manipulate that buffer as a string
then you need another of my string implementations
 one that lets you specify a fixed size data area, or wrap
a string interface around an existing character buffer.
Next column).

To facilitate the use of cpstrings
interchangeably with Standard strings, there is a
converting constructor which takes a const string& and
makes a cpstring. There is also a conversion operator
string() which will make a string from a cpstring.
In addition, most of the functions of cpstring (assign,
append, insert, replace, the search functions, the comparison
functions, and the free comparison operators) also have versions
that take a const string& argument. This allows common
operations such as a == b where one is a cpstring
and the other a string to be carried out without having to
create a temporary of one type or the other.

Naturally, after creating cpstring,
I went back and used it to store the type and source
elements in ItemName. The final (or at least the current)
version of ItemName now looks like this:

class ItemName {

typedef set<cpstring>
StringSet;

static const StringSet types;

static StringSet sources;

StringSet::const_iterator
_type;

StringSet::const_iterator
_source;

cpstring _symbol;

public:

. . . . // constructors,
destructors, and such

const cpstring& type()
const { return *_type; }

const cpstring& source()
const { return *_source; }

const cpstring& symbol()
const { return _symbol; }

ItemName& type(const
string& s);

ItemName& source(const
string& s);

ItemName& symbol(const
string& s);

. . . . // etc.

};

I will close this (already too
long) column with a few final observations. ItemName is a
very small class, but it provides a number of valuable insights
into the problems of designing reusable, general-purpose types.
One of the often stated rules of C++ is "always make data
elements private." In the original version of ItemName,
this may have seemed like a rather silly restriction. As we saw
however, when we started changing the implementation of ItemName
(with many thousands of lines of code already using it), we were
very glad that we had isolated the data behind a functional
interface. Even then we had problems.

As I changed ItemNames
representation, all clients of ItemName were forced to
re-compile. Even though I only posted new versions of ItemName
to the released libraries at carefully timed intervals, the fact
remained that whenever a new version of ItemName appeared,
people would get annoyed at the time it would take to rebuild
their code. I considered completely hiding the details of ItemNames
implementation in a separate class, with ItemName becoming
just a handle. While that might have helped while I
was working on things, I rejected it as a long-term solution,
however. Since these changes came about from the need to address
problems with excessive memory allocation, I decided that adding
another level of free-store allocation was not appropriate. The
best thing I could do was to stabilize ItemName again as
soon as possible.

I also decided to go ahead and
return cpstring references in the interface of ItemName.
Most of us never think twice about having a member function such
as

const string& type()
const;

The problem is that by returning
a reference we are exposing details of the implementation, even
though we are using a member function. If we change the
implementation, we will be forced to change the interface, which
could cause our clients to have to change code. As I noted above,
one of my unstated requirements in changing ItemName was
that I not break any existing code. What this meant in practice
was that I had to continue to return a string from the
accessor functions, or something that could be automatically
converted into one.

I decided to go ahead and return
a const cpstring& (and to do so with an inline
function) for the sake of efficiency (the usual reason for
returning references instead of objects). As it was, I worried
about the possibility that a lot of code using ItemName
might have to convert the cpstrings to regular strings,
with the attendant memory allocation that would go along with it.
This is one of the primary reasons that I went to the trouble of
making cpstring a complete replacement for string.
If I had not been worried about the overhead of converting cpstrings
to strings, I could have solved the memory problem with symbol
in some simpler fashion, including using ordinary C-style
strings. If I had done so however, I would have been forced to
create a string temporary in order to return the value
from an accessor function (you can see this in my second solution
for type, above). Doing anything else would have broken
code. Since ItemName was a heavily used type, adding this
level of overhead could have caused performance problems in its
own right. By returning a cpstring&, and by making cpstrings
interface accept string objects without conversion, I
hoped to keep to a minimum the extra overhead I would introduce
by using cpstring instead of string. As it was, I
came real close to sticking with the Standard string class
for _type and _source since I was storing these in
a container and sharing them anyway.

I also thought long and hard (in
fact I am still thinking) about not providing cpstring
with an operator string() member function. By preventing cpstrings
from being automatically converted back into strings, I
would ensure that the compiler would not silently start creating string
temporaries that were not there before. The downside for
preventing this performance hit was that I could possibly break
some existing code. This last possibility turned out to be the
deciding factor in this case  I simply could not afford to
break code at that point. After examining some typical code it
was decided that we could worry about performance problems with string
temporaries after we discovered that we actually had performance
problems with string temporaries. If I had been tuning ItemName
earlier in the project, I am pretty sure that I would have
provided an explicit member function (as_string()?)
instead of operator string(). This would have forced users
of cpstring to be aware of any string temporaries
they created.

In the final analysis, cpstring
turned out to be a useful class. It helped solve ItemNames
memory problem, though the object sharing approaches used for type
and source provided much higher payoff for the effort
involved. Outside of ItemName, cpstring has turned
out to be only of limited use  there are few uses for a
string optimized to hold fewer than 6 characters. Still, I think
the effort involved in creating cpstring was worth it.
Still, if I could have come up with a more general-purpose
solution, I would have preferred it. (One option I have been
considering is to make cpstring a template with an
argument to specify the size of the internal buffer. That way, a
user with an application where most of the strings are less than
n characters could create a cpstring optimized
for n characters, but still have a general purpose
type that could accommodate longer strings if necessary.)

In summary, the Standard string
class provides a very useful abstraction. If you find yourself in
a situation where your standard implementation is inadequate for
your application, you definitely want to consider keeping the
interface, but providing your own implementation. If your
situation is like mine and you discover your problems late in the
project (and that is probably what will happen), then you may
have very little choice. Even if you anticipate them up front, it
is probably better to stick as close to the Standard string
specification as you can. More and more, this is what programmers
are going to expect, and by using Standard string you will
minimize the surprises that your maintenance staff will have down
the road.