During my recent
work on Tornado, I've come across a
number of things which I haven't found anywhere else on the internet - so I
thought I'd post them here for others to use. It's a hotchpotch of stuff not
really in much order or organisation, but I think it is stuff you
won't find anywhere else!

1. Bugs in Microsoft software:

Tornado pushes everything to the max - how it goes about using the system
is extremely taxing for it and as a result I have uncovered a large number
of bugs in Qt (for most of which I have supplied source fixes which
Trolltech then applied to the main distribution - thank you Trolltech!) and
more interestingly quite a few bugs in Microsoft's products.

Since in order to report bugs to Microsoft you need to pay them several
thousand of whatever currency to get your direct line to real developers, I
am publishing my list of discovered bugs and how I worked around them in the
hope it will help others. The list is in no particular order:

Microsoft Visual Studio C++ v6.0 & .NET 2002 (v7.0)This isn't a bad little compiler - though Visual Studio .NET IDE 2002
is bug-ridden, it should be better by .NET 2003 (where the hell is the
service pack???) However, both compilers corrupt memory (or the stack)
when you make a copy of an object when:

The object being copied has a default copy constructor

The object being copied has no data members and is subclassed from
an object with a virtual table (ie; it has virtual methods) and
subclassed from another object with no data members (eg; a class used
purely for namespace purposes)

Basically the default copy constructor code correctly copies the first
inherited class' data and adjusts the virtual table correctly. However,
unfortunately, it then goes on to alter the table for the class with no
data members - which, because it has no data, means it overwrites
something else lower down in memory. Result: crashes.

Solution: Add a dummy data member to either the class itself or the
class it derives from which doesn't have any, it doesn't matter which. I
would suggest adding it to the base class makes the problem go away
permanently.

Duplex pipes are broken (ie; ones created by CreateNamedPipe())I was completely stumped by the weird behaviour of my pipes - they'd
work fine and then hang infrequently. It turns out that if one thread
fills a duplex pipe's write buffer and then another thread tries writing
more when the reading thread has not yet read any data, the second write
hangs and so do any other subsequent writes. Best of all, the read doesn't
return either! I found this bug reported for Windows NT 3.1
so it's clearly an oldie!

Solution: Break your duplex pipes into two separate pipes each
going one direction. You get improved latency too I noticed because it
seems you can only write once to a pipe on NT before it blocks.

OutputDebugString() isn't thread-safeYes, should your threads try outputting debug text at the same time,
this call will hang! Perfect!

Solution: Stick a critical section around OutputDebugString().

Overlapped i/o reads continues even after closing the file handleArguably this isn't a bug, but it results in memory corruption as the
overlapped i/o writes its data into what is now something else. It isn't
mentioned very clearly, but you can use CancelIo() to cancel it before
closing the file.

Unfortunately, CancelIo() only cancels the overlapped i/o started by
the calling thread. So, for example, if your reader thread dies and
you're trying to cleanup after it, you can close all the handles and free
the memory but you can't stop it still corrupting memory. Thanks
Microsoft!

Update: Yes, this frigging misfunctionality got me again. There was
literally one exit route (caused by a thrown exception under rare
circumstances) where CancelIo() wasn't getting called and yes, I was
getting my heap corruption again.

Solution: Either find some way of ensuring each thread cancels its
own i/o, or stick a Sleep(500) in before releasing the memory the
overlapped i/o would write into, or indeed just don't free the memory
ever. Your choice

2. Multithreaded programming tips

Most of the programs I write are multithreaded. The usual sagely advice
regarding multithreading is to not use it except when absolutely necessary
ie; there is a substantial performance advantage to be gained.

My problem with this advice is that it seems to me like the proverbial
ostrich sticking their head into the ground. If you never practice
something, you'll never get good at it and with future computers being
certainly SMP multiprocessing and then NUMA it seems daft not to write your
code today for what is obviously going to be the future, especially if it
costs you little right now except a tiny bit more effort. Furthermore, a lot
of techniques are used in computer software to fudge the "not everything
happens at once" problem most computer programmers have. Things like
sockets, events, messaging, data arriving - all these are asynchronous
events which happen in parallel with one another and for each of these there
is a programmatic convention for serialising them into a set of consecutive
operations easier for the programmer to get their head around.

Now most of these conventions I'm fine with - anything making my life
easier is good. But what I don't agree with is when use of these conventions
makes a program considerably more complex and harder to maintain - a good
example here is notification network sockets. These have an interrupt
routine post a window message to an application so it can deal with a socket
event synchronously - and invariably, I can't help seeing it as making
things more complex, not less so.

There are many others. Something doing intensive i/o with a serial port
for example. In this situation using overlapped i/o on Windows, you can have
i/o happen in the background while you get on with running the GUI for
example. But it's a much better design to have a separate thread do
that and let the GUI alone to do its own thing.

People have this notion, and it's not helped by past-generation
programmers, that writing multithreaded code is hard. I have written 30,000
line 20 thread programs and they work perfectly. Sure, the effort expended
was a little bit harder but not hugely so - and the customer got a program
capable of very low latency (for windows), high speed control program.

2.1 The key to success

The key to success for any program is discipline. There are many
excellent but sloppy programmers just as dangerous as an awful programmer,
and there are many technically poor programmers with the discipline to make
them a very valuable member of a team.

Discipline is following rules. The rules are extremely simple, and if you
follow them you will write code:

With less than one memory leak per 5,000 lines

With a negligible chance of causing memory corruption

With perhaps only one to two bugs per 1000 lines

Which is modular, maintainable and easy to debug

Which is fast, scales excellently with hardware, and lasts long into
the future

And here they are:

Error checking is your friendIt's sad to see how little code has the good old assert() function in
it. It costs almost nothing and yet permits sanity checks which will often
give clue to an underlying problem if they are sprinkled around liberally.
It's also distressing to see how little code checks the return code of an
API for error condition - this is a particular problem on Windows, but
Unix suffers from it plenty too.

I suggest you wrap every API call in a macro. Tornado uses TERRHWIN()
for Win32 API calls and TERRHOS() for Unix ones. These macros test for the
error condition and if true use FormatMessage() and strerror()
respectively to decode a global error code into a string which they
construct into an exception which also contains source file name and line
number (on debug Win32 builds it also adds a stack backtrace), which is
then thrown. Unlike older C++ compilers, there is no common popularly used
compiler today which doesn't do a good job in generating efficient and
correct exception handling code - most problems in this area nowadays stem
from sloppy programming. If you want a library that already has all this
done for you, choose my own: TnFOX.

Good starting points are checking parameters for exceeding maximums or
minimums, if accessing a list index is exceeding the number of members in
that list, if when finding an item that must be in a list fails etc.
Nowadays with C++, assert() should be used for stuff likely to fail in
debug sessions as obviously it won't check in release builds.

However, you should still sprinkle your code with runtime checks which
throw an exception also in release builds shipped to the user. I use a
special kind of exception which indicates if the failure of the check is
fatal and also reports it to the user in a special way - with an
accompanying stack backtrace for example.

You should fully test your exception handling. I've had a worker thread
slowly corrupt memory randomly to test how well the application responds
to an extreme environment - does it correctly save out as much data as it
can, clean up after itself as best as it can etc. It's as an old boss once
said how he tested his filing system - he stuck in a new hard drive, ran
the FS and then struck the HD with a hammer to simulate a platter crash.

Zero pointers throw exceptionsIf I had a dime for every time someone says pointers are evil. Well
sure, they are powerful, but they are not evil. And in the hands of
a disciplined programmer, they can make your code (a) considerably more
efficient and (b) catch more errors.

Yes, you read that last bit right. C++ references (using &) are all well
and good, but they must point to an object and hence they are ripe
for referencing an object which has been deleted. I only ever use
references for an object whose existence I can guarantee, and pointers for
everything else.

Pointers are good because they can be zeroed and any attempt to indirect
through a zero pointer will throw an obvious exception on every platform
I've ever used. The key to using pointers as a powerful and useful tool
instead of corrupting data and leading to unstable code is to always
zero a pointer when it doesn't point to anything.

That single, simple rule can remove perhaps 40% of all bugs seen in C or
C++ code. It even works for assembler! In practice, it means always
initialising a pointer with zero eg; MyClass *ptr=0; especially
in class constructors eg;

class MyClass {
MyOtherClass *myptr;
MyClass() : myptr(0) {
...

You should also always zero a pointer after deletion. In fact, I
use a macro called TDELETE(x) which is simply { delete x; x=0; }
which is as simple as it gets.

Of all the rules here, if you follow this one you will see the greatest
improvement in the quality of your code. You will find attempts to pass a
freed block of data to some code or calling methods in a deleted
instantiation will no longer result in memory corruption or changes "not
happening" but a guaranteed fatal exception you can track down and fix.

Memory leakage
95% of memory problems with C-like languages are due to sloppy pointer
discipline. C++ gives you a wonderful thing to look after your memory
allocations - it's called a destructor, and you should use it wherever
possible. The best form is to encapsulate and abstract your memory and
pointer use in a class - preferably one someone else has already debugged
for you eg; the STL's list, vector, array or string objects. If you must
write it yourself, make it small and abstract as much of the realities
underneath from the outside world - this makes fixing it when it's broken
much easier. Needless to say, plenty of sanity checks here is a must.

For the 10% of memory allocation code you can't use a class for, use this
class:

Using this class is a must if you use exception handling with code
which allocates pointers on the stack. At any time your code may be exited
and thus the code deleting a pointer will never get called - using a
destructor, this is fixed.

(Added 21st Nov 2003: Nowadays I rarely use that class except for
static data, I use rollbacks instead. See
the docs in TnFOX)

A lot of people use expensive tools like BoundsChecker or Purify to detect
memory problems. This is all well and good, but I feel they are missing
the fact that a great deal of these facilities are already provided free
with your language and toolkit - irrespective of platform.

(You need to add a static const char *_tmemdbg_current_file_=__FILE__;
to each .cpp file individually - this is because of how __FILE__ works).
Indeed, MSVC6 and later come with a debug heap which already has this new
new written for you but it's a trivial implementation on other
platforms like GCC. Simply place the allocated pointer with filename and
line number into a list (preferably accessed by hash table) and at the end
of the program do a dump of the unfreed memory blocks showing where they
were allocated. Voila, you have an industrial strength memory leak
detection mechanism.

MSVC7 or later (or Visual Studio .NET <year> as it's called by MS) comes
with extensive run time check ability. It can check for exceeding
boundaries known to it (your array or list class should already be doing
this), writing off the ends of buffers (you should already be using
strncpy() - never strcpy() - or in C++, always a string class) and
usefully for the transition to 64 bit code, container truncations not
using a cast (eg; an int into a short, or worst of all especially on
windows the common int = void *). What I find most useful about this new
MSVC7 feature is testing my code to see how well it works in a machine
three times slower than mine - because it should always run adequately at
the third of the speed of whatever you're seeing as the developer with a
high-end machine.

Lastly, if you're on Linux you should seriously check out
valgrind which can debug
more than just memory problems (eg; how well does your program use the
processors cache?). I usually belt through any code just before release.

Lock or prevent all shared data access between threads
It's the first rule of multithreaded programming - synchronise access to
all shared data. It would be better to say however that if you structure
your code differently, you can reduce the access to shared data in
the first place which results in (a) much faster paralleled execution and
(b) less errors because you forgot to hold the lock and (c) less thread
deadlocks.

It's hard to put into words how to structure your code differently.
Certainly, you need to do many more deep copies of objects which comes
with an obvious memory usage and speed penalty. But I think the essence is
to keep objects within themselves as much as possible and try to make
everything reentrant (ie; able to have multiple things running the same
code but with different inputs eg; two instantiations of a class could run
without locking because they are working on two separate sets of data).
Obviously, static buffers must be avoided at all costs - the stack is your
friend here!

Don't be afraid to split stuff off into another thread when it seems
appropriate. The more practice you get writing multithreaded code, the
better you will get at it. It's best to hold locks more than you need to
at first, but if you hold a lock while waiting for another you will run
into thread deadlock problems which can be solved two ways: (a) release
the first while waiting on the second or (b) use a more global lock
covering access to more data. If the task you want to perform in parallel
is very small, consider using a threadpool instead of starting a whole
dedicated thread. And yep, you guessed it - TnFOX has a FXThreadPool
class.

Because of this deadlock problem, sticking a mutex on every little piece
of data will usually lead to non-functional code. Furthermore, it results
in very slow performance because all synchronisation code is expensive -
it negates the effect of the processor's cache and stalls the entire
memory subsystem on many architectures. No, the solution is to strike the
right balance between how specific and global your locks are, and it's
something you get much much better at with practice.

One point here is again the use of destructors to help you much like with
freeing memory pointers above. If you hold any kind of synchronisation
lock, hold it indirectly by instantiating a class - then when an exception
or other form of strange exit happens, the lock is correctly freed. The
Java and Win32 approach of try...finally is fine and useful, but by
thinking a little more laterally you can get something more efficient in
C++. (Added 21st Nov 2003: See TnFOX's
FXMtxHold)

Lastly, at least once per large multithreaded project you will usually
have to write a really difficult piece of multithreaded code - memorable
for me have been the intelligent mutex (which permitted a configurable
pattern of multiple users eg; define set A, B & C of possible user
combinations, so only locks of say type A can execute in parallel - bit
like a read/write mutex but more complex), a hardware write access lock
which the user switched between threads and mostly recently, an optimised
scalable fully recursive read/write mutex. My advice for these kinds of
code is to keep the difficult bit as absolutely as small as possible. It's
better to have 100 lines of extremely difficult code than 1000 lines of
moderately difficult code especially as you can rip out and completely
replace 100 lines much easier than 1000. And of course, reuse an existing
implementation off the web long before writing your own -
TnFOX comes with LGPLed source code you
can study.

Multithreading efficiencyA well written multithreaded program will come within 10% and usually
5% the speed of a non-multithreaded program on a uniprocessor machine ie;
not noticeable to the user. If it's done right, the unique benefits
include a much more responsive program capable of working simultaneously
on dozens of things at the same time and furthermore a much more stable
program because fatal errors can often be localised to within a single
thread, letting your program relaunch that thread or otherwise save data
which might otherwise be lost. A multithreaded program is furthermore much
more scalable for the future - your code will continue running faster with
increased hardware into the future much more linearly than an older-style
program.

Thread support varies greatly between platform. Windows uses the OS/2
model which is very flexible and powerful but also slower than POSIX can
be which is considerably more basic. Really useful stuff Windows has which
POSIX should have is thread message queues, Asynchronous Procedure
Callbacks (APCs), inter-process synchronisation (POSIX defines optional
extremely basic inter-process locks, but they are rarely implemented) and
lastly the thing I miss the most is the ability to wait on more than one
wait object at once. Simple things in Windows like having arbitrary code
executed in another thread's context is very cumbersome in POSIX (usually
involves signals), and I recommend from bitter experience you try and find
a different way of structuring it so you don't have to.

I have three tips to dramatically improve multithreading efficiency apart
from that design and structure have the biggest effect (but I've already
covered that above):

Use spin counts on anything which can put a thread to sleep.
Basically, putting a thread to sleep and waking it again is extremely
expensive - a call to the kernel, lots of modifying inter-thread shared
lists, updating the scheduler etc. A simple count of 1...4000 tries to
get a lock before actually waiting on it makes a huge difference on a
multiprocessor system and can make some difference on a uniprocessor one
if the lock is being contended a lot (depends on that system's
scheduler). Of course, if trying a lock is also expensive (some POSIX
implementations), you'll want not to do this. If using Windows, take a
good look at InitializeCriticalSectionAndSpinCount().

Mutexs are dirt simple to implement - however amazingly most POSIX
implementations are still horrendously inefficient. Since they're the
most used synchronisation primitive, you may consider rolling your own
mutexs - on Windows, critical sections are 57 times quicker than Win32
mutexs so it should be obvious which to use! To roll your own mutexes,
you'll need access to the hardware's synchronisation assembler op which
can at the very minimum swap a memory location atomically though many
architectures can also increment atomically or even exchange and compare
at once. I suggest using the one I wrote for
TnFOX (check the FXMutex class).

Use more thread local storage. Even experienced multithreaded
programmers often forget how thread local storage can greatly improve an
algorithm working with shared data. I used TLS in my RWMutex below to
save having to maintain a list of threads currently owning a read lock
and it not only made things more efficient, it made the design simpler
too. You can encapsulate TLS for C++ using something like the code in
the RWMutex source below.

Consider more intelligent synchronisation primitives. The most
common complex primitive is the read-write mutex, though I listed some
stranger ones above. RWMutexes are best suited to code which mostly
reads from shared data but rarely writes to it - which, in my
experience, is 85-90% of shared data use. Or, to put it another way, if
you are writing just as much as reading your shared data structures
consider separating the read-only parts out into a RWMutex protected
index and keep the writeable parts individually locked by a normal mutex
- thus you can use the index for quick look up and not hold up any other
threads except when two want to write to the same data.

POSIX has optional support for RWMutexes and Win32 & OS/2 just doesn't
have them at all. I provide a working optimal algorithm with source
below which can aid you in this.

3. A Read-Write Mutex

I provide here C++ source code for Tornado's implementation of a generic
read write mutex and thread local storage (TLS) encapsulation for C++ using
POSIX threads (Win32 users please consult TlsAlloc()). Tornado's kernel runs
hundreds of threads all of which work simultaneously on the kernel namespace
which is made up of tens of thousands of linked nodes, the entire tree of
which is protected by one RWMutex. Most of the writeable stuff is linked off
the node tree into separately protected items, so the nodes are effectively
a giant index.

This implementation has gone through three incarnations to try and
balance out the priority of readers and writers and to remove a few
deadlocking bugs. As far as I know, the following implementation is bug free
but I offer no warranties or liability if it is not. This source is
completely free for use by anyone for any reason, so you get no ability to
whinge if it doesn't work for you!

Basically, the below is as perfect a RWMutex I know how to make. It
permits infinite recursion on any combination of read and write locks and
also gives writers the priority - as soon as a write is requested, all new
reader requests wait in a queue. It is extremely efficient, normally having
only a 4x mutex overhead for reader lock claim and release. It keeps the
amount of time holding the mutex protecting its private data as low as
possible to reduce contention and avoids waiting on the kernel as much as
possible - and when it wakes threads waiting for something else to finish,
it does so as selectively as is portably possible for POSIX threads. It also
deals with the classical problem of two or more threads already holding a
read lock both requesting a write lock at the same time.

As it's copied directly from the Tornado source tree, it uses some
classes you may not be familiar with: TMutex (an optimised implementation),
TMtxHold (a class holding a mutex lock and thus guaranteeing its unlocking
irrespective of how the function exits), TWaitCondition (an object upon
which a thread can wait for a notification ie; it sleeps till another thread
indicates it should wake - roughly corresponds to an event object in Win32 &
OS/2 or a wait condition under POSIX threads). See Tornado's documentation
if you want to know more.