Using fibers to simplify enumerators, part 3: Having it both ways

Now we will use fibers to fight back.
Before you decide to use fibers in your programs,
make sure to read the
dire warnings at the end of this article.
My goal here is to show one use of fibers,
not to say that fibers are the answer to all your problems.
Fibers can create more problems than they solve.
We'll come back to all the dire warnings later.

As with most clever ideas, it has a simple kernel:
Use a fiber to run both the caller and the enumerator
each on their own stack.

This helper class does the basic bookkeeping
of fiber-based enumeration. At construction, it
remembers the fiber that is consuming the enumeration,
as well as creating a fiber that will produce the
enumeration. At destruction, it cleans up the fiber.
The derived class is expected to implement the
FiberProc method and call Produce()
every so often.

The real magic happens in the (somewhat anticlimactic)
Produce()
and Next() methods:

To Produce() something, we remember the production code,
pre-set the enumeration result to its default of
FER_CONTINUE, and switch to the consumer
fiber. When the consumer fiber comes back with an answer,
we return it from Produce().

To get the next item, we
remember the identity of the calling fiber, then
switch to the enumerator fiber.
This runs the enumerator until it decides to Produce()
something, at which point we take the production code and
return it.

That's all there is to it. The m_fef and
m_fer members are for passing the parameters and
results back and forth across the fiber boundary.

Okay, with that groundwork out of the way, writing the
producer itself is rather anticlimactic.

As you can see, this class is a mix of the two previous
classes. Like the consumer-based class, information about
the item being enumerated is obtained by calling methods
on the enumerator object. But like the callback-based
version, the loop that generates the objects themselves is
a very simple recursive function, with a call to
Produce in place of a callback.

In fact, it's even simpler than the callback-based version,
since we don't have to worry about the FER_STOP code.
If the consumer wants to stop enumeration, the consumer simply
stops calling Next().

Most of the complexity in the class
is just bookkeeping to permit
abandoning the enumeration prematurely.

Okay, let's take this fiber out for a spin. You can use
the same TestWalk function as last time,
but for added generality, change the first parameter
from DirectoryTreeEnumerator* to
FiberEnumerator*.
(The significance of this will become apparent next time.)

Since the enumerator is going to switch between fibers,
we'd better convert the thread to a fiber so it'll have
something to switch back to!

Here's a schematic of what happens when you run this
fiber-based enumerator:

ConvertThreadToFiber

Main fiber

construct DirectoryTreeEnumerator

Enumerator fiber

CreateFiber

(not running)

initialize variables

Next(CONTINUE)

SwitchToFiber()

→

starts running

FindFirstFile etc

Produce(FILE)

Next(CONTINUE) "returns" FILE

←

SwitchToFiber()

use the result

Next(CONTINUE)

SwitchToFiber()

→

Produce(FILE) "returns" CONTINUE

FindNextFile etc

Produce(DIR)

Next(CONTINUE) "returns" DIR

←

SwitchToFiber()

use the result

Next(CONTINUE)

SwitchToFiber()

→

Produce(DIR) "returns" CONTINUE

and so on... until...

Produce(DONE)

Next(CONTINUE) "returns" DONE

←

SwitchToFiber()

cleanup

DeleteFiber

destruct DirectoryTreeEnumerator

ConvertFiberToThread

Observe that from each fiber's point of view, the other fiber
is just a subroutine!

Coding subtlety: Why do we capture the caller's fiber each time
the Next() method is called?
Why not capture it when the FiberEnumerator is constructed?

Next time, we'll see how this fiber-based enumerator easily
admits higher-order operations such as filtering and composition.

Dire warnings about fibers

Fibers are like dynamite. Mishandle them and your process explodes.

The first dire warning is that
fibers are expensive in terms of address space,
since each one gets its own stack (typically a megabyte).

And since each fiber has its own stack, it also has its own exception chain.
This means that if a fiber throws an exception, only that fiber can
catch it. (Same as threads.)
That's a strong argument against using an
STL std::stack object to maintain our state:
STL is based on an exception-throwing model, but you can't catch
exceptions raised by another fiber.
(You also can't throw exceptions past a COM boundary, which
severely limits how much you can use STL in a COM object.)

One of the big problems with fibers is that
everybody has to be in cahoots.
You need to decide on one person who will
call
the ConvertThreadToFiber function
since fiber/thread conversion is not reference-counted.
If two people call ConvertThreadToFiber on the same
thread, the first will convert it, and so will the second!
This results in two fibers for the same thread,
and things can only get worse from there.

You might think, "Well, wouldn't
the GetCurrentFiber function
return NULL if the thread hasn't been converted to a fiber?"
Try it: It returns garbage.
(It's amazing how many people ask questions without taking
even the slightest steps towards figuring out the answer themselves.
Try writing
a test program.)

But even if GetCurrentFiber told you whether or not the thread
had been converted to a fiber, that still won't help.
Suppose two people want to do fibrous activity on the thread.
The first converts, the second notices that the thread
is already a fiber (somehow) and skips the conversion.
Now the first operation completes and calls
the ConvertFiberToThread function.
Oh great, now the second operation is stranded doing
fibrous activity without a fiber!

Therefore, you can use fibers safely only if you control
the thread and can get all your code to agree on who
controls the fiber/thread conversion.

An important consequence of the "in cahoots" rule is that
you have to make sure all the code you use on a fiber
is "fiber-safe" - a level of safety even beyond thread-safety.
The C runtime library keeps
information in per-thread state: There's
errno,
all sorts of bonus bookkeeping when you
create a thread,
or call various functions that maintain state in per-thread data
(such as
strerror,
_fcvt,
and
strtok).

In particular, C++ exception handling
is managed by the runtime,
and the runtime tracks this data in per-thread state (rather than
per-fiber state). Therefore, if you throw a C++ exception from
a fiber, strange things happen.

(Note: Things may have changed in the C runtime lately;
I'm operating from information that's a few years old.)

Even if you carefully avoid the C runtime library,
you still have to worry about any other libraries you use
that use per-thread data. None of them will work with fibers.
If you see a call to
the TlsAlloc function,
then there's a good chance that the library is not fiber-safe.
(The fiber-safe version is
the FlsAlloc function.)

Another category of things that are not fiber-safe are windows.
Windows have thread affinity, not fiber affinity.

btw, did the Windows team introduce fibers for any reason? (e.g. MS SQL?) Since from previous conversations, I get the impression that only Win32 has such an explicitly defined Fiber API, and also because Fibers are a relatively obscure thing that most people probably wouldn’t ask for, let alone know about.

Fibers are definitely designed to help applications which really understand their data sets, their I/O patterns and don’t have a big plug-in/extensibility mechanism. At some point large applications like this become CPU bound since they’re actually used by enterprises who do insanely expensive things like buy large storage arrays that let the number of concurrent I/Os scale as far as the CPU and operating system can push them.

So then they need to start optimizing CPU usage and they can’t afford the statistical goodness of things like thread scheduling and I/O completion ports.

I bet you can figure out which applications really take advantage of fibers and therefore which bugs around using fibers have gotten fixed. They’re enterprise applications which do a lot of I/O but are run by large enough physical systems that the real limiting factor is CPU use.

Caveat emptor. To solve the easy enumerators problem you really want to have smaller things like closures/continuations without having to keep the entire stack around. This is hard without compiler support and even then managing the lifetime of the storage that the closure references is hard without GC. This is effectively how iterators/"yield" works in C#.

I co-wrote a simple game engine some time ago which used coroutines, which fibers seem to be an implementation of, to allow objects to independently operate.

In my case, each "object" in the game was able to register one or more coroutines which would be resumed each frame. Each time the coroutine was resumed the current tick count was passed in so that the coroutine would know how much to adjust its associated object by, depending on what it was supposed to be doing.

This wasn’t done with Win32 Fibers, though. Indeed, Raymond’s series here was the first I’ve ever heard of them. My co-author had implemented a simple scripting system, so my coroutine implementation just played with the stack kept inside the script interpreter. The coroutines didn’t exist outside the script context.

The nice thing about coroutines vs. threads is that it’s up to each routine when it yields, so things like locking become less important. If a coroutine is going to do something unsafe, it must just make sure it leaves everything in a sensible state before yeilding.

In practice, most applications other than "real-time" games are better off with threads, of course.

One nitpick: ConvertThreadToFiber() should fail with GetLastError() == ERROR_ALREADY_FIBER if the thread’s already a fiber. So try it; if it fails for that reason, keep going, but skip the call to ConvertFiberToThread() at the end, and you can work with components which called yours on a fiber.

In Longhorn, there’s also IsThreadAFiber(), which should always work.

GetCurrentFiber() and GetFiberData() don’t return garbage; they just don’t return what you’re expecting (the field in the TEB is overloaded with OS2 information). If you file a bug against me, I’ll look into fixing them.

Given all the dire warnings about fibers, has anyone actually measured how much slower the equivalent version using threads would be?

I suspect that switching back and forth between two fiber stacks for each produced element is also not very good for performance.

If threads are used instead then for large collections you could batch the elements by putting a producer-consumer queue between the enumerator thread and the main thread. Then the overhead of thread switching will be amortized over a large number of elements, and you will not have to worry about fiber-safety.

Because fibers are a brain-the-size-of-a-planet feature. You are expected to have already worked through all the difficulties yourself – like deciding who will do the thread-to-fiber conversion. In the same way you don’t get an error if you free memory twice; you’re expected simply not to do that in the first place.

Remember, the ENTIRE PROGRAM must coordinate fiber management – see all the dire warnings – if you’re coordinating fibers on a program-wide basis surely you can decide who will be responsible for the thread-to-fiber conversion; that’s the least of your problems.

Pavel Lebedinsky, switching between fibers is *cheap* compared to switching between threads.

A switch between fibers involves; saving the registers state, and then loading the new register state from the new fiber. This will also switch the stack too. You are looking at a few dozen bytes which need to be moved about.

With a thread; it involves a trip to kernel land, saving the registers state, loading the new registers state, and then a trip back to user-land.

The timesaver is the removal of the trips between user-mode and kernel-mode todo a context switch.

If memory serves, SQL Server gets something like a 5-10% boost when running in fiber mode. However I believe that it then can’t use XSPs (the DLL-based plug-in mechanism) because, in general, code can’t deal with fibers / shifting across threads. (This is why it’s not the default; it does break some code that may be running in-process with SQL.)

Maybe they’ve done the work so that XSP invocations are remoted to run on threads that aren’t fibrous; it’s been too many years since I’ve paid deep attention to SQL server, so please research the details here (XSPs) yourself.

Re: cost:

I may be wrong here, but I believe that a context switch forces reloading the PCR (on x86) which is a very expensive operation for the CPU. Fiber switches switch a lot less state, don’t go into kernel mode and don’t change the PCR.

> switching between fibers is *cheap* compared to switching between threads.

My point was that any context switching (thread or fiber) has additional costs because of poor locality of reference. These costs may dominate everything else (things like saving the registers, transition to KM and back, running the scheduler code etc). If that’s the case then using fibers doesn’t really buy you anything, except all the bugs you’re going to discover because most code has never been tested with fibers and probably is not fiber-safe.

Now I’m not an expert on this to be able to say one way or another. But when I googled for ["context switch" "cache miss" cost] the first hit was this article:

They say that at least in some cases, cache-performance dominates the cost of a context switch. Note that this was back in 1991, and by all accounts the situation has only become worse in recent years because CPU speeds grew much faster than memory speeds.

So it seems to me that unless you’ve actually *measured* the performance of both solutions and found fibers significantly faster, you’re better off staying with threads.

Any kind of context switch — whether an explicit one made in fibers, or a true thread change, is painful in terms of cache swap. It’s still true, though, that SwitchToFiber() is cheaper than a normal thread context switch, as fiber switching can be done almost entirely in non-priviledged (usermode) code.

And yes, it is still true today that context switches kill the cache. Those of us in the performance team have been on several crusades to reduce the number of switches needed in an "idle" system over the last few years — pushing for fewer services, fewer ISRs/DPCs, etc. In my particular case, my rallying cry has been high-definition video playback; we’ve found that having a high number of ISRs/DPCs per second will absolutely sink decoding, even if every single interrupt executes quickly, because they kill the cache. (This is in addition, of course, to the problem of ISRs/DPCs that take too long and lock up the CPU for extended periods of time, such as long NDIS chains or non-conformant ATAPI drives.)

Yes, a fiber is very similar to a jmp_buf. The thing is that setjmp/longjmp explicitly only are enabled for jumping up the stack, not across between stacks. There’s no defined way to allocate a new stack, get code running on it and save its context into a jmp_buf in standard C etc.

Note also that since a jmp_buf is a struct, its size can’t really vary and depends on the version of the C runtime (regardless of platform) that you are built with. Fibers are provided by the OS on windows and therefore if someone discovers that some critical piece of state wasn’t captured/switched, it can be added.

Unfortunately those additions do tend to break apps who were depending on that piece of context not being switched. Luckily most people have heeded the warning and aren’t using fibers except for people who think they already understand their per-fiber context pretty well.

(It’s amazing how many people ask questions without taking even the slightest steps towards figuring out the answer themselves. Try writing a test program.)

<<

Isn’t that exactly how people come to rely on undocumented behavior? Just because a test program behaves in a certain way on your machine doesn’t mean it will on others. Answers to questions like this should be in the docs (don’t know if this one is, just making a general point). Of course, sometimes writing a test program is the only way, but it shouldn’t be the first one tried.

While C# iterators cover a good chunk of what you’d normally want to do, they don’t use separate stacks and so you can’t do everything fibres can. The main restriction is that all your "yield"s have to be in the top function instance.

IEnumerator<X> InOrder(Tree<X> t) {

if (t != null) {

InOrder(t.left);

yield return t.value;

InOrder(t.right);

}

}

At the least, you’d have to put a for loop around the recursive calls and "re-yield" the values to the top. This is pretty inefficient.

Full continuation support would fix this. Unfortunately (AFAIK), adding continuation support causes a significant performance hit (even if you don’t use them).