internal override ThreadScoped<T> Allocate()
{
// If 'next' is null, we are at the end of the list of free refs,
// so allocate a new one and enqueue it, then return 'this'
var x = next;
if (x != null) return this;
// The CLR has some fundamental limits on generic nesting depths, so we circumvent
// this by using two generic parameters, and nesting them via counting.
x = Interlocked.CompareExchange(ref next, CreateNext(), null);
// atomic swap failure doesn't matter, since the caller of Acquire()
// accesses whatever instance is at this.next
return this;
}

This is simple base-99 arithmetic. Anyone familiar with arithmetic should recognize the pattern here: when we get to certain multiples of 99, we reset the previous digits and carry the 1 to the next slot. Normally, humans deal in base-10, so a carry happens at 101, 102, 103, and so on.

In this case, we are dealing with base-99, so carries happen at 991, 992 and 993, and the "carry" operation consists of nesting a generic type parameter. Simple!

This scheme is also trivially extensible to as many additional parameters as is needed, so if someone somewhere really does need more than a million fast thread-local variables, I have you covered.

Tuesday, September 27, 2011

Awhile ago, Jeroen Frijters blogged about .NET 4's new ThreadLocal<T>, and how it was implemented. It was based on a neat generics trick exploiting the fact that the CLR does not erase types. Indexing a generic type X with a type T, means X<string> has distinct static and thread-local fields from say, X<int>.

Jeroen's sample implementation exploiting this trick for thread-local instance variables seemed far too complicated however, and it suffered from a number of limitations that he notes in his post. If Microsoft's implementation was even more complicated as was claimed, I wasn't confident it would perform very well at all. I decided to do my own implementation, without conforming the ThreadLocal<T>'s interface since I have very specific applications in mind.

ThreadScoped<T>

As I suspected, my implementation was considerably simpler, coming in at less than 90 lines of code, and it doesn't suffer from the problems that Jeroen identifies, save for the fact generated types are not reclaimed until the AppDomain exits. However, ThreadScoped instances are aggressively reused, so I don't anticipate this to be a huge problem.

Conceptually, the implementation is quite simple. We have four views to keep in mind here:

Thread-global static data: ordinary static fields that are thus visible by all threads.

Thread-global instance data: ordinary object fields visible to all threads with access to that instance.

Our thread-global static data consists of a lock-free linked list of unallocated ThreadScoped instances. Our thread-global instance data is a version number, and a 'next' pointer that points to the next available ThreadScoped<T>. This is used when freeing an instance by pushing it on the global list. Accessing the global and instance data are all done via Interlocked.CompareExchange, so all updates are lock-free.

Now comes the tricky part: the private type ThreadScoped<T>.Ref<TIndex> has two static fields, one for the value T and one for a version number (more on this later):

The type parameter TIndex is not actually used in a static or instance field, it's merely used here as a "phantom type". Basically, if we keep substituting new types for TIndex, we'll keep forcing the CLR to generate new thread local static fields for us that we can use to simulate thread-local instance fields!

This is done in the Ref<TIndex>.Allocate() method. The global instance list always contains at least one unallocated instance. Whenever we try to allocate a new ThreadScoped<T>, we check whether we're down to the last instance in the list. If so, this last instance will generate a Ref<Ref<TIndex>> and enqueue that on the end:

internal override ThreadScoped<T> Allocate()
{
// if 'next' is null, we are at the end of the list of free refs,
// so allocate a new one and enqueue it, then return 'this'
var x = next;
if (x != null) return this;
x = Interlocked.CompareExchange(ref next, new Ref<Ref<TIndex>>(), null);
// atomic swap failure doesn't matter, since the caller of Acquire()
// accesses whatever instance is at this.next
return this;
}

ThreadScoped.Create then pops the front of the list to obtain the next to last instance, and we're back to only having one unallocated instance.

There's an important invariant here: there is always at least one item in the global list, and the last item on the global list is always the most deeply nested generic type that has been generated so far for ThreadScoped<T>. This means when we get to the last remaining unallocated instance, we can always safely generate a new instance without interfering with other threads.

The version numbers also bear some explanation. Basically, even if Thread0 disposes of an ThreadScoped instance, Thread1 may have data sitting in that instance. Suppose that the ThreadScoped instance is then pushed on the free list, and then allocated again. If Thread1 participates in that second computation, it will already find data sitting in its supposedly fresh thread-local instance from the last computation where it was supposed to have been disposed!

Obviously this is not what we want, but while Thread0 is disposing of the instance, it can't access Thread1's thread-local fields to clear them. This is the purpose of the instance and the thread-local version numbers. During dispose, we increment the instance's version number. If the thread-local version number and the instance version number don't match, it means the instance was disposed in the intervening time, so we should reinitialize it before proceeding:

And that's it! There are a few other minor details related to book keeping that aren't really important. I think that's pretty much as simple as you can get, and since this results in effectively a direct pointer to thread-local data, it should be quite fast. There are also no intrinsic allocation limits, as it will just keep allocating or reusing instances as needed.

Benchmarks

Here is the source for the benchmarks I used. The main loop runs 10,000,000 iterations of a simple calculation that performs one read and one write to the thread-local instance per iteration. I ran this test 20 times in a loop, logging the number of iterations per second on each run. Then I used Peirce's Criterion to filter out the outliers, and used the resulting data set.

The numbers are iterations/second, benchmarks were run on .NET 4 and Windows 7. The results are fairly compelling:

ThreadLocal<T>

ThreadScoped<T>

ThreadStatic

5022000

16260000

22396000

5042000

14378000

18484000

4972000

16514000

20790000

5002000

15722000

22470000

5070000

14244000

19860000

5098000

13946000

16570000

5076000

16474000

22246000

5102000

13994000

20532000

5012000

13494000

14482000

5108000

16090000

21598000

5074000

16778000

20000000

5104000

15408000

21620000

5076000

12762000

16312000

5054000

16792000

21008000

5064000

16380000

21856000

5108000

16154000

15910000

5066000

16750000

22988000

5108000

16076000

18778000

5012000

15986000

23094000

MIN

4972000

12762000

14482000

MAX

5108000

16792000

23094000

AVG

5061579

15484316

20052316

% Overhead relativeto ThreadStatic

296%

29%

0%

Microsoft's implementation takes a real beating in this simple benchmark with almost 300% overhead over raw ThreadStatic fields. I'm not sure what could be going on in there to cause such a significant slowdown

By contrast, ThreadScoped has a modest 30% overhead, which is far more reasonable. I suspect this overhead is due to two factors: 1. virtual dispatch overhead because ThreadScoped.Value is a virtual property, and 2. the encapsulated instance for Ref <TIndex> may require a bit of pointer-chasing to resolve the right thread-local static field. I can't think of a way to eliminate this overhead, so this is as good as it's going to get for now.

I will note that I've had a few programs where ThreadLocal<T> did not perform as badly as demonstrated above, so it may be that reads or writes are more penalized. However, no benchmark or program I tested had ThreadLocal outperforming ThreadScoped, so I can say with confidence that ThreadScoped is much faster than ThreadLocal.

Jeroen's post also implied that he had tested a ThreadLocal version that used arrays for the backing store, and that it was faster. I also implemented a version of ThreadScoped using arrays, but haven't tested it enough. It did seem quite a bit faster with some code I had been playing with, but there were many uncontrolled variables, so I can't reach any conclusion with any confidence. The array-backed version is commited to the Sasa repository for the adventurous though. There are serious limitations with using arrays as a backing store however, namely that you're stuck with a fixed number of instances defined at compile-time, and allocating large enough arrays to avoid falling back on slow thread-local data wastes quite a bit of memory.

Still, I will run some benchmarks at some future date, so stay tuned! As for ThreadScoped<T>, it will probably be in the next Sasa release coming in a month or two, or you can just grab it from the Mercurial repo.

Sunday, September 25, 2011

There's a great post on implementing idioms with LINQ, and the example application was to implement formlets, as WebSharper does for F#. Tomas's post is well written, so if you're unclear on the above concepts I recommend reading it first before proceeding with this article.

The claim in that post is that idioms can only be encoded via LINQ's 'join' operators. While strictly true if you stick to all the LINQ rules, because LINQ queries are just naive syntactic transforms you don't have to follow the rules. You can thus exploit this to hijack the signatures for the SelectMany overloads to yield idiom signatures. It's not all sunshine and roses though, as there are consequences.

Overview

LINQ is a standard set of methods one can implement that the C# compiler can use to provide "query patterns". This query:

var foo = from x in SomeFoo
from y in foo.Values
select y;

is translated by the C# compiler to:

var foo = SomeFoo.SelectMany(x => x.Values, (x, y) => y);

This is a purely syntactic transformation, meaning that the C# compiler simply takes the text from above, and naively translates each 'from', 'where', 'select', etc. into calls to instance or extension methods, SelectMany, Where, Select, etc. Type inference must then be able to infer the types used in your query, and everything must type check.

The fact that we're dealing with a purely syntactic transform means that we can be sneaky and alter the signatures of these LINQ functions and the C# compiler would be none the wiser. The resulting calls to the LINQ methods would still need to compile, but we can ensure that they only compile following the rules we want, in this case, of the rules of idioms.

LINQ Methods

The core LINQ methods are as follows, using Formlet<T> as the LINQ type:

The problematic methods for idioms are the two SelectMany calls, specifically, the parameter I've called 'collector'. You can see that the LINQ type is unwrapped and the value extracted on each SelectMany, and passed to the rest of the query. Accessing the previous values like this is forbidden in idioms.

Fortunately, the signatures for SelectMany don't have to have this exact signature, they must only have a similar structure. You must have two SelectMany overloads with one and two delegate parameters, and the first delegate parameter must return your LINQ type, in this case Formlet<T>, as this allows you to chain query clauses one after another. You can also modify the second delegate parameter in various ways, but I haven't found much use for that myself.

To implement idioms, we will simply alter the first delegate parameter so instead of unwrapping the value encapsulated by the Formlet<T>, we simply pass the Formlet<T> itself:

would then no longer compile, because 'x' is now not a Foo, but is in fact a Formlet<Foo>, and the formlet type does not have a "Values" property. Of course, you shouldn't provide a property to extract the encapsulated value, or this is all for naught.

The Downsides

Simple queries work great, but longer queries may run into some problems if you alter the LINQ signatures. In this case, if you try to access previous values by mistake, as in our example query above, you will get a complicated error message:

Error 1 Could not find an implementation of the query pattern for source type
'Formlet.Formlet<AnonymousType#1>'. '<>h__TransparentIdentifier0' not found.

Basically, your incorrect program was naively translated to use the LINQ methods, but because it does not properly match the type signatures you've hijacked, type inference fails. So you can't break your idioms by hijacking the query pattern this way, but depending on your target audience, perhaps you will render them unusable.

Still, it's a neat trick that should be in every type wizard's toolbox.

Saturday, September 17, 2011

Equality and identity are often intermingled in the .NET framework. .NET delegates correctly implement a straightforward structural equality, where if two delegates are equal if they designate the same method and the same object, regardless of whether the delegate is a unique instance:

This is unfortunate, because it means that by default we cannot implement previous event handling patterns using IObservable without significantly restructuring the code to propagate the IDisposable returned by IObservable.Subscribe. By this, I mean code that properly managed event registration and unregistration has no easy transition to using IObservable, it must be completely rewritten.

Like event equality, the equality of IObservers created from delegates should be structural, not based on identity. Thus, manually managing subscriptions would be possible via an explicit "Unsubscribe" operation.

This decision has a real consequence that I just hit: I can implement IObservable given an object implementing INotifyPropertyChanged, but could not do the reverse using the built-in abstractions. You'd either have to define your own IObservers that implement structural equality, or you'd have to store the actual event locally and trigger it manually when a new value is available, as I have done with NamedProperty<T> in Sasa.

On a slightly related note, I've switched Sasa over to use Mercurial on Sourceforge, and have closed the old subversion repo.