When I started re-writing this website, I wanted to make good use of my multi-core CPU. Generating hundreds of pages using XSL transforms and plenty of pre-processing in C#, there's a lot of parallelism to be had.

I began by using the TPL's data parallelism features: mainly Parallel.ForEach and Parallel.Invoke. These are super easy to use, and made an immediate huge difference.

Then the Visual Studio 11 developer preview came out, and I felt compelled to make use of its new async features. This meant ditching the Parallel methods all together and writing for task parallelism.

There are still parts of the .NET Framework which don't support async, and XML is one of them. Because I'm reading relatively small documents, I was able to work around these limitations by asynchronously filling a MemoryStream from a file and feeding the MemoryStream to the XML classes:

But I had one more problem to solve. For efficiency, Parallel.ForEach partitions its items into ranges which will be operated on concurrently. A side effect of this that I was relying on was that only so many I/O operations would be able to happen at once. In my new code I'm simply launching all these tasks at once rather than partitioning—this absolutely killed performance as potentially hundreds of concurrent I/Os caused my disk to seek like crazy.

What I ended up doing here was creating a ticket system which can be used to allow only a limited number of I/Os to happen concurrently: essentially a safe task-based semaphore.

When the lock gets disposed, it'll let the next operation in line progress. This was simple to implement efficiently using Interlocked methods and a ConcurrentQueue.

Some operations—file opening and existence testing, directory creation, etc.—have no asynchronous analog. For these there is no good solution, so I simply wrapped them in a task as in the OpenReadAsync example above. They're rare enough that it hasn't been a problem.

The end result? Actually about 50% better performance than using the Parallel methods. When all the files are in cache, I'm able to generate this entire website from scratch in about 0.7 seconds.

I moved to Wordpress when I got sick of limitations in an XSLT-based site. I've now moved back to an XSLT-based site due to Wordpress' heavy SQL usage—SourceForge's SQL servers are pretty overloaded, and it was making this site run pretty slow.

It took me a while to finish the transition and there are still a few things I need to iron out, but I think it's ready enough for prime time. It's been a great transition, though. I finally got to rewrite that old theme, using clean HTML and CSS. All the limitations I hated in my old XSLT-based site are gone as well, albeit with a good deal more pre-processing involved.

One of the things I've changed is in how images are scaled: when there isn't enough screen space (such as reducing the width of the screen), they'll all shrink to fit. This was important to me because I've grown to use Windows 7's snapping feature and it's important that sites still work when using only half the screen. This actually revealed a bug in Google Chrome and perhaps other Webkit-based browsers, so hopefully that gets fixed soon.

Another thing I've started trying is to size floating images based on megapixels instead of simply a maximum width/height. This was simple to do and will increase aesthetics by ensuring no images appear abnormally big compared to other ones. So far I like the results.

Now that I'm mostly done with this, I should be able to write a lot more. Those promised resampling articles are coming, I swear!

Years ago when I was a fledgling still learning to code, one of the first things I tried creating was an image resizer. I preferred (and given time, still do) to just have a go at things without research so while I succeeded in making a resizer, the results were (predictably) poor. I soon got sidetracked and left the code to rot, but the idea remained.

It’s taken me over 10 years to find a reason to revisit the subject, but I finally have: gamma compression.

In HTML, the color #777 will have roughly half the intensity of #FFF. This matches our perception and makes working with color fairly easy, but the way light really works is much different. Our eyes perceive light on a logarithmic scale—twice the photons won’t appear twice as bright to us. Transforming the actual linear intensity into our familiar representation is called gamma compression.

When we blend two gamma-​compressed colors, the result is not the same as if we blended two linear colors. To correctly blend colors, we must first uncompress them into their linear values. Take the gamma-​compressed values 0.1 and 0.9. If we just add 0.1 to 0.9, we’ll of course get 1.0: an 11% change of value. Doing it the correct way, we first decompress them into the linear values 0.​01 and 0.​79. Add 0.​01 to 0.​79, re-​compress, and the result will be 0.​905:​ an 0.5% change of value. Gamma-​ignorant processing gave us a way off result!

Downsizing this 4800×2400 image presents a worst-case scenario for a gamma-ignorant resizer. The sharp contrast between the lights and surrounding darkness makes the blending error very prominent, and the massive downsizing gives it a chance to do a lot of blending.

At the top we see the result of gamma-correct resizing. This is how it's supposed to look—you can still see the lights along the African and Australian coasts. Western USA is clearly still very awake, as well as Europe and parts of Asia.

On the bottom we see the result of gamma-ignorant resizing. The fainter lights have been completely drowned out. Australia and Africa now barely register, and all the other continents look far darker overall. Big difference! The unfortunate thing is that a majority of resizers will look like the image on the bottom. The incorrect result is often good enough that they either don’t notice or don’t care.

One of these incorrect resizers is in Avisynth, a scripted video processor used for everything from simple DVD deinterlacing all the way to heavy restoration of old 8mm film. A while ago I was using it to resize a Quake video and discovered that similar to the lights of the image above, all the starry teleporters lost their stars: a clear sign of gamma-​ignorant processing.

I decided to make a high-​quality resizing plugin for Avisynth that would use a fully linear, high bit depth pipeline. No shortcuts. Working on it has been a lot of fun and challenge, so I’ll be writing about it here over the next few days.

Synchronous, copying from OS cache ( fread). This is the simplest form of I/O, but isn’t very scalable.

Synchronous, reading directly from OS cache (memory mapping). This is wicked fast and efficient once memory is filled, but aside from some cases with read-​ahead, your threads will still block with page faults.

Asynchronous, copying from OS cache ( ReadFile). Much more scalable than fread, but each read still involves duplicating data from the OS cache into your buffer. Fine if you’re reading some data only to modify the buffer in place, but still not very great when you’re treating it as read only (such as to send over a socket).

Asynchronous, maintaining your own cache ( FILE_FLAG_NO_BUFFERING). More scalable still than ReadFile, but you need to do your own caching and it’s not shared with other processes.

Note that there’s one important choice missing: memory mapping with asynchronous page faults. As far as I know there are no operating systems that actually offer this—it’s kind of a dream feature of mine. There are two APIs that will help support this:

CreateMemoryManager opens a handle to the Windows memory manager, and MakeResident will fill the pages you specify (returning true for synchronous completion, false for error/async like everything else). The best of both worlds: fast, easy access through memory, a full asynchronous workflow, and shared cache usage. This would be especially useful on modern CPUs that offer gigantic address spaces.

The memory manager already has similar functionality in there somewhere, so it might not be difficult to pull into user-​mode. Just an educated guess. Maybe it’d be terribly difficult. Dream feature!

For all the cons of giving a single entity control over C#, one pro is that it gives the language an unmatched agility to try new things in the C family of languages. LINQ—both its language integration and its backing APIs—is an incredibly powerful tool for querying and transforming data with very concise code. I really can’t express how much I’ve come to love it.

The new async support announced at PDC10 is basically the holy grail of async coding, letting you focus on what your task is and not how you’re going to implement a complex async code path for it. It’s an old idea that many async coders have come up with, but, as far as I know, has never been successfully implemented simply because it required too much language support.

The lack of peer review and standards committee for .​NET shows—there’s a pretty high rate of turnover as Microsoft tries to iron down the right way to tackle problems, and it results in a very large library with lots of redundant functionality. As much as this might hurt .​NET, I’m starting to view C# as a sort of Boost for the C language family. Some great ideas are getting real-​world use, and if other languages eventually feel the need to get something similar, they will have a bounty of experience to pull from.

C++, at least, is a terrifyingly complex language. Getting new features into it is an uphill battle, even when they address a problem that everyone is frustrated with. Getting complex new features like these into it would be a very long process, with a lot of arguing and years of delay. Any extra incubation time we can give them is a plus.

I’ve had my Nook for about two months now, and have read about 8 books on it so far. It’s a great little device that I’ve been unable to put down in my spare time. My only real gripe is that, like almost every other e-​reader out there, it has such poor typesetting that you have to wonder if it was designed by software engineers who aren’t big readers—it certainly provides the paper look, but it’s still got a long way to go to provide a reading experience equal to a dead tree book.

To give an example, there are currently two types of books you can buy on the nook: those that are left-​aligned, and those that are justified.

Here’s the left alignment. It’s readable, but it wastes a lot of space on the right edge and isn’t very aesthetic. Next we have the justified version:

This kind of very basic justification is pretty poor looking—in attempting to create a nice straight margin, it adds a bunch of ugly space between words. This is the same kind of justification web browsers do, and almost nobody uses it because it looks so bad. I knew about this going in. Science fiction author Robert J. Sawyer wrote a very informative post about it, and I set about finding a way to solve this problem before my nook even arrived in the mail.

One great feature about the nook is that it supports viewing PDFs, with and without re-​flowing text. With re-​flowing turned off, you’re free to manually typeset the document any way you want, and it will appear exactly how you want on the screen. This is what I was looking for. With this you can use Microsoft Word to create a great looking PDF. If you’re feeling brave and want even better looking text, professional typesetting software like Adobe InDesign provides more advanced features that will give fantastic results.

Here we can see proper hyphenation and justification. A good justification algorithm won’t just add more space between words, but will also be willing to shrink it just slightly if it will make things look better. You can see this in the first line of text, where it fits “adipiscing” in to remove the whitespace that plagued the text in the previous image. It also evenly spaces the entire paragraph at once instead of just a single line, and fits more text into each line by hyphenating words on line breaks.

It’s looking pretty good now, and is almost like a real book. But there’s a little more that can be done:

Can you spot the difference? I admit, without a keen eye and a little knowledge of typography, I won’t expect most people to. Here’s an animation to help show the differences better:

There are two things happening here, one more subtle than the other. The first is optical margin adjustment. This improves aesthetics by adjusting each line’s margins ever so slightly to reduce empty space, giving a more flat look to edges. You can see it on the fifth line, where it compensates for the empty space under the V’s left edge by moving it out a little bit. It’s even more noticeable with punctuation on the right edge, where it pushes the periods and hyphens out into the margin.

The second thing happening is ligature substitution. Certain combinations of characters have similar fine details in the same spots and can look a little awkward together, and ligatures can make them look better by combining them into a single specialized glyph. You can see this in the middle left of the text, where “officae” is slightly altered—look closely at the “ffi” and you will see the three letters merged together with the first F becoming a little smaller and the dot over the I merging with the second F to create a slightly larger overhang. Look in your favorite dead tree book, and you’ll probably find it in “ff” and “fi” combinations—it’s pretty hard to notice without looking for it, but it is used to subtly improve legibility.

There is nothing about EPUBs that prevent e-​readers from performing this kind of typesetting automatically. With any luck, one day we’ll get this nice look in all the e-​books we can download. The fault is solely within the e-​reader’s software. Until they start to take typesetting seriously, the only way we’ll get this is with PDFs. Unfortunately, most e-​books aren’t legally available without DRM, making this kind of dramatic alteration impossible with most of the stuff you can buy.

It’s easy to say this isn’t very important. After all, it doesn’t affect functionality, right? You can still read the first picture! When you’re doing a lot of reading, it is important. Proper text flow reduces eye movement and takes less work for your brain to process, letting you read longer and faster. It also happens to get significantly more text onto each page, which means fewer delays from page turns—in my experience, it reduces a book’s page count by around 15%.

There are a lot of compelling reasons to get an e-​reader. They can store thousands of books at once and remember your place in each one of them. You can look up unknown words in a dictionary, and bookmark stuff for later reference without drawing all over a page with a highlighter. You can browse through massive book stores and read bought books instantly without ever leaving your home. It baffles me that they don’t bother to improve the single most important part of the device—reading!

One of my big pet peeves with ClearType prior to Windows 7 was that it only anti-aliased horizontally with sub-pixels. This is great for small fonts, because at such a small scale traditional anti-aliasing has a smudging effect, reducing clarity and increasing the font’s weight. For large fonts however, it introduces some very noticeable aliasing on curves, as best seen in the ‘6′ and ‘g’ here:

You’ve probably noticed this on websites everywhere, but have come to accept it. Depending on your browser and operating system, you can probably see it in the title here. This problem is solved in Windows 7 with the introduction of DirectWrite, which combines ClearType’s horizontal anti-aliasing with regular vertical anti-aliasing when using large font sizes:

Of course, DirectWrite affects more than just Latin characters. Any glyphs with very slight angles will see a huge benefit, such as hiragana:

Unfortunately, this isn’t a free upgrade. For whatever reason, Microsoft didn’t make all the old GDI functions use DirectWrite’s improvements so to make use of this, all your old GDI and DrawText code will need to be upgraded to use Direct2D and DirectWrite directly, so an old WM_PAINT procedure like this:

This is no small change, and considering this API won’t work on anything but Vista and Windows 7, you’ll be cutting out a lot of users if you specialize for it. While you could probably make a clever DrawText wrapper, Direct2D and DirectWrite are really set up to get you the most benefit if you’re all in. Hopefully general libraries like Pango and Cairo will get updated backends for it.

DirectWrite has other benefits too, like sub-pixel rendering. When you render text in GDI, glyphs will always get snapped to pixels. If you have two letters side by side, it will choose to always start the next letter 1 or 2 pixels away from the last—but what if the current font size says it should actually be a 1.5 pixel distance? In GDI, this will be rounded to 1 or 2. This is also noticeable with kerning, which tries to remove excessive space between specific glyphs such as “Vo”. Because of this, most of the text you see in GDI is very slightly warped. It’s much more apparent when animating, where it causes the text to have a wobbling effect as it constantly snaps from one pixel to the next instead of smoothly transitioning between the two.

DirectWrite’s sub-pixel rendering helps to alleviate this by doing exactly that: glyphs can now start rendering at that 1.5 pixel distance, or any other point in between. Here you can see the differing space between the ‘V’ and ‘o’, as well as a slight difference between the ‘o’s with DirectWrite on the right side, because they are being rendered on sub-pixel offsets:

The difference between animating with sub-pixel rendering and without is staggering when we view it in motion:

Prior to DirectWrite the normal way to animate like this was to render to a texture with monochrome anti-aliasing (that is, without ClearType), and transform the texture while rendering. The problem with that is the transform will introduce a lot of imperfections without expensive super-sampling, and of course it won’t be able to use ClearType. With DirectWrite you get pixel-perfect ClearType rendering every time.

Apparently WPF 4 is already using Direct2D and DirectWrite to some degree, hopefully there will be high-quality text integrated in Flash’s future. Firefox has also been looking at adding DirectWrite support, but I haven’t seen any news of Webkit (Chrome/Safari) or Opera doing the same. It looks like Firefox might actually get it in before Internet Explorer. Edit: looks like Internet Explorer 9 will use DirectWrite—wonder which will go gold with the feature first?

A while ago I wrote about creating a good parser and while the non-blocking idea was spot-on, the rest of it really isn’t very good in C++ where we have the power of templates around to help us.

I’m currently finishing up a HTTP library and have been revising my views on stream parsing because of it. I’m still not entirely set on my overall implementation, but I’m nearing completion and am ready to share my ideas. First, I’ll list my requirements:

I/O agnostic: the parser does not call any I/O functions and does not care where the data comes from.

Pull parsing: expose a basic stream of parsed elements that the program reads one at a time.

Non-blocking: when no more elements can be parsed from the input stream, it must immediately return something indicating that instead of waiting for more data.

In-situ reuse: for optimal performance and scalability the parser should avoid copying and allocations, instead re-using data in-place from buffers.

A simple, easy to follow parser: having the parser directly handle buffers can easily lead to spaghetti code, so I’m simply getting rid of that. The core parser must operate on a single iterator range.

To accomplish this I broke this out into three layers: a core parser, a buffer, and a buffer parser.

The core parser

Designing the core parser was simple. I believe I already have a solid C++ parser design in my XML library, so I’m reusing that concept. This is fully in-situ pull parser that operates on a range of bidirectional iterators and returns back a sub-range of those iterators. The pull function returns ok when it parses a new element, done when it has reached a point that could be considered an end of the stream, and need_more when an element can’t be extracted from the passed in iterator range. Using this parser is pretty simple:

By letting the user pass in a new range of iterators to parse each time, we have the option of updating the stream with more data when need_more is returned. The parse() function also updates the first iterator so that we can pop any data prior to it from the data stream.

By default the parser will throw an exception when it encounters an error. This can be changed by calling an overload and handling the error result type:

The buffer

Initially I was testing my parser with a deque<char> like above. This let me test the iterator-based parser very easily by incrementally pushing data on, parsing some of it, and popping off what was used. Unfortunately, using a deque means we always have an extra copy, from an I/O buffer into the deque. Iterating a deque is also a lot slower than iterating a range of pointers because of the way deque is usually implemented. This inefficiency is acceptable for testing, but just won't work in a live app.

My buffer class is I/O- and parsing-optimized, operating on pages of data. It allows pages to be inserted directly from I/O without copying. Ones that weren't filled entirely can still be filled later, allowing the user to commit more bytes of a page as readable. One can use scatter/gather I/O to make operations span multiple pages contained in a buffer.

The buffer exposes two types of iterators. The first type is what we are used to in deque: just a general byte stream iterator. But this incurs the same cost as deque: each increment to the iterator must check if it's at the end of the current page and move to the next. A protocol like HTTP can fit a lot of elements into a single 4KiB page, so it doesn't make sense to have this cost. This is where the second iterator comes in: the page iterator. A page can be thought of as a Range representing a subset of the data in the full buffer. Overall the buffer class looks something like this:

One thing you may notice is it expects you to push() and pop() pages directly onto it, instead of allocating its own. I really hate classes that allocate memory – in terms of scalability the fewer places that allocate memory, the easier it will be to optimize. Because of this I always try to design my code to – if it makes sense – have the next layer up do allocations. When it doesn't make sense, I document it. Hidden allocations are the root of evil.

The buffer parser

Unlike the core parser, the buffer parser isn't a template class. The buffer parser exposes the same functionality as a core parser, but using a buffer instead of iterator ranges.

This is where C++ gives me a big advantage. The buffer parser is actually implemented with two core parsers. The first is a very fast http::parser<const char*>. It uses this to parse as much of a single page as possible, stopping when it encounters need_more and no more data can be added to the page. The second is a http::parser<buffer::iterator>. This gets used when the first parser stops, which will happen very infrequently – only when a HTTP element spans multiple pages.

This is fairly easy to implement, but required a small change to my core parser concept. Because each has separate internal state, I needed to make it so I could move the state between two parsers that use different iterators. The amount of state is actually very small, making this a fast operation.

The buffer parser works with two different iterator types internally, so I chose to always return a buffer::iterator range. The choice was either that or silently copy elements spanning multiple pages, and this way lets the user of the code decide how they want to handle it.

The I/O layer

I'm leaving out an I/O layer for now. I will probably write several small I/O systems for it once I'm satisfied with the parser. Perhaps one using asio, one using I/O completion ports, and one using epoll. I've designed this from the start to be I/O agnostic but with optimizations that facilitate efficient forms of all I/O, so I think it could be an good benchmark of the various I/O subsystems that different platforms provide.

One idea I've got is to use Winsock Kernel to implement a kernel-mode HTTPd. Not a very good idea from a security standpoint, but would still be interesting to see the effects on performance. Because the parser performs no allocation, no I/O calls, and doesn't force the use of exceptions, it should actually be very simple to use in kernel-mode.

Bjarne Stroustrup and Herb Sutter have both reported on the ISO C++ meeting in Frankfurt a week ago, in which the much-heralded feature "concepts" were removed from C++1x.

Concepts are a powerful feature aimed at improving overloading (basically, removing the extra work in using things like iterator categories) and moving type checking up the ladder so that more reasonable error messages can be produced when a developer passes in the wrong type (think a single error line instead of multiple pages of template crap). Apparently the feature was a lot less solid than most of us thought, with a huge amount of internal arguing within the committee on a lot of the fundamental features of it. It seems that while most agreed concepts were a good idea, nobody could agree on how to implement them.

I'm definitely disappointed by this, but I'm also glad they chose to remove concepts instead of further delaying the standard, or worse putting out a poorly designed one. Instead, it seems like there is hope for a smaller C++ update to come out in 4-5 years that adds a more well thought out concepts feature. There are plenty of other C++1x language features to be happy about for though, like variadic templates, rvalue references, and lambda functions!

You may notice I've been saying C++1x here instead of C++0x—that's because it's pretty obvious to everyone now that we won't be getting the next C++ standard in 2009, but more likely 2011 or 2012. Just in time for the end of the world!

My tips for efficient I/O are relevant all the way back to coding for Windows 2000. A lot of time has passed since then though, and for all the criticism it got, Windows Vista actually brought in a few new ways to make I/O even more performant than before.

This will probably be my last post on user-mode I/O until something new and interesting comes along, completing what started a couple weeks ago with High Performance I/O on Windows.

Synchronous completion

In the past, non-blocking I/O was a great way to reduce the stress on a completion port. An unfortunate side-effect of this was an increased amount of syscalls -- the last non-blocking call you make will do nothing, only returning WSAEWOULDBLOCK. You would still need to call an asynchronous version to wait for more data.

Windows Vista solved this elegantly with SetFileCompletionNotificationModes. This function lets you tell a file or socket that you don't want a completion packet queued up when an operation completes synchronously (that is, a function returned success immediately and not ERROR_IO_PENDING). Using this, the last I/O call will always be of some use -- either it completes immediately and you can continue processing, or it starts an asynchronous operation and a completion packet will be queued up when it finishes.

Like the non-blocking I/O trick, continually calling this can starve other operations in a completion port if a socket or file feeds data faster than you can process it. Care should be taken to limit the number of times you continue processing synchronous completions.

Reuse memory with file handles

If you want to optimize even more for throughput, you can associate a range of memory with an unbuffered file handle using SetFileIoOverlappedRange. This tells the OS that a block of memory will be re-used, and should be kept locked in memory until the file handle is closed. Of course if you won't be performing I/O with a handle very often, it might just waste memory.

Dequeue multiple completion packets at once

A new feature to further reduce the stress on a completion port is GetQueuedCompletionStatusEx, which lets you dequeue multiple completion packets in one call.

If you read the docs for it, you'll eventually realize it only returns error information if the function itself fails—not if an async operation fails. Unfortunately this important information is missing from all the official docs I could find, and searching gives nothing. So how do you get error information out of GetQueuedCompletionStatusEx? Well, after playing around a bit I discovered that you can call GetOverlappedResult or WSAGetOverlappedResult to get it, so not a problem.

This function should only be used if your application has a single thread or handles a high amount of concurrent I/O operations, or you might end up defeating the multithreading baked in to completion ports by not letting it spread completion notifications around. If you can use it, it's a nice and easy way to boost the performance of your code by lowering contention on a completion port.

Bandwidth reservation

One large change in Windows Vista was I/O scheduling and prioritization. If you have I/O that is dependant on steady streaming like audio or video, you can now use SetFileBandwidthReservation to help ensure it will never be interrupted by something else hogging a disk.

Cancel specific I/O requests

A big pain pre-Vista was the inability to cancel individual I/O operations. The only option was to cancel all operations for a handle, and only from the thread which initiated them.

If it turns out some I/O operation is no longer required, it is now possible to cancel individual I/Os using CancelIoEx. This much needed function replaces the almost useless CancelIo, and opens the doors to sharing file handles between separate operations.