A More Powerful BinaryReader/Writer

Extending BinaryReader/Writer to support a different byte order, string and date formats, and advanced navigation in binary files

Introduction

This article will discuss how to extend the functionality of the standard .NET BinaryReader and BinaryWriter classes to support a lot of new, generally applicable and useful features.

The API discussed in this article has been wrapped into a NuGet package for you to easily implement it in your existing .NET 4.5 projects.

A separate wiki for the NuGet package is found on the GitLab repository which is focusing the implementors side more, but this article also discusses how the package evolved and what I had to consider when writing its internals.

Background

Whenever I wrote a library to handle loading and saving a custom binary file format - sometimes complex proprietary formats of other companies - I used the standard .NET BinaryReader and BinaryWriter classes to parse the binary data out of these files.

However, the more complex the formats got, the more I missed pretty generic tasks and features in these .NET classes. Especially, I was looking for the following functionality:

Handle data stored in a different byte order than the one of the executing machine.

Understand strings not stored in the .NET format, for example, 0-terminated strings.

Simple reading and storing of repetitive data types (e.g., loading 12 Int32s), without writing for loops over and over.

Single-time reading strings with a different encoding than the one specified for the whole reader / writer instance.

Easier navigation around in the file, like temporarily seeking to an offset, then seeking back, or aligning to a block size.

At first, I wrote extension methods appending new features to the existing BinaryReader or BinaryWriter classes. However, this was insufficient for implementing the behavior of reading data in a byte order different than the system executing the code. This eventually lead me to create new classes named BinaryDataReader and BinaryDataWriter, inheriting from the .NET ones.

Let's see how I realized the different aspects listed above, and have a look at how to use it from the implementors side.

Implementation & Usage

Byte Order

Implementing the support for reading multi-byte data in a different byte order than the one of the machine which is executing the code required a lot of changes to the standard .NET class.

Remember that .NET does not define the endianness of the data it is working with. Thus, it might parse data either in big or little endian, depending on what machine you are running it on.

So at first, it was important to detect the system's byte encoding correctly. This is trivial by simply checking the System.BitConverter.IsLittleEndian field:

To make the reader / writer respect this, I introduced a ByteOrder property to which one of those enumeration values can be set. It checks whether the system runs in a different byte order than the one set to reverse the read / written bytes if required:

This worked out really well with the help of all the BitConverter.ToXxx methods, allowing me to retrieve the bytes of multi-byte values or convert bytes to multi-byte values. Decimal values were an oddball however, requiring some manual work of byte conversion, with the code from here taken as the basis.

Usage

By default, the BinaryDataReader/Writer uses the system byte order.

You simply set a ByteOrder enumeration value to the ByteOrder property of a BinaryDataReader/Writer - even between calls to Read* or Write(T) - to switch around the byte order whenever you want:

Repetitive Data Types

When working with 3D file formats, I often had to read transformation matrices, which are 16 floats, one after another. Of course, I could write a highly specific "ReadMatrix" method, but I wanted to keep it reusable and added methods like ReadSingles / Write(T[]), which you pass the count of values you want to read. It internally just runs a for loop and calls the method with the corresponding singular name.

Different String Formats

Most of the file formats I worked with used 0-terminated strings, e.g., those without any length prefix and just ending as soon as a byte of the value 0 has been read. So I added overloads for ReadString (or Write(string)) methods to which you can pass a value of the BinaryStringFormat enumeration, supporting the following representations:

ByteLengthPrefix: The string has a prefix of an unsigned byte, determining the number of the following characters.

WordLengthPrefix: The string has a prefix of a signed two-byte value (e.g. Int16 / short), determining the number of the following characters.

DwordLengthPrefix: The string has a prefix of a signed four-byte value (e.g. Int32 / int), determining the number of the following characters.

ZeroTerminated: The string has no prefix, but ends on the first encountered byte with the value 0.

NoPrefixOrTermination: The string has neither a prefix nor a postfix, and the length must be known to read it.

Usage

Whenever you want to read or write a string in the according format, use the corresponding method overloads:

Due to NoPrefixOrTermination requiring you to know the number of characters you want to read, there is just an overload requiring the length instead of the BinaryStringFormat. You cannot use it directly in the ReadString overloads to which you would pass a BinaryStringFormat enumeration value.

One-time String Encoding

The default .NET BinaryReader/Writer classes allow you to specify a string encoding in the constructor, but they don't allow you - for example - to write a one-off ASCII string for an instance you created with UTF8 encoding. I added overloads to override this encoding by simply passing it to the ReadString or Write(string) calls.

The encoding of a standard .NET reader or writer cannot be changed at runtime. In fact, it cannot even be retrieved after creating the instances. My inherited classes remember that encoding though and it can be queried - but not set - through the Encoding property.

Usage

Simply pass the one-off encoding to the ReadString or Write(string) methods:

Different Date / Time Formats

Not only strings have different binary representations, DateTime instances can also be stored in different, common ways. These mostly differ at what point in time the tick with index 0 happened and how granular those ticks are, also determining the minimum and maximum DateTime value. Right now, the API only supports a few representations, which the BinaryDateTimeFormat enumeration spoils:

Usage

As you might have guessed, this enumeration can be used like the string methods accepting a BinaryStringFormat. The new methods are ReadDateTime(BinaryDateFormat) / Write(DateTime value, BinaryDateTimeFormat):

Advanced Stream Navigation

Another common task totally not covered in the default BinaryReader/Writer classes is temporarily seeking to another position, fetching or storing data there, then going back to the previous position.

I implemented temporary seeking with the using / IDisposable pattern. When you call TemporarySeek(long), it returns an instance of the SeekTask class, which immediately teleports the stream position to the one you specified. After it gets disposed, it returns to the position it had before the seek.

This first reads an offset from the file itself, then seeks to that offset read to fetch 128 more bytes from there. After that, the using block ends and the stream returns to the position after the offset originally read.

Of course, you can also use absolute offsets to seek to; this was just a common example as seen in many file formats.

Block Alignment

Several file formats are highly optimized to be loaded quickly by the hardware they will run on, and thus organize their data in blocks of a specific size in bytes. Some finicky calculation is required to seek to the start of the next block from the current position, but the BinaryDataReader/Writer classes simply wrap it for you in one call to which you pass the size of your blocks.

Usage

Let's say you know your file is organized in blocks of 0x200 bytes size. Use Align as follows:

using (MemoryStream stream = new MemoryStream(...))
using (BinaryDataReader reader = new BinaryDataReader(stream))
{
string header = reader.ReadString(4);
reader.Align(0x200); // Seek to the start of the next block of 0x200 bytes size.
}

Shortcuts to Stream Properties

Some important stream properties or methods like Length, Position or Seek are a little buried in the default .NET BinaryReader/Writer, since you have to access them through BaseStream there. My classes forward those properties and you can directly access them on the reader / writer instance.

///<summary>/// Gets or sets the position within the current stream. This is a shortcut to the base stream Position
/// property.
///</summary>publiclong Position
{
get { return BaseStream.Position; }
set { BaseStream.Position = value; }
}

Points of Interest

Optimizing the performance of the reader and writer is surely of high priority and I did the best I know without the use of unsafe code. Maybe someone knows ways to optimize it even further, and I'm eager to get to know the 'tricks' which can speed up binary data handling!

Don't forget to check out the NuGet package if you want to start using discussed features right away (note that the API has changed quite a bit over the years, and you can find the new documentation here).

Comments and Discussions

I was thinking though, the endianess idea means each time a numerical value is read (or written) a logical branching operation occurs - to check if it need reversion. For once off, or a handful of such reads/writes it is definitely no big deal.

However, what if you're reading 1000s of these in one go? With your current implementation it means it checks endianess for each and every one of those reads. Not sure if the compiler / JIT interpreter is "smart" enough to optimize them away in this case, though I doubt it very much.

I was thinking you could implement the endianess as a contained helper class instead. E.g. instead of just defining the _needsReversion value, have a contained helper class to handle these. Then a static constructor instantiates one of two of these classes (one for direct byte order and the other for reverse order). Then all your Read* and Write(T) functions just pass the value to those through the instantiated class instead. No more need for each to check the _needsReversion value.

Of course, just how much (if any) of an improvement this would make needs some testing. And my guess is it would be minimal for small amounts of values, and only linearly scaling for larger amounts. Still, it may be worth investigating.

This is actually done since the 4.x version of the Syroot.BinaryData library. You can have a look at BinaryStream.cs in the repository, where you see a ByteConverter passed to the BinaryStream instance which reuses it for all endian-dependent calls, and simply uses its implementations to reverse data or not.
There are static properties to retrieve a Big or Little ByteConverter instances, and one representing the system endianness (as set at startup) of course not doing much other than converting bytes to value types. You can find all implementations here.

However, in the newer Syroot.BinaryData.Memory library working with Spans, I retired this approach to if-else on a boolean again, mostly to be able to use the new functionality of BinaryPrimitives.ReverseEndianness, but also because the virtual call "overhead" did not really made it faster.

Sorry, I was starting to go through the source, but didn't have time to peruse it all. My comment was only related to the article.

Yep, that does seem more than adequate! I'll do a bit more reading the source when I'm less congested, but so far I cannot find any "glaring" possibilities for improvement. Once again, great job with this!

first of all very nice work!
One suggestion for the repetitive data types solution is to read a full block of bytes, depending on the data type you want to read.
Example:
You want to read 16 Int32 values. You can use BinaryReader.ReadBytes(sizeof(Int32)*16) and convert the resulting array in a for loop or any way you want.
I am not sure if and how much the speedup is in C#, but it is huge in other languages, e.g. Matlab, of course depending how often you read blocks of data and how large these blocks are.

Yeah, that's true. I did not implement it this way yet because of endianness. It would make iterating over the array kinda useless if you just go in and pick a part to reverse it again. Better rewrite the data inside the buffer first. It sure is a thing I should test for performance in the future.

Dang, I should've uploaded it earlier... it was copied around between my projects for this decade and got more 'mature' and complete over time, until I was bored enough to polish it last weekend to finally make a NuGet package out of it

However, due to a shortcoming in the BitConverter class, this value is false the first time you query it. The reason for this basically boils down to "static constructors are evil". The public, static IsLittleEndian field is initialized after the static constructor of the BitConverter class is run, but it might not be run at the time when you query that value.

I suggest you re-read the linked blog post, and the associated StackOverflow thread. The behaviour you're trying to avoid only applies when you access the field through the debugger - eg: in Visual Studio's Immediate window, Watch window, Locals window, etc.

When you access the field in code, the static constructor is guaranteed to run before the value of the field is returned. There is no need to call a method on the class to get the field initialized.

"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

Yeah, you're right. After reading the article again I realized my code (that spot already a few years old) has been the result of a misunderstanding with that. I already fixed it on the projects wiki, and update the article now, including the package ASAP.