I’ve been recently looking into different serialization options. While there are plentyofwriteups (even in C#) already available, I wanted to:

Have one about C#

Learn something new 😉

Look into my particular data distribution / characteristics

Understand not only performance, but also size impact

I’d say that there are very few summaries that you should just rely on. With some exception, serialization is generally pretty fast, and the choices you make have also following considerations:

do you need it to be ‘human readable’?

Is it just enough to have a tool that presents the serialization for you?

Do you need tagged or untagged serialization? Untagged serialization can be noticably faster and smaller.
What if even the field names aren’t preserved in serialization? If you don’t have a schema agreed upon (ahead of time), some of customers of your data might not know how to interpret it (e.g. visualization tools for data).

do you care about maximum compactness of data representation?

how much do you really care about performance? if you are going to send it ‘over the wire’, chances are that any of the serializers will be ‘fast enough’.

do you need cross language support? do you need code-gen for those languages?

In general, picking up a serializer is picking up your personal favorite. TLDR: don’t use XML. If you have to, use JSON. It’s easy to deal with when schemaless (although dates can pose some challenges). In C#, if you can don’t use JavascriptSerlializer, DataContractSerializer or DataContractJsonSerializer. They are on the slow side.

Goals

I want to determine time and size impact of different serializers. I am more concerned about the size, assuming that time will be around the same for majority of serializers. The goal of this post is not to document differences between different serialization stacks (analysis of languages, apis, etc.)

Serializers in the set

Almost all of serializers in this list support some kind of RPC on their own, I’ll skip that part from this analysis.

Newtonsoft.Json – it’s a good json serializer. It supports ‘DataContract’ attributes from System.Runtime.Serialization, it also supports BSON. (todo: BSON)

Thrift – thrift is a fully fledged, cross-language/platform ‘RPC’ library (service development), used (among others) by Twitter and Salesforce (also, combined with finagle). It has language agnostic data (and service) definition layer, which then transpiles to a specific language of your choice. Various serialization options are available.

Avro – data serialization system, but also provides RPC layer if needed. It relies on schemas (but it’s embedded with the message/data), but code does not have to be generated (unlike thrift and protobuf).

Binary formatter – built in binary formatter / serializer for .net. You’d probably never use it in production, as it doesn’t really provide any backward compatibiltiy in case of schema changes.

About the project

I used Benchmark.Net for performance experiments. While it puts some constraints on code layout, it not only measures performance, but it correctly prepares the performance measurement, it also measures approximate allocations and garbage collections.

If serializer cannot use standard data type, I use AutoMapper to map from my original type to the type of that serializer. Since some serializers don’t handle nulls, at some point I decided to not have nulls in my properties.

The test data

I decided to use test two scenarios. One is an object that contains 4 strings. The other one contains binary data (in my case ‘binary’ is HTML). It is meant to represent a content fetched by a web fetcher. So the data contains Url (or string), Response Header (as text, since it’s ANSI), and Content (byte[]) — since the fetcher itself might not know what encoding to apply.

I generated two objects which I am using across all tests. Both objects are generated before the tests are performed. All instances of serializers are also created before the test begins – I assume one-time creation time is neglibile (even if it’s one-time per type).

What I didn’t test is how these serializers handle nested objects, cycles, etc. While all of them work fine with nested objects, they differ in cycle handling. Some of them are configurable with that regard. Note that almost always cycle (and reference) handling has additional performance impact, hence it was out of scope.

Size & compressability

One of the goals is to save size. Turns out that you cannot get really much better than untagged serialization that supports binary arrays as first class citizen. Of course there is an open question on how is inheritance implemented (if supported), but it’s outside of the scope of this document.

Let’s take a look at the basic object (4 strings). First column represents uncompressed size, second uses DeflateStream to compress the data. Third column represents the % of the size of the object compared to the largest one, and fourth column represents the % of the size of compressed object to the largest uncompressed object.

As we can see, all of the ‘top of the line’ serializers produce very similar sized results (well, to be frank, with 4 strings there is not much rocket science in that).

uncompressed bytes

compressed (optimal)

% of max

% of max compressed

MessagePack

710

457

71%

46%

Avro

708

457

71%

46%

ThriftCompact

713

460

71%

46%

Proto3

712

463

71%

46%

BondUnsafeSimpleCopied

719

464

72%

46%

BondUnsafeCompactReused

717

469

72%

47%

ThriftBinary

732

478

73%

48%

NewtonsoftJsonReusedSerializer

774

488

77%

49%

JavascriptSerializer

774

488

77%

49%

DataContractJsonSerializer

846

497

84%

50%

Xml

989

634

99%

63%

BinaryFormatter

1002

649

100%

65%

When looking at larger messages with binary content we see similar results (top-of-the-line serializers taking roughly the same size). I added two more columns that ignore the outlier javascript serializers (those serializers serialize byte[] to array of bytes represented by strings).

uncompressed bytes

compressed (optimal)

% of max

% of max compressed

% of max without outliers

% of max without outliers compressed

MessagePack

314822

49330

29%

4%

75%

12%

Avro

314817

49309

29%

4%

75%

12%

ThriftCompact

314821

49320

29%

4%

75%

12%

Proto3

314822

49325

29%

4%

75%

12%

BondUnsafeSimpleCopied

314823

49322

29%

4%

75%

12%

BondUnsafeCompactReused

314825

49332

29%

4%

75%

12%

ThriftBinary

314833

49341

29%

4%

75%

12%

NewtonsoftJsonReusedSerializer

419349

84895

38%

8%

100%

20%

JavascriptSerializer

1101496

78217

100%

7%

263%

19%

DataContractJsonSerializer

1101719

78238

100%

7%

263%

19%

Xml

419617

85045

38%

8%

100%

20%

BinaryFormatter

315231

49551

29%

4%

75%

12%

Sizewise, all of the serializers perform roughly the same. The largest difference is coming from the fact that byte[] does not have to be represented as base64 (newtonsoft.json), or as string of bytes (javascript serializer, data contract serializer). Just to put it into perspective, uncompressed JSON is 33% bigger than uncompressed-anything-else (base64 FTW), in compressed size the difference is even larger (50%+).

Performance

Since we established that size-wise the differences are minor, let’s look at the performance results. Not all serializers were tested in the same conditions, but more on that a bit later.

On the large objects, the serializers are working on the order of 100-400 us, with built in c# serializers being the slowest (outside of that range), newtonsoft.json being slow as well, and the rest being not that far from each other. The fastest serializer (Bond) reach 16us per serialization, which seems crazy. Note however, that it didn’t involve allocation of a single byte (nor garbage collection). That’s because I configured it to reuse the buffer. I didn’t do it with other serializers, but if performance is important to you, you should consider using buffer pools to avoid unnecessary garbage collections. (on the other hand you still need to create your object which involves memory operations, so the difference in the big picture might not be that noticable. Having said that, the slowest BondSimple with additional buffer copying is about as fast as Proto3 or Avro. Surprisingly, thrift is at the end of the peleton, and it looks like it allocates 2 times the required memory. (while Avro, Bond and Proto3 allocate around 600kB, Thrift & MessagePack allocate 1.2MB, and then are almost 2 times slower). It may be well because of how the MemoryStream works, and if it needs to expand, it will double its allocation.

Method

Mean

StdDev

Gen 0

Gen 1

Gen 2

Allocated

NewtonsoftJsonReusedSerializer

1,456.6460 us

4.5955 us

324.4792

296.0938

292.7083

1.48 MB

NewtonsoftJsonGenericSerializer

1,256.3022 us

7.9291 us

321.6146

242.9688

163.8021

1.69 MB

NewtonsoftJsonDataContract

1,260.6177 us

7.7955 us

318.75

244.2708

161.9792

1.69 MB

Xml

716.6923 us

3.6916 us

276.1719

250.651

249.8698

1.32 MB

DataContractJsonSerializer

45,589.4428 us

141.1303 us

45.8333

45.8333

45.8333

4.57 MB

BondUnsafeCompact

265.6632 us

2.5722 us

152.7344

137.5

136.849

857.03 kB

BondUnsafeSimple

122.7984 us

0.6369 us

72.3307

56.7057

56.7057

382.2 kB

BondUnsafeSimpleCopied

225.2874 us

3.5498 us

126.6276

111.1328

111.0026

698.52 kB

BondUnsafeCompactReused

366.4184 us

5.9772 us

201.5625

185.8724

184.9609

1.17 MB

BondUnsafeCompactReusedCopied

372.5610 us

3.7343 us

209.375

193.6849

192.7734

1.17 MB

BondUnsafeSimpleReused

121.1955 us

1.3657 us

71.5169

55.9896

55.8919

382.13 kB

BondUnsafeSimpleReusedBuffer

16.2607 us

0.0823 us

–

–

–

0 B

JavascriptSerializer

70,611.3527 us

263.3033 us

1687.5

1062.5

125

14.46 MB

Proto3

213.8052 us

2.3267 us

108.0729

106.1198

106.1198

640.15 kB

BinaryFormatter

415.9039 us

5.1037 us

197.526

195.5729

195.5729

1.27 MB

MessagePack

381.4887 us

4.9364 us

190.3646

190.3646

190.3646

1.26 MB

Avro

219.0960 us

2.3580 us

108.2031

107.2266

107.2266

637.4 kB

ThriftBinary

379.5297 us

3.1255 us

192.3177

190.3646

190.3646

1.27 MB

ThriftCompact

388.4297 us

8.7005 us

193.3594

191.5365

191.5365

1.27 MB

On the smaller object, the performance was unmeasurable with default Benchmark.Net settings (it was too fast). I might come back to these tests later.

What really matters in terms of performance

Based on the test results, I’d risk to say that allocations & garbage collections have the largest impact on perf. Most performance problems come from allocation of memory. If you can avoid additional allocations, you will notice (in some cases), 50% performance improvements. If your application is serialization heavy, using buffer pools can significantly enhance your performance. Keep in mind that it might not matter in your application. Chances are that your logic is way more time consuming than the serialization itself.

I briefly touched on this point in previous paragraph, but let’s compare BondUnsafeSimple, BondUnsafeSimpleCopied, BondUnsafeSimpleCopiedReusedBuffer. First and second differ in that that “OutputBuffer” from Bond is copied from the buffer into array. (buffer has more capacity than the size of serialization). You probably won’t do that if you will be saving that object on disk or sending it over the wire. But you can see that the copy operation basically doubles memory allocation. Similarly, BondUnsafeSimpleReusedBuffer differs from BondUnsafeSimple in that, that it doesn’t even recreate “OutputBuffer” for subsequent serializations. Once the buffer grew to a certain size (and it gets reused), no more reallocations are required. This proves (or at least hints!) that majority of time in serialization is spent in memory allocation and not doing actual dumping of the data (especially when we are talking about copying byte[] into a stream).

Conclusion

If you are in a need of serializing data that contains binaries (as well as other properties), you have lots of choices. None of them involve ‘human readable’ representation. Any of the Avro, Thrift, MessagePack, Proto3 would do the trick. Seems that Avro, Proto3 and Bond might be standing out. Proto3 is well established and so is Bond (it is a public knowledge that Bond is used in scale infrastructure at Microsoft). I will be looking later on to see if there is something I am doing wrong with thrit that would cause it to have ~70% smaller performance (and higher memory usage) than the others.

What about deserialization?

Now when it comes to deserialization… there will be another article. One interesting property that I will want to check for in subsequent chapters is lazy deserialization. Sometimes when the object is large, you might want to deserialize (load to memory) just a part of it. This may be a bigger deal in things like javascript, where to load something to memory you not only have to read string but also interpret that, but nevertheless it might be an interesting property.