Bλog

Links

Text/UTF-8: Studying memory usage

Benchmarking the memory usage of an example server
Published on August 9, 2011 under the tag haskell

What is this?

This blogpost continues where the previous one left off. Again, I study the performance of an application using the Data.Text library intensively. The difference is that this blogpost focuses almost exclusively on the memory usage of the resulting application.

The application used is a simple document store. Clients can store documents per ID, and retrieve document ID’s based on terms in the document. This blogpost is written in Literate Haskell, feel free to grab the raw version.

We use the OverloadedStrings language extension for general prettiness…

The pure logic

Let’s first write down the pure logic of our web application. When we receive a document from a client, we want to extract the terms (i.e, words) used in the document. This is why we have the tokenize function:

The web logic

That is, in addition the features which Snap provides, we also need access to a shared Store. All of our web controllers have this type: let’s look at the controller which adds a document. The function is fairly straightforward, it fetches the document ID and body, and adds it using modifyMVar_. Lastly, it also shows a response to the client (we define the blaze auxiliary function later).

Results

Next up is running it! I ran the application twice, once using the current version of Text, and once using my UTF-8 based port. A client was simulated which sent a large volume of twitter data in a variety of languages to the server. The following graph represents memory usage over time:

Memory usage results

Conclusions

While there is a very clear difference, it isn’t as large as I first suspected. This is caused by a number of reasons:

we use a Text value per token in the document. There is an additional 6 words per value, causing a non-negligible overhead for the relatively small tokens;

a lot of memory is taken up by Set Int as well;

the internal structure of the Map also takes up 6 words per item.

That being said, I think the difference shows that UTF-8 clearly has some benefits over UTF-16 in many situations. I’m looking forward to discussing more of the possible advantages and disadvantages… perhaps at CamHac?