2008-07-21

A year and a half ago, I posted what seemed to be the simplest recipe for reading and writing UTF-8 in Haskell. In this post, I will provide an even simpler recipe, made possible by Eric Mertens' utf8-string package.

For those who are not familiar with Haskell, its internal representation for characters is Unicode, but for IO it effectively assumes that that it is reading and writing in the ISO8859-1 format. This used to be annoying for those of us who wanted to work with the UTF-8 encoding, but now there is a very simple solution, perfect for those of us who don't want to think too much and just get the job done.

the example

The sample problem from my last post was to take a UTF-8 encoded file as input, reverse all its lines, writing the results in the same file, with a ".rev" extension appended to its name. The solution might be self-explanatory if you are used to Haskell, but I will make some minor comments below, just in case.

In the above code, we use some drop-in replacements for some System.IO functions. Some of these functions are also provided in the Prelude, so we must hide them so that they do not overlap with what we import. (Alternatively, we could import the UTF-8 ones qualified, which could be handy in contexts where we want the option of reading and writing in UTF-8 without committing to it). The rest is straightforward. Notice that we do not jump through any hoops whatsoever. In fact, you can pretty much take any pre-existing Haskell program that you have written and turn it into a UTF-8 version by changing the import statements.

The utf8-string package is available on HackageDB. Thanks to Eric M. for providing this little wrapper! It's a perfect example of the kind of thing which seems obvious... after somebody else has thought to do it.

A year and a half ago, I posted what seemed to be the simplest recipe for reading and writing UTF-8 in Haskell. In this post, I will provide an even simpler recipe, made possible by Eric Mertens' utf8-string package.

For those who are not familiar with Haskell, its internal representation for characters is Unicode, but for IO it effectively assumes that that it is reading and writing in the ISO8859-1 format. This used to be annoying for those of us who wanted to work with the UTF-8 encoding, but now there is a very simple solution, perfect for those of us who don't want to think too much and just get the job done.

the example

The sample problem from my last post was to take a UTF-8 encoded file as input, reverse all its lines, writing the results in the same file, with a ".rev" extension appended to its name. The solution might be self-explanatory if you are used to Haskell, but I will make some minor comments below, just in case.

In the above code, we use some drop-in replacements for some System.IO functions. Some of these functions are also provided in the Prelude, so we must hide them so that they do not overlap with what we import. (Alternatively, we could import the UTF-8 ones qualified, which could be handy in contexts where we want the option of reading and writing in UTF-8 without committing to it). The rest is straightforward. Notice that we do not jump through any hoops whatsoever. In fact, you can pretty much take any pre-existing Haskell program that you have written and turn it into a UTF-8 version by changing the import statements.

The utf8-string package is available on HackageDB. Thanks to Eric M. for providing this little wrapper! It's a perfect example of the kind of thing which seems obvious... after somebody else has thought to do it.

As far as I can tell, the Urdu, Pashto, Persian and Arabic sentences are basically gibberish. Perhaps the source of these sentences is like that? Any way, I love any post that contains Haskell and Urdu :)