How To Find Out What Character Encoding Is Used In a Particular File

Although the character encoding is essential for us to know what the bytes in a file mean, it is amazing how often nobody seems to know. So you would hope that there would be some clever application that would work it out for you. Wikipedia has some good sense on Character-set detection. The bottom line is that character-set detection is not an exact science. There are some tools available but none of them can give a completely reliable answer to what actual character set has been used.

We do have our own tool and you can get it free from our Downloads page. It has the same limitations that all the tools have, but it can help to point you in the right direction. It is a simple Java program that reads a byte-stream and provides some information about what the character encoding is likely to be. It can analyse a 12GB file in 4 minutes and 30 seconds on my MacBook Pro, so you really can afford to find out this important information about your source file, and it is a lot less bother to do it early on than to struggle with all the odd things you find in your data if you just load it without checking the character encoding first.

There are four examples below, showing what you see with different kinds of files:

It turns out that this is a very clean file. I wish all our source files were as good as this. It is clear from this output that the file is a UNIX (or Mac) file because the end of each line is marked by a newline character (not carriage return, newline as in Windows files). We can also see that most of the characters are single bytes, with 388 UTF-8 multi-byte sequences, all of which have the right number of continuation bytes following the leading byte. There are no bytes in this file that are invalid in UTF-8.

This means that if we read this file using the UTF-8 encoding and load it into a database that is also in UTF-8, then all the characters should appear correctly.

We can also see that this file has no control characters apart from newlines. This is very good news and will make working with this data much easier.

There may still be some issues with characters that we want to regard as the same, but that are actually variants, such as straight quotes and curly quotes, various kinds of hyphens, accented characters, etc. That isn't a fault in the file; it is a feature of working in a multi-lingual world.

The next section of the analysis tells us some more about the characters we find in this file. The report gives us a table with one row for each of the 256 possible byte values, how many of each there are in the file, and what they might mean. Since we have already convinced ourselves that this is a valid UTF-8 file, the "Name" column and the "Specific to Windows-1252" column are not very interesting to us (we'll see examples where they are interesting later). The UTF-8 Group is interesting.

From this table, we see that almost all of the bytes are valid ASCII characters. ASCII is a subset of UTF-8, so that is good.

Next there are a number of Continuation Bytes. These only make sense with their Leading Bytes, so we'll look at those next.

The Names shown for Windows-1252 and ISO 8859-1 are irrelevant for a UTF-8 file. We didn't know that it was UTF-8 until we had run the analysis, but now that we do know, don't be distracted by those names.

The leading bytes we have are:

Leading Byte(hex)

Number of Bytesin Sequence

Character Group(Unicode block)

Number Found In This File

c2

2

Latin

6

c3

2

Latin

212

c4

2

Latin

7

c5

2

Latin

9

e2

3

Symbol

154

I want to find these, to see what kinds of characters are being used, because our users probably will not think to search for accented characters or special symbols. So how are we going to do this in large files? I thought back to the time when I was using the Netezza Data Warehouse Appliance (happy days!). The Netezza Data Loader was very good at telling me:

When something had gone wrong

What the problem was

Exactly where it was in the file: line number and character within the line

More recently, I was working on a Hive project. Hive works on the philosophy that you just don't want to know about errors. Its design principle is "keep quiet and carry on". (That's why Data Scientists spend 80% of their time cleaning data. Brilliant. Not!). Anyway, having spent much time helping developers play "Hunt The Dodgy Character" I decided to write a program to give me this information.

So, if we run the same command as before, but add a second parameter to switch on verbose output, we get this.

This is just the first few lines of what can be very verbose. It highlights possible errors (as we'll see in later examples) and shows all the multi-byte UTF-8 sequences. For most files using a Latin alphabet this will be fine, but if you are using a writing system that generates a lot of multi-byte sequences (Cyrillic or Chinese, for example) this would be a pain. You could redirect it to a file, or pipe it through more as I did here.

In the extract above we see some valid UTF-8 characters. The report shows the line number and character position within the line for each of the interesting characters (which are all UTF-8 Leading Bytes in this case). Given the line number, we can find that line in the file. Here's an example using standard UNIX commands (you aren't trying to do this in Windows are you?):

This takes us directly to line 2470, the first line mentioned in the verbose report above. Then we have to find the 77th character in that line. We can see that this is the apostrophe in "JAMES’S". There are several characters that can be used as an apostrophe. Here's one way to find out which of the options we are using here (just piping the output from the previous command through hexdump):

We can see that the hex string for the apostrophe between "JAMES" and "S" is e2 80 99. Hex e2 is a UTF-8 Leading Byte that introduces a 3-byte sequence.

We've made this a bit easier by showing in the verbose report the 3-byte UTF-8 sequence, and the Unicode Code Point that this sequence encodes:

Valid UTF-8 multi-byte sequence: e2 80 99 Unicode Code Point: U+2019

There are over a million Unicode Code Points, so how do we find this one? There are several websites that I have found useful for this:

Micha Köllerwirth's website. This is probably the easiest way to look up any code point or UTF-8 sequence. It is an awesome piece of work - thank you Micha.

The Wikipedia UTF-8 page, scroll down to the multi-coloured table, find the cell for the Leading Byte that you are interested in and click on the link in the top of that cell. For e2 we see that this Leading Byte encodes sequences for symbol. Although most of the Leading Byte cells have a link to that block of the Unicode characters, this one doesn't. I forgive them unreservedly; it's still a very useful page.

You could also try FileFormat.Info. This also has a rich supply of useful information.

The next one we find is at line: 4878, column: 109. This is Unicode Code Point: U+0130 which is "İ", capital "I" with a dot above it. This is used in Turkish to show one particular pronunciation of the letter "I". "İ" is the correct first letter of the city "İstanbul". Since Istanbul is often spelled (outside Turkey) with a plain "I" we would want to find both "İstanbul" and "Istanbul" if we were searching for this city in our databases. This is another reason for maintaining a standardised version for searching.

What this shows is that, even with a very clean file like this one, it is worth understanding what data you have so that you can make informed decisions about how to offer this data to users.

Example 2: A file that appears to be encoded using ISO/IEC 8859-1

This is another file from the UK Government's Open Data initiative. It is a list of places with map references (a gazetteer). You can download the file from Open Data Gazeteer. The format is described as "ASCII text, Colon separated".

All the line ends are Windows -style (carriage return/linefeed) so this is clearly a Windows file. There are no other control characters and no carriage returns without linefeeds, so all that is good.

The encoding is clearly not UTF-8 because there are 1,586 bytes that are invalid in UTF-8, and all the bytes that could be UTF-8 Leading Bytes are not followed by the correct number of Continuation Bytes. In fact, there are no Continuation Bytes.

The documentation says that the file format is ASCII. It is not really ASCII because there are 8-bit values. These could be valid in the ISO 8859 and Windows 12xx families, both of which are sometimes described as "8-bit ASCII". It's time to look at what byte values we actually have.

This shows us nearly 5 million colons, and that is consistent with the field delimiter being a colon.

The most interesting thing here is to see which characters are used from the ISO 8859 and Windows 12xx families, with byte values between 128 and 255. None of the characters used are both:

specific to Windows 1252, and

not in ISO 8859

Assuming that we have either ISO 8859-1 or Windows 1252, the characters that are used are: À, È, Ì, Ò, Ù, à, á, â, è, é, ê, ë, ì, î, ò, ô, ù and û. The gazetteer covers England, Scotland and Wales. Place names in Scotland and Wales are often shown in the Gaelic and Welsh languages and these languages do use the accents we have found. Everything so far suggests that this file is valid. We can look in more detail by running the verbose analysis. Here are the first few lines:

Line 5, column 15 is "à"
Line 11, column 19 is "À"
Line 12, column 17 is "ì"
Line 16, column 29 is "ò"

So, at this level, this is another good quality file. The encoding is compatible with Windows 1252 and ISO/IEC 8859-1, as well as some of the less common variants in the Windows 12xx family and in the ISO/IEC 8859 family. I would be happy to load this file into a database as a one-off exercise, but if I were planning to load a new version of it every month, say, then I would want to know exactly what the encoding is, because next month's data might have something that works properly only in one of the various possible encodings that we have found.

Databases should usually be in UTF-8, now, because more and more of the data that we receive in the future will arrive in UTF-8. For a database in an established organisation, with a set of applications consistently using one of the older encodings, such as ISO 8859-1, then it would be easier now to create the database to use that encoding. But I would prefer to be ahead of the game, so I would always favour creating databases in UTF-8 now and converting all data to UTF-8 on loading. This isn't difficult (see Load a File to a Database and Convert the Character Encoding). What will make life difficult is dirty data, but that is always true, whatever encoding we use.

Example 3: An XML file encoded in UTF-8

This is also a file from the UK Government's Open Data initiative. It is a list of roads, giving the road name, classification, coordinates (as points and series of connected points) and some other data. You can download the file from Open Roads. The format is described as "GML 3"; this is XML with a particular XML schema that is designed to support geograpical information. The XML header says that the file in encoded in UTF-8.

This shows us that we have a Windows file. It appears to use only ASCII characters. Since ASCII is a valid subset of UTF-8, this is consistent with its description in the XML header. There is an average of three control characters per line. We will need to understand what these control characters are and decide what to do about them.

Running the verbose output gives a file of 4.8GB, because there are so many tab characters. From a quick scan of the output, it seems that tabs are used to indent the GML. See Loading XML Data to a Database. So the verbose output is not very useful in this case.

If we are going to load this data into a database, then the encoding will not give us any problems. Because it is purely 7-bit ASCII (which is compatible with all the Latin character encoding systems we can load it into any of these without conversion. However, any XML format should always be restructured into atomic values, in tables and columns as described in Loading XML Data to a Database. While doing this we would remove the tab characters because they are not part of the data.

Example 4: A file from a commercial back-office system

This section is about the file that made me realise the need for a character profiler. The file comes from a system that was the definitive record of customers for a large company. The company was formed by buying a piece of an even larger company, and this particular system was still run by the old company while the new offshoot built its own customer database. The old system was written decades ago and still runs on an IBM mainframe. The relationship between the old company and the new one is such that it is very difficult to get questions answered about data in this file. We know that the file comes from a mainframe, but the process by which the file is periodically delivered to a Windows server in the new company is a mystery. In fact the file is one of hundreds that make up the old customer database, but it is the file that contains the main customer details. It is nowhere near being normalised; the file has over 150 fields in each record, some of which are directly related to the customer and many that should be off in other tables because they have a many-to-one relationship, or an indirect relationship, to the customer.

None of this is unusual. It is the stuff that is everyday work for people in big organisations migrating data to new systems or building Data Warehouses, and now Data Lakes. Somehow we have to try to make sense of such data. Before we even get to that stage though, we have to load the data into some system that can help us to analyse it. This file was a delimited file, like a csv file but with an exclamation mark as the delimiter. The choice of delimiter is not a problem, but this file was a problem. Whatever parser we used to divide the file into records, and then fields within records, would periodically be broken by some unexpected characters in the data. Hence our need to find out, quickly, what had caused today's failure.

Here is the first part of the analysis. It does not include all the UTF-8 summary statistics because these were not written when I ran this and I no longer have access to the file to rerun it. It doesn't matter because this was not a UTF-8 file.

A line is defined here as any sequence of characters ending in newline, or carriage return/newline. "Windows line ends" counts the number of carriage return/newline pairs, and in this case we have 3,811,261 of these. UNIX line ends counts the number of newlines that are not preceded by a carriage return and there are none of these. This tells us that the file we are looking at was produced on a Windows system (not UNIX and not Mac).

This summary also tells us that there are two carriage returns without newlines and that there are 1,456 other control characters.

The next extract is the start of the table that lists every byte value (from 0 to 255) and the count of bytes with that value. I'm showing just the first 32 values. This covers all but one of the control characters:

There are 1,428 tab (or horizontal tab, to be precise) characters in the file. These explain some of the strange search results we saw later.

There are 28 "null" bytes (hex 00). This has nothing to do with the nulls we know in SQL or Java, they are simply bytes with a value of zero. They cause problems because we can't normally see them in query results or in text editors. They also cause some applications to ignore any content of the column that follows the null byte because the null byte is sometimes taken to mark the end of a string.

There are two "file separator" control characters (hex 1c). I don't even know (or care) what they mean, but they will mess up our queries and possibly the parsing of the file.

In this early version of the character encoding profile program, we did have the line and column numbers of characters that might cause us problems, or give us information about the possible encoding. Here is a very short extract:

This extract shows a couple of lines that have a byte value of 162 (decimal) which is hex a2. If the file were in UTF-8 this would be a continuation byte so there should be a leading byte before it (and there is not, in this file, because it would show on the line above). If the file were in Windows-1252 this would be the cent currency symbol (¢). Since this is a plausible symbol in Windows-1252 and part of an invalid byte in UTF-8, it does suggest that this file is more likely to be in Windows-1252 and less likely to be in UTF-8. Hex a2 is also a valid character in the ISO/IEC 8859 character families.

The third line in this extract shows one of the lone carriage return characters. This enabled us to go and look at the line containing this abberation, and decide what to do about it. It is hard to see what use a lone carriage return would be. Since we gave up teletypes there has not been much use for the carriage return character except in the ill-advised use of it as part of the end of a line in Windows.

Example 5: A large binary file

In binary files, the bytes do not represent characters. The whole file is a sequence of bits that are organised in some way that a particular program can interpret. Binary files are used for images, video, audio, compressed data, and others.

In general, there is not much point in running the character encoding profile on a binary file except, maybe:

To establish that it is a binary file

To find out if there are significant chunks of text in the file

As a boundary test of the profiling tool

I tried it for the third reason above. I was looking for the biggest file on my Mac and it was the latest Hortonworks Sandbox virtual machine. Here goes!

As predicted, this isn't very useful. It shows a pretty even distribution of byte values. In fact the report by every byte value shows about 45 million occurrences of every possible byte value (so I won't bother you with it here). The most useful thing for me was the following line:

Elapsed Time: 00:04:20

That's nearly 46 million bytes per second. For me that makes the profiing worth doing on every file we are asked to process.