AdSense Mobile Ad

Tuesday, September 28, 2010

A Shell Script to Find and Remove the BOM Marker

Edited:

As pointed out by Omri, the script is failing on OS X apparently because of an idiosyncrasy in Apple's sed implementation. I temporarily fixed the script switching from sed to perl on OS X: perl is also shipped by default on OS X so there shouldn't be any problem. However, on OS X this version of the script scans by default the entire file, and not only the first line as it does with other sed implementations.

Introduction

Have you ever seen this characters while dumping the contents of some of your text files, haven't you?

ï»¿

If you have, you found a BOM marker! The BOM marker is a Unicode character with code point U+FEFF that specifies the endianness of an Unicode text stream.

Since Unicode characters can be encoded as a multibyte sequence with a specific endianness, and since different architectures may adopt distinct endianness types, it's fundamental to signal the receiver about the endianness of the data stream being sent. Dealing with the BOM, then, it's part of the game.

If you want to know more about when to use the BOM you can start by reading this official Unicode FAQ.

This post has been modified to solve some problems and improve the script according to your comments:

Files can now be filtered by extension using the -e option, as suggested by Goldan.

BOM can be removed throughout the file using the -a option, as suggested by Goldan.

An arbitrary number of files can be safely passed as a parameter.

The script behaves correctly even with filenames with whitespaces in it.

Safe Harbour Statements: I try to test the script in the greatest number of systems but I'm not guaranteeing that it is working correctly on your. I'll be glad if you give me your feedback: any suggestion or bug report will be appreciated.

UTF-8

UTF-8 is one of the most widely used Unicode characters encoding on software and protocols that have to deal with textual data stream. UTF-8 represents each Unicode character with a sequence of 1 to 4 octects. Each octect contains control bits that are used to identify the beginning and the length of an octect sequence. The Unicode code point is simply the concatenation of the non control bits in the sequence. One of the advantages of UTF-8 is that it retains backwards compatibility with ASCII in the ASCII [0-127] range since such characters are represented with the same octect in both encodings.

If you feel curious about how the UTF-8 encoding works, I've written an introductory post about it.

Common Problems

Because of its design, the UTF-8 encoding is not endianness-sensible and using the BOM with this encoding is discouraged by the Unicode standard. Unfortunately some common utilities, notably Microsoft Notepad, keep on adding a BOM in your UTF-8 files thus breaking those application that aren't prepared to deal with it.

Some programs could, for example, display the following characters at the beginning of your file:

ï»¿

A more serious problem is that a BOM will break a UNIX shell script interfering with the shebang (#!).

A Shell Scripts to Check for BOMs and Remove Them

1110 1111 1011 1011 1011 1111E F B B B F
The quickest way I know of to process a text file and perform this operation is sed. The following syntax will instruct sed to remove the BOM from the first line of its input file:

sed '1 s/\xEF\xBB\xBF//' < input > output

A Warning for Solaris Users

I haven't found a way (yet) to correctly use a sed implementation bundled with Solaris 10 to perform this operation, neither using /usr/bin/sed nor /usr/xpg4/bin/sed. If you're a Solaris user, please consider installing GNU sed to use the following script.

The quickest way to install sed and a lot of fancy Solaris packages is using Blastwave or OpenCSW. I've also written a post about loopback-mounting Blastwave/OpenCSW installation directory in Solaris Zones to simplify Blastwave/OpenCSW software administration.

A Suggestion for Windows Users

If you want to execute this script in a Windows environment, you can install CygWin. The base install with bash and the core utilities will be sufficient for this script to work on your CygWin environment.

Source

This is the source code of a skeleton implementation of a bash shell script that will remove the BOM from its input files. The script support recursive scanning of directories to "clean" an entire file system tree and a flag (-x) to avoid descending in a filesystem mounted elsewhere. The script uses temporary files while doing the conversion and the original file will be overwritten only if the -d option is not specified.

function doJob() { # Check if the script has been called from the outside. if [ $PROCESSING_FILES == true ] ; then for i in $(seq 1 ${#FILES[@]}) do echo ${FILES[$i-1]} processFile "${FILES[$i-1]}" done

Great script... it worked just fine except that the temp file writes to 644 permissions. Is there any way you could modify it to hold the permissions on the file so that they are preserved? I would definitely use it all the time if so...

It should, but this is not always the case, unfortunately. I've got the BOM marker in the beginning of every \footnote{} in LaTeX file after opening and saving it with TeXMaker on Windows. As it is noted on unicode.org, the marker's usage in the middle of a file is deprecated: http://unicode.org/faq/utf_bom.html#bom6

Thanks again for the bash script, just tried it for recursively correct a directory and it worked perfectly. By the way, can I use it to recursively correct files with specified extension (e.g. .tex) in a directory?

And another suggestion: since your script is so well written, you could easily add an option to remove BOM marker not only from the beginning, but from the whole document. To do that, I just needed to replace sed command on line 76 with the one I mentioned above.

I wasn't able to get this to work on MacOSX Lion. I know UNIX but am pretty new to Apple's environment so bear with me :-) When running the script, everything appeared to run fine but the output file still had the BOM in it (I double-checked by running with the -x option and examining the temp file).

When trying to debug this and playing around with sed on MacOSX, it appears that it ignores the BOM at the beginning of the file. It's possible that this is an idiosyncrasy of the Apple implementation (or maybe something that's new in Lion, since it appears from the article that you tested on previous versions of OSX?)

I'm using the native /usr/bin/sed. You mentioned that you couldn't get this to work on the native solaris sed - so maybe this is a similar issue. You also mentioned substituting GNU sed - do you know by any chance whether there's a simple way to download standalone GNU utils for OSX as opposed to installing a big package?

Although most of the time on OS X, I don't really use it for shell scripting, I'm still a Solaris guy for that. And yes: I just tried and the script doesn't work correctly with Apple's sed.

This is a quick workaround: I put it here because it's not going to fix the entire script as it is. Instead of using sed, replace the line where sed is invoked with the following one:

cat "$1" | perl -pe 's/\xEF\xBB\xBF//' > "$TEMPFILENAME"

I preferred using sed instead of perl mainly because it's available on almost any system, even the most stripped-down installations. However, for OS X we've got to stick with perl and wait for me to fix the script.

Also, I just realized I missed an argument check during the last script refactoring. In a few minutes it will be updated.