UTF-8 can be auto-detected better by contents than by BOM. The method is simple: try to read the file (or a string) as UTF-8 and if that succeeds, assume that the data is UTF-8. Otherwise assume that it is CP1252 (or some other 8 bit encoding). Any non-UTF-8 eight bit encoding will almost certainly contain sequences that are not permitted by UTF-8. Pure ASCII (7 bit) gets interpreted as UTF-8, but the result is correct that way too.
– TronicFeb 11 '10 at 13:25

28

Scanning large files for UTF-8 content takes time. A BOM makes this process much faster. In practice you often need to do both. The culprit nowadays is that still a lot of text content isn't Unicode, and I still bump into tools that say they do Unicode (for instance UTF-8) but emit their content a different codepage.
– Jeroen Wiert PluimersDec 18 '13 at 7:41

7

@Tronic I don't really think that "better" fits in this case. It depends on the environment. If you are sure that all UTF-8 files are marked with a BOM than checking the BOM is the "better" way, because it is faster and more reliable.
– mg30rgJul 31 '14 at 9:31

23

UTF-8 does not have a BOM. When you put a U+FEFF code point at the start of a UTF-8 file, special care must be made to deal with it. This is just one of those Microsoft naming lies, like calling an encoding "Unicode" when there is no such thing.
– tchristOct 1 '14 at 22:37

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be
encountered in contexts where UTF-8 data is converted from other
encoding forms that use a BOM or where the BOM is used as a UTF-8
signature. See the “Byte Order Mark” subsection in Section 16.8,
Specials,
for more information.

It might not be recommended but from my experience in Hebrew conversions the BOM is sometimes crucial for UTF-8 recognition in Excel, and may make the difference between Jibrish and Hebrew
– MatanyaDec 7 '12 at 8:13

17

It might not be recommended but it did wonders to my powershell script when trying to output "æøå"
– MariusNov 12 '13 at 9:22

50

Regardless of it not being recommended by the standard, it's allowed, and I greatly prefer having something to act as a UTF-8 signature rather the alternatives of assuming or guessing. Unicode-compliant software should/must be able to deal with its presence, so I personally encourage its use.
– martineauDec 31 '13 at 20:41

25

@bames53: Yes, in an ideal world storing the encoding of text files as file system metadata would be a better way to preserve it. But most of us living in the real world can't change the file system of the OS(s) our programs get run on -- so using the Unicode standard's platform-independent BOM signature seems like the best and most practical alternative IMHO.
– martineauJan 16 '14 at 19:37

27

@martineau Just yesterday I ran into a file with a UTF-8 BOM that wasn't UTF-8 (it was CP936). What's unfortunate is that the ones responsible for the immense amount of pain cause by the UTF-8 BOM are largely oblivious to it.
– bames53Jan 16 '14 at 23:21

@Alcott : You understood correctly. The string [EF BB BF 41 42 43] is just a bunch of bytes. You need external information to choose how to interpret it. If you believe those bytes were encoded using ISO-8859-1, then the string is "ï»¿ABC". If you believe those bytes were encoded using UTF-8, then it is "ABC". If you don't know, then you must try to find out. The BOM could be a clue. The absence of invalid character when decoded as UTF-8 could be another... In the end, unless you can memorize/find the encoding somehow, an array of bytes is just an array of bytes.
– paercebalSep 11 '11 at 18:57

16

@paercebal While "ï»¿" is valid latin-1, it is very unlikely that a text file begins with that combination. The same holds for the ucs2-le/be markers ÿþ and þÿ. Also you can never know.
– user877329Jun 21 '13 at 16:48

14

@deceze It is probably linguistically invalid: First ï (which is ok), then some quotation mark without space in-between (not ok). ¿ indicates it is Spanish but ï is not used in Spanish. Conclusion: It is not latin-1 with a certainty well above the certainty without it.
– user877329Nov 5 '13 at 7:20

17

@user Sure, it doesn't necessarily make sense. But if your system relies on guessing, that's where uncertainties come in. Some malicious user submits text starting with these 3 letters on purpose, and your system suddenly assumes it's looking at UTF-8 with a BOM, treats the text as UTF-8 where it should use Latin-1, and some Unicode injection takes place. Just a hypothetical example, but certainly possible. You can't judge a text encoding by its content, period.
– deceze♦Nov 5 '13 at 7:44

34

"Encodings should be known, not divined." The heart and soul of the problem. +1, good sir. In other words: either standardize your content and say, "We're always using this encoding. Period. Write it that way. Read it that way," or develop an extended format that allows for storing the encoding as metadata. (The latter probably needs some "bootstrap standard encoding," too. Like saying "The part that tells you the encoding is always ASCII.")
– jpmc26Jul 23 '15 at 21:25

There are at least three problems with putting a BOM in UTF-8 encoded files.

Files that hold no text are no longer empty because they always contain the BOM.

Files that hold text that is within the ASCII subset of UTF-8 is no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.

It is not possible to concatenate several files together because each file now has a BOM at the beginning.

And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:

It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.

It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.

Re point 1 "Files that hold no text are no longer empty because they always contain the BOM", this (1) conflates the OS filesystem level with the interpreted contents level, plus it (2) incorrectly assumes that using BOM one must put a BOM also in every otherwise empty file. The practical solution to (1) is to not do (2). Essentially the complaint reduces to "it's possible to impractically put a BOM in an otherwise empty file, thus preventing the most easy detection of logically empty file (by checking file size)". Still good software should be able to deal with it, since it has a purpose.
– Cheers and hth. - AlfJun 18 '14 at 14:22

5

Re point 2, "Files that hold ASCII text is no longer themselves ASCII", this conflates ASCII with UTF-8. An UTF-8 file that holds ASCII text is not ASCII, it's UTF-8. Similarly, an UTF-16 file that holds ASCII text is not ASCII, it's UTF-16. And so on. ASCII is a 7-bit single byte code. UTF-8 is an 8-bit variable length extension of ASCII. If "tools break down" due to >127 values then they're just not fit for an 8-bit world. One simple practical solution is to use only ASCII files with tools that break down for non-ASCII byte values. A probably better solution is to ditch those ungood tools.
– Cheers and hth. - AlfJun 18 '14 at 14:27

6

Re point 3, "It is not possible to concatenate several files together because each file now has a BOM at the beginning" is just wrong. I have no problem concatenating UTF-8 files with BOM, so it's clearly possible. I think maybe you meant the Unix-land cat won't give you a clean result, a result that has BOM only at the start. If you meant that, then that's because cat works at the byte level, not at the interpreted contents level, and in similar fashion cat can't deal with photographs, say. Still it doesn't do much harm. That's because the BOM encodes a zero-width non-breaking space.
– Cheers and hth. - AlfJun 18 '14 at 14:34

14

@Cheersandhth.-Alf This answer is correct. You are merely pointing out Microsoft bugs.
– tchristOct 1 '14 at 22:34

It'a an old question with many good answers but one thing should be added.

All answers are very general. What I'd like to add are examples of the BOM usage that actually cause real problems and yet many people don't know about it.

BOM breaks scripts

Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:

#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node

It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.

The shebang characters are represented by the same two bytes in
extended ASCII encodings, including UTF-8, which is commonly used for
scripts and other text files on current Unix-like systems. However,
UTF-8 files may begin with the optional byte order mark (BOM); if the
"exec" function specifically detects the bytes 0x23 and 0x21, then the
presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent
the script interpreter from being executed. Some authorities recommend
against using the byte order mark in POSIX (Unix-like) scripts,[14]
for this reason and for wider interoperability and philosophical
concerns. Additionally, a byte order mark is not necessary in UTF-8,
as that encoding does not have endianness issues; it serves only to
identify the encoding as UTF-8. [emphasis added]

BOM is illegal in JSON

Implementations MUST NOT add a byte order mark to the beginning of a JSON text.

BOM is redundant in JSON

Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).

BOM breaks JSON parsers

Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:

Determining the encoding and endianness of JSON, examining the first 4 bytes for the NUL byte:

UTF-32LE the first byte is not followed by 3 NULs so it won't be recognized

UTF-16BE has only 1 NUL in the first 4 bytes so it won't be recognized

UTF-16LE has only 1 NUL in the first 4 bytes so it won't be recognized

Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.

Additionally if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8 because it doesn't start with an ASCII character < 128 as it should according to the RFC.

Other data formats

BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.

For other data formats than JSON, take a look how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.

Other uses of BOM

As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization because it is an example of BOM characters causing real problems.

rfc7159 which supersedes rfc4627 actually suggests supporting BOM may not be so evil. Basically not having a BOM is just an ambiguous kludge so that old Windows and Unix software that are not Unicode-aware can still process utf-8.
– Eric GrangeApr 10 '17 at 7:59

1

Sounds like JSON needs updating in order to support it, same with Perl scripts, Python scripts, Ruby scripts, Node.js. Just because these platforms opted to not include support, doesn't necessarily kill the use for BOM. Apple has been trying to kill Adobe for a few years now, and Adobe is still around. But an enlightening post.
– htm11hJul 24 '17 at 15:47

4

@EricGrange, you seem to be very strongly supporting BOM, but fail to realize that this would render the all-ubiquitous, universally useful, optimal-minimum "plain text" format a relic of the pre-UTF8 past! Adding any sort of (in-band) header to the plain text stream would, by definition, impose a mandatory protocol to the simplest text files, making it never again the "simplest"! And for what gain? To support all the other, ancient CP encodings that also didn't have signatures, so you might mistake them with UTF-8? (BTW, ASCII is UTF-8, too. So, a BOM to those, too? ;) Come on.)
– Sz.Mar 14 '18 at 22:20

Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

Long answer:

Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

Which is better?

Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.

this would also invalidate valid UTF-8 with a single erroneous byte in it, though :/
– endolithJul 15 '12 at 1:05

6

-1 re " It causes problems with non-BOM-aware software.", that's never been a problem for me, but on the contrary, that absence of BOM causes problems with BOM-aware software (in particular Visual C++) has been a problem. So this statement is very platform-specific, a narrow Unix-land point of view, but is misleadingly presented as if it applies in general. Which it does not.
– Cheers and hth. - AlfJun 18 '14 at 14:46

2

No, UTF-8 has no BOM. This answer is incorrect. See the Unicode Standard.
– tchristOct 1 '14 at 22:35

1

You can even think you have a pure ASCII file when just looking at the bytes. But this could be a utf-16 file as well where you'd have to look at words and not at bytes. Modern sofware should be aware about BOMs. Still reading utf-8 can fail if detecting invalid sequences, codepoints that can use a smaller sequence or codepoints that are surrogates. For utf-16 reading might fail too when there are orphaned surrogates.
– brightyFeb 9 '15 at 16:56

UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.

If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.

Thanks for this excellent tip in case one is creating UTF-8 files for use by Excel. In other circumstances though, I would still follow the other answers and skip the BOM.
– Thomas JensenMay 7 '13 at 19:20

5

It's also useful if you create files that contain only ASCII and later may have non-ascii added to it. I have just ran into such an issue: software that expects utf8, creates file with some data for user editing. If the initial file contains only ASCII, is opened in some editors and then saved, it ends up in latin-1 and everything breaks. If I add the BOM, it will get detected as UTF8 by the editor and everything works.
– Roberto AlsinaSep 9 '13 at 22:03

Where do you read a recommendation for using a BOM into that RFC? At most, there's a strong recommendation to not forbid it under certain circumstances where doing so is difficult.
– DeduplicatorAug 11 '15 at 18:37

2

Excel thinks it's ANSI and shows gibberish then the problem is in Excel.
– sorontarNov 26 '16 at 8:10

BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters ï»¿ at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.

It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.

Yes, just spent hours identifying a problem caused by a file being encoded as UTF-8 instead of UTF-8 without BOM. (The issue only showed up in IE7 so that led me on a quite a goose chase. I used Django's "include".)
– user984003Jan 31 '13 at 20:45

Future readers: Note that the tweet issue I've mentioned above was not strictly related to BOM, but if it was, then the tweet would be garbled in a similar way, but at the start of the tweet.
– Halil ÖzgürFeb 1 '13 at 7:26

10

@user984003 No, the problem is that Microsoft has mislead you. What it calls UTF-8 is not UTF-8. What it calls UTF-8 without BOM is what UTF-8 really is.
– tchristOct 2 '14 at 0:11

what does the "sic" add to your "no pun intended"
– JoelFanOct 23 '17 at 21:15

1

@JoelFan I can't recall anymore but I guess the pun might have been intended despite the author's claim :)
– Halil ÖzgürOct 23 '17 at 21:34

Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?

Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.

On the meaning of the BOM and UTF-8:

The Unicode Standard permits the BOM in UTF-8, but does not require
or recommend its use. Byte order has no meaning in UTF-8, so its
only use in UTF-8 is to signal at the start that the text stream is
encoded in UTF-8.

Argument forNOTusing a BOM:

The primary motivation for not using a BOM is backwards-compatibility
with software that is not Unicode-aware... Another motivation for not
using a BOM is to encourage UTF-8 as the "default" encoding.

ArgumentFORusing a BOM:

The argument for using a BOM is that without it, heuristic analysis is
required to determine what character encoding a file is using.
Historically such analysis, to distinguish various 8-bit encodings, is
complicated, error-prone, and sometimes slow. A number of libraries
are available to ease the task, such as Mozilla Universal Charset
Detector and International Components for Unicode.

Programmers mistakenly assume that detection of UTF-8 is equally
difficult (it is not because of the vast majority of byte sequences
are invalid UTF-8, while the encodings these libraries are trying to
distinguish allow all possible byte sequences). Therefore not all
Unicode-aware programs perform such an analysis and instead rely on
the BOM.

In particular, Microsoft compilers and interpreters, and many
pieces of software on Microsoft Windows such as Notepad will not
correctly read UTF-8 text unless it has only ASCII characters or it
starts with the BOM, and will add a BOM to the start when saving text
as UTF-8. Google Docs will add a BOM when a Microsoft Word document is
downloaded as a plain text file.

On which is better,WITHorWITHOUTthe BOM:

The IETF recommends that if a protocol either (a) always uses UTF-8,
or (b) has some other way to indicate what encoding is being used,
then it “SHOULD forbid use of U+FEFF as a signature.”

My Conclusion:

Use the BOM only if compatibility with a software application is absolutely essential.

Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8†, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.

I'd better to strict to WITHOUT the BOM. I found that .htaccess and gzip compression in combination with UTF-8 BOM gives an encoding error Change to Encoding in UTF-8 without BOM follow to a suggestion as explained here solve the problems
– ChetabahanaApr 16 '15 at 15:09

'Another motivation for not using a BOM is to encourage UTF-8 as the "default" encoding.' -- Which is so strong & valid an argument, that you could have actually stopped the answer there!... ;-o Unless you got a better idea for universal text representation, that is. ;) (I don't know how old you are, how many years you had to suffer in the pre-UTF8 era (when linguists desperately considered even changing their alphabets), but I can tell you that every second we get closer to ridding the mess of all the ancient single-byte-with-no-metadata encodings, instead of having "the one" is pure joy.)
– Sz.Mar 14 '18 at 22:41

See also this comment about how adding a BOM (or anything!) to the simplest of the text file formats, "plain text", would mean preventing exactly the best universal text encoding format from being "plain", and "simple" (i.e. "overheadless")!...
– Sz.Mar 14 '18 at 22:58

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"

Do you have any example where software makes a decision of whether to use UTF-8 with/without BOM, based on whether the previous encoding it is encoding from, had a BOM or not?! That seems like an absurd claim
– barlopMar 3 '18 at 15:31

I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.

I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.

Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.

Thanks for the excellent tip about windows classic Notepad. I already spent some time finding out the exact same thing. My consequence was to always use Notepad++ instead of windows classic Notepad. :-)
– Thomas JensenMay 7 '13 at 19:22

You better use madedit. It's the only Editor that - in hex mode - shows one character if you select a utf-8 byte sequence instead of a 1:1 Basis between byte and character. A hex-Editor that is aware about a UTF-8 file should bevave like madedit does!
– brightyFeb 9 '15 at 16:49

@brighty I don't think you need one to one for the sake of the BOM. it doesn't matter, it doesn't take much to recognise a utf-8 BOM is efbbbf or fffe (of fffe if read wrong). One can simply delete those bytes. It's not bad though to have a mapping for the rest of the file though, but to also be able to delete byte by byte too
– barlopMar 3 '18 at 15:34

@barlop Why would you want to delete a utf-8 BOM if the file's content is utf-8 encoded? The BOM is recognized by modern Text Viewers, Text Controls as well as Text Editors. A one to one view of a utf-8 sequence makes no sense, since n bytes result in one character. Of course a text-editor or hex-editor should allow to delete any byte, but this can lead to invalid utf-8 sequences.
– brightyMar 4 '18 at 16:41

@brighty utf-8 with bom is an encoding, and utf-8 without bom is an encoding. The cmd prompt uses utf8 without bom.. so if you have a utf8 file, you run the command chcp 65001 for utf8 support, it's utf8 without bom. If you do type myfile it will only display properly if there is no bom. If you do echo aaa>a.a or echo אאא>a.a to output the chars to file a.a, and you have chcp 65001, it will output with no BOM.
– barlopMar 5 '18 at 4:55

UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.

The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.

Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.

"which has no use for UTF-8 as it is 8-bits per glyph anyway." Er... no, only ASCII-7 glyphs are 8-bits in UTF-8. Anything beyond that is going to be 16, 24, or 32 bits.
– PowerlordFeb 8 '10 at 18:38

1

"The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases."... endianness simply does not apply to UTF-8, regardless of use case
– JoelFanOct 23 '17 at 21:30

When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.

But this is not the case when we have text, CSV and XML files, either on Windows or Linux.

For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.

Save it as XML and declare it as UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

It will not display (it will not be be read) correctly, even if it's declared as UTF-8.

I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file

It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.

UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.

Edit: Just want to make it clear that I prefer to not have the BOM at all, add it in if some old rubbish breaks with out it, and replacing that legacy application is not feasible.

That’s because Microsoft has swapped the meaning of what the standard says. UTF-8 has no BOM: they have created Microsoft UTF-8 which inserts a spurious BOM in front of the data stream and then told you that no, this is actually UTF-8. It is not. It is just extending and corrupting.
– tchristOct 2 '14 at 0:14

This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.

As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on it's source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

Why is a BOM not recommended?

Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.

It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)

When should you encode with a BOM?

If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.

As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.

Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.

Also, I just found this in Wikipedia:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns

The byte order mark (BOM) is a Unicode
character used to signal the
endianness (byte order) of a text file
or stream. Its code point is U+FEFF.
BOM use is optional, and, if used,
should appear at the start of the text
stream. Beyond its specific use as a
byte-order indicator, the BOM
character may also indicate which of
the several Unicode representations
the text is encoded in.

Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.

My real problem with the absence of BOM is the following. Suppose we've got a file which contains:

abc

Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:

abg-αβγ

Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.

αβγ is not ascii, but can appear in 8bit-ascii-bassed encodings. The use of a BOM disables a benafit of utf-8, its compatability with ascii (ability to work with lagacy applications where pure ascii is used).
– ctrl-alt-delorJan 7 '11 at 13:03

This is the wrong answer. A string with a BOM in front of it is something else altogether. It is not supposed to be there and just screws everything up.
– tchristOct 2 '14 at 0:13

Without BOM this opens as ANSI in most editors. I agree absolutely. If this happens you're lucky if you deal with the correct Codepage but indeed it's just a guess, because the Codepage is not part of the file. A BOM is.
– brightyFeb 9 '15 at 16:59

A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as
files. When you need to conform to such a protocol, use a BOM.

Some protocols allow optional BOMs in the case of untagged text. In those cases,

Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM,
the encoding could be anything.

Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there
is no BOM, the text should be interpreted as big-endian.

Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the
BOM as encoding form signature should be avoided.

Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In
particular, whenever a data stream is declared to be UTF-16BE,
UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

UTF with BOM is better if you use UTF-8 in HTML files, if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or something exotic language in the same page. That is my opinion (30 years of computing and IT industry).

I find this to be true as well. If you use characters outside of the first 255 ASCII set and you omit the BOM, browsers interpret it as ISO-8859-1 and you get garbled characters. Given the answers above, this is apparently on the browser-vendors doing the wrong thing when they don't detect a BOM. But unless you work at Microsoft Edge/Mozilla/Webkit/Blink, you have no choice but work with the defects these apps have.
– funkwurmNov 28 '17 at 8:42