Skip to main content
Nov 21, 2018

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. BOM use is optional, and, if used, should appear at the start of the text stream.

Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.

The Unicode Standard does permit the BOM in UTF-8, but does not require or recommend its usage. Byte order has no meaning in UTF-8 so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.

One reason the UTF-8 BOM is not recommended is that many pieces of software without Unicode support nevertheless are able to handle UTF-8 inside a text but not at the start of a text. For instance, the bytes of UTF-8 can be placed between the quotes of string constants in many programming languages, and that language will write the correct UTF-8 to a file or to a display, despite the language not knowing anything about UTF-8. This provides an easy migration path to convert systems to Unicode and to remove all legacy encodings, without simultaneously upgrading the programming language. The unexpected three bytes of the BOM break this however, as they are located where they are certain to be a syntax error.

A leading BOM can also defeat software that uses pattern matching on the start of a text file, since it inserts 3 bytes before the pattern. Though commonly associated with the Unix shebang at the start of an interpreted script, the problem is more widespread. For instance in PHP, the existence of a BOM will cause the page to begin output before the initial code is interpreted, causing problems if the page is trying to send custom HTTP headers (which must be set before output begins).

Many Windows programs (including Windows Notepad!) add BOMs to UTF-8 files by default and it can a lot of troubles if you use it for developing. 

Note: if you use UTF-8 without BOM and use english letters only, after you save the file it will lost UTF-8 encoding!