Xponent's Mostly XML Blog

What Is A Byte Order Mark?

Have you ever gotten an error like "Missing byte order mark" or "Byte order mark found" and wondered what the error message means? First of all, you need to know what a byte order mark(BOM) is. The BOM is a unicode character that is used to indicate the byte order of the document. This is important when the encoding uses two bytes per character, such as with utf-16. The BOM indicates which byte is significant. The BOM character may be used to indicate which of the several Unicode representations the text is encoded in.

An XML document is not required to have a BOM, but if it does it should occur at the beginning of the file. It is used to inform XML parsers(software that reads XML byte by byte, checks for syntax errors and identifies each node and value) what encoding the XML document is written in. If the document contains no BOM, most parsers default to utf-8. But there is another way. XML documents may begin with an optional XmlDeclaration which may include an optional encoding attribute. For example, the following XmlDeclaration specifies that the document is written in utf-8 encoding:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

Note that if an XmlDeclaration is included it must occur at the beginning of the document and the version attribute is required, while the encoding and standalone attributes are optional. If the encoding attribute is used, and it conflicts with the BOM, an error will occur when you attempt to open the file with a program that uses an XML parser. Therefore, it is recommended to not change the encoding attribute, or insert one if none exists, in an existing XML document.

How to remove the Byte Order Mark

There are several scenarios where you might need to remove the BOM from an existing XML document. One is if you know the BOM is incorrect and is causing an error when the XML document is read. Some editors have an option for saving a file without a BOM. You may also use a hex editor to delete the BOM and then save the file.

But what if the XML file is too big to load into your editor? If you are a developer, you can use a stream object that reads and writes bytes to read the entire XML file and write it back out without the BOM. This will work with any size file if you specify a reasonably small buffer size. The code snippet below is a c# method that returns the encoding indicated by the BOM, or an empty string if no BOM is found.

XMLMax and XmlSplit have an undocumented method that can be used to remove the BOM. Select the split method that splits the file at every Nth element and specify a value greater than the number of elements in the XML file. XMLMax by default does not write a byte order mark. For XmlSplit, simply omit the command-line argument for writing a BOM. Instead of splitting the xml, the XML is re-written without the BOM. If the option to write an XmlDeclaration is used, a new XmlDeclaration is written to replace the one in the original file.

Inserting a BOM.

The methods described above for removing a BOM can also be used to insert one. In the case of editors, most either insert a BOM as the default when saving, or have an option to not write the BOM. If you use a hex editor, or re-write the file using a stream, be careful that the BOM matches the actual encoding of the file. Note that for utf-16 the BOM is written with two bytes whereas for utf-8 it is written with three bytes.

Code to read the BOM and return the encoding.
bom

Submitted by Bill Conniff, Founder of Xponent, on November 15, 2011