What Is A Byte Order Mark?
Have you ever gotten an error like "Missing byte order mark" or "Byte order mark found" and wondered what the error
message means? First of all, you need to know what a byte order mark(BOM) is. The BOM is a unicode character that
is used to indicate the byte order of the document. This is important when the encoding uses two bytes per character,
such as with utf-16. The BOM indicates which byte is significant. The BOM character may be used to indicate which of
the several Unicode representations the text is encoded in.
An XML document is not required to have a BOM, but if it does it should occur at the beginning of the file. It is
used to inform XML parsers(software that reads XML byte by byte, checks for syntax errors and identifies each node
and value) what encoding the XML document is written in. If the document contains no BOM, most
parsers default to utf-8. But there is another way. XML documents may begin with an optional XmlDeclaration which
may include an optional encoding attribute. For example, the following XmlDeclaration specifies that the document
is written in utf-8 encoding:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
Note that if an XmlDeclaration is included it must occur at the beginning of the document and the version attribute
is required, while the encoding and standalone attributes are optional. If the encoding attribute is used, and it
conflicts with the BOM, an error will occur when you attempt to open the file with a program that uses an XML parser.
Therefore, it is recommended to not change the encoding attribute, or insert one if none exists, in an existing XML
How to remove the Byte Order Mark
There are several scenarios where you might need to remove the BOM from an existing XML document. One is if you know
the BOM is incorrect and is causing an error when the XML document is read.
Some editors have an option for saving a file without a BOM. You may also use a hex editor to delete the BOM and
then save the file.
But what if the XML file is too big to load into your editor?
If you are a developer, you can use a stream object that reads and writes bytes to read the entire XML file and
write it back out without the BOM. This will work with any size file if you specify a reasonably small buffer size.
The code snippet below is a c# method that returns the encoding indicated by the BOM, or an empty string if no
BOM is found.
XMLMax and XmlSplit have an undocumented method that can be used to remove the BOM. Select the
split method that splits the file at every Nth element and specify a value greater than the number of elements in
the XML file. XMLMax by default does not write a byte order mark. For XmlSplit, simply omit the command-line argument
for writing a BOM. Instead of splitting the xml, the XML is re-written without the BOM. If the option to write an
XmlDeclaration is used, a new XmlDeclaration is written to replace the one in the original file.
Inserting a BOM.
The methods described above for removing a BOM can also be used to insert one. In the case of editors, most either
insert a BOM as the default when saving, or have an option to not write the BOM. If you use a hex editor, or re-write
the file using a stream, be careful that the BOM matches the actual encoding of the file. Note that for utf-16 the
BOM is written with two bytes whereas for utf-8 it is written with three bytes.
Code to read the BOM and return the encoding.
Submitted by Bill Conniff, Founder of Xponent, on November 15, 2011