Splitting Large XML Files: Some Gotchas
A common requirement for splitting large XML files is to create multiple, smaller XML files that contain the same, or roughly the same, number of
This is a fairly simple coding project. Most programming languages provide tools for reading and writing large XML documents. Other XML splitting
requirements are not as simple. XmlSplit was designed to handle many common slitting projects and even some
fairly complex ones, but sometimes the requirements are so complicated that custom programming is required. A good idea of the range of
real-world requirements can be obtained by searching the Web for "Split XML" and reading some of the discussions.
XSLT is a declarative programming language that can handle most any splitting requirement. It does require that the entire file be loaded
into memory, which is generally not a problem given the huge amount of RAM that can be installed on personal computers. However, if the XML
is several gigabytes then XSLT processing could take considerable time.
Before coding your own program to split an XML file, consider the following issues and recommendations:
1. Each split file must have a root element in order to be well-formed XML.
2. Don't split the XML into exactly equal size files by reading and writing bytes. The resulting files will likely be split in the middle of an XML
node and, therefore, not be well-formed XML.
3. Use an XML parser that conforms to the W3C XML specification. Most programming languages include a fast, forward only XML reader that can
read large XML files of any size and requires very little memory.
4. Avoid reading XML line by line using a text reader that looks for line ending characters unless you know the XML contains them. Some do
not, especially transmitted XML messages.
5. Avoid doing your own parsing or using tools like regex unless certain that the XML is well-formed. If XML content contains the
">" character, which is not illegal in XML, then locating end tags can be problematic. If you need to determine the depth of XML
elements then additional code is needed to keep track of the depth.
6. If the original XML has an XmlDeclaration node, should it be replicated in each split file? This question also applies to a DocumentType
node(DOCTYPE). If the XML contains entity references declared in a DocumentType, then each split file containing such entities must have the
DocumentType declaration in order to be well-formed XML.
7. What if the original XML has a header element? Should it be skipped or replicated in each split file?
8. If the original XML has a byte order mark, should it be written to each split file?
9. XML parsers are required to normalize line endings by removing a carriage return(CR) when a carriage Return and line feed(LF) is
encountered and replace a CR with a LF when a CR occurs by itself. Because CRLF is the standard line ending in the Windows operating system,
most all Windows software, including text editors, word processors, etc. use CRLF for line endings.
Submitted by Bill Conniff, Founder of Xponent, on August 29, 2012