Monday, December 15, 2008

XML Parsing

With respect to XML document parsing particularly with Java, things have been quite puzzling, at least for me! Jargons like DOM, SAX, JDOM, dom4j etc made things quite confusing! I am going to put up my finding on the issue, from different sources, to share with others, hope it helps. Please correct me or share with me if I am wrong on something or if something is missing!! Let us talk a little about DOM and SAX first then JDOM and dom4j.
Generally an XML document can be parsed in one of two ways either via Document Object Model (DOM) that’s been specified by W3C or via Simple API for XML (SAX).
By looking at an XML document (at least a well-formed one), it’s quite simple to say that it has a tree like structure e.g. root node(root element), child nodes, sub-child nodes etc. So we can infer that a tree structure will help us access the contents within the document, tree data structure for instance!!! This is the approach around which DOM APIs are based on. On the other hand, XML document contents can also be accessed by the reading the document sequentially, very much like humans do while reading a book, that’s where SAX APIs have originated from!! !
Document Object Model, or DOM, is a platform and language-independent API from W3C for accessing and modifying XML documents. A DOM based parser reads the whole XML document in one go and keeps it in memory in tree-like structure where each node of the tree stands for an element of the document. Once in memory, parser provides methods to access the contents. The XML document is read line by line and as element(s) are encountered, the corresponding tree is built on. DOM enables navigation along the built tree in any direction providing random access to any node.
Simple API for XML, or SAX, is API collection of Callbacks for sequential parsing of the XML document. SAX based parsers are event-driven parser, these parsers parse XML document sequentially without storing the structure into the memory that makes SAX memory efficient as compared to DOM. The events issued by the parser are: start and end of the document, start and end of each tag, comments and processing instructions. SAX based parsers, being sequential, provides access to only one bit of the document at a time.

Remember: SAX and DOM are just APIs for parsing XML document, (so are JDOM and Dom4j as we’ll discuss them later) and are not parsers!!! You do need DOM, SAX or any other parsing API to parse XML Document. A parser can be based on either of the APIs.

SAX and DOM APIs have evolved as specifications that describe how XML parsers can pass contents of an XML documents to client applications through different interfaces which are implemented by XML parsers. Though first developed for Java, APIs were later developed for other major programming languages. There are number of parsers available Apache Xerces, Sun Crimson, Oracle’s XML Parser, MSXML etc. Apache Xerces and Crimson have been included in the JDK 1.5 and later.

SAX and DOM APIs are different with respect to their structure as explained above that is SAX is an event based collection of Callbacks while DOM is in-memory tress structure. Thus DOM uses tree data structure while does SAX doesn’t use any until done so manually which essentially means that a parse tree can also be constructed with event based API.

Next time we'll discuss a comparison of DOM and SAX

1 comment:

dontcare said...

you might also want to look at vtd-xml, the latest and most advanced XML processing API available today