Atif's Blog: December 2008

Monday, December 22, 2008

When to use SAX and DOM?

This question doesn’t have any straight or fix answer because it really depends upon the situation. Different parameters (memory, speed etc) have to be taken into consideration to make this decision. We go through some of points that will be helpful in making the decision which API should be used in which situation.

DOM is memory intensive while SAX is not, because DOM creates a Document object in memory using a tree data structure while SAX does not create any such default object in memory rather uses custom object which makes SAX less memory intense.
DOM is slower than SAX for the same reason as of memory.
DOM provides a random access to the XML contents of the document having them in memory while SAX provides sequential access to the contents as it has access to one part of the document at a time.
DOM has an edge over SAX that it provides ability to modify the document which is not the case with SAX. This implies that if client application is concerned about the location of different elements within the document (may be to modify it) then DOM is the option, however, if it’s concerned with individual elements then SAX is the right option.
From a parser’s point of view, there is nothing much to do for a parser for SAX. There are few interfaces to be implemented for SAX for parser while this is not the case with DOM. Most of the parsers support both of these models.

Based on these points, you can have some idea which model is the most suitable to your needs!!

Tuesday, December 16, 2008

XML Parsing

With respect to XML document parsing particularly with Java, things have been quite puzzling, at least for me! Jargons like DOM, SAX, JDOM, dom4j etc made things quite confusing! I am going to put up my finding on the issue, from different sources, to share with others, hope it helps. Please correct me or share with me if I am wrong on something or if something is missing!! Let us talk a little about DOM and SAX first then JDOM and dom4j.
Generally an XML document can be parsed in one of two ways either via Document Object Model (DOM) that’s been specified by W3C or via Simple API for XML (SAX).
By looking at an XML document (at least a well-formed one), it’s quite simple to say that it has a tree like structure e.g. root node(root element), child nodes, sub-child nodes etc. So we can infer that a tree structure will help us access the contents within the document, tree data structure for instance!!! This is the approach around which DOM APIs are based on. On the other hand, XML document contents can also be accessed by the reading the document sequentially, very much like humans do while reading a book, that’s where SAX APIs have originated from!! !
Document Object Model, or DOM, is a platform and language-independent API from W3C for accessing and modifying XML documents. A DOM based parser reads the whole XML document in one go and keeps it in memory in tree-like structure where each node of the tree stands for an element of the document. Once in memory, parser provides methods to access the contents. The XML document is read line by line and as element(s) are encountered, the corresponding tree is built on. DOM enables navigation along the built tree in any direction providing random access to any node.
Simple API for XML, or SAX, is API collection of Callbacks for sequential parsing of the XML document. SAX based parsers are event-driven parser, these parsers parse XML document sequentially without storing the structure into the memory that makes SAX memory efficient as compared to DOM. The events issued by the parser are: start and end of the document, start and end of each tag, comments and processing instructions. SAX based parsers, being sequential, provides access to only one bit of the document at a time.

Remember: SAX and DOM are just APIs for parsing XML document, (so are JDOM and Dom4j as we’ll discuss them later) and are not parsers!!! You do need DOM, SAX or any other parsing API to parse XML Document. A parser can be based on either of the APIs.

SAX and DOM APIs have evolved as specifications that describe how XML parsers can pass contents of an XML documents to client applications through different interfaces which are implemented by XML parsers. Though first developed for Java, APIs were later developed for other major programming languages. There are number of parsers available Apache Xerces, Sun Crimson, Oracle’s XML Parser, MSXML etc. Apache Xerces and Crimson have been included in the JDK 1.5 and later.

SAX and DOM APIs are different with respect to their structure as explained above that is SAX is an event based collection of Callbacks while DOM is in-memory tress structure. Thus DOM uses tree data structure while does SAX doesn’t use any until done so manually which essentially means that a parse tree can also be constructed with event based API.

Next time we'll discuss a comparison of DOM and SAX