Editing a BIG XML via DOM parser

Editing a BIG XML via DOM parser - java

If there is a very big XML and DOM parser is used to parse it.
Now there is a requirement to add/delete elements from the XML i.e edit the XML
How to edit the XML as the entire XML will not be loaded due to memory constraints ?
What could be the strategy to solve this ?

You may consider to use a SAX parser instead, which doesn't keep the whole document in memory. It will be faster and will also use much less memory.

As two other answers mentioned already, a SAX parser will do the trick. Your other alternative to DOM is a StAX parser.
Traditionally, XML APIs are either:
DOM based - the entire document is read into memory as a tree
structure for random access by the calling application
event based - the application registers to receive events as
entities are encountered within the source document.
Both have advantages; the former (for example, DOM) allows for random
access to the document, the latter (e.g. SAX) requires a small memory
footprint and is typically much faster.
These two access metaphors can be thought of as polar opposites. A
tree based API allows unlimited, random access and manipulation, while
an event based API is a 'one shot' pass through the source document.
StAX was designed as a median between these two opposites. In the StAX
metaphor, the programmatic entry point is a cursor that represents a
point within the document. The application moves the cursor forward -
'pulling' the information from the parser as it needs. This is
different from an event based API - such as SAX - which 'pushes' data
to the application - requiring the application to maintain state
between events as necessary to keep track of location within the
document.

StAX is my preferred approach for handling large documents. If DOM is a requirement, check out DOM implementations like Xerces that support lazy construction of DOM nodes:
http://xerces.apache.org/xerces-j/faq-write.html#faq-4

Your assumption of memory constraint loading the XML document may only apply to DOM. VTD-XML loads the entire XML in memory, and does it efficiently (1.3x the size of XML document)... both in memory and performance...
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
Another distinct benefit, which none other XML framework in existence has, is its incremental update capability...
http://www.devx.com/xml/Article/36379

As stivlo mentioned you can use a SAX parser for reading the XML.
But for writing the XML you can write into fileoutput stream as plain text. I am sure that you will get requirement that mentions after which tag or under which tag the new data should be inserted.

Related

SAXBuilder().build(InputStream) - does this read entire file into memory?

Reading the docs, this is the method used in all the examples I've seen:
(Version of org.jdom.input.SAXBuilder is jdom-1.1.jar)
Document doc = new SAXBuilder().build(is);
Element root = doc.getRootElement();
Element child = root.getChild("someChildElement");
...
where is is an InputStream variable.
I'm wondering, since this is a SAX builder (as opposed to a DOM builder), does the entire inputstream get read into the document object with the build method? Or is it working off a lazy load and as long as I request elements with Element.getChildren() or similar functions (stemming from the root node) that are forward-only through the document, then the builder automatically takes care of loading chunks of the stream for me?
I need to be sure I'm not loading the whole file into memory.
Thanks,
Mike

The DOM parser similarly to the JDom parser loads the whole XML resource in memory to provide you a Document instance allowing to navigate in the elements of the XML.
Some references here :
the DOM standard is a codified standard for an in-memory document
model.
And here :
JDOM works on the logical in-memory XML tree,
Both DOM and JDom use the SAX parser internally to read the XML resource but they use it only to store the whole content in the Document instance that they return. Indeed, with Dom and JDom, the client never needs to provide a handler to intercept events triggered by the SAX parser.
Note that both DOM and JDom don't have any obligation to use SAX internally.
They use them mainly as the SAX standard is already there and so it makes sense to use it for reporting errors.
I need to be sure I'm not loading the whole file into memory.
You have two programming models to work with XML: streaming and the document object model (DOM).
You are looking for the first one.
So use the SAX parser by providing your handler to handle events generated by the SAX parser (startDocument(), startElement(), and so for) or as alternative look at a more user friendly API : STAX (Streaming API for XML) :
As an API in the JAXP family, StAX can be compared, among other APIs,
to SAX, TrAX, and JDOM. Of the latter two, StAX is not as powerful or
flexible as TrAX or JDOM, but neither does it require as much memory
or processor load to be useful, and StAX can, in many cases,
outperform the DOM-based APIs. The same arguments outlined above,
weighing the cost/benefits of the DOM model versus the streaming
model, apply here.

It eagerly parses the whole file to build the in-memory representation (i.e. Document) of the XML file.
If you want to be absolutely certain of that, you can go through the source on GitHub. More importantly the following classes: SAXBuilder, SAXHandler, and Document.

Writing an XML rule parser in java

I have a complex xml file which has multi-level elements. I have to parse the XML file and based on the elements present, I have to handle the incoming request. I can use JAXB to generate the classes and parse the xml. But to go through the multi-level elements and match against the rules makes the program too complex and heavy (leads to 4-5 levels of loops). Is there an efficient and lighter way of achieving the same?

Depending on your needs to store -or not- temporarily the read data, you can count on these parsers:
DOM (Document Object Model) parsers store the XML data into a memory structure.
jDOM (Java DOM) parser is a DOM-like implementation in Java, with its own API.
SAX (Simple API for XML) parsers traverse an XML stream completely asynchronously, throwing user events for every read data.
StAX (Streaming API for XML) parsers reads an XML stream synchronously.
All of them can be found in any standard runtime (JRE), except jDOM, which is open-source.
So, if you are looking for an efficient way to process XML and take decissions based upon the read data, maybe StAX would suit your needs, because as soon as you get the data you need, you might stop reading and discard the rest of the input XML.
Update
To apply matching rules over the whole document I recommend you to use XPath over DOM.

Java XML Parsing: SAX Vs. DOM [duplicate]

I read some articles about the XML parsers and came across SAX and DOM.
SAX is event-based and DOM is tree model -- I don't understand the differences between these concepts.
From what I have understood, event-based means some kind of event happens to the node. Like when one clicks a particular node it will give all the sub nodes rather than loading all the nodes at the same time. But in the case of DOM parsing it will load all the nodes and make the tree model.
Is my understanding correct?
Please correct me If I am wrong or explain to me event-based and tree model in a simpler manner.

Well, you are close.
In SAX, events are triggered when the XML is being parsed. When the parser is parsing the XML, and encounters a tag starting (e.g. <something>), then it triggers the tagStarted event (actual name of event might differ). Similarly when the end of the tag is met while parsing (</something>), it triggers tagEnded. Using a SAX parser implies you need to handle these events and make sense of the data returned with each event.
In DOM, there are no events triggered while parsing. The entire XML is parsed and a DOM tree (of the nodes in the XML) is generated and returned. Once parsed, the user can navigate the tree to access the various data previously embedded in the various nodes in the XML.
In general, DOM is easier to use but has an overhead of parsing the entire XML before you can start using it.

In just a few words...
SAX (Simple API for XML): Is a stream-based processor. You only have a tiny part in memory at any time and you "sniff" the XML stream by implementing callback code for events like tagStarted() etc. It uses almost no memory, but you can't do "DOM" stuff, like use xpath or traverse trees.
DOM (Document Object Model): You load the whole thing into memory - it's a massive memory hog. You can blow memory with even medium sized documents. But you can use xpath and traverse the tree etc.

Here in simpler words:
DOM
Tree model parser (Object based) (Tree of nodes).
DOM loads the file into the memory and then parse- the file.
Has memory constraints since it loads the whole XML file before parsing.
DOM is read and write (can insert or delete nodes).
If the XML content is small, then prefer DOM parser.
Backward and forward search is possible for searching the tags and evaluation of the
information inside the tags. So this gives the ease of navigation.
Slower at run time.
SAX
Event based parser (Sequence of events).
SAX parses the file as it reads it, i.e. parses node by node.
No memory constraints as it does not store the XML content in the memory.
SAX is read only i.e. can’t insert or delete the node.
Use SAX parser when memory content is large.
SAX reads the XML file from top to bottom and backward navigation is not possible.
Faster at run time.

You are correct in your understanding of the DOM based model. The XML file will be loaded as a whole and all its contents will be built as an in-memory representation of the tree the document represents. This can be time- and memory-consuming, depending on how large the input file is. The benefit of this approach is that you can easily query any part of the document, and freely manipulate all the nodes in the tree.
The DOM approach is typically used for small XML structures (where small depends on how much horsepower and memory your platform has) that may need to be modified and queried in different ways once they have been loaded.
SAX on the other hand is designed to handle XML input of virtually any size. Instead of the XML framework doing the hard work for you in figuring out the structure of the document and preparing potentially lots of objects for all the nodes, attributes etc., SAX completely leaves that to you.
What it basically does is read the input from the top and invoke callback methods you provide when certain "events" occur. An event might be hitting an opening tag, an attribute in the tag, finding text inside an element or coming across an end-tag.
SAX stubbornly reads the input and tells you what it sees in this fashion. It is up to you to maintain all state-information you require. Usually this means you will build up some sort of state-machine.
While this approach to XML processing is a lot more tedious, it can be very powerful, too. Imagine you want to just extract the titles of news articles from a blog feed. If you read this XML using DOM it would load all the article contents, all the images etc. that are contained in the XML into memory, even though you are not even interested in it.
With SAX you can just check if the element name is (e. g.) "title" whenever your "startTag" event method is called. If so, you know that you needs to add whatever the next "elementText" event offers you. When you receive the "endTag" event call, you check again if this is the closing element of the "title". After that, you just ignore all further elements, until either the input ends, or another "startTag" with a name of "title" comes along. And so on...
You could read through megabytes and megabytes of XML this way, just extracting the tiny amount of data you need.
The negative side of this approach is of course, that you need to do a lot more book-keeping yourself, depending on what data you need to extract and how complicated the XML structure is. Furthermore, you naturally cannot modify the structure of the XML tree, because you never have it in hand as a whole.
So in general, SAX is suitable for combing through potentially large amounts of data you receive with a specific "query" in mind, but need not modify, while DOM is more aimed at giving you full flexibility in changing structure and contents, at the expense of higher resource demand.

You're comparing apples and pears. SAX is a parser that parses serialized DOM structures. There are many different parsers, and "event-based" refers to the parsing method.
Maybe a small recap is in order:
The document object model (DOM) is an abstract data model that describes a hierarchical, tree-based document structure; a document tree consists of nodes, namely element, attribute and text nodes (and some others). Nodes have parents, siblings and children and can be traversed, etc., all the stuff you're used to from doing JavaScript (which incidentally has nothing to do with the DOM).
A DOM structure may be serialized, i.e. written to a file, using a markup language like HTML or XML. An HTML or XML file thus contains a "written out" or "flattened out" version of an abstract document tree.
For a computer to manipulate, or even display, a DOM tree from a file, it has to deserialize, or parse, the file and reconstruct the abstract tree in memory. This is where parsing comes in.
Now we come to the nature of parsers. One way to parse would be to read in the entire document and recursively build up a tree structure in memory, and finally expose the entire result to the user. (I suppose you could call these parsers "DOM parsers".) That would be very handy for the user (I think that's what PHP's XML parser does), but it suffers from scalability problems and becomes very expensive for large documents.
On the other hand, event-based parsing, as done by SAX, looks at the file linearly and simply makes call-backs to the user whenever it encounters a structural piece of data, like "this element started", "that element ended", "some text here", etc. This has the benefit that it can go on forever without concern for the input file size, but it's a lot more low-level because it requires the user to do all the actual processing work (by providing call-backs). To return to your original question, the term "event-based" refers to those parsing events that the parser raises as it traverses the XML file.
The Wikipedia article has many details on the stages of SAX parsing.

In practical: book.xml
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
</bookstore>
DOM presents the xml document as a the following tree-structure in memory.
DOM is W3C standard.
DOM parser works on Document Object Model.
DOM occupies more memory, preferred for small XML documents
DOM is Easy to navigate either forward or backward.
SAX presents the xml document as event based like start element:abc, end element:abc.
SAX is not W3C standard, it was developed by group of developers.
SAX does not use memory, preferred for large XML documents.
Backward navigation is not possible as it sequentially process the documents.
Event happens to a node/element and it gives all sub nodes(Latin nodus, ‘knot’).
This XML document, when passed through a SAX parser, will generate a sequence of events like the following:
start element: bookstore
start element: book with an attribute category equal to cooking
start element: title with an attribute lang equal to en
Text node, with data equal to Everyday Italian
....
end element: title
.....
end element: book
end element: bookstore

Both SAX and DOM are used to parse the XML document. Both has advantages and disadvantages and can be used in our programming depending on the situation
SAX:
Parses node by node
Does not store the XML in memory
We cant insert or delete a node
Top to bottom traversing
DOM
Stores the entire XML document into memory before processing
Occupies more memory
We can insert or delete nodes
Traverse in any direction.
If we need to find a node and does not need to insert or delete we can go with SAX itself otherwise DOM provided we have more memory.

Parsing a xml file using Java

I need to parse a xml file using JAVA and have to create a bean out of that xml file after parsing .
I need this while using Spring JMS in which producer is producing a xml file .First I need to read the xml file and take action according .
I read some thing about parsing and come with these option
xpath
DOM
Which ll be the best option to parse the xml file.

did you check JAXB

There's three ways of parsing an XML file, SAX, DOM and StAX.
DOM will parse the whole file and build up a tree in memory - great for small files but obviously if this is huge then you don't want the entire tree just sitting in memory! SAX is event based - it doesn't load anything into memory per-se but just fires off a series of events as it reads through the file. StAX is a median between the two, the application moves the cursor forward as it needs, grabbing the data as it goes (so no event firing or huge memory consumption.)
What one you use will really depend on your application - all have built in libraries since Java 6.

Looks like, you receive a serialized object via Java messaging. Have a look first, how the object is being serialized. Usually this is done with a library (jaxb, axis, ...) and you could use the very same library to create a deserializer.
You will need:
The xml schema (a xsd file)
The Java bean class (very helpful, it should exist)
Then, usually the library will create all helper classes and files and you don't have to care about parsing.

if you need to create an object, just extract the needed properties and go on...
I recommend using StaX, see this tutorial for more information.

Umh..there are several ways you can parse an xml document to into memory and work with it. You mentioned DOM. DOM actually holds uploads the whole document into memory and then allows you to move between different branches of the XML document.
On the other hand, you could use StAX. It works similar to DOM. The only difference is that, it streams the content of the XML document thus allowing better allocation of memory. On the other hand, it does not retain the information that has already been read.
Look at : http://download.oracle.com/javaee/5/tutorial/doc/bnbem.html It gives details about both parsing methods and example code. Hope that helps.

Parsing very large XML documents (and a bit more) in java

(All of the following is to be written in Java)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
First, the stream will be decrypted according to the aforementioned algorithm.
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
Here is my question:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.

Stax is the right way. I would recommend looking at Woodstox

This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.

You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.

I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);

You might be interested by XOM:
XOM is fairly unique in that it is a
dual streaming/tree-based API.
Individual nodes in the tree can be
processed while the document is still
being built. The enables XOM programs
to operate almost as fast as the
underlying parser can supply data. You
don't need to wait for the document to
be completely parsed before you can
start working with it.
XOM is very memory efficient. If you
read an entire document into memory,
XOM uses as little memory as possible.
More importantly, XOM allows you to
filter documents as they're built so
you don't have to build the parts of
the tree you aren't interested in. For
instance, you can skip building text
nodes that only represent boundary
white space, if such white space is
not significant in your application.
You can even process a document piece
by piece and throw away each piece
when you're done with it. XOM has been
used to process documents that are
gigabytes in size.

Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.