Java: processing large XML files - extracting data without coding state automata?

Java: processing large XML files - extracting data without coding state automata? - java

I am not experienced in Java XML processing. My colleague rapidly made implementation on JAXP SAX parser so large XML isn't loaded to memory and we operated on streams. That mean that we implemented callback interface with methods like:
public void startElement(..., String elementName, ...){ ... }
public void characters(char [] buf, int offset, int len) { ... }
Implementation maintain state of current position in tags hierarchy managed by stack of element names and depth.
Each startElement/endElement full of spaghetti if/case and register callbacks which called in characters method to decide a need and an algorithm how to extract and where to save new partially processed portion of data. This code poured by filtering logic. Actual logic are larger but not harder.
On each closing 2nd level tag if filters make positive decision we pass gathered data to other place, cleanup current context state and begin process another portion of data.
Our logic is primitive - if lvl2 tag is person and has subtags in that order: skills / skill / id with specified value for id - extract lvl3 email tag value + lvl4 tag value address / city.
This task isn't XPath as we extract several categories at once and if I properly understood XPath operate on DOM and isn't stream oriented.
I see possible use of XSLT (which is stream oriented language) but seems that it scope - from one XML document make another XML document. It is possible to pipe large document through XSLT processor to build easy to process XML with descriptive XSLT source code and then process resulted data with SAX parser. But this look like bad decision.
What Java technology used for extracting data from regular structured large XML stream with using descriptive instruction (better in XPath like reduced syntax that define tag ordering from root and checks for tag/attribute values) when and what to extract and that provide callback extension point to pass extracted portion of data for further processing?
My main goal make code more maintainable by expressing extraction rules in descriptive way and avoid writing toy custom finite state automata to track context where we stay in SAX parser.

SAX is old-skool and, as you point out, you end up with lots of logic inside your startElement callbacks.
StAX is the streaming parser which I would think is much more suited to your use case as it allows you to pull events from the XML stream so there's no DOM-like requirement to load the entire document and you get more support for XML semantics than the SAX approach. StAX is described here

Related

Writing an XML rule parser in java

I have a complex xml file which has multi-level elements. I have to parse the XML file and based on the elements present, I have to handle the incoming request. I can use JAXB to generate the classes and parse the xml. But to go through the multi-level elements and match against the rules makes the program too complex and heavy (leads to 4-5 levels of loops). Is there an efficient and lighter way of achieving the same?

Depending on your needs to store -or not- temporarily the read data, you can count on these parsers:
DOM (Document Object Model) parsers store the XML data into a memory structure.
jDOM (Java DOM) parser is a DOM-like implementation in Java, with its own API.
SAX (Simple API for XML) parsers traverse an XML stream completely asynchronously, throwing user events for every read data.
StAX (Streaming API for XML) parsers reads an XML stream synchronously.
All of them can be found in any standard runtime (JRE), except jDOM, which is open-source.
So, if you are looking for an efficient way to process XML and take decissions based upon the read data, maybe StAX would suit your needs, because as soon as you get the data you need, you might stop reading and discard the rest of the input XML.
Update
To apply matching rules over the whole document I recommend you to use XPath over DOM.

Read result of XSL Transformation with StAX in a streaming mode

My use case is:
XML File as input
Needs to be transformed with XSLT (using Java 8 build-in XSLT processor)
Process the Result with a XMLStreamReader (using Java 8 build-in StAX implementation)
I want to do this in a "streaming mode" (so not writing the output of the XSL Transformation to a File and than parsing it by the XMLStreamReader).
Is this possible? If so, how? I can only find SAX-based examples.

Probably not possible. Most XSLT engines write the result "tree" in push mode, so you can skip the creating of a physical tree by accepting events as they occur, but if you want to get the result in pull mode you're going to need to run the transformation in one thread and read the results in another thread, using something like a BlockingQueue to communicate the events from on thread to another. Saxon has internal mechanisms to handle push-pull conflicts using multiple threads in this way, but only at item level, not at event level, and this isn't much use to you because the whole result tree is one item. So the basic answer is, the only way I can see to do what you want is to write a multi-threaded SAX-to-StAX converter, which is a tricky bit of Java coding.

When you run a transform the XSLT processor needs to build its result within the transform call.
It is easy to see that the engine can write the result to a stream, or build up a DOM (or any other in memory structure), or call a SAXHandler.
But how should the engine directly accept and invoke a XMLStreamReader which has a different programming model than a callback oriented SAX Handler?
But you could create a DOM tree as result of the transformation. Given the DOM it is possible to create a DOMXmlStreamReader which iterates over the DOM and emits Stax tokens. But I don't know if such an implementation exists.

Java XML Parsing: SAX Vs. DOM [duplicate]

I read some articles about the XML parsers and came across SAX and DOM.
SAX is event-based and DOM is tree model -- I don't understand the differences between these concepts.
From what I have understood, event-based means some kind of event happens to the node. Like when one clicks a particular node it will give all the sub nodes rather than loading all the nodes at the same time. But in the case of DOM parsing it will load all the nodes and make the tree model.
Is my understanding correct?
Please correct me If I am wrong or explain to me event-based and tree model in a simpler manner.

Well, you are close.
In SAX, events are triggered when the XML is being parsed. When the parser is parsing the XML, and encounters a tag starting (e.g. <something>), then it triggers the tagStarted event (actual name of event might differ). Similarly when the end of the tag is met while parsing (</something>), it triggers tagEnded. Using a SAX parser implies you need to handle these events and make sense of the data returned with each event.
In DOM, there are no events triggered while parsing. The entire XML is parsed and a DOM tree (of the nodes in the XML) is generated and returned. Once parsed, the user can navigate the tree to access the various data previously embedded in the various nodes in the XML.
In general, DOM is easier to use but has an overhead of parsing the entire XML before you can start using it.

In just a few words...
SAX (Simple API for XML): Is a stream-based processor. You only have a tiny part in memory at any time and you "sniff" the XML stream by implementing callback code for events like tagStarted() etc. It uses almost no memory, but you can't do "DOM" stuff, like use xpath or traverse trees.
DOM (Document Object Model): You load the whole thing into memory - it's a massive memory hog. You can blow memory with even medium sized documents. But you can use xpath and traverse the tree etc.

Here in simpler words:
DOM
Tree model parser (Object based) (Tree of nodes).
DOM loads the file into the memory and then parse- the file.
Has memory constraints since it loads the whole XML file before parsing.
DOM is read and write (can insert or delete nodes).
If the XML content is small, then prefer DOM parser.
Backward and forward search is possible for searching the tags and evaluation of the
information inside the tags. So this gives the ease of navigation.
Slower at run time.
SAX
Event based parser (Sequence of events).
SAX parses the file as it reads it, i.e. parses node by node.
No memory constraints as it does not store the XML content in the memory.
SAX is read only i.e. can’t insert or delete the node.
Use SAX parser when memory content is large.
SAX reads the XML file from top to bottom and backward navigation is not possible.
Faster at run time.

You are correct in your understanding of the DOM based model. The XML file will be loaded as a whole and all its contents will be built as an in-memory representation of the tree the document represents. This can be time- and memory-consuming, depending on how large the input file is. The benefit of this approach is that you can easily query any part of the document, and freely manipulate all the nodes in the tree.
The DOM approach is typically used for small XML structures (where small depends on how much horsepower and memory your platform has) that may need to be modified and queried in different ways once they have been loaded.
SAX on the other hand is designed to handle XML input of virtually any size. Instead of the XML framework doing the hard work for you in figuring out the structure of the document and preparing potentially lots of objects for all the nodes, attributes etc., SAX completely leaves that to you.
What it basically does is read the input from the top and invoke callback methods you provide when certain "events" occur. An event might be hitting an opening tag, an attribute in the tag, finding text inside an element or coming across an end-tag.
SAX stubbornly reads the input and tells you what it sees in this fashion. It is up to you to maintain all state-information you require. Usually this means you will build up some sort of state-machine.
While this approach to XML processing is a lot more tedious, it can be very powerful, too. Imagine you want to just extract the titles of news articles from a blog feed. If you read this XML using DOM it would load all the article contents, all the images etc. that are contained in the XML into memory, even though you are not even interested in it.
With SAX you can just check if the element name is (e. g.) "title" whenever your "startTag" event method is called. If so, you know that you needs to add whatever the next "elementText" event offers you. When you receive the "endTag" event call, you check again if this is the closing element of the "title". After that, you just ignore all further elements, until either the input ends, or another "startTag" with a name of "title" comes along. And so on...
You could read through megabytes and megabytes of XML this way, just extracting the tiny amount of data you need.
The negative side of this approach is of course, that you need to do a lot more book-keeping yourself, depending on what data you need to extract and how complicated the XML structure is. Furthermore, you naturally cannot modify the structure of the XML tree, because you never have it in hand as a whole.
So in general, SAX is suitable for combing through potentially large amounts of data you receive with a specific "query" in mind, but need not modify, while DOM is more aimed at giving you full flexibility in changing structure and contents, at the expense of higher resource demand.

You're comparing apples and pears. SAX is a parser that parses serialized DOM structures. There are many different parsers, and "event-based" refers to the parsing method.
Maybe a small recap is in order:
The document object model (DOM) is an abstract data model that describes a hierarchical, tree-based document structure; a document tree consists of nodes, namely element, attribute and text nodes (and some others). Nodes have parents, siblings and children and can be traversed, etc., all the stuff you're used to from doing JavaScript (which incidentally has nothing to do with the DOM).
A DOM structure may be serialized, i.e. written to a file, using a markup language like HTML or XML. An HTML or XML file thus contains a "written out" or "flattened out" version of an abstract document tree.
For a computer to manipulate, or even display, a DOM tree from a file, it has to deserialize, or parse, the file and reconstruct the abstract tree in memory. This is where parsing comes in.
Now we come to the nature of parsers. One way to parse would be to read in the entire document and recursively build up a tree structure in memory, and finally expose the entire result to the user. (I suppose you could call these parsers "DOM parsers".) That would be very handy for the user (I think that's what PHP's XML parser does), but it suffers from scalability problems and becomes very expensive for large documents.
On the other hand, event-based parsing, as done by SAX, looks at the file linearly and simply makes call-backs to the user whenever it encounters a structural piece of data, like "this element started", "that element ended", "some text here", etc. This has the benefit that it can go on forever without concern for the input file size, but it's a lot more low-level because it requires the user to do all the actual processing work (by providing call-backs). To return to your original question, the term "event-based" refers to those parsing events that the parser raises as it traverses the XML file.
The Wikipedia article has many details on the stages of SAX parsing.

In practical: book.xml
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
</bookstore>
DOM presents the xml document as a the following tree-structure in memory.
DOM is W3C standard.
DOM parser works on Document Object Model.
DOM occupies more memory, preferred for small XML documents
DOM is Easy to navigate either forward or backward.
SAX presents the xml document as event based like start element:abc, end element:abc.
SAX is not W3C standard, it was developed by group of developers.
SAX does not use memory, preferred for large XML documents.
Backward navigation is not possible as it sequentially process the documents.
Event happens to a node/element and it gives all sub nodes(Latin nodus, ‘knot’).
This XML document, when passed through a SAX parser, will generate a sequence of events like the following:
start element: bookstore
start element: book with an attribute category equal to cooking
start element: title with an attribute lang equal to en
Text node, with data equal to Everyday Italian
....
end element: title
.....
end element: book
end element: bookstore

Both SAX and DOM are used to parse the XML document. Both has advantages and disadvantages and can be used in our programming depending on the situation
SAX:
Parses node by node
Does not store the XML in memory
We cant insert or delete a node
Top to bottom traversing
DOM
Stores the entire XML document into memory before processing
Occupies more memory
We can insert or delete nodes
Traverse in any direction.
If we need to find a node and does not need to insert or delete we can go with SAX itself otherwise DOM provided we have more memory.

Editing a BIG XML via DOM parser

If there is a very big XML and DOM parser is used to parse it.
Now there is a requirement to add/delete elements from the XML i.e edit the XML
How to edit the XML as the entire XML will not be loaded due to memory constraints ?
What could be the strategy to solve this ?

You may consider to use a SAX parser instead, which doesn't keep the whole document in memory. It will be faster and will also use much less memory.

As two other answers mentioned already, a SAX parser will do the trick. Your other alternative to DOM is a StAX parser.
Traditionally, XML APIs are either:
DOM based - the entire document is read into memory as a tree
structure for random access by the calling application
event based - the application registers to receive events as
entities are encountered within the source document.
Both have advantages; the former (for example, DOM) allows for random
access to the document, the latter (e.g. SAX) requires a small memory
footprint and is typically much faster.
These two access metaphors can be thought of as polar opposites. A
tree based API allows unlimited, random access and manipulation, while
an event based API is a 'one shot' pass through the source document.
StAX was designed as a median between these two opposites. In the StAX
metaphor, the programmatic entry point is a cursor that represents a
point within the document. The application moves the cursor forward -
'pulling' the information from the parser as it needs. This is
different from an event based API - such as SAX - which 'pushes' data
to the application - requiring the application to maintain state
between events as necessary to keep track of location within the
document.

StAX is my preferred approach for handling large documents. If DOM is a requirement, check out DOM implementations like Xerces that support lazy construction of DOM nodes:
http://xerces.apache.org/xerces-j/faq-write.html#faq-4

Your assumption of memory constraint loading the XML document may only apply to DOM. VTD-XML loads the entire XML in memory, and does it efficiently (1.3x the size of XML document)... both in memory and performance...
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
Another distinct benefit, which none other XML framework in existence has, is its incremental update capability...
http://www.devx.com/xml/Article/36379

As stivlo mentioned you can use a SAX parser for reading the XML.
But for writing the XML you can write into fileoutput stream as plain text. I am sure that you will get requirement that mentions after which tag or under which tag the new data should be inserted.

Parsing very large XML documents (and a bit more) in java

(All of the following is to be written in Java)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
First, the stream will be decrypted according to the aforementioned algorithm.
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
Here is my question:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.

Stax is the right way. I would recommend looking at Woodstox

This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.

You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.

I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);

You might be interested by XOM:
XOM is fairly unique in that it is a
dual streaming/tree-based API.
Individual nodes in the tree can be
processed while the document is still
being built. The enables XOM programs
to operate almost as fast as the
underlying parser can supply data. You
don't need to wait for the document to
be completely parsed before you can
start working with it.
XOM is very memory efficient. If you
read an entire document into memory,
XOM uses as little memory as possible.
More importantly, XOM allows you to
filter documents as they're built so
you don't have to build the parts of
the tree you aren't interested in. For
instance, you can skip building text
nodes that only represent boundary
white space, if such white space is
not significant in your application.
You can even process a document piece
by piece and throw away each piece
when you're done with it. XOM has been
used to process documents that are
gigabytes in size.

Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.