Parsing a xml file using Java - java

I need to parse a xml file using JAVA and have to create a bean out of that xml file after parsing .
I need this while using Spring JMS in which producer is producing a xml file .First I need to read the xml file and take action according .
I read some thing about parsing and come with these option
xpath
DOM
Which ll be the best option to parse the xml file.

did you check JAXB

There's three ways of parsing an XML file, SAX, DOM and StAX.
DOM will parse the whole file and build up a tree in memory - great for small files but obviously if this is huge then you don't want the entire tree just sitting in memory! SAX is event based - it doesn't load anything into memory per-se but just fires off a series of events as it reads through the file. StAX is a median between the two, the application moves the cursor forward as it needs, grabbing the data as it goes (so no event firing or huge memory consumption.)
What one you use will really depend on your application - all have built in libraries since Java 6.

Looks like, you receive a serialized object via Java messaging. Have a look first, how the object is being serialized. Usually this is done with a library (jaxb, axis, ...) and you could use the very same library to create a deserializer.
You will need:
The xml schema (a xsd file)
The Java bean class (very helpful, it should exist)
Then, usually the library will create all helper classes and files and you don't have to care about parsing.

if you need to create an object, just extract the needed properties and go on...
I recommend using StaX, see this tutorial for more information.

Umh..there are several ways you can parse an xml document to into memory and work with it. You mentioned DOM. DOM actually holds uploads the whole document into memory and then allows you to move between different branches of the XML document.
On the other hand, you could use StAX. It works similar to DOM. The only difference is that, it streams the content of the XML document thus allowing better allocation of memory. On the other hand, it does not retain the information that has already been read.
Look at : http://download.oracle.com/javaee/5/tutorial/doc/bnbem.html It gives details about both parsing methods and example code. Hope that helps.

Related

Parse a large XML file and get the duplicate attributes

I have a large XML file. It is structured like below:
...
<LexicalEntry id="tajaAhul_$axoS_1">
<Lemma partOfSpeech="n" writtenForm="تجاهُل شخْص"/>
<Sense id="tajaAhul_$axoS_1_<homaAl_$axoS_n1AR" synset="<homaAl_$axoS_n1AR"/>
<WordForm formType="root" writtenForm="جهل"/>
</LexicalEntry>
...
The file has been created automatically, so it may contain a duplicate writtenForm. I want to parse it with JAVA to check if there is really a duplicate writtenForm and if so I want to get them. With JAVA, the more I read about parsing XML files the more I get confused! I found that if the file is a large one, I should use SAX Parser but I am not familiar with all his functions and methods and I also found that with SAX Parser, I should create all the work in some handler class.
Since you mentioned your XML is large, the best option to parse is the SAX parser as you have already found out. It's not as scary as you assume. It reads through your XML content and calls your "Handler" to handle what it "sees" in the XML. Your handler class will be the one that will 'capture' and structure the XML content. Because it reads 'through' your XML it doesn't consume memory to store the content of XML. There are many examples out there on SAX parsing but this could be a starter example. Good luck!

Use xpath instead of XSD object generation for accessing XML details?

There is an XML file hosted on a server that I want to parse. Normally I generate an XSD from the XML and then generate the java pojo's from this XSD. Using jackson I then parse the XML to a java object representation. Is it not more straightforward to just use xpath ? This means I do not need to generate a object hierarchy based on the XML and also I do not need to regenerate the object hierarchy if the XML changes. xpath seems much more concise and intuitive ?
Why should I use XSD , object generation instead of xpath ?
According to the XML Schema specification XSD is used for defining the structure, content and semantics of XML documents. This means that you can use XSD to validate your XML file.
Depending on your circumstances you might be able to do without generating the whole object tree if all you need is to get some values from the XML file. In this case XPath is the way to go. However, you still might want to have an XSD file in order to validate the XML file before parsing it. This way you make your software fail fast, when the structure of your XML file changes, which will suggest that you change your XPath expressions. But for this to work, you shouldn't use the XSD you generate from your XML file, instead you should have a separate pre-generated XSD file which complies with the XPath expressions.
I think both approaches are valid, depending on the circumstances.
At the end of the day, you want to extract the values from that remote xml file and do something with them.
First criteria to consider is the size of that file, and the number of data elements.
If it's just a few, then xpath extraction should be straightforward. However, if that xml file represent a sizable and/or complex data structure, then you probably want the de-serialization to a Java data structure that you can then utilize, and JAXB would be a good candidate.
JAXB is going to be easier/better if the remote server adheres or publishes an XML Schema. If it doesn't, and changes often and significantly, you're going to suffer either way, but particularly so with JAXB. There are ways to smooth things over by pre-processing that xml with XSLT to force it into a more reliable form, but that is going to be a partial solution most likely.

Efficient way to read a small part of a BIG XML file in Java

We have a new requirement:
There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small.
...
...
What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?
Thank you
The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.
I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.
I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.
Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.
If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.
Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser here and for XPath here.
StAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.

Parsing very large XML files and marshalling to Java Objects

I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?
I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH
Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/
Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml

Editing a BIG XML via DOM parser

If there is a very big XML and DOM parser is used to parse it.
Now there is a requirement to add/delete elements from the XML i.e edit the XML
How to edit the XML as the entire XML will not be loaded due to memory constraints ?
What could be the strategy to solve this ?
You may consider to use a SAX parser instead, which doesn't keep the whole document in memory. It will be faster and will also use much less memory.
As two other answers mentioned already, a SAX parser will do the trick. Your other alternative to DOM is a StAX parser.
Traditionally, XML APIs are either:
DOM based - the entire document is read into memory as a tree
structure for random access by the calling application
event based - the application registers to receive events as
entities are encountered within the source document.
Both have advantages; the former (for example, DOM) allows for random
access to the document, the latter (e.g. SAX) requires a small memory
footprint and is typically much faster.
These two access metaphors can be thought of as polar opposites. A
tree based API allows unlimited, random access and manipulation, while
an event based API is a 'one shot' pass through the source document.
StAX was designed as a median between these two opposites. In the StAX
metaphor, the programmatic entry point is a cursor that represents a
point within the document. The application moves the cursor forward -
'pulling' the information from the parser as it needs. This is
different from an event based API - such as SAX - which 'pushes' data
to the application - requiring the application to maintain state
between events as necessary to keep track of location within the
document.
StAX is my preferred approach for handling large documents. If DOM is a requirement, check out DOM implementations like Xerces that support lazy construction of DOM nodes:
http://xerces.apache.org/xerces-j/faq-write.html#faq-4
Your assumption of memory constraint loading the XML document may only apply to DOM. VTD-XML loads the entire XML in memory, and does it efficiently (1.3x the size of XML document)... both in memory and performance...
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
Another distinct benefit, which none other XML framework in existence has, is its incremental update capability...
http://www.devx.com/xml/Article/36379
As stivlo mentioned you can use a SAX parser for reading the XML.
But for writing the XML you can write into fileoutput stream as plain text. I am sure that you will get requirement that mentions after which tag or under which tag the new data should be inserted.

Categories