Best way to read XML in Java

Best way to read XML in Java - java

From some of our other application i am getting XML file.
I want to read that XML file node by node and store node values in database for further use.
So, what is the best way/API to read XML file and retrieve node values using Java?

There are various tools for that. Today, I prefer two:
Simple XML
JAXB
StAX
Here is a good comparison between the Simple and JAXB: http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-simple.html
Personally, I like Simple a bit better because support by Niall is excellent but JAXB (as explained in the blog post above) can produce better output with less code.
StAX is a more basic API which allows you to read XML documents that simply don't fit into RAM (neither Simple nor JAXB allow you to read an XML document "object by object" - they will always try to load everything into RAM at once).

I would advice for a simple XML tool if you can manage by that.
For example I and my colleges have introduces complex XML frameworks that worked like a charm at first.
Then you forget about the framework, you have special build files just for mapping XML to beans, you have annotated beans, you provide a new barrier for new developers to your project. You loose much of your freedom to refactor.
At the end you will be sorry that you used the complex framework to save some time in the beginning and I have seen more than one time that the frameworks have been thrown out in refactoring because everybody had a negative feeling about it although they are great at paper.
So think twice about introducing complex XML frameworks if you seldom use them. If you and your team use them rather frequently then they are the way to go.

I suggest using XPath. Xalan is already included in the JDK (no external jars needed) and it fits your requirement, i.e. iterating through element nodes (i presume) and storing their text values. For example:
String xml = "<root> <item>One</item> <item>Two</item> <item>Three</item> </root>";
XPathFactory xpf = XPathFactory.newInstance();
InputSource is = new InputSource(new StringReader(xml));
NodeList nodes = (NodeList) xpf.newXPath().evaluate("/*/*", is,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getNodeName() + " -> " + e.getTextContent());
}
}
This example returns a list of all non-root elements and print out the corresponding element name and text content. Adapt the xpath expression to fit your needs.

dom4j and jdom are pretty easy to use (ignoring the requirement "best" for a moment ;) )

Try Apache Xerces. It is mature and robust. Any such available alternatives will do also, just be sure not to roll out your own implementation.

Bypassing alltogether the question of parsing the xml and storing the values in a database, I'd like to question the need to do the above. Most databases can handle xml nowadays, so it can be stored in some way into a table without the need of parsing the content; and the content of such an xml within a column in a table can typically be queried by use of 'xmlselect()' and similar functions.
Think about this for a second; if in the near or distant future the content of the xml that you get from the other application changes, you'll have plenty of changes to do. If it changes often, it'll become a nightmare.
Cheers,
Wim

Try XStream, this one's really simple.

well，i used stax to parse quite a huge of XML nodes, which consumes less memory than Dom and sax, cauz it is of style of pulling XML data. Stax might be a good choice for large XML data nodes.

Related

What's the right way to produce a XML content in Java?

I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.

It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.

I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html

The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.

How to efficiently make a large XML file searchable in a web application?

I have an XML document and I need to make it searchable via a webapp. The document is currently only 6mb.. but could be extrememly large, thus from my research SAX seems the way to go.
So my question is, given a search term do I:
Do I load the document in memory once (into a list of beans and then
store it in memory)? And then search it when need be?
or
Parse the document looking for the desired search term and only add
the matches to the list of beans? And repeat this process with each
search?
I am not that experienced with webapps, but I am trying to figure out the optimal way to approach this, does anyone with Tomcat, SAX and Java Web apps have any suggestions as to which would be optimum?
Regards,
Nate

When you say that your XML file could be very large, I assume you do not want to keep it in memory. If you want it to be searchable, I understand that you want indexed accesses, without a full read at each time. IMHO, the only way to achieve that is to parse the file and load the data in a lightweight file database (Derby, HSQL or H2) and add relevant indexes to the database. Databases do allow indexed search on off memory data, XML files do not.

Assuming your search field is a field that is known to you, for example let the structure of the xml be:
<a>....</a>
<x>
<y>search text1</y>
<z>search text2</z>
</x>
<b>...</b>
and say the search has to be made on the 'x' and its children, you can achieve this using STAX parser and JAXB.
To understand the difference between STAX and SAX, please refer:
When should I choose SAX over StAX?
Using these APIs you will avoid storing the entire document in the memory. Using STAX parser, you parse the document, when you encounter the 'x' tag load it into memory(java beans) using JAXB.
Note: Only x and its children will be loaded to memory, not the entire document parsed till now.
Do not use any approaches that use DOM parsers.
Sample code to load only the part of the document where the search field is present.
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("file");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("x")) {
xsr.nextTag();
}
JAXBContext jc = JAXBContext.newInstance(X.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
JAXBElement<Customer> jb = unmarshaller.unmarshal(xsr, X.class);
xsr.close();
X x = jb.getValue();
System.out.println(x.y.content);
Now you have the field content to return the appropriate field. When the user again searches for the same field under 'x', give the results from the memory and avoid parsing the XML again.

Searching the file using XPath or XQuery is likely to be very fast (quite fast enough unless you are talking thousands of transactions per second). What takes time is parsing the file - building a tree in memory so that XPath or XQuery can search it. So (as others have said) a lot depends on how frequently the contents of the file change. If changes are infrequent, you should be able to keep a copy of the file in shared memory, so the parsing cost is amortized over many searches. But if changes are frequent, things get more complicated. You could try keeping a copy of the raw XML on disk, and a copy of the parsed XML in memory, and keeping the two in sync. Or you could bite the bullet and move to using an XML database - the initial effort will pay off in the end.
Your comment that "SAX is the way to go" would only be true if you want to parse the file each time you search it. If you're doing that, then you want the fastest possible way to parse the file. But a much better way forward is to avoid parsing it afresh on each search.

Android XML parser for simple xml node strings

I need to parse a series of simple XML nodes (String format) as they arrive from a persistent socket connection. Is a custom Android SAX parser really the best way? It seams slightly overkill to do it in this way
I had naively hoped I could cast the strings to XML then reference the names / attributes with dot syntax or similar.

I'd use the DOM Parser. It isn't as efficient as SAX, but if it's a simple XML file that's not too large, it's the easiest way to get up and moving.
Great tutorial on how to use it here: http://tutorials.jenkov.com/java-xml/dom.html

You might want to take a look at the XPath library. This is a more simple way of parsing xml. It's similar to building SQL queries and regex's.
http://www.ibm.com/developerworks/library/x-javaxpathapi.html

I'd go for a SAX Parser:
It's much more efficient in terms of memory, especially for larger files: you don't parse an entire document into objects, instead the parser performs a single uni-directional pass over the document and triggers events as it goes through.
It's actually surprisingly easy to implement: for instance take a look at Working with XML on Android by IBM. It's only listings 5 and 6 that are the actual implementation of their SAX parser so it's not a lot of code.

You can try to use Konsume-XML: SAX/STAX/Pull APIs are too low-level and hard to use; DOM requires the XML to fit into memory and is still clunky to use. Konsume-XML is based on Pull and therefore it's extremely efficient, yet the API is higher-level and much easier to use.

How to deserialize Java objects from XML?

I'm sure this might have been discussed at length or answered before, however I need a bit more information on the best approach for my situation...
Problem:
We have some large XML data (anywhere from 100k to 5MB) which we need to inflate into Java objects. The issue is that the data doesn't really doesn't map onto an object very well at all, so we need to only pull certain parts of the data out and create the objects. Given that, solutions such as JAXB or XStream really aren't appropriate.
So, we need to pull XML data out and get it into java objects as efficiently as possible.
Possible Solutions:
The way I see it, we have 3 possible solutions:
SAX parsing
DOM parsing
XSLT
We can load the XML into any JAXP implementation and pull the data out using one of the above methods.
Question(s)
I have a few questions/concerns:
How does XSLT work under the hood? Is it just a DOM parser? I ask because XSLT seems like a good way to go, but I don't really want to consider it if it won't give us better performance than DOM.
What are some popular libraries that provide DOM, XSLT, and SAX XML parsers?
In your experience, what are the reasons for picking DOM, SAX, or XSLT? Does the ease of use of DOM or XSLT totally dominate the performance improvements SAX offers?
Any benchmarks out there? The ones I've found are old (as in, 8 years old). So some recent benchmarks would be appreciated.
Are there any other solutions besides those outlined above that I could be missing?
Edit:
A few clarifications... You can use XSLT to directly inject values into a Java object... it is normally used to transform XML into some other XML, however I'm talking from the standpoint of calling a method from XSLT into java to inject the value.
I'm still not clear on how an XSLT processor works exactly... How is it feeding the XML into the XSLT code you write?

Use XSLT to transform the large XML files into a local domain model that is mapped to java objets with JAXB.
Start with the JDK 5+ built in XML libraries (unless you absolutely need XSLT 2.0, in which case use Saxon)
Don't focus on relative performance of SAX/DOM, focus on learning how to write XPath expressions and use XSLT, and then worry about performance later if and only if you find it to be a problem.
The Eclipse XML editors are decent, but if you can afford it, spring for Oxygen XML, which will let you do XPath evaluation in realtime.

We had a similar situation and I just threw together some XPath code that parsed the stuff I needed.
It was amazingly quick even on 100k+ XML files. We went as low tech as possible. We handle around 1000 files a day of that size and parsing time is very low. We have no memory issues, leaks etc.
We wrote a quick prototype in Groovy (if my memory is accurate) - proof of concept took me about 10 minutes

JAXB, the Java API for XML Binding might be what you want. You use it to inflate an XML document into a Java object graph made up of "Java content objects". These content objects are instances of classes generated by JAXB to match the XML document's schema
But if you already have a set of Java classes, or don't yet have a schema for the document, JAXB probably isn't the best way to go. I'd suggest doing a SAX parse and then building up your Java objects during the parse. Alternatively you could try a DOM parse and then walk the resulting Document tree to pull out the parts of interest (maybe with XPath) -- but 5MB of XML might turn into 50MB of DOM tree objects in Java.

DOM, SAX and XSLT are different animals.
DOM parsing loads the entire document into memory, which for 100K to 5MB (very small by today's standards) would work.
SAX is a stream parser which reads the XML and delivers events to your code for each tag.
XSLT is a system for transforming one XML tree into another. Even if you wrote a transform that converts the input to a more suitable format, you'd still have to write something using DOM or SAX to convert it into Java objects.

You can use the #XmlPath extension in EclipseLink JAXB (MOXy) to easily handle this use case. For a detailed example see:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
Sample Code:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
#XmlRootElement(name="kml")
#XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}

What's the best way to retrieve two pieces of data from an XML file?

I've got an XML document that is in either a pre or post FO transformed state that I need to extract some information from. In the pre-case, I need to pull out two tags that represent the pageWidth and pageHeight and in the post case I need to extract the page-height and page-width parameters from a specific tag (I forget which one it is off the top of my head).
What I'm looking for is an efficient/easily maintainable way to grab these two elements. I'd like to only read the document a single time fetching the two things I need.
I initially started writing something that would use BufferedReader + FileReader, but then I'm doing string searching and it gets messy when the tags span multiple lines. I then looked at the DOMParser, which seems like it would be ideal, but I don't want to have to read the entire file into memory if I could help it as the files could potentially be large and the tags I'm looking for will nearly always be close to the top of the file. I then looked into SAXParser, but that seems like a big pile of complicated overkill for what I'm trying to accomplish.
Anybody have any advice? Or simple implementations that would accomplish my goal? Thanks.
Edit: I forgot to mention that due to various limitations I have, whatever I use has to be "builtin" to core Java, in which I can't use and/or download any 3rd party XML tools.

While XPath is very good for querying XML data, I am not aware of good and fast XPath implementation for Java (they all use DOM model at least).
I would recommend you to stick with StAX. It is extremely fast even for huge files, and it's cursor API is rather trivial:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLStreamReader r = f.createXMLStreamReader("my.xml");
try {
while (r.hasNext()) {
r.next();
. . .
}
} finally {
r.close()
}
Consult StAX tutorial and XMLStreamReader javadocs for more information.

You can use XPath to search for your tags. Here is a tutorial on forming XPath expressions. And here is an article on using XPath with Java.
An easy to use parser (dom, sax) is dom4j. It would be quite easier to use than the built-in SAXParser.

try "XMLDog"
This uses sax to evaluate xpaths

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.