I'm sure this might have been discussed at length or answered before, however I need a bit more information on the best approach for my situation...
Problem:
We have some large XML data (anywhere from 100k to 5MB) which we need to inflate into Java objects. The issue is that the data doesn't really doesn't map onto an object very well at all, so we need to only pull certain parts of the data out and create the objects. Given that, solutions such as JAXB or XStream really aren't appropriate.
So, we need to pull XML data out and get it into java objects as efficiently as possible.
Possible Solutions:
The way I see it, we have 3 possible solutions:
SAX parsing
DOM parsing
XSLT
We can load the XML into any JAXP implementation and pull the data out using one of the above methods.
Question(s)
I have a few questions/concerns:
How does XSLT work under the hood? Is it just a DOM parser? I ask because XSLT seems like a good way to go, but I don't really want to consider it if it won't give us better performance than DOM.
What are some popular libraries that provide DOM, XSLT, and SAX XML parsers?
In your experience, what are the reasons for picking DOM, SAX, or XSLT? Does the ease of use of DOM or XSLT totally dominate the performance improvements SAX offers?
Any benchmarks out there? The ones I've found are old (as in, 8 years old). So some recent benchmarks would be appreciated.
Are there any other solutions besides those outlined above that I could be missing?
Edit:
A few clarifications... You can use XSLT to directly inject values into a Java object... it is normally used to transform XML into some other XML, however I'm talking from the standpoint of calling a method from XSLT into java to inject the value.
I'm still not clear on how an XSLT processor works exactly... How is it feeding the XML into the XSLT code you write?
Use XSLT to transform the large XML files into a local domain model that is mapped to java objets with JAXB.
Start with the JDK 5+ built in XML libraries (unless you absolutely need XSLT 2.0, in which case use Saxon)
Don't focus on relative performance of SAX/DOM, focus on learning how to write XPath expressions and use XSLT, and then worry about performance later if and only if you find it to be a problem.
The Eclipse XML editors are decent, but if you can afford it, spring for Oxygen XML, which will let you do XPath evaluation in realtime.
We had a similar situation and I just threw together some XPath code that parsed the stuff I needed.
It was amazingly quick even on 100k+ XML files. We went as low tech as possible. We handle around 1000 files a day of that size and parsing time is very low. We have no memory issues, leaks etc.
We wrote a quick prototype in Groovy (if my memory is accurate) - proof of concept took me about 10 minutes
JAXB, the Java API for XML Binding might be what you want. You use it to inflate an XML document into a Java object graph made up of "Java content objects". These content objects are instances of classes generated by JAXB to match the XML document's schema
But if you already have a set of Java classes, or don't yet have a schema for the document, JAXB probably isn't the best way to go. I'd suggest doing a SAX parse and then building up your Java objects during the parse. Alternatively you could try a DOM parse and then walk the resulting Document tree to pull out the parts of interest (maybe with XPath) -- but 5MB of XML might turn into 50MB of DOM tree objects in Java.
DOM, SAX and XSLT are different animals.
DOM parsing loads the entire document into memory, which for 100K to 5MB (very small by today's standards) would work.
SAX is a stream parser which reads the XML and delivers events to your code for each tag.
XSLT is a system for transforming one XML tree into another. Even if you wrote a transform that converts the input to a more suitable format, you'd still have to write something using DOM or SAX to convert it into Java objects.
You can use the #XmlPath extension in EclipseLink JAXB (MOXy) to easily handle this use case. For a detailed example see:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
Sample Code:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
#XmlRootElement(name="kml")
#XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
#XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}
Related
I've read several questions and tutorials over the internet such as
Best XML parser for Java [closed]
JAVA XML - How do I get specific elements in an XML Node?
What is the best way to create XML files in Java?
how to modify xml tag specific value in java?
Using StAX - From Oracle Tutorials
Apache Xerces Docs
Introduction to XML and XML With Java
Java DOM Parser - Modify XML Document
But since this is the very first time I have to manipulate XML documents in Java I'm still a little bit confused. The XML content is written with String concatenation and that seems to me wrong. It is the same to concatenate Strings to produce a JSON object instead of using a JSONObject class. That's the way the code is written right now:
"<msg:restenv xmlns:msg=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.b2wdigital.com/umb/NEXM_processa_nf_xml_req req.xsd\"><autenticacao><usuario>"
+ usuario + "</usuario><senha>" + StringUtils.defaultIfBlank(UmbrellaRestClient.PARAMETROS_INFRA_UMBRELLA.get("SENHA_UMBRELLA"), "WS.INTEGRADOR")
+ "</senha></autenticacao><parametros><parametro><p_vl_gnre>" + valorGNRE + "</p_vl_gnre><p_cnpj_destinatario>" + cnpjFilial + "</p_cnpj_destinatario><p_num_ped_compra>" + idPedido
+ "</p_num_ped_compra><p_xml_sefaz><![CDATA[" + arquivoNfe + "]]></p_xml_sefaz></parametro></parametros></msg:restenv>"
In my research I think that almost everything I've read pointed to SAX as the best solution but I never really found something really useful to read about it, almost everything states that we have to create a handler and override startElement, endElement and characters methods.
I don't have to serialize the XML file in hard disk or database or anything else, I just need to pass its content to a webservice.
So my question really is, which is the right way to do it?
Concatenate Strings the way things are done right now?
Write the XML file using a Java API like Xerces? I have no clue on how that can be done.
Read the XML file with streams and just change node texts? My XML without the files would be like that:
<msg:restenv xmlns:msg="{url}"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="{schemaLocation}">
<autenticacao>
<usuario></usuario>
<senha></senha>
</autenticacao>
<parametros>
<parametro>
<p_vl_gnre></p_vl_gnre>
<p_cnpj_destinatario></p_cnpj_destinatario>
<p_num_ped_compra></p_num_ped_compra>
<p_xml_sefaz><![CDATA[]]></p_xml_sefaz>
</parametro>
</parametros>
</msg:restenv>
I've also read something about using Apache Velocity as a template Engine since I don't actually have to serialize the XML and that's a approach that I really like because I've already worked with this framework and it's a really simple framework.
Again, I'm not looking for the best way, but the right one with tutorials and examples, if possible, on how to get things done.
It all depends on context. There is no single "right way".
The biggest issues with concatenation is the combination of escaping the XML in to a String constant (which is fiddly), but also escaping the values that you can using so that they're correct for an XML document.
For small XMLs, that's fine.
But for larger ones, it can be a pain.
If most of your XML is boilerplate with just a few values inserted, you may find that templates using something like Velocity or any of the other several libraries may be quite effective. It helps keep the template "native" (you don't have to wrap it in "'s and escape it), plus it keeps the XML out of your code, but easily lets you stamp in the parts that you need to do.
I agree that there's not just one way to do it, but I would advise you to take a look at JAXB. You can easily consume and produce XML without any of that pesky String manipulation. Look here for a simple intro: https://docs.oracle.com/javase/tutorial/jaxb/index.html
The Answer by Will Hartung is correct. There is not one right way as it depends on your situation.
For a beginner programmer, I suggest writing the strings manually so you get to understand XML in general and your content in particular. As for the mechanics of String concatenation, you would generally be using StringBuilder rather than String for better performance. Where thread-safety is needed, use StringBuffer.
The major issue is memory.
Abundant MemoryIf you have lots of memory and small XML documents, you can load the entire document into memory. This way you can traverse a document forwards, backwards, and jump around arbitrarily. This approach is know as Document Object Model (DOM). One better-known implementation of this approach is Apache Xerces. There are other good implementations as well.
Scarce MemoryIf you have little memory and large XML documents, then you need to plow through the document from start to finish, biting off small chunks at a time for lower memory usage. This approach is known as SAX. You can find multiple good implementations.
Another issue is validation. Do you want to validate the XML documents against a DTD or Schema? Some tools do this and some do not.
When all you need is to serialize the content of a Java object and read it back, I recommend the Simple XML Serialization library. Much simpler with a quicker learning-curve than the other tools.
I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?
I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH
Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/
Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml
I need to parse a series of simple XML nodes (String format) as they arrive from a persistent socket connection. Is a custom Android SAX parser really the best way? It seams slightly overkill to do it in this way
I had naively hoped I could cast the strings to XML then reference the names / attributes with dot syntax or similar.
I'd use the DOM Parser. It isn't as efficient as SAX, but if it's a simple XML file that's not too large, it's the easiest way to get up and moving.
Great tutorial on how to use it here: http://tutorials.jenkov.com/java-xml/dom.html
You might want to take a look at the XPath library. This is a more simple way of parsing xml. It's similar to building SQL queries and regex's.
http://www.ibm.com/developerworks/library/x-javaxpathapi.html
I'd go for a SAX Parser:
It's much more efficient in terms of memory, especially for larger files: you don't parse an entire document into objects, instead the parser performs a single uni-directional pass over the document and triggers events as it goes through.
It's actually surprisingly easy to implement: for instance take a look at Working with XML on Android by IBM. It's only listings 5 and 6 that are the actual implementation of their SAX parser so it's not a lot of code.
You can try to use Konsume-XML: SAX/STAX/Pull APIs are too low-level and hard to use; DOM requires the XML to fit into memory and is still clunky to use. Konsume-XML is based on Pull and therefore it's extremely efficient, yet the API is higher-level and much easier to use.
I have a somewhat large file (~500KiB) with a lot of small elements (~3000). I want to pick one element out of this and parse it to a java class.
Attributes Simplified
<xml>
<attributes>
<attribute>
<id>4</id>
<name>Test</id>
</attribute>
<attribute>
<id>5</id>
<name>Test2</name>
</attribute>
<!--3000 more go here-->
</attributes>
class Simplified
public class Attribute{
private int id;
private String name;
//Mutators and accessors
}
I kinda like XPath, but people suggested Stax and even VDT-XML. What should I do.
500 kb is not that large. If you like XPath, go for it.
I kinda like XPath, but people suggested Stax and even VDT-XML. What should I do.
DOM, SAX and VTD-XML are all three different ways to parse a XML document. Roughly in this order of memory efficiency. DOM needs over 5 times of memory as XML file big is. SAX is only a bit more efficient, VTD-XML uses only a little more memory than the XML file big is, about 1.2 times.
XPath is just a way to select elements and/or data from a (parsed) XML document.
With other words, you can just use XPath in combination with any of the XML parsers. So this is after all a non-concern. If you just want to go for best memory efficiency and performance, go for VTD-XML.
I have commented above as well, because there are few options to consider - but by the sound of it your initial description I think you could get away with a simple SAX processor here: which will probably run faster (although it might not look as pretty when it comes to mapping the Java class) than other mechanisms:
There is an example here, which matches quite closely with your example:
http://www.informit.com/articles/article.aspx?p=26351&seqNum=6
Avoid anything that is a DOM parser - no need for that, especially with a large-ish file and relatively simple XML syntax.
Which specific one to use, sorry, I haven't used them, so I can't give you any more guidance than to look at your licensing, performance, and support (for questions).
My favorite XML library is Dom4j
Whenever I have to deal with XML I just use XMLBeans. It may be overkill for what you are after, but it makes life easy (once you know how to use it).
If you don't care about performance at all, Apache Digester may be useful for you, as it will already initialize the Java objects for you after you define the rules.
I've a big XML data returned by SAP. Of this, I need only few nodes, may be 30% of the returned data.
After googling, I got to know that I can filter the nodes in either of the ways:
Apply XSLT templates - Have seen some nice solutions, which i want, in this site only.
Using a parser - use JDOM or SAX parser.
which is the efficient way to "filter XML nodes"?
thanks
SAX parser will be the fastest and most efficient (in that you don't need to read the entire document into memory and process it).
XSLT will be probably a terser solution since all you need is an identity transform (to copy the input document) with a few templates to copy out the bits you want.
Personally I'd go with the SAX parser.
The StAX API may suit your needs - have a look at StreamFilter or EventFilter. It has an advantage over SAX in that its pull model makes it is easier to quit processing when you've parsed all the data you want without resorting to artificial mechanisms like throwing an exception.
If your employer can afford SAP, then they certainly can afford Saxon, which is an XSLT processor that can process streams of arbitrary length.