XML API for best performance - java

I have an application that works with a lot of XML data. So, I want to ask you which is the best API to handle XML in java. Today, I'm using W3 and, for performance, I want to migrate to some API.
I make XML from 0, a lot of transforms, import into database (mysql, mssql, etc), export from database to html, modifi of those XML, and more.
Is JDOM the best option? do you know some other better than JDOM?
I heard (by reading pages) about javolution. Somebody use it?
Which API you recommend me?

If you have vast amounts of data, the main thing is to avoid having to load it all into memory at once (because it will use a vast amount of memory, and because it prevents you overlapping IO and processing). Sadly, i believe most DOM and DOM-like libraries (like DOM4J) do just that, so they are not well suited for processing vast amounts of XML efficiently.
Instead, look at using a streaming API, like SAX or StAX. StAX is, in my experience, usually easier to use.
There are other APIs that try to give you the convenience of DOM with the performance of SAX. Javolution might be one; VTD-XML is another. But to be honest, i find StAX quite easy to work with - it's basically a fancy stream, so you just think in the same way as if you were reading a text file from a stream.
One thing you might try is combining JAXB with StAX. The idea is that you stream the file using StAX, then use JAXB to unmarshal chunks within it. For instance, if you were processing an Atom feed, you could open it, read past the header, then work in a loop unmarshalling entry elements to objects one at a time. This only really works if your format consists of a sequence of independent elements, like Atom; it would be largely useless on something richer like XHTML. You can see examples of this in the JAXB reference implementation and a guy's blog post.

The answer depends on what performance aspects are important for your application. One factor is whether you are handling large XML documents.
For parsing, DOM-based approaches will not scale well to large documents. If you need to parse large documents, non-DOM parsers such as those using SAX and StAX will be faster and less resource intensive. However, if you need to transform XML after parsing, using either XSL or a DOM API, you are going to need the whole document in memory in any case.
For creating XML from code, StAX provides a nice API for this. Since the approach is stream-based, this will scale well to writing very large documents.

Well, the most developers I know and myself, we use dom4J, maybe if you have the time you could write a small performancetest with use of both frameworks, then you will see the difference. I prefere dom4j.

Related

Best Practice for large XML file builder

I have to build an XML file for an input to a SOAP service in Java. The input xml can consist of at least 1000 tags. What is the best way to build the XML? I have the XSD files but it is a bit complicated to use JAXB. Is XMLStreamWriter a good option for that?
XMLStreamWriter is one of the better APIs to use for writing XML from a Java application, but it has a few quirks (e.g. its namespace handling is a bit bizarre) and you may find it worthwhile to wrap it in a convenience API that knows about the kind of document you are writing, e.g. what namespaces it uses.
One of the advantages of the XMLStreamWriter interface is that there are plenty of implementations to choose from. For example Saxon has an implementation that gives you full control over all the XSLT/XQuery serialization options plus Saxon extensions (for example, you can even control the output order of attributes!)
One of the problems I hit with all event-based APIs is that sooner or later you find yourself forgetting to write an end tag, and that can be quite tricky to debug. Using a wrapper API that forces you to include the element name in a call on endElement() can be useful for debugging; if debugging is switched on you can keep a stack of element names and check that endElement() is writing the right tag; with debugging switched off you just drop this check.
Serializing using JAXB is higher-level, of course, but the downside is that it gives you less control.

What is the advantage of using JAXP instead of DOM / SAX directly in Java?

Being new to XML parsing I'm trying to understand the different technologies. There is a confusing amount of different technologies for different needs:
W3C-DOM
XOM
jDom
JAXP
JAXB
DOM
SAX
StAX
TrAX
Woodstox
dom4j
Crimson
VTD-XML
Xerces-J
Castor
XStream
...
Just to name a few.
DOM and SAX seem to be a low-level way for parsing and working on XML, so I decided to focus on the ones that get mentioned the most in different sources and are low-level:
DOM, SAX, JAXP.
I've read about parsers in general here on stackoverflow, JAXP-Tutorial from Oracle, XML-Parsing in general, and so on.
I've also tried some tutorials like this german one and others.
I'm grasping a little bit about DOM and SAX now, but the reason to use JAXP is still beyond me. It seems to be more of an interface to use DOM, SAX, ... internally, but why not use DOM or SAX directly?
What is the advantage of using JAXP in layman's-terms?
(Although you haven't said so explicitly, your question seems to relate exclusively to the Java world, and this answer reflects that.)
JAXP is a set of interfaces covering XML parsing, XSLT transformation, and XML schema validation. If we just focus on the XML parsing side, its main contribution is to provide a mechanism for locating an XML parser implementation, so your source code isn't locked into a particular product. Frankly that's of limited value these days; the only two SAX/DOM parsers in common use are the one embedded in the JDK, and Apache Xerces. Apache Xerces is better in every respect except that you need to download it separately.
As for the other parsing interfaces, they break down into two categories: event-based APIs and tree-based APIs. Tree-based APIs are much easier to work with, but can use a lot of memory when handling large documents.
The two dominant event-based APIs are SAX (push) and StAX (pull). Pull parsing is something many programmers find easier because you can use the program stack to maintain state information; unfortunately though the StAX API is a bit buggy - different implementations have fixed its gaps in different ways. The most complete and reliable implementation of StAX is the Woodstox parser; the most complete and reliable implementation of SAX is Apache Xerces. But don't attempt to use an event-based parsing approach unless your application really needs that level of performance (and unless you have the level of experience needed to avoid losing all the performance gains at the application level.)
For tree-based APIs, the DOM remains dominant solely because it was defined by W3C and is implemented in the JDK, and is therefore perceived as "standard"; also it's the one mentioned in all the books on the subject. However, of all the tree models, it is unquestionably the worst designed (mainly because it predates the introduction of namespaces). Alternatives include JDOM2, DOM4J, XOM, and AXIOM. I tend to recommend JDOM2 or XOM.
JAXP is just Sun's (now Oracle's) name for a collection of SAX and DOM classes they bundle with the JDK. If you're using JAXP, you're also using SAX and/or DOM. It's not a different thing.
JAXP also adds a few helper classes in the javax.xml.parsers package that fill gaps in SAX 1 and DOM 1, i.e. old versions of these libraries from 15+ years ago. However these are not necessary with SAX2/DOM3 that are used today. Worse yet, javax.xml.parsers classes such as DocumentBuilderFactory and SAXParserFactory are designed in a confusing way (they're not namespace aware by default) so they are almost always used incorrectly. Then developers come here to ask why their program doesn't do what they think it should. Just ignore these classes and use XMLReaderFactory (SAX 2) or DOMImplementationLS (DOM 3) instead.

Best way to parse large XML document in Jython

I need to parse a large (>800MB) XML file from Jython. The XML is not deeply nested, containing about a million relevant elements. I need to convert these elements into real objects.
I've used nu.xom.* successfully before, but now that I've switched from Java to Jython, the library fails with the following message:
The parser has encountered more than
"64,000" entity expansions in this
document; this is the limit imposed by
the application.
I have not found a way to fix this, so I probably have to look for another XML library. It could be either Java or Jython-compatible Python and should be efficient. Pythonic would be great, nu.xom.* is simple but not very pythonic. Do you have any suggestions?
Sax is the best way to parse large documents.
Sounds like you're hitting the default expansion limit.
See this note:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4843787
You need to set System property "entityExpansionLimit" to change
the default.
(added) see also the answer to this question.
Try using the SAX parser, it is great for streaming large XML files.
Does jython support xml.etree.ElementTree? If so, use the iterparse method to keep your memory size down. Read this and use elem.clear() as described.
there is a lxml python library, that can parse large files, without loading data to memory.
but i don't know if i jython compatible

Processing large xml files

I am having a large xml file which contains many sub elements. I want to able to run some xpath queries. I tried using vtd-xml in java, but I get outofmemory error sometimes, because the xml is so large to fit into memory. Is there an alternative way of processing such large xml's.
try http://code.google.com/p/jlibs/wiki/XMLDog
it executes xpaths using sax without creating in-memory representation of xml documents.
SAXParser is very efficient when working with large files
What are you trying to do right now? By the sounds of it you are trying to use a DOM based parser, which essentially loads the entire XML file into memory as a DOM representation. If you are dealing with a large file, you'll better off using a SAX parser, which processes the XML document in a streaming fashion.
I personally recommend StAX for this.
Did you use standard vtd or extended VTD-xml? If you use extended XML then you have the option of using memory mapping... did you try that?
Using XPath might not be a very good idea if you plan on compiling many expressions dynamically in a long lived application.
I'm not entirely sure how the java version of XPath works, but in .NET XPath compiles a dynamic assembly then adds it to the app domain. Subsequent uses of the expression look at the assembly now loaded into memory.
In one case, where I was using XPath it lead to a situation where I think, this same type of mechanism was slowing filling up memory similar to a memory leak.
My theory is that as each expression was compiled using values from the user, each compiled expressions was likely unique, so a new expression was compiled and added to the app domain.
Since you can remove the assembly from the app domain without restarting the entire app domain, memory was being consumed each time an expression was evaluated and it could not be recovered. As a result, the code was leaking memory in the form of assemblies in memory, and after a while, well you know the results.

Why choose an XSL-transformation?

For a current project the decision has to be made whether to use XML and an XSL-transformation to produce HTML or to directly use HTML-templates.
I'd be interested in arguments for or against the XSL-approach. I understand that in cases where you have to support many different layouts, an XSL-solution has a lot of advantages, but why would you choose it in those cases where you only have to support one target layout?
Edit: We're talking about Java here.
XSLT is a functional programming language and you can use it to create frontends as rich as any templating system. However, you shouldn't — you and your team will go insane.
Both options present the opportunity of transforming objects into a presentation form in a logical sort of way. XSLT is best suited for creating more XML, which might lead you to believe that it's a perfect candidate to use to create XHTML. However, creating XHTML shouldn't be the primary goal — Creating a user experience is. Don't concern yourself with the medium.
Two significant drawbacks to XSLT concern the syntax: Your templates, and the templates that they include, and the templates that those templates include will all be gigantic and verbose. Second, you'll have to do a lot of functional programming, and less-experienced engineers may be confused and terrified when they encounter a recursive template with an accumulating function parameter instead of a simple for loop.
If you're attracted by the beauty of transforming logically-constructed, valid XML entities, consider instead a type-safe templating system that transforms beans instead. Check out Google XML Pages, and create logically-organized, type-safe templates that will be easy for future engineers to pick up and extend.
I created an XML/XSLT-driven UI for an enterprise product about 5 years ago. We're still using it, and I can now look back on my experience and see many pros and cons:
Pros:
XSL is a powerful declarative language, useful & fun for experienced developers, and transforms can do pretty amazing things in a few lines of code
XSL is designed for use with XML, so if your data is already XML then it makes a lot of sense
Separation of concerns (rendering vs. data) is better than many template languages
XSL-based rendering can be easily "subclassed". By that I mean: let's say you have data class A with associated template A.xslt. For class B derived from A, you can easily create B.xslt with only the small differences, and include A.xslt for inherited behaviors. This makes it less succeptible to breaking due to changes in A.xslt.
The above point also gives you the power to do overrides. For class A with associated A.xslt, we can easily switch the associated template to A-custom.xslt, which is a few small changes plus inheritance of A.xslt. We can do this on the fly in the field and again, the benefit is that A-custom.xslt is only a few lines, not an entire modified copy of the original A.xslt. The small footprint means it's more likely to work with multiple versions of A.xslt.
In .NET 2.0, XSLT is compiled and becomes very fast. There may be similar tech for Java. (Most template languages do this now too.)
In .NET, it's possible to create an "Object XPath Navigator", which lets you transform your data objects without having to convert them to an XML object. Again there may be similar tech in Java
XSLT is smart about HTML & handles escaping, white space issues, etc. well
Cons:
XSL is a powerful declarative language, confusing to newer programmers - and fewer people know XSLT well
XSL is verbose. XML is often verbose too.
XSL transforms are probably slower than "native" templates. Even when compiled there's still more state overhead to XSL than most template languages
It's hard to pass parameters to XSLs, you have to either send them in line with your data (forcing you to create extra XML) or via system-specific methods (which may also involve constructing XML data)
If you don't have an ObjectXPathNavigator or equivalent, you'll incur significant overhead when turning your data objects into XML for transformation
Depending on the capabilities of your transformer, you may also incur buffering overhead as you transform into a string buffer and then send that string to the output device
The more advanced your XSLT usage, the less likely it is that your tools will support you (specifically as you start to use includes or faster ways to pass XML data in)
I'll try to update as I think of more issues. I think that looking back now, my verdict would be to stick with a common template language. What were once big issues when I selected XML/XSLT have now been addressed by newer and more mature revisions of the major template engines. We do still benefit greatly from the ability to inherit .xslt files, which is something most template engines don't do well. But in the end the value of having lots of developers providing examples is far greater (compare ASP.NET answers vs XSLT answers on StackOverflow, for instance.)
Hope that helps!
I've done significant development using XSLT and it has been both tremendously successful and a complete failure at two different sites.
A few thoughts before a conclusion:
I don't think anyone would argue that XSLT is far more powerful than a template parsing engine, it's a functional language.
Although it's not as widely adopted as most procedural languages, it's still a real language that's being used out there for actual projects, people can be hired already with knowledge of XSLT and it's a transferable skill for your current staff.
XSLT has also been around for a while now, the implementations are mature, I'm sure this is the case for long running templating engines (like Velocity) but newer engines may be less robust.
Whatever template language you decide on it's unlikely to be as well documented as XSLT. Check out any of the Michael Kay Programmer's reference series for an example on how to do a great reference book.
Tool support is generally very good ... if you have a budget. XMLSpy and Stylus Studio have both been very useful for me in the past.
XSLT is not only hard but, more importantly, different. Most people are not Computer Science graduates formally trained in functional programming. The majority of programmers will write XSLT in a procedural style which will not harness any power of the language and give you a maintenance headache.
XSLT transforms can be slow and can take a lot of memory. You may have problems if you have a stylesheet with a large XML input.
I love XSLT but whether you should use it or not comes down to a few points:
Are you committed to XSLT? Do you have serious in-house expertise in XSLT? Are you prepared to get some?
Is your data in XML? Does it make sense in XML? Do you have someone in-house who loves your data enough to make sure it's well structured and there's always an appropriate schema?
Unless the answer to those questions is yes and you have complex data that requires a complex rendering process, I wouldn't consider using XSLT ... especially if there's no experience in the team. Bad XSLT is much, much, much worse than a bad template.
However, it can render complex data in a maintainable fashion which would be impossible using many of today's templating engines.
Going the XSL way will future-proof your application. Meaning, if you decide in the future to add more templates with different layouts you will be able capitalize on those advantages. In my current project we save off the XML used (in an XMLType or CLOB) and allow other applications to access the data and XSL templates to generate documents via a web service. This was an after thought of the original design that was super easy to implement due to our decision to use XML/XSL.
XSLT has the advantage of being able to also produce output in other document types (i.e. pdf) and pdf output is very likely nowadays. XML/XSLT does also separate data from the view.
When we have done XSLT in the past, it was to allow the ability to extend our product. The output remained the same, only the presentation layer needed to change. This allowed us a lot of flexibility when we had clients that wanted to "customize" their UI, since all we needed to do was replace the XSLT file. If you foresee needing to make a lot of those kinds of changes, XSLT might be your answer.
However, as stated above, the XSLT syntax and functional programming mentality can make it difficult to effectively produce templates. We found that we liked to stick to the tricks that we learned and when we had client requests that fell outside of what we already knew, no one wanted to volunteer for the ticket. Usually someone eventually figured out how to do the task and our "bag of tricks" got larger, but it was often very cumbersome to figure out new things.
If you don't foresee change the UI ever, or at least not much, XSLT may not be worth the extra effort.
Please don't use XML/XSLT for web front-ends. I was in projects like this and it's horrible. Often you have to first produce the XML from objects or something similar, which doesn't make sense. A second point is, that there are so many good HTML editors out there for free, but I've found none for XSLT. So editing complex XSLT is no fun. I would recommend to go with HTML templates and a common template engine.
Depending on your application, having an XML layer that is then transformed to XHTML via XSLT also meens, that you can write easy WebServices to the XML layer - allowing your customers to consume your sites data...
Having the XML sent to the browser with a transformation link (forgot the exact syntax...) also meens less bandwith needed, as the XSLT file will stay the same and you only need to pass the raw XML it is built from - sort of like using an external CSS style sheet instead of adding the style attributes to your markup ;)
I think you need to examine what the source of your data will be. As mentioned by boris callens earlier, if the you are pulling from a database you will have to transform first to XML, then apply your transformations. Should the data source be RSS or the like, then XSLT is a natural choice.
XPATH and XSLT has a high learning curve and functional programming can be daunting to get your arms around. In time crunch this may not be the right choice.
For front end work JSON has a lighter payload, and is readily supported by jQuery and other Javascript libraries. You may want to consider JSON as the data protocol as the jQuery library is far more accessible to developers and the time to productivity with the framework is far less than with XSLT, embedded Javascript in tags, awful syntax and all the other minutia that come with XML/XPATH/XSLT on the front end.
Keep it simple. That's a principle that one gets to appreciate more and more.
Velocity or Freemarker are incredibly flexible and versatile. Your code base will be clear, easily understandable, and it will run much (much) faster than the X monstrosities.
http://fishbowl.pastiche.org/2002/02/12/xslt_is_the_spawn_of_satan/
I see how the XSL approach can be handy if your data is already XML.
But usually it isn't. It's somewhere in a database, needs to be generated on the spot or comes from some service.
Creating XML from this source to then be able to create HTML from that XML is useless in my opinion. I would stick with (X)Html templates.
In contrast to HTML, there are a lot of XML tools available if you need to do parsing and processing of the templates in any way. So you should choose XML to get the benefits of using tools and libraries for XML.
However, that said, it may just be that XHTML fits your needs, since this gives you full support of XML tools and libraries while still being normal HTML which is correctly processed by modern web browsers. If you need to do post-processing of those later on, you can still apply XSLT to the XHTML data.
I've used XML & XSLT in a previous project, financial web sites, and it worked well for us, but:
We had multiple customers, which
varied the number of outputs we had.
We could replace the XSLT stylesheet
and this made changes to the site
easier to manage for the developers
We had a specialist web editor on the team. We gave them example XML & they could edit the stylesheets directly
If there were ever any wording changes that needed to go onto the website yesterday ( it was a bank, this happened surprisingly often), we could just deploy the new XSLT without redeploying the entire site.
Multiple different output formats were needed. We used FOP for transformation to PDF, which is based upon the same sort of technology, so wasn't too hard for us to understand :-)
The main reason I see for using XSLT is if you have multiple sites all based upon the same XML, but requiring different HTML output.
XML + XSLT are really cool. You have the ability to output many types of target formats in the future.
But ne aware of embedded HTML in the XML. Firefox XSLT doesn't support "disable-output-escaping". See Bugzilla.
We use XSLT to generate html in our content management system and it works just fine.
Some hints: Don't try to generate all the page at once from one big hairy XML, you'll go insane. Use the HTML template (plain text/html file with styles, decorations and basic markup) with embedded markers (like, <!--MENU-->, <!--CONTENT-->), and replace markers with xslt-transformation of appropriate data.
Having said that, I doubt you really need xslt if you only going to have one layout, forever.

Categories