I need to parse a large (>800MB) XML file from Jython. The XML is not deeply nested, containing about a million relevant elements. I need to convert these elements into real objects.
I've used nu.xom.* successfully before, but now that I've switched from Java to Jython, the library fails with the following message:
The parser has encountered more than
"64,000" entity expansions in this
document; this is the limit imposed by
the application.
I have not found a way to fix this, so I probably have to look for another XML library. It could be either Java or Jython-compatible Python and should be efficient. Pythonic would be great, nu.xom.* is simple but not very pythonic. Do you have any suggestions?
Sax is the best way to parse large documents.
Sounds like you're hitting the default expansion limit.
See this note:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4843787
You need to set System property "entityExpansionLimit" to change
the default.
(added) see also the answer to this question.
Try using the SAX parser, it is great for streaming large XML files.
Does jython support xml.etree.ElementTree? If so, use the iterparse method to keep your memory size down. Read this and use elem.clear() as described.
there is a lxml python library, that can parse large files, without loading data to memory.
but i don't know if i jython compatible
Related
can you please help me for choice the best java APIs to map CSV file and XML file into java objects in the context of spring boot application and micro services?
- OpenCSV, CommonApache JAXB ...?
what are the best API for csv and XML to java objcts for a
Thanks you.
I used OpenCSV a lot, without any issue. You can get a good feel of it from this article.
You will need a different library for XML. You need to choose first between DOM and SAX. The most important criteria is size - does it fit in memory with ease? If so, use a DOM one, as it's faster. Otherwise SAX.
A good recommendation for DOM parsing is dom4j.
I have an application that works with a lot of XML data. So, I want to ask you which is the best API to handle XML in java. Today, I'm using W3 and, for performance, I want to migrate to some API.
I make XML from 0, a lot of transforms, import into database (mysql, mssql, etc), export from database to html, modifi of those XML, and more.
Is JDOM the best option? do you know some other better than JDOM?
I heard (by reading pages) about javolution. Somebody use it?
Which API you recommend me?
If you have vast amounts of data, the main thing is to avoid having to load it all into memory at once (because it will use a vast amount of memory, and because it prevents you overlapping IO and processing). Sadly, i believe most DOM and DOM-like libraries (like DOM4J) do just that, so they are not well suited for processing vast amounts of XML efficiently.
Instead, look at using a streaming API, like SAX or StAX. StAX is, in my experience, usually easier to use.
There are other APIs that try to give you the convenience of DOM with the performance of SAX. Javolution might be one; VTD-XML is another. But to be honest, i find StAX quite easy to work with - it's basically a fancy stream, so you just think in the same way as if you were reading a text file from a stream.
One thing you might try is combining JAXB with StAX. The idea is that you stream the file using StAX, then use JAXB to unmarshal chunks within it. For instance, if you were processing an Atom feed, you could open it, read past the header, then work in a loop unmarshalling entry elements to objects one at a time. This only really works if your format consists of a sequence of independent elements, like Atom; it would be largely useless on something richer like XHTML. You can see examples of this in the JAXB reference implementation and a guy's blog post.
The answer depends on what performance aspects are important for your application. One factor is whether you are handling large XML documents.
For parsing, DOM-based approaches will not scale well to large documents. If you need to parse large documents, non-DOM parsers such as those using SAX and StAX will be faster and less resource intensive. However, if you need to transform XML after parsing, using either XSL or a DOM API, you are going to need the whole document in memory in any case.
For creating XML from code, StAX provides a nice API for this. Since the approach is stream-based, this will scale well to writing very large documents.
Well, the most developers I know and myself, we use dom4J, maybe if you have the time you could write a small performancetest with use of both frameworks, then you will see the difference. I prefere dom4j.
I am having a large xml file which contains many sub elements. I want to able to run some xpath queries. I tried using vtd-xml in java, but I get outofmemory error sometimes, because the xml is so large to fit into memory. Is there an alternative way of processing such large xml's.
try http://code.google.com/p/jlibs/wiki/XMLDog
it executes xpaths using sax without creating in-memory representation of xml documents.
SAXParser is very efficient when working with large files
What are you trying to do right now? By the sounds of it you are trying to use a DOM based parser, which essentially loads the entire XML file into memory as a DOM representation. If you are dealing with a large file, you'll better off using a SAX parser, which processes the XML document in a streaming fashion.
I personally recommend StAX for this.
Did you use standard vtd or extended VTD-xml? If you use extended XML then you have the option of using memory mapping... did you try that?
Using XPath might not be a very good idea if you plan on compiling many expressions dynamically in a long lived application.
I'm not entirely sure how the java version of XPath works, but in .NET XPath compiles a dynamic assembly then adds it to the app domain. Subsequent uses of the expression look at the assembly now loaded into memory.
In one case, where I was using XPath it lead to a situation where I think, this same type of mechanism was slowing filling up memory similar to a memory leak.
My theory is that as each expression was compiled using values from the user, each compiled expressions was likely unique, so a new expression was compiled and added to the app domain.
Since you can remove the assembly from the app domain without restarting the entire app domain, memory was being consumed each time an expression was evaluated and it could not be recovered. As a result, the code was leaking memory in the form of assemblies in memory, and after a while, well you know the results.
I am looking at a set of parsers generated for Atom, XAL, Kml etc. seemingly using an automated technique with a XML pull based parser. The clue towards the automation is presence of "package.html" in all XML-to-Java mapped classes folders. I would like to produce a similar one for the rather large Collada 1.4 spec. My first attempt with Altova ran into small problems due the "enum" keyword. I am sure I can fix it in the next run with appropriate renaming. Khronos admit to not designing the 1.4 spec to being automated parser generation friendly.
The actual parsers i.e. XAL parser, Atom parser etc. implement the XMLEventParser interface. I would like to know if anybody has encountered/used this pattern. If so which tool can be used to map the XSD to a class set simply giving access to the data components of the nodes using getters and setters.
I'm not sure I understand your question, but it appears that you want to process XML formats like Atom and represent it in objects with getters/setters. This can easily be done with JAXB.
For an example see:
http://bdoughan.blogspot.com/2010/09/processing-atom-feeds-with-jaxb.html
I writing a dynamic HTML parsers functionality.
I will want to modify existing parsers and also would want to add more parsers (I expect parsers will be modified as sites a remodified and new parsers will be needed for new sites).
I started writing a generic functionality which use a XML with conditions and rules for each site but as this works fine for now, I'm pretty sure it will need constant modifications...
The parsers will parse and write the data to a DB.
My application runs on JBOSS 4.
Any known best practice for that?
Thanks,
Rod
Thanks for your answer. Maybe I was unclear. I realized that imm. from the rate my question got. What I am writing feature that manage parsers execution. Each parser will parse a different text document structure. Documents structure might change from time to time and more new structured document will be added to be parsed. I dont want to recompile build deploy my application for each arser change.
I want to manage the execution of each parser as theymight be executed in parralel or according to execution rules.
Does Using Java ScriptingEngine might be a good option?
There are lots of ways to have some code that can be modified without redeploying. Using groovy scripts to do the parsing is one. Is is a rather simple matter to check to see if the script has been modified and automatically reload it.
The design sounds convoluted to me, but IFF you prove to yourself there's not a much simpler way to accomplish the same task, you may want a rules engine like Drools...