How can I efficiently parse 200,000 XML files in Java? - java

I have 200,000 XML files I want to parse and store in a database.
Here is an example of one: https://gist.github.com/902292
This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.
What I am wondering is:
1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.
2) Where is a simple tutorial on said parser? (DOM or SAX)
Thanks
EDIT
I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.
However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.
Here is part of the project.
https://gist.github.com/905550#file_xm_lparser.java
Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.
Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).
Thanks

Why not use a proper XML database (like Berkeley DB XML)? Then you can just dump the documents in directly, and create indices as needed (e.g. on the HotelID).

divide and conquer
Split 200,000 files into multiple buckets and parallelize the parse/insert. Look at Java 5 Executors if you want to keep it simple or use spring-batch if this is a recurring task in which case you can benefit from a high level framework.
API
Use of SAX can help but not necessary as you are not going to keep the parsed model around (i.e. all you are doing is parsing, inserting and then let go of the parsed data at which time the objects are eligible for GC). Look into a simple API like JDOM.
Other ideas
You can implement a producer/consumer kind of model where producer produces the pojo's created after parsing and consumer takes the pojo's and inserts them into db. the advantage here is that you can batch the inserts to gain more performance.

SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.
Lalith

Go with SAX, or if you want, StAX. Forget about DOM. Use an effective library like aalto.
I am sure that parsing will be quite cheap compared to making the database requests.
But 200k is not such a big number if you only need to do this once.

SAX will be faster than DOM, this could well be an issue if you have 200,000 files to parse.

StAX is faster then SAX and this is much faster then DOM. If performance is super critical you can also think about building a special compiler to parse the XML files. But usually lexing and parsing is not that much of an issue with StAX but the "after-processing".

Related

Parallel Processing Wikipedia's XML Data Dump with Storm

I'm trying to process the wikipedia dump found here. Specifically with the file - enwiki-latest-pages-articles-multistream.xml.bz2. This is about 46GB uncompressed. I am currently using the STAX parser in Java (xerces) and am able to extract 15K page elements per second. However the bottleneck seems to be the parser and I have toyed around with aalto-xml but it hasn't helped.
Since I'm parsing each page element in the Storm spout, it is a bottleneck. However, I thought I could simply emit the text between ... tags and have several bolts process each of those page elements in parallel. This would reduce the amount of work the Storm spout has to perform. However I am not sure of the specific approach to take here. If I use a parser to extract the content between tags then that would mean it would parse every single element from the beginning of the tag until the end. Is there a way to eliminate this overhead in a standard SAX/STAX parser?
I tried something similar in an attempt to parallelise.
anyway since I use the wikipedia data for many tasks, it was just simpler to generate a one-article-perline-dump which then I can run many experiments from in parallel.
It takes only a few minutes to run, then I have a dump which I can feed to Spark (in your case Storm) very easily.
If you want to use our tools check:
https://github.com/idio/wiki2vec
There is no way to do random access on XML document, but many Java XML parsers have somewhat more efficient skipping of unused content: Aalto and Woodstox for example defer decoding of String values (and construction of String objects) so that if tokens are skipped, no allocations are needed.
One thing to make sure with Stax is to NOT use Event API unless there is specific need to buffer contents -- it does not offer much functionality over basic Streaming API (XMLStreamReader), but does add significant allocation overhead since every XMLEvent is constructed, regardless of whether it is needed. Streaming API, on the other hand, only indicates type of Event/Token, and lets caller decide whether content (attributes, textual value) is needed and can avoid most of object allocations.

Fast way of loading large corpus of xml files?

I have a large corpus of xml files (~20,000 files). When I load the entire corpus, it takes me around ~1 sec to load each document. The xmls are pretty large. (> 10,000 lines). Each xml represents a document with nodes for sentences, tokens in the sentence and other similar attributes.
I am using DocumentBuilder in java to load the xml. After loading the xml, I also need to extract some relevant xml nodes (around 100 sentences). For this I used getElementsByTagName().
Is there a faster way to load xml documents in java?
You can consider a SAX implementation. SAX is typically somewhat 2 to 5 times faster based on this link: http://dublintech.blogspot.be/2011/12/jaxb-sax-dom-performance.html. It makes a lot of sense when you only need to actually process part of your document and not all contents.
You can also use faster disks like SSDs or maybe a virtual file system with a caching strategy.
If you have slow disks it might even make sense to zip all of them in a big zip which will reduce disk access by 80 to 90%. The unzipping overhead should be offset by the gain in disk access performance.
But saying that you are also considering Lucene actually implies that we are missing some crucial information about your use case, because it implies that the action you are optimize is more or less a 'one off' anyway.
If you really need only a relatively small part of the content of your document you could also consider storing that information in one data structure and serialize it. That way you only need to deserialize one file and not process 20.000 XML documents. In case the documents change you could also store document paths and a hashcode like MD5 to detect modified documents.

Efficient way to read a small part of a BIG XML file in Java

We have a new requirement:
There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small.
...
...
What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?
Thank you
The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.
I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.
I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.
Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.
If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.
Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser here and for XPath here.
StAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.

Android XML parser for simple xml node strings

I need to parse a series of simple XML nodes (String format) as they arrive from a persistent socket connection. Is a custom Android SAX parser really the best way? It seams slightly overkill to do it in this way
I had naively hoped I could cast the strings to XML then reference the names / attributes with dot syntax or similar.
I'd use the DOM Parser. It isn't as efficient as SAX, but if it's a simple XML file that's not too large, it's the easiest way to get up and moving.
Great tutorial on how to use it here: http://tutorials.jenkov.com/java-xml/dom.html
You might want to take a look at the XPath library. This is a more simple way of parsing xml. It's similar to building SQL queries and regex's.
http://www.ibm.com/developerworks/library/x-javaxpathapi.html
I'd go for a SAX Parser:
It's much more efficient in terms of memory, especially for larger files: you don't parse an entire document into objects, instead the parser performs a single uni-directional pass over the document and triggers events as it goes through.
It's actually surprisingly easy to implement: for instance take a look at Working with XML on Android by IBM. It's only listings 5 and 6 that are the actual implementation of their SAX parser so it's not a lot of code.
You can try to use Konsume-XML: SAX/STAX/Pull APIs are too low-level and hard to use; DOM requires the XML to fit into memory and is still clunky to use. Konsume-XML is based on Pull and therefore it's extremely efficient, yet the API is higher-level and much easier to use.

Parsing very large XML documents (and a bit more) in java

(All of the following is to be written in Java)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
First, the stream will be decrypted according to the aforementioned algorithm.
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
Here is my question:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.
Stax is the right way. I would recommend looking at Woodstox
This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.
You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.
I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);
You might be interested by XOM:
XOM is fairly unique in that it is a
dual streaming/tree-based API.
Individual nodes in the tree can be
processed while the document is still
being built. The enables XOM programs
to operate almost as fast as the
underlying parser can supply data. You
don't need to wait for the document to
be completely parsed before you can
start working with it.
XOM is very memory efficient. If you
read an entire document into memory,
XOM uses as little memory as possible.
More importantly, XOM allows you to
filter documents as they're built so
you don't have to build the parts of
the tree you aren't interested in. For
instance, you can skip building text
nodes that only represent boundary
white space, if such white space is
not significant in your application.
You can even process a document piece
by piece and throw away each piece
when you're done with it. XOM has been
used to process documents that are
gigabytes in size.
Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.

Categories