Fast way of loading large corpus of xml files? - java

I have a large corpus of xml files (~20,000 files). When I load the entire corpus, it takes me around ~1 sec to load each document. The xmls are pretty large. (> 10,000 lines). Each xml represents a document with nodes for sentences, tokens in the sentence and other similar attributes.
I am using DocumentBuilder in java to load the xml. After loading the xml, I also need to extract some relevant xml nodes (around 100 sentences). For this I used getElementsByTagName().
Is there a faster way to load xml documents in java?

You can consider a SAX implementation. SAX is typically somewhat 2 to 5 times faster based on this link: http://dublintech.blogspot.be/2011/12/jaxb-sax-dom-performance.html. It makes a lot of sense when you only need to actually process part of your document and not all contents.
You can also use faster disks like SSDs or maybe a virtual file system with a caching strategy.
If you have slow disks it might even make sense to zip all of them in a big zip which will reduce disk access by 80 to 90%. The unzipping overhead should be offset by the gain in disk access performance.
But saying that you are also considering Lucene actually implies that we are missing some crucial information about your use case, because it implies that the action you are optimize is more or less a 'one off' anyway.
If you really need only a relatively small part of the content of your document you could also consider storing that information in one data structure and serialize it. That way you only need to deserialize one file and not process 20.000 XML documents. In case the documents change you could also store document paths and a hashcode like MD5 to detect modified documents.

Related

How to replace a string in an xml file without loading file contents into memory in java?

My application creates a very big xml file (of about 300K transactions). Each of the transaction will have about 20 xml elements. So it creates a huge xml file. We did not use JAXB or SAX or DOM for creation of xml file as memory is the constraint. Now i have a need to replace certain tag values in xml file once it is created. I know what is to be replaced and the value to replace with. How can i replace those variables without loading entire file into memory? For 300K transactions, the file size is coming for about 600 MB. So we do not want to load entire file into memory for replacing few variables.
We are using Java5. Is there a way we can do it?
You can try VTD-XML:
Memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.
Fastest XML parser: On a Core2 2.5Ghz Desktop, VTD-XML outperforms DOM parsers by 5x~12x, delivering 150~250 MB/sec per core sustained throughput.
Incremental-update capable XML parser capable of cutting, pasting, splitting and assembling XML documents with max efficiency.
Available in C, C++, C# and Java.
Example modifying XML.
Everything I've ever read on this topic indicates that you can't do this without loading the file into memory or streaming it to another file. That's probably what you'll end up needing to do - stream your source into a new file, modifying as you go.
More info about that process - http://docs.oracle.com/javaee/5/tutorial/doc/bnbfl.html#bnbgq
I like the way Stephen C addresses your problem in an answer here - How to modify a huge XML file by StAX?
You could try a streaming transformation using XSLT 3.0 (specifically, Saxon-EE).
I'm not sure what you mean by "tag values" (it's so much easier if people use the correct terminology...) but if you mean the values of text nodes, then you could write a streaming transformation something like this:
<xsl:mode streamable="yes" on-no-match="shallow-copy"/>
<xsl:template match="xyz/text()[.='old value']">
<xsl:text>new value</xsl:text>
</xsl:template>
with further rules for additional substitutions. You can also, of course, have rules that rename or delete selected elements, etc.

Best way to compare two very large XML files record by record

I have two large XML files (3GB, 80000 records). One is updated version of another. I want to identify which records changed (were added/updated/deleted). There are some timestamps in the files, but I am not sure they can be trusted. Same with order of records within the files.
The files are too large to load into memory as XML (even one, never mind both).
The way I was thinking about it is to do some sort of parsing/indexing of content offset within the first file on record-level with in-memory map of IDs, then stream the second file and use random-access to compare those records that exist in both. This would probably take 2 or 3 passes but that's fine. But I cannot find easy library/approach that would let me do it. vtd-xml with VTDNavHuge looks interesting, but I cannot understand (from documentation) whether it supports random-access revisiting and loading of records based on pre-saved locations.
Java library/solution is preferred, but C# is acceptable too.
Just parse both documents simultaneously using SAX or StAX until you encounter a difference, then exit. It doesn't keep the document in memory. Any standard XML library will support S(t)AX. The only problem would be if you consider different order of elements to be insignificant...

How can I efficiently parse 200,000 XML files in Java?

I have 200,000 XML files I want to parse and store in a database.
Here is an example of one: https://gist.github.com/902292
This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.
What I am wondering is:
1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.
2) Where is a simple tutorial on said parser? (DOM or SAX)
Thanks
EDIT
I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.
However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.
Here is part of the project.
https://gist.github.com/905550#file_xm_lparser.java
Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.
Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).
Thanks
Why not use a proper XML database (like Berkeley DB XML)? Then you can just dump the documents in directly, and create indices as needed (e.g. on the HotelID).
divide and conquer
Split 200,000 files into multiple buckets and parallelize the parse/insert. Look at Java 5 Executors if you want to keep it simple or use spring-batch if this is a recurring task in which case you can benefit from a high level framework.
API
Use of SAX can help but not necessary as you are not going to keep the parsed model around (i.e. all you are doing is parsing, inserting and then let go of the parsed data at which time the objects are eligible for GC). Look into a simple API like JDOM.
Other ideas
You can implement a producer/consumer kind of model where producer produces the pojo's created after parsing and consumer takes the pojo's and inserts them into db. the advantage here is that you can batch the inserts to gain more performance.
SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.
Lalith
Go with SAX, or if you want, StAX. Forget about DOM. Use an effective library like aalto.
I am sure that parsing will be quite cheap compared to making the database requests.
But 200k is not such a big number if you only need to do this once.
SAX will be faster than DOM, this could well be an issue if you have 200,000 files to parse.
StAX is faster then SAX and this is much faster then DOM. If performance is super critical you can also think about building a special compiler to parse the XML files. But usually lexing and parsing is not that much of an issue with StAX but the "after-processing".

Processing XML file with Huge data

I am working on an application which has below requirements -
Download a ZIP file from a server.
Uncompress the ZIP file, get the content (which is in XML format) from this file into a String.
Pass this content into another method for parsing and further processing.
Now, my concerns here is the XML file may be of Huge size say like '100MB', and my JVM has memory of only 512 MB, so how can I get this content into Chunks and pass for Parsing and then insert the data into PL/SQL tables.
Since there can be multiple requests running at the same time and considering 512MB of memory what will be the best possible to process this.
How I can get the data into Chunks and pass it as Stream for XML parsing.
Java's XMLReader is a a SAX2 parser. Where a DOM parser reads the whole of the XML file in and creates a (often large) data structure (usually a tree) to represent its contents, a SAX parser lets you register a handler that will be called when pieces of the XML document are recognized. In that call-back code, you can save only enough data to do what you need -- e.g. you might save all the fields that will end up as a single row in the database, insert that row and then discard the data. With this type of design, your program's memory consumption depends less on the file size than on the complexity and size of a single logical data item (in your case, the data that will become one row in the database).
Even if you did use a DOM-style parser, things might not be quite as bad as you expect. XML is pretty verbose, so (depending on how it's structured and such) a 100 MB file will often represent only 10-20 MB of data, and as little as 5 MB of data wouldn't be particularly rare or unbelievable.
Any SAX parser should work since it won't load the entire XML file into memory like a DOM parser.

Parsing very large XML documents (and a bit more) in java

(All of the following is to be written in Java)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
First, the stream will be decrypted according to the aforementioned algorithm.
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
Here is my question:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.
Stax is the right way. I would recommend looking at Woodstox
This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.
You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.
I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);
You might be interested by XOM:
XOM is fairly unique in that it is a
dual streaming/tree-based API.
Individual nodes in the tree can be
processed while the document is still
being built. The enables XOM programs
to operate almost as fast as the
underlying parser can supply data. You
don't need to wait for the document to
be completely parsed before you can
start working with it.
XOM is very memory efficient. If you
read an entire document into memory,
XOM uses as little memory as possible.
More importantly, XOM allows you to
filter documents as they're built so
you don't have to build the parts of
the tree you aren't interested in. For
instance, you can skip building text
nodes that only represent boundary
white space, if such white space is
not significant in your application.
You can even process a document piece
by piece and throw away each piece
when you're done with it. XOM has been
used to process documents that are
gigabytes in size.
Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.

Categories