I implemented VTD-Parser in java which successfully parsed XML file of around 500 MB easily and was able to write in an excel.
I understand that DOM parser first created a tree node structure and then get the data, and SAX is an event based parser. But what makes VTD parse the file so easy and efficient?
I tried to search, I got many examples of implementation but never got the logic .I tried the below link to get the idea but did not get the clear picture.
VTD_Parser
If any one can explain the brief idea.
According to the Wikipedia page on the subject, VTD-XML (Virtual Token Descriptor for XML) utilizes non exctractive parsing, meaning it does not extract the data out of the document into some memory based data-structure, but rather builds a data structure that contains pointers (in the form of offset and length) to the original document. This processing is clearly the most memory efficient, but I believe it comes at the cost of performance since the inevitable IO operation is done when the data is requested (but caching can help a lot here).
It seems to me that tis processing is most usefull when the input is very big and the requested data is very small (kind of data mining scenario)
Related
I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?
Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html
Use a SAX based parser that presents you with the contents of the document in a stream of events.
StAX API is easier to deal with compared to SAX. Here is a short tutorial
Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.
As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.
Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.
Use almost any SAX Parser to stream the file a bit at a time.
I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.
I'm trying to process the wikipedia dump found here. Specifically with the file - enwiki-latest-pages-articles-multistream.xml.bz2. This is about 46GB uncompressed. I am currently using the STAX parser in Java (xerces) and am able to extract 15K page elements per second. However the bottleneck seems to be the parser and I have toyed around with aalto-xml but it hasn't helped.
Since I'm parsing each page element in the Storm spout, it is a bottleneck. However, I thought I could simply emit the text between ... tags and have several bolts process each of those page elements in parallel. This would reduce the amount of work the Storm spout has to perform. However I am not sure of the specific approach to take here. If I use a parser to extract the content between tags then that would mean it would parse every single element from the beginning of the tag until the end. Is there a way to eliminate this overhead in a standard SAX/STAX parser?
I tried something similar in an attempt to parallelise.
anyway since I use the wikipedia data for many tasks, it was just simpler to generate a one-article-perline-dump which then I can run many experiments from in parallel.
It takes only a few minutes to run, then I have a dump which I can feed to Spark (in your case Storm) very easily.
If you want to use our tools check:
https://github.com/idio/wiki2vec
There is no way to do random access on XML document, but many Java XML parsers have somewhat more efficient skipping of unused content: Aalto and Woodstox for example defer decoding of String values (and construction of String objects) so that if tokens are skipped, no allocations are needed.
One thing to make sure with Stax is to NOT use Event API unless there is specific need to buffer contents -- it does not offer much functionality over basic Streaming API (XMLStreamReader), but does add significant allocation overhead since every XMLEvent is constructed, regardless of whether it is needed. Streaming API, on the other hand, only indicates type of Event/Token, and lets caller decide whether content (attributes, textual value) is needed and can avoid most of object allocations.
I have the following issue: I have very large XML files (like 300+ Megs), and I need to parse them in order to add some of their values to the db. The structure of these files is also very complex. I want to use Stax Parser as it offers the nice possibility of pull-parsing (and thus processing) only parts of the XML file at a time, and thus not loading the whole thing in memory, but on the other hand getting the values with Stax (at least on these XML files) is cumbersome, I need to write a ton of code. From this latter point of view it will immensly help me if I could marshall the XML file to Java objects (like JAX-B does) however this would load the whole file plus a ton of Object instances in memory all at once.
My question is, is there some way to pull-parse (or just partially parse) the file sequentially, and then marshall only those parts to Java objects so I can deal with them easily without bogging down on memory?
I would recommend Eclipse EMF. But it has the same problem, if you give it the file name it would parse the whole thing. Although there are some options to reduce how much is loaded, but I didn't bother much as we run on machines with 96 GB RAM. :)
Anyway, If your XML format is well defined, then one workaround is to fool the EMF by breaking down the whole file into several smaller (but still well defined) XML snippets. Then feed each snippet one after the other. I don't know JAX-B, but perhaps the same workaround can be applied there as well. Which I would recommend, because EMF is too big a hammer for such a small issue.
Just to elaborate a bit if your XML looks like this:
<tag1>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
............
<tag2>
<tag3/>
<tag4>
<tag5/>
</tag4>
<tag6/>
<tag7/>
</tag2>
</tag1>
Then it can be broken down into one XML each starting with <tag2> and ending with </tag2>. And in java most parsers would accept a Stream, so just parse using whatever you want, create some StringStream or something for each <tag2> in a loop and pass to JAX-B or EMF.
HTH
Well, first off I wanna thank the two persons answering my questions, but I finally ended up not using those propositions partly because those proposed technologies are a bit far from the Java let's say "standard XML parsing" and it feels weird going so far when there's a similar tool already present in Java and partly also because in fact I did found a solution that only uses Java API's to accomplish this.
I will not detail too much the solution I found, because I've already finished the implementation, and it's quite a big chunk of code to place here (I use Spring Batch on top of it all, with a ton of configuration and stuff).
I will however make a small comment on what I finally ended up doing:
The big idea here is the fact that if you have an XML document AND it's corresponding XSD schema, you can parse & marshall it with JAXB, and you can do it in chunks, and said chunks can be read with an even parser such as STAX and then passed to the JAXB Marshaller.
This practically means that you must first decide where's a good place in your XML file where you can say "this part here has A LOT of repetive structure, I will treat those repetitions one at a time". Those repetitive parts are usually the same (child) tag repeated a lot inside a parent tag. So all you have to do is make an event listener in your STAX parser that is triggered at the start of each of those child tags, than stream over to JAXB the content of that child tag, marshall it with JAXB and process it.
Really the idea is excellently described in this article, which I followed (true, it's from 2006, but it deals with JDK 1.6 which at that time was pretty new, so version-wise it's not that old at all):
http://www.javarants.com/2006/04/30/simple-and-efficient-xml-parsing-using-jaxb-2-0/
Document projection might be the answer here. Saxon and a number of other XQuery processors offer this as an option. If you have a reasonably simple query that selects a small amount of data from a large document, the query processor analyses the query to work out which parts of the tree need to be available for the query, and which can be discarded during processing. The resulting tree can often be only 1% of the size of the full document. Details for Saxon here:
http://saxonica.com/documentation/sourcedocs/projection.xml
I have 200,000 XML files I want to parse and store in a database.
Here is an example of one: https://gist.github.com/902292
This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.
What I am wondering is:
1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.
2) Where is a simple tutorial on said parser? (DOM or SAX)
Thanks
EDIT
I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.
However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.
Here is part of the project.
https://gist.github.com/905550#file_xm_lparser.java
Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.
Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).
Thanks
Why not use a proper XML database (like Berkeley DB XML)? Then you can just dump the documents in directly, and create indices as needed (e.g. on the HotelID).
divide and conquer
Split 200,000 files into multiple buckets and parallelize the parse/insert. Look at Java 5 Executors if you want to keep it simple or use spring-batch if this is a recurring task in which case you can benefit from a high level framework.
API
Use of SAX can help but not necessary as you are not going to keep the parsed model around (i.e. all you are doing is parsing, inserting and then let go of the parsed data at which time the objects are eligible for GC). Look into a simple API like JDOM.
Other ideas
You can implement a producer/consumer kind of model where producer produces the pojo's created after parsing and consumer takes the pojo's and inserts them into db. the advantage here is that you can batch the inserts to gain more performance.
SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.
Lalith
Go with SAX, or if you want, StAX. Forget about DOM. Use an effective library like aalto.
I am sure that parsing will be quite cheap compared to making the database requests.
But 200k is not such a big number if you only need to do this once.
SAX will be faster than DOM, this could well be an issue if you have 200,000 files to parse.
StAX is faster then SAX and this is much faster then DOM. If performance is super critical you can also think about building a special compiler to parse the XML files. But usually lexing and parsing is not that much of an issue with StAX but the "after-processing".
(All of the following is to be written in Java)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
First, the stream will be decrypted according to the aforementioned algorithm.
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
Here is my question:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.
Stax is the right way. I would recommend looking at Woodstox
This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.
You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.
I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);
You might be interested by XOM:
XOM is fairly unique in that it is a
dual streaming/tree-based API.
Individual nodes in the tree can be
processed while the document is still
being built. The enables XOM programs
to operate almost as fast as the
underlying parser can supply data. You
don't need to wait for the document to
be completely parsed before you can
start working with it.
XOM is very memory efficient. If you
read an entire document into memory,
XOM uses as little memory as possible.
More importantly, XOM allows you to
filter documents as they're built so
you don't have to build the parts of
the tree you aren't interested in. For
instance, you can skip building text
nodes that only represent boundary
white space, if such white space is
not significant in your application.
You can even process a document piece
by piece and throw away each piece
when you're done with it. XOM has been
used to process documents that are
gigabytes in size.
Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.