Extract part of XML file [duplicate] - java

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Related

Parallel Processing Wikipedia's XML Data Dump with Storm

I'm trying to process the wikipedia dump found here. Specifically with the file - enwiki-latest-pages-articles-multistream.xml.bz2. This is about 46GB uncompressed. I am currently using the STAX parser in Java (xerces) and am able to extract 15K page elements per second. However the bottleneck seems to be the parser and I have toyed around with aalto-xml but it hasn't helped.
Since I'm parsing each page element in the Storm spout, it is a bottleneck. However, I thought I could simply emit the text between ... tags and have several bolts process each of those page elements in parallel. This would reduce the amount of work the Storm spout has to perform. However I am not sure of the specific approach to take here. If I use a parser to extract the content between tags then that would mean it would parse every single element from the beginning of the tag until the end. Is there a way to eliminate this overhead in a standard SAX/STAX parser?
I tried something similar in an attempt to parallelise.
anyway since I use the wikipedia data for many tasks, it was just simpler to generate a one-article-perline-dump which then I can run many experiments from in parallel.
It takes only a few minutes to run, then I have a dump which I can feed to Spark (in your case Storm) very easily.
If you want to use our tools check:
https://github.com/idio/wiki2vec
There is no way to do random access on XML document, but many Java XML parsers have somewhat more efficient skipping of unused content: Aalto and Woodstox for example defer decoding of String values (and construction of String objects) so that if tokens are skipped, no allocations are needed.
One thing to make sure with Stax is to NOT use Event API unless there is specific need to buffer contents -- it does not offer much functionality over basic Streaming API (XMLStreamReader), but does add significant allocation overhead since every XMLEvent is constructed, regardless of whether it is needed. Streaming API, on the other hand, only indicates type of Event/Token, and lets caller decide whether content (attributes, textual value) is needed and can avoid most of object allocations.

reading files from memory instead of disk

I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?
I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.
You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example
Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.
have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory
Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).
Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.
Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j
It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.

Efficient way to read a small part of a BIG XML file in Java

We have a new requirement:
There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small.
...
...
What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?
Thank you
The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.
I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.
I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.
Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.
If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.
Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser here and for XPath here.
StAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.

When am I doubling my memory usage?

I have a servlet that users post a XML file to.
I read that file using:
String xml = request.getParameter("...");
Now say that xml document is 10KB, since I created the variable xml I am now using 10KB of memory for that variable correct?
Now I need to parse that xml (using xerces), and I converted it to a input stream when passing to it to the saxparsers parse method (http://docs.oracle.com/javase/1.5.0/docs/api/javax/xml/parsers/SAXParser.html).
So if I convert a string to a stream, is that doubling my memory usage?
Need some clarifications on this.
If I connect my process with visualvm or jconsole, while stepping through the code, can I see if I am using additional memory as I step through the code in my debugger?
I want to make sure I am not doing this inefficienctly as this endpoint will be hit hard.
A 10,000 bytes of text generally turns into 20 KB.
When you process text you generally need 2-10x more memory as you will be doing something with that information, such as creating a data structure.
This means you could need 200 KB. However given that in a PC this represents 1 cents worth, I wouldn't worry about it normally. If you have a severely resource limited device, I would consider moving the processing to another device like a server.
I think you might be optimizing your code before actually seeing it running. The JVM is very good and fast to recover unused memory.
But answering your question String xml = request.getParameter("..."); doesn't double the memory, it just allocates an extra 4 or 8 bytes (depending if the JVM is using compressed pointers) for the pointer.
Parsing the xml is different the SAX parser is very memory efficient, so it won't use too much memory, I think around 20 bytes per Handler plus any instance variables that you have... and obviously any extra objects that you might generate in the handler.
So the code you have looks like it's as memory efficient as it can get (depending of what you have in your handlers, of course).
Unless you're working on embedding that code in a device or running it 100k times a second, I would suggest you not to optimize anything unless you're sure you need to optimize it. The JVM has some crazy advanced logic to optimize code and the garbage collector is very fast to recover short lived objects.
If users can post massive files back to your servlet, then it is best not to use the getParameter() methods and handle the stream directly - Apache File Upload Library.
That way you can use the SAX Parser on the InputStream (and the whole text does not need to be loaded into memory before processing) - as you would have to do with the String based solution.
This approach scales well and requires only a tiny amount of memory per request compared to the String xml = getParameter(...) solution.
You will code like this:
saxParser.parse(new InputSource(new StringReader(xml));
You first need to create StringReader around xml. This won't double your memory usage, StringReader class merely wraps xml variable and return it character by character when requested.
InputSource is even thinner - it simply wraps provided Reader or InputStream. So in short: no, your String won't be copied, your implementation is pretty good.
No, you won't get 2 copies of the string, doubling your memory. Other things might double that memory, but the string itself won't be duplicated.
Yes, you should connect visualVm and jconsole to see what happens to memory and thread processing.

Parsing very large XML documents (and a bit more) in java

(All of the following is to be written in Java)
I have to build an application that will take as input XML documents that are, potentially, very large. The document is encrypted -- not with XMLsec, but with my client's preexisting encryption algorithm -- will be processed in three phases:
First, the stream will be decrypted according to the aforementioned algorithm.
Second, an extension class (written by a third party to an API I am providing) will read some portion of the file. The amount that is read is not predictable -- in particular it is not guaranteed to be in the header of the file, but might occur at any point in the XML.
Lastly, another extension class (same deal) will subdivide the input XML into 1..n subset documents. It is possible that these will in some part overlap the portion of the document dealt with by the second operation, ie: I believe I will need to rewind whatever mechanism I am using to deal with this object.
Here is my question:
Is there a way to do this without ever reading the entire piece of data into memory at one time? Obviously I can implement the decryption as an input stream filter, but I'm not sure if it's possible to parse XML in the way I'm describing; by walking over as much of the document is required to gather the second step's information, and then by rewinding the document and passing over it again to split it into jobs, ideally releasing all of the parts of the document that are no longer in use after they have been passed.
Stax is the right way. I would recommend looking at Woodstox
This sounds like a job for StAX (JSR 173). StAX is a pull parser, which means that it works more or less like an event based parser like SAX, but that you have more control over when to stop reading, which elements to pull, ...
The usability of this solution will depend a lot on what your extension classes are actually doing, if you have control over their implementation, etc...
The main point is that if the document is very large, you probably want to use an event based parser and not a tree based, so you will not use a lot of memory.
Implementations of StAX can be found from SUN (SJSXP), Codehaus or a few other providers.
You could use a BufferedInputStream with a very large buffer size and use mark() before the extension class works and reset() afterwards.
If the parts the extension class needs is very far into the file, then this might become extremely memory intensive, 'though.
A more general solution would be to write your own BufferedInputStream-workalike that buffers to the disk if the data that is to be buffered exceeds some preset threshold.
I would write a custom implementation of InputStream that decrypts the bytes in the file and then use SAX to parse the resulting XML as it comes off the stream.
SAXParserFactory.newInstance().newSAXParser().parse(
new DecryptingInputStream(),
new MyHandler()
);
You might be interested by XOM:
XOM is fairly unique in that it is a
dual streaming/tree-based API.
Individual nodes in the tree can be
processed while the document is still
being built. The enables XOM programs
to operate almost as fast as the
underlying parser can supply data. You
don't need to wait for the document to
be completely parsed before you can
start working with it.
XOM is very memory efficient. If you
read an entire document into memory,
XOM uses as little memory as possible.
More importantly, XOM allows you to
filter documents as they're built so
you don't have to build the parts of
the tree you aren't interested in. For
instance, you can skip building text
nodes that only represent boundary
white space, if such white space is
not significant in your application.
You can even process a document piece
by piece and throw away each piece
when you're done with it. XOM has been
used to process documents that are
gigabytes in size.
Look at the XOM library. The example you are looking for is StreamingExampleExtractor.java in the samples directory of the source distribution. This shows a technique for performing a streaming parse of a large xml document only building specific nodes, processing them and discarding them. It is very similar to a sax approach, but has a lot more parsing capability built in so a streaming parse can be achieved pretty easily.
If you want to work at higher level look at NUX. This provides a high level streaming xpath API that only reads the amount of data into memory needed to evaluate the xpath.

Categories