Processing XML file with Huge data

Processing XML file with Huge data - java

I am working on an application which has below requirements -
Download a ZIP file from a server.
Uncompress the ZIP file, get the content (which is in XML format) from this file into a String.
Pass this content into another method for parsing and further processing.
Now, my concerns here is the XML file may be of Huge size say like '100MB', and my JVM has memory of only 512 MB, so how can I get this content into Chunks and pass for Parsing and then insert the data into PL/SQL tables.
Since there can be multiple requests running at the same time and considering 512MB of memory what will be the best possible to process this.
How I can get the data into Chunks and pass it as Stream for XML parsing.

Java's XMLReader is a a SAX2 parser. Where a DOM parser reads the whole of the XML file in and creates a (often large) data structure (usually a tree) to represent its contents, a SAX parser lets you register a handler that will be called when pieces of the XML document are recognized. In that call-back code, you can save only enough data to do what you need -- e.g. you might save all the fields that will end up as a single row in the database, insert that row and then discard the data. With this type of design, your program's memory consumption depends less on the file size than on the complexity and size of a single logical data item (in your case, the data that will become one row in the database).
Even if you did use a DOM-style parser, things might not be quite as bad as you expect. XML is pretty verbose, so (depending on how it's structured and such) a 100 MB file will often represent only 10-20 MB of data, and as little as 5 MB of data wouldn't be particularly rare or unbelievable.

Any SAX parser should work since it won't load the entire XML file into memory like a DOM parser.

Related

Extract part of XML file [duplicate]

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

how costly(Time) are read and write operations on csv file in java?

I am writing a software which has a part dealing with read and write operaions. I am wondering how costly these operations are on a csv file. Is there are any other file formats that consume less time? Because I have to do write and read on csv files at the end of every cycle.

Read and write operations depend on the file system, hardware, software configuration, memory, mermory setup and size of the file to read. But not on the format. A different problem related with this is the cost of parsing the file that surely must relative low as csv is very simple.
The point is that CSV is a good format for tables of data but not for nested data. If your data has a lot of nested information you can separate it into different csv files or you will have some information redundancy that will penalize your performance. But other formats might have other kind of redundancy.
And do not optimize prematurily. If you are reading and writing from the file very frecuently this file will surely be kept on RAM. JSON or a zipped file might save size and be read faster but would have a higher parsing time and could be even slower at the end. And the parsing time depends also on the implemenation of the library (Gson vs Jackson) and version.
It will be nice to know the reasons behind your problem to give better ansewrs.

The cost of reading / writing to a CSV file, and whether it is suitable for your application, depend on the details of your use case. Specifically, if you are simply reading from the beginning of the file and writing to the end of the file, then the CSV format is likely to work fine. However, if you need to access particular records in the middle of your file then you probably wish to choose another format.
The main issue with a CSV file is that it is not a good format choice for random access, since each record (row) is of variable size, so you cannot simply seek to a particular record offset in the file, and instead need to read every row (well, you could still jump and sample, but you cannot seek directly by record offset). Other formats with fixed sized records would allow you to seek directly to a particular record in the file, making updating of an entry in the middle of the file possible without needing to re-read and re-write the entire file.

How to replace a string in an xml file without loading file contents into memory in java?

My application creates a very big xml file (of about 300K transactions). Each of the transaction will have about 20 xml elements. So it creates a huge xml file. We did not use JAXB or SAX or DOM for creation of xml file as memory is the constraint. Now i have a need to replace certain tag values in xml file once it is created. I know what is to be replaced and the value to replace with. How can i replace those variables without loading entire file into memory? For 300K transactions, the file size is coming for about 600 MB. So we do not want to load entire file into memory for replacing few variables.
We are using Java5. Is there a way we can do it?

You can try VTD-XML:
Memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.
Fastest XML parser: On a Core2 2.5Ghz Desktop, VTD-XML outperforms DOM parsers by 5x~12x, delivering 150~250 MB/sec per core sustained throughput.
Incremental-update capable XML parser capable of cutting, pasting, splitting and assembling XML documents with max efficiency.
Available in C, C++, C# and Java.
Example modifying XML.

Everything I've ever read on this topic indicates that you can't do this without loading the file into memory or streaming it to another file. That's probably what you'll end up needing to do - stream your source into a new file, modifying as you go.
More info about that process - http://docs.oracle.com/javaee/5/tutorial/doc/bnbfl.html#bnbgq
I like the way Stephen C addresses your problem in an answer here - How to modify a huge XML file by StAX?

You could try a streaming transformation using XSLT 3.0 (specifically, Saxon-EE).
I'm not sure what you mean by "tag values" (it's so much easier if people use the correct terminology...) but if you mean the values of text nodes, then you could write a streaming transformation something like this:
<xsl:mode streamable="yes" on-no-match="shallow-copy"/>
<xsl:template match="xyz/text()[.='old value']">
<xsl:text>new value</xsl:text>
</xsl:template>
with further rules for additional substitutions. You can also, of course, have rules that rename or delete selected elements, etc.

Fast way of loading large corpus of xml files?

I have a large corpus of xml files (~20,000 files). When I load the entire corpus, it takes me around ~1 sec to load each document. The xmls are pretty large. (> 10,000 lines). Each xml represents a document with nodes for sentences, tokens in the sentence and other similar attributes.
I am using DocumentBuilder in java to load the xml. After loading the xml, I also need to extract some relevant xml nodes (around 100 sentences). For this I used getElementsByTagName().
Is there a faster way to load xml documents in java?

You can consider a SAX implementation. SAX is typically somewhat 2 to 5 times faster based on this link: http://dublintech.blogspot.be/2011/12/jaxb-sax-dom-performance.html. It makes a lot of sense when you only need to actually process part of your document and not all contents.
You can also use faster disks like SSDs or maybe a virtual file system with a caching strategy.
If you have slow disks it might even make sense to zip all of them in a big zip which will reduce disk access by 80 to 90%. The unzipping overhead should be offset by the gain in disk access performance.
But saying that you are also considering Lucene actually implies that we are missing some crucial information about your use case, because it implies that the action you are optimize is more or less a 'one off' anyway.
If you really need only a relatively small part of the content of your document you could also consider storing that information in one data structure and serialize it. That way you only need to deserialize one file and not process 20.000 XML documents. In case the documents change you could also store document paths and a hashcode like MD5 to detect modified documents.

reading files from memory instead of disk

I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?

I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.

You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example

Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.

have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory

Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).

Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.

Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j

It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Processing XML file with Huge data - java

Any SAX parser should work since it won't load the entire XML file into memory like a DOM parser.

Related

Extract part of XML file [duplicate]

how costly(Time) are read and write operations on csv file in java?

How to replace a string in an xml file without loading file contents into memory in java?

Fast way of loading large corpus of xml files?

reading files from memory instead of disk

Categories

Resources