In an app I am working on, I have to process very large XML files (files as much as 2GB in size)...I want to run some XQuery commands against those files using the Saxon java library.
How do I do this, in such a way that at a time only a small set of records in the file is kept in memory, and the file is processed in such small sets of data (rather than whole file at once)-- and at the same time, the XQuery command's output should be correct? I would prefer to use machines with only 0.5GB RAM to run the XQuery commands--> so its just not possible to load the entire XML into memory at once.
Saxon's support for streamed processing is actually stronger in XSLT than in XQuery, largely because the XSLT working group has been addressing this issue in designing XSLT 3.0. You can find information on the streaming capabilities of the product at
http://www.saxonica.com/documentation9.4-demo/index.html#!sourcedocs/streaming
Note these are available only in the commercial edition, Saxon-EE.
For simple "burst mode" streaming you can do things like:
for $e in saxon:stream(doc('big.xml')/*/record[#field='234']) return $e/name
By "burst mode" I essentially mean a query that operates over a large number of small disjoint subtrees of the source document.
The best way (but complicated) to reach such functionality is to limit possible XQuery commands (i.e. enumerate all possible use cases). After that once for every file process it using SAX or StAX way to create an internal "index" for whole XML file, that maps search keys to offsets (start and finish) in XML file. Those offsets should point to some small, but well-formed part of XML file, that can be loaded standalone and analyzed to check if it is matches specified XQuery.
Alternative way is to parse (again with SAX or StAX) XML file into some disk-based temporary database (like Apache Derby) and create your own XQuery => SQL translator OR interpretator to access this file data. You won't get OutOfMemoryException, but perfomance of such method... may be not the best for once-used files.
Related
I have to write a very large XLS file, I have tried Apache POI but it simply takes up too much memory for me to use.
I had a quick look through StackOverflow and I noticed some references to the Cocoon project and, specifically the HSSFSerializer. It seems that this is a more memory-efficient way to write XLS files to disk (from what I've read, please correct me if I'm wrong!).
I'm interested in the use case described here: http://cocoon.apache.org/2.1/userdocs/xls-serializer.html . I've already written the code to write out the file in the Gnumeric format, but I can't seem to find how to invoke the HSSFSerializer to convert it to XLS.
On further reading it seems like the Cocoon project is a web framework of sorts. I may very well be barking up the wrong tree, but:
Could you provide an example of reading in a file, running the HSSFSerializer on it and writing that output to another file? It's not clear how to do so from the documentation.
My friend, HSSF serializer is part of POI. You are just setting certain attributes in the xml to be serialized (but you need a whole process to create it). Also, setting a whole pipeline using this framework just to create a XLS seems odd as it changes the app's architecture. ¿Is that your decision?
From the docs:
An alternate way of generating a spreadsheet is via the Cocoon
serializer (yet you'll still be using HSSF indirectly). With Cocoon
you can serialize any XML datasource (which might be a ESQL page
outputting in SQL for instance) by simply applying the stylesheet and
designating the serializer.
If memory is an issue, try XSSF or SXSSF in POI.
I don't know if by "XLS" you mean a specific, prior to Office 2007, version of this "Horrible SpreadSheet Format" (which is what HSSF stands for), or just anything you can open with a recent version of MS Office, OpenOffice, ...
So depending on your client requirements (i.e. those that will open your Excel file), another option might be available : generating a .XLSX file.
It comes down to producing an XML file in the proper grammar, which seems to be fit to your situation, as you seem to have already done that with the Gnumeric XML-based file format without technical trouble, and without hitting memory-effisciency issues.
Please note other XML-based spreadsheet formats exist, that Excel and other clients would be able to use. You might want to dig into the open document file formats.
As to wether to use Apache Cocoon or something else:
Cocoon can sure host the XSL processing ; batch (Cocoon CLI) processing is available if you require Cocoon, but require it not to run as a webapp (though as far as I remember, CLI feature was broken in the lastest builds of the 2.1 series) ; and Cocoon comes with a load of features and technologies that could address further requirements.
Cocoon might be overkill if it just comes down to running an XSL transformation, for which there is a bunch of well-known, lighter tools you can pick from.
I used lucene's ExtractWikipedia tool to extract a bz2 dump of the latest english wiki pages. The resulting .txt files still have the wikipedia markup language in them. Is there a tool or python script that I can run over the directory to only parse out the content from each file in the directory? (ie: modify the files so that they only contain content, no markup)
Alternatively, is there a java library or package which can accomplish this? I'm hoping to integrate it into the Lucene class, ExtractWikipedia.
you can try this a wikiprep it's a ready perl script that (you will need to install perl first )
removes the wikimarkup language
generate heirarchial categories
removes redirections
generates an XML format that's easily to parse
http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/
it may take some couple of hours to run over all wikipedia dumb
and may need a large memory about 6GB ram
I need to parse a large (>800MB) XML file from Jython. The XML is not deeply nested, containing about a million relevant elements. I need to convert these elements into real objects.
I've used nu.xom.* successfully before, but now that I've switched from Java to Jython, the library fails with the following message:
The parser has encountered more than
"64,000" entity expansions in this
document; this is the limit imposed by
the application.
I have not found a way to fix this, so I probably have to look for another XML library. It could be either Java or Jython-compatible Python and should be efficient. Pythonic would be great, nu.xom.* is simple but not very pythonic. Do you have any suggestions?
Sax is the best way to parse large documents.
Sounds like you're hitting the default expansion limit.
See this note:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4843787
You need to set System property "entityExpansionLimit" to change
the default.
(added) see also the answer to this question.
Try using the SAX parser, it is great for streaming large XML files.
Does jython support xml.etree.ElementTree? If so, use the iterparse method to keep your memory size down. Read this and use elem.clear() as described.
there is a lxml python library, that can parse large files, without loading data to memory.
but i don't know if i jython compatible
I am having a large xml file which contains many sub elements. I want to able to run some xpath queries. I tried using vtd-xml in java, but I get outofmemory error sometimes, because the xml is so large to fit into memory. Is there an alternative way of processing such large xml's.
try http://code.google.com/p/jlibs/wiki/XMLDog
it executes xpaths using sax without creating in-memory representation of xml documents.
SAXParser is very efficient when working with large files
What are you trying to do right now? By the sounds of it you are trying to use a DOM based parser, which essentially loads the entire XML file into memory as a DOM representation. If you are dealing with a large file, you'll better off using a SAX parser, which processes the XML document in a streaming fashion.
I personally recommend StAX for this.
Did you use standard vtd or extended VTD-xml? If you use extended XML then you have the option of using memory mapping... did you try that?
Using XPath might not be a very good idea if you plan on compiling many expressions dynamically in a long lived application.
I'm not entirely sure how the java version of XPath works, but in .NET XPath compiles a dynamic assembly then adds it to the app domain. Subsequent uses of the expression look at the assembly now loaded into memory.
In one case, where I was using XPath it lead to a situation where I think, this same type of mechanism was slowing filling up memory similar to a memory leak.
My theory is that as each expression was compiled using values from the user, each compiled expressions was likely unique, so a new expression was compiled and added to the app domain.
Since you can remove the assembly from the app domain without restarting the entire app domain, memory was being consumed each time an expression was evaluated and it could not be recovered. As a result, the code was leaking memory in the form of assemblies in memory, and after a while, well you know the results.
I need an indexed file format that can hold a few hundred large variable sized binary blobs.
Blobs are around 1-5MB and the file could be as large as 1 GB. I need to be able to quickly find, read, add and remove blobs without recreating the the entire file. I have no need to compress the blobs, however if blobs were removed, I'd like to reclaim or reuse the space.
Ideally there would be a Java API.
I'm currently doing this with a ZIP format, but there's no known way to update a ZIP file without recreating it and performance is bad.
I've looked into SQLite but its blob performance was slow, and its overkill for my needs.
Any thoughts, or should I roll my own?
And if I do roll my own, any book or web page suggestions?
Berkeley DB Java Edition does what you need. It's free.
You need some virtual file system. Our SolFS is the one of the options yet we have only JNI layer, as the engine is written in C. There exists one more option, CodeBase, but as they don't provide an evaluation version of their file system, I know a few about it.
SolFS is ideally suitable for your task, because it lets you have alternative streams for files and associate searchable metadata with each file or even alternative stream.