Efficient Parser for large XMLs - java

I have very large XML files to process. I want to convert them to readable PDFs with colors, borders, images, tables and fonts. I don't have a lot of resources in my machine, thus, I need my application to be very optimal addressing memory and processor.
I did a humble research to make my mind about the technology to use but I could not decide what is the best programming language and API for my requirements. I believe DOM is not an option because it consumes a lot of memory, but, would Java with SAX parser fulfill my requirements?
Some people also recommended Python for XML parsing. Is it that good?
I would appreciate your kind advice.

SAX is very good parser but it is outdated.
Recently Oracle have launched new Parser to parse the xml files efficiently called Stax
*http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html*
Attached link will also shows comparisons of all parsers along with memory utilization and its features.
Thanks,
Pavan

Yes I think Sax will work for you. Dom is not good for large XML files as It keeps the whole XML file in memory. You can see a Comparison I wrote in my blog here

Not sure if you're interested in using Perl, but if you're open to it, the following are all good options: LibXML, LibXSLT and XML-Twig, which is good for files too large to fit in memory (so is LibXML::Reader). Of course as SAX is there, but it can be slow. Most people recommend the first two options. Finally, CPAN is an amazing source with a very active community.

If you want the best of DOM without its memory overhead, vtd-xml is the best bet, here is the proof...
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

Related

a replacement for xslt transformation

I am working on a project that currently uses a bunch of very big xslt files.
we use those xslt's to translate an XML from our system to an XML that the other system can read.
Our system actually receives JSONs which we actually save as XMLs just for those xslts.
We are now thinking about a way to replace the xslt with something simpler, but we have a restriction:
Those xslt's are modified by outside people (which work on the other system), so just refactoring them is not an option, since its only a temporary solution until they will become ugly again. also, we still need to find a way to let those people change the way we transform the XML - preferably without teaching them how to code.
Since our system is written in java, we would also like our solution to be supported by one of the major java frameworks.
I was thinking about a sort of rule engine with XQuery for customization, but I am not sure if that is a valid solution.
Another idea I found was to just use ruby, since many people say that it does the job better. but I fear that the teaching overhead will be too great.
I would really appreciate any ideas you might have for solving this problem.
Thanks :)

Write large PDFs with Java sequentially

I am looking for a Java library that let's you write large PDFs sequentially with a minimum amount of memory. Most of the libraries I had a look at has to build up the document in memory first before you can actually write it.
The problem I have to deal with are OutOfMemoryErrors. It would be great if I could flush the writer programmatically whenever needed e.g. for each page.
Does anyone have any recommendations? I need something with a license along the lines of the LGPL (so not the GPL or the Affero GPL that iText uses).
You can do that with iText. It supports writing to OutputStreams.
The free version of Docmosis has a fairly open license so it might suit you. It uses a template-approach which is different from building from code. Docmosis processes all documents in a stream-based fashion since it's intended for singificant parallel use and for large documents. It also allows you to offload the most CPU intensive part processing to another server. Hope that helps.
I actually, had same issue as you do, a friend help me out, but he did in C# and using an api called GhostScriptSharp, you should check for it.
I can't give you a copy of the code, since its copyrighted, but i'm sure it would help you out, since the tool i think is builded on Java.
jPod can swap indirect objects and supports incremental writing.
This is still not optimal as you need an "increment" on each flush, but better than nothing...
EDIT
Öhhh - this is one of the famous examples of self describing code :-) Your're right, theres not much of a tutorial or that - but the Javadoc is quite good.
jPod writes incremental by default. See "CosDocument.setWriteModeHint" to set to full mode.
The example "CreateDoc" and "AppendPage" are simple examples of how to add pages. You may do the same and call "save" every 10 or 100 pages. This should "soften" all references to pages in memory and if not held by some other references of yours, the can be garbage collected.
THere's still the question how you fill the pages? THere are examples dealing with content streams, too (DrawText,..). BUT jPod is not like iText, jasper or whatever. There's only PDF model abstractions. You have no "Layouter" or "Renderer" that creates page content from text, html or something like that. How do you do this?

Speed of Groovy XML Slurping

We're starting to investigate a project that requires a tricky bit of XML parsing.
I like the look of Groovy's XmlSlurper (Groovy appears to be my Golden Hammer of choice at the moment). We'll need to process a pretty wide range of XML inputs and Groovy's dynamic nature might just let us work out a neat, concise solution. We'll see.
A concern is the cost of that flexibility and dynamism in terms of speed, though I've done no testing of that yet. Does anyone have any experience with this? Are Groovy and XmlSlurper particularly fast or slow compared to some of the Java alternatives for parsing XML?
I did not see serious performance problems with XmlSlurper but you should use it carefully:
If you need to parse few large XML-s you should have no problem with performance. According to this article XmlSlurper has been written to process large XML files.
If you need to parse many small XML-s you should use it in 'a Groovy way' and with pre-populated XML parser instance(s).
In my experience, the speed with which you can get something up and running in Groovy far outweighs any slowdown caused by its dynamic nature...
And in the rare instances it is severely impacting your application, you can always drop out the Groovy code, and write a Java class which adheres to the same Interface, and should plug straight in...
Hmmm...not really an answer this. I guess you could see it more as words of encouragement from the touch line ;-)

Stream-lined xml builder/parser in Android?

I'm learning the Android api from a book, and it seems like there isn't any mention of a stream-lined api for dealing with raw xml (reading and writing). His suggestion for parsing is the XmlPullParser, and his examples look horrendous considering the kind of api's I'm spoiled by in other platforms (LINQ to XML especially).
Is this the best available technique on the Android platform?
Obviously I can write a wrapper to avoid the repetitive stuff, but I'd be surprised if no such thing already exists.
Also, he doesn't even make mention of creating xml structures in code. What are my options for both?
On a side note, do any Java devs that are familiar with LINQ to XML in .Net know of anything equivalent in Java?
Since you probably don't want to load any substantial size DOMs into Android's memory - pull and SAX parsers are preferred way dealing with XML in Android. I think it pays to invest into understanding how SAX works and write a custom handler than rely on some generic libraries that may be incompatible or overbloated. I parse XML in my apps using SAX all the time and I'm very pleased with the speed (most of the time)
Well I'm pretty new to Java, but here's what I've gleaned so far about xml parsing on Android:
The XmlPullParser approach is recommended for Android due to resource constraints. There is a DOM parser available in Android, which would let you use XPath to navigate an xml document. Using the DOM means that you have to load the entire document into memory at once, however. The XmlPullParser method is much more efficient in terms of memory used.
The XmlPullParser method takes a little getting used to after being comfortable with LINQ to XML or XPath, but it's really not too bad IMHO (at least with the documents I was parsing). If you're working with small xml documents you could certainly use the DOM with XPath.
There's a decent article about the different methods for reading and writing XML with Android here:
http://www.ibm.com/developerworks/opensource/library/x-android/index.html
I had the same issues with parsing xml or xhtml and ended up writing a webservice doing it for me.
Android Device ->(Request URL) -> Webservice Get and Parse -
-> (Data) -> Android Device
You can transmit the data in JSON to work with it on the device.
The advantage of this is you can minimize the traffic on the slow mobile network and change the parsing without releasing a new android app.
Maybe this is will work for you too.
regards

Parsing RTF Documents with Java/JavaCC

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:
Swing RTFEditorKit API
but it isn't that accurate when it comes to parsing RTF documents. In fact there's a comment in the API:
The RTF support was not written by the
Swing team. In the future we hope to
improve the support provided.
I don't think I'm going to wait for this to happen :)
The other approach taken is to define a grammar using JavaCC and generate a parser. This works better, but I'm having trouble finding a complete grammar. I've tried:
PMD Applied JavaCC Grammar
which is ok and the following (which is the best so far).
Koders RTFParserDelegate and ETranslate Grammar
There are various implementations of the ETranslate grammar about (I know the Nutch API may use this). Does anybody know which is the most accurate grammar or whether there is a better approach to this?
I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it's taking a while... any help would be appreciated
Does anybody know which is the most accurate grammar or whether there
is a better approach to this?
Many years ago I spent some time reading RTF (Wikipedia) with C#. I say reading because if you understand RTF in detail and use it the way it was designed you will realize that RTF is not meant to be read as a whole and parsed as a whole over and over again when editing. In the documentation you will find the syntax for RTF, but don't be misled into believing that you should use a lexer/parser. In the documentation they give a sample reader for RTF.
Remember that RTF was created many ages ago when memory was measured in KB and not MB, and editing long documents of several hundred pages in a conventional way would tax system resources. So RFT has the ability to be edited in smaller subsections without loading or modifying the entire document. This is what gives it the ability to work on such large documents with limited memory. It is also why the syntax may seem odd at first.
Presumably, the source of OpenOffice contains what you're looking for.

Categories