How to transform huge xml files in java? - java

As the title says it, I have a huge xml file (GBs)
<root>
<keep>
<stuff> ... </stuff>
<morestuff> ... </morestuff>
</keep>
<discard>
<stuff> ... </stuff>
<morestuff> ... </morestuff>
</discard>
</root>
and I'd like to transform it into a much smaller one which retains only a few of the elements.
My parser should do the following:
1. Parse through the file until a relevant element starts.
2. Copy the whole relevant element (with children) to the output file. go to 1.
step 1 is easy with SAX and impossible for DOM-parsers.
step 2 is annoying with SAX, but easy with the DOM-Parser or XSLT.
so what? - is there a neat way to combine SAX and DOM-Parser to do the task?

StAX would seem to be one obvious solution: it's a pull parser rather than either the "push" of SAX or the "buffer the whole thing" approach of DOM. Can't say I've used it though. A "StAX tutorial" search may come in handy :)

Yes, just write a SAX content handler, and when it encounters a certain element, you build a dom tree on that element. I've done this with very large files, and it works very well.
It's actually very easy: As soon as you encounter the start of the element you want, you set a flag in your content handler, and from there on, you forward everything to the DOM builder. When you encounter the end of the element, you set the flag to false, and write out the result.
(For more complex cases with nested elements of the same element name, you'll need to create a stack or a counter, but that's still quite easy to do.)

I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.
It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.
UPDATE
I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.
<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
version="1.0" pass-through="none" output-method="xml">
<stx:template match="element/child">
<stx:process-self group="copy" />
</stx:template>
<stx:group name="copy" pass-through="all">
</stx:group>
</stx:transform>
The pass-through="none" at the stx:transform configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template matches the XPath element/child (this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy" is invoked on the current element. That group has pass-though="all", so the default templates copy their input and process child elements. When the element/child element is ended, control is passed back to the template that invoked process-self, and the following elements are ignored again. Until the template matches again.
The following is an example input file:
<root>
<child attribute="no-parent, so no copy">
</child>
<element id="id1">
<child attribute="value1">
text1<b>bold</b>
</child>
</element>
<element id="id2">
<child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" ></b>
</x:childX>
</child>
</element>
</root>
This is the corresponding output file:
<?xml version="1.0" encoding="UTF-8"?>
<child attribute="value1">
text1<b>bold</b>
</child><child attribute="value2">
text2
<x:childX xmlns:x="http://x.example.com/x">
<!-- comment -->
yet more<b i="i" x:i="x-i" />
</x:childX>
</child>
The unusual formatting is a result of skipping the text nodes containing newlines outside the child elements.

Since you're talking about GB's, I would rather prioritize the memory usage in the consideration. SAX needs about 2 times of memory as the document big is, while DOM needs it to be at least 5 times. So if your XML file is 1GB big, then DOM would require a minimum of 5GB of free memory. That's not funny anymore. So SAX (or any variant on it, like StAX) is the best option here.
If you want the most memory efficient approach, look at VTD-XML. It requires only a little more memory than the file big is.

Have a look at StAX, this might be what you need. There's a good introduction on IBM Developer Works.

For such a large XML document something with a streaming architecture, like Omnimark, would be ideal.
It wouldn't have to be anything complex either. An Omnimark script like what's below could give you what you need:
process
submit #main-input
macro upto (arg string) is
((lookahead not string) any)*
macro-end
find (("<keep") upto ("</keep>") "</keep>")=>keep
output keep
find any

You can do this quite easily with an XMLEventReader and several XMLEventWriters from the javax.xml.stream package.

Related

Validate and remove any extraneous closing tags in xml in java

Example:
<Module name="IOWData">
*</VERSION>*
<ACQ> PAR </ACQ>
<RECON> PUP </RECON>
<Group name="PAR">
<HEALTHSTATUS> OK </HEALTHSTATUS>
</Group>
</Module>
I want to remove any extraneous closing tag i.e. a closing tag which hasn't be opened in the xml (as shown in the example - version tag).
Note: It can be any tag anywhere throughout the xml. Also the xml is huge I don't really wish to load the entire xml in memory.
Following ideas I have:
Regex : If I can use regular expression to solve this. But I need help in how to check the tag name for closing and opening check.
Using XSD. But how ?
Hope I'm clear and yearning for an efficient solution.
Thanks!
First, don't call it XML. It isn't XML. If you start by calling it non-XML, that will help to establish the mindset that tools designed for processing XML aren't going to be any use to you.
Given that you have to parse a language that isn't XML, and that no parser for that language currently exists, you're going to have to learn about writing parsers[*]. It's a topic that is covered in every computer science course and in any compiler textbook, but it's not something to attempt until you have read a bit about the theory.
Once you know how to start writing a parser, the best thing to do is to write down the BNF of the grammar you want to parse, which is basically the XML grammar plus the option of stray end-tags. You will have a lexical analyser which identifies the tags (including the strays) and pushes them across to a syntax analyzer, which can do the job of matching tag names (although this is technically, in the jargon of compiler writing, semantics rather than syntax). Then you just have to identify the strays and drop them from the event stream passed to the next stage of processing, which can be a standard SAX ContentHandler.
I hope that gives you an accurate feel for the size of the mountain you are wanting to climb.
[*]I guessed that you don't know much about this from the fact that you thought regular expressions might do the job.

Is there any way to force JAVA to print empty CDATA within Elements?

I know it's inproper but the company "on the other side", for some reasons, forces us to change such code:
<POZ />
into
<POZ><!CDATA[[]]></POZ>
I found Forcing Empty CDATA Elements but it needs an additional XSL file to be attached and this is something I would like to avoid.
I tried trivial
Element poz = new Element("POZ");
CDATA cdataContent = new CDATA("");
poz.addContent(cdataContent);
but it - of course - generates simple
<POZ />
not the extended version I expect...
Is there a way to force JAVA to generate these CDATA sections within empty Elements?
Ideally, no XML application should ever care about the difference between CDATA escapes and normal text, or the difference between a self-terminated empty element and one expressed with separate open and close tags. I'd strongly recommend beating up whoever is consuming this document to make them drop this requirement.
The DOM doesn't consider the concept of an empty text node meaningful, whether it's a CDATA Section or not. I'm not sure offhand whether any of the other XML APIs will let you generate an empty CDATA, but I wouldn't be surprised if the answer is no.
If you absolutely require this, I would recommend that you write a postprocesing stage which does a simple string replacement on the file to force it into the form you require.
If you can't persuade the recipient that using XML properly is a good idea, then emit
<POZ>[[[[[[CDATA]]]]]]</POZ>
from the Java app, and post-process it with sed or similar.

Java: Given a List of file names, make sure the corresponding XML only contains information about these filess

I have a List of files (20,000 to 50,000 files), and a large xml file. I want the file XML to only contains information about the file in the List.
For example, let say we have only file XYZ on our list, and XML files look as below.
<?xml version="1.0" encoding="ISO-8859-1"?>
<index>
<document>
<entry number="1">
<commentfield>
<name>FileName</name>
<value>XYZ</value>
</commentfield>
</entry>
<entry number="2">
<commentfield>
<name>Note</name>
<value>03-000</value>
</commentfield>
</entry>
</document>
<document>
<entry number="1">
<commentfield>
<name>FileName</name>
<value>ABC</value>
</commentfield>
</entry>
</document>
...
</index>
The XML contains information of two files, XYZ and ABC. Therefore, I do not want the final XML to contains the last <document> ... ABC ... </document> because this document ABC is not on our List. I have requirements successfully work in KSH script, but it runs too slow (over 4 hours for 22000 files. Well it also does something else). But I decide to port over to Java for better performance. What I have done is read line by line into a String, and when i hit </document>, then I parse out the name of the file, check if this files exist on our list, if so then write this whole <document> ... </document> to another xml file, then read again the next <document>. Is there a better way?
Already able to write code to accomplish this using DOM parser. The code are long, so if you need it, please pm me. tyvm for your help
'Parsing' an XML input yourself using regex or whatever is a brittle solution that will place unnecessary restrictions on the format of the input text (around whitespace and such). There's no need for it when the Java library comes with several XML parsers.
Using DOM might be the easiest way to go, if you can guarantee that your input XML won't grow too large to slurp into memory at once. You can:
Read the XML into a DOM structure
Traverse the DOM and modify it, removing the unwanted nodes
Write the modified DOM to a new file using a Transformer. Example here.
A more efficient option might be StAX, which doesn't require the entire input to be read in at once. I haven't used it, but it has the ability to read as well as write documents. You could read a <document> element at a time, and write it back to an output file if it's in the list. A bit of a tutorial here.
Ignoring, for the moment, details of the best way to parse and re-write the XML, the basic strategy of reading once through the XML file and looking for each file name in the list seems sound.
However, you might be able to improve they way you check for presence in the list of filenames (you don't specify how you're doing that). A couple of possibilities:
Put the filenames in a Set, and check for presence in the set, which will be an O(1) or O(log N) operation
Sort the list of filenames and perform a binary search, which will be an O(log N) operation.
Either way would be an improvement over a simple linear search through an unsorted list.
There are multiple ways to approach this:
XSLT would make this very simple if you have a fixed input list you can write a transform that only selects valid elements and outputs them. This way you don't have to actually write any code and can use something like xsltproc that is very fast!
This is what I would try first because it specifically created for transforming XML into other XML, it is less code and less code is less maintenance.
Here is an idea of how to get started, this outputs all the <document/> elements where the <value/> elements is not equal to ABC.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml"/>
<!-- this matches ALL nodes and ALL attributes -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- this matches the entire document element where value = 'ABC' -->
<xsl:template match="document[entry[commentfield[value[(text()='ABC')]]]]"/>
</xsl:stylesheet>
There are plenty of resources and good books on XSLT all you need to do is provide a whitelist of supported <value/> elements and reverse the logic in my example.
If you have an .xsd or you can create one, your input file doesn't look very complicated, you can use JAXB to automatically generate a Object hierarchy to parse the input file and then you can walk the resulting Object graph and remove anything that doesn't meet your criteria and Marshall it back to a file.
JAXB isn't very viable if the file size is larger than what will fit into memory.
You can use Xpath to get the elements, if you know of the structure of the xml you can then remove those elements. Depending how you are processing your xml you can either use DOM (probably not a good idea for large XMLs)

Parse an XML structure as a simple, generic set of maps and lists

Somewhat related to How to convert XML to java.util.Map and vice versa, only even more generic.
I have an XML document, and I'd like to convert it to a very generic set of key/value pairs (in Java, that is). The basic idea is that we can parse pretty much every XML document and pass it on directly to a JSP file, which can read the values and display them.
Say one has an XML structure as follows:
<root>
<items>
<item id="10">Some item here</item>
</items>
<things>
<thing awesome="true">
<orly-owl hoot="woot" />
</thing>
</things>
</root>
The output would be a set of Map objects that both contain values, lists, and other maps. Here's how, ideally, it'd be read in a (pseudo) JSP file:
<c:forEach var="item" items="${root.items}">
${item.id}
${item.text}
</c:forEach>
<c:forEach var="things" items="${root.things}">
Is it awesome? ${thing.awesome}
orly? ${thing.orly-owl.hoot}
</c:forEach>
Basically, it'd be an xml parser of sorts that has a simple set of rules.
For each XML entity:
Does it have subnodes?
add entry to map with node name as key and List (of maps) as value
Does it have attributes or value?
add entry to map with attribute name as key and attribute value as value
...or something to that degree. I don't really have the data structure in mind properly yet.
So my question is: Is there a ready-made parser that can do this or something like this?
The ones I've found and tried out today all map to a fixed object hierarchy, i.e. you have to create a root object with a List of Item objects with its own properties. This isn't bad per sé (and it can be auto-generated based on a (to be written / designed) DTD object, but it's my current assignment to try out both options. Tried the first, it'll work once those mapping xml files make sense to me and error messages start telling me what I'm doing wrong, but haven't been able to figure out how to do the second (read: write a recursive xml parser (dom or sax) that recurses recursively).
Coherency may be absent in this question, it's five 'o clock.
Edit, after thinking it through some more. It will work (that is, sending Objects to JSP that can contain values, Maps and Lists), however it'll be terribly problematic while parsing, for example in the next example:
<root thing="thine mother">
<thing mabob="yus" />
<thing mabob="nay" />
<items>
<item id=1" />
</items>
</root>
In this particular instance, there's two same-named thing-elements under the root. Same-named elements should go into a List. However, at the same level there's an items element, which is a singular element which should go in as a map item. Add to that there's a third named 'element' in the root element, and the whole thing's buggered.
Without analyzing the structure beforehand (and setting a flag like 'there's both same-named and unique-named elements under this particular element'), you cannot assume this. And the last thing I want to do is to force the XML to be according to a particular structure.
My colleague actually suggested running the XML through an XSL so that it'd be 'flatter' (more like database rows), or having the xml output have a maximum depth of one. Not an option, really.
Anyways. Thanks for the suggestions all, it seems this isn't a very plausible solution to the problem - at least not without screwing up basic rules and conventions of XML and Common Sense.
On to the next ideas - having JSP render a Document directly using the XML JSTL library.
JDOM can certainly provide you with Lists built from the elements. The library has been around for quite some time and is pretty easy to use. http://jdom.org/
It seems like the JSTL XML bindings will do exactly what you want.
And the reason that you're unlikely to find anything that exactly meets your requirements using lists and maps is because XML does not neatly translate into lists and maps (mostly because of the question "how do you treat attributes differently than content?").
Java Architecture for XML Binding (JAXB) should be on your short list. Here's a bief tutorial introduction.
The apache-commons Digester could do this, it is a wrapper around a SAX parser that lets you create rules for unmarshalling data into objects.
OTOH if you want to know how to do recursive parsing you could check out this article for an interesting approach (using a recursive transition network). The idea is you create a network of objects that shows the relationship between the xml elements, and you keep track of where you are in this network as you parse using a stack.

What xml parser should I use?

I have a somewhat large file (~500KiB) with a lot of small elements (~3000). I want to pick one element out of this and parse it to a java class.
Attributes Simplified
<xml>
<attributes>
<attribute>
<id>4</id>
<name>Test</id>
</attribute>
<attribute>
<id>5</id>
<name>Test2</name>
</attribute>
<!--3000 more go here-->
</attributes>
class Simplified
public class Attribute{
private int id;
private String name;
//Mutators and accessors
}
I kinda like XPath, but people suggested Stax and even VDT-XML. What should I do.
500 kb is not that large. If you like XPath, go for it.
I kinda like XPath, but people suggested Stax and even VDT-XML. What should I do.
DOM, SAX and VTD-XML are all three different ways to parse a XML document. Roughly in this order of memory efficiency. DOM needs over 5 times of memory as XML file big is. SAX is only a bit more efficient, VTD-XML uses only a little more memory than the XML file big is, about 1.2 times.
XPath is just a way to select elements and/or data from a (parsed) XML document.
With other words, you can just use XPath in combination with any of the XML parsers. So this is after all a non-concern. If you just want to go for best memory efficiency and performance, go for VTD-XML.
I have commented above as well, because there are few options to consider - but by the sound of it your initial description I think you could get away with a simple SAX processor here: which will probably run faster (although it might not look as pretty when it comes to mapping the Java class) than other mechanisms:
There is an example here, which matches quite closely with your example:
http://www.informit.com/articles/article.aspx?p=26351&seqNum=6
Avoid anything that is a DOM parser - no need for that, especially with a large-ish file and relatively simple XML syntax.
Which specific one to use, sorry, I haven't used them, so I can't give you any more guidance than to look at your licensing, performance, and support (for questions).
My favorite XML library is Dom4j
Whenever I have to deal with XML I just use XMLBeans. It may be overkill for what you are after, but it makes life easy (once you know how to use it).
If you don't care about performance at all, Apache Digester may be useful for you, as it will already initialize the Java objects for you after you define the rules.

Categories