Java parser to read nested and lengthy xml - java

I have gone through many Stackoverflow pages and web to decide on parser which fits in for my requirement.
I need to read nested and big xml files in java , so DOM parser would not be good fit . My xml looks like below(snippet)-
<products>
<product>
<productCode></productCode>
<Code>3002191</Code>
<anotherCode></anotherCode>
<entityName>entityName value</entityName>
<entityName2>entityName value</entityName2>
<Type>value</Type>
<List>1</List>
<SecondCode>124</SecondCode>
<docInfo>
<name>value1</name>
<docName>value</docName>
<docId>045</docId>
<type>Full Name</type>
<class>value</class>
<docCode>123</docCode>
<date>07/12/2016</date>
<countries>
<country>India</country>
</countries>
<language>EN</language>
</docInfo>
<docInfo>
<name>value1</name>
<docName>value</docName>
<docId>1219</docId>
<type>Full Name</type>
<class>value</class>
<docCode>123</docCode>
<date>07/12/2016</date>
<countries>
<country>India</country>
</countries>
<language>EN</language>
</docInfo>
</product>
<product>
..
</product>
</products>
Requirement: I need to store products information into list of hashmap for further processing with other xmls. Firstly, I thought to use Stax api to do this.But element docInfo has countries element so there can be multiple document for many countries and I cant parse backward to save another document(which has same document info but with country) . Please let me know if I am clear enough
Please let me know which parser will be good to handle this situation , i do not have any schemas for this xml.
Thanks a lot.

To parse a big amount of XML, the best is to use SAX :
https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
You implement the ContentHandler interface and you can put the logic you need when parsing the docInfo and subsequent countries.

Related

Transform xml with data from different sources to another xml using java

I have an xml file in which some elements contain certain values.For example:
<item>
<origin>
<![CDATA[KWI]]>
</origin>
<destination>
<![CDATA[DOH]]>
</destination>
</item>
I have an excel sheet containing the country codes and port code mapping as:
COUNTRY_CODE PORT_CODE MANAGING_PORT_STATION
KW KWI MPS1
QA DOH MPS2
In the output xml, i need to put it something like:
<itemOut>
<country><![CDATA[KW]]></country>
<managingPortStation>MPS1</managingPortStation>
<dest><![CDATA[DOH]]></dest>
</itemOut>
So in short, I need to combine some non xml sources into the output xml file based on the input xml file, along with the xml file.
To accomplish the above, what should I use? Is it possible via xslt? Or what API's are available with java. I have just skimmed through the jaxp. But is it worth spending more time for my case? I would prefer to do it with java,rather than xslt since I am more familiar with it.

Lucene xml retrieval

I have an xml file with like 500 fields,
I have looked into the apache digester option,
but involves a lot of work. I have to create many classes to cover these 500 fields.
Is there any other way for me to get
suppose i search for 10013,i can get the matched document name, but can i get the name field in the below xml syntax.
<contact type="individual"> <name>name feild</name>
<address>999 W. Prince St.</address>
<city>New York</city>
<province>NY</province>
<postalcode>10013</postalcode>
<country>USA</country>
<telephone>1-212-345----</telephone>
</contact>
Or does apache solr give me a better api for this.

Xstream append to existing XML file

First let me say I am very new to Java. I've been trying to figure out how to append a chunk of XML to an existing XML file using Xstream.
Example XML:
<root>
<first>
<a>Some Value</a>
<b>Some Value B</b>
</first
<second>
<a>Another Value</a>
<b>Another Value B</b>
</second>
</root>
How would I go about appending the following using Xstream?
<third>
<a>More A</a>
<b>More B</b>
</third>
Have you followed the two minute tutorial of Xstream yet? This can be found here.
You should address some implementation choices first: which way to go with Xstream?
For example: is the XML document a large document or a small document (if small, you could use a DomDriver.If large, use a StaxDriver)?
Does your XML document uses namespaces? If so, be aware that not all Xstream parsers implement namespace awareness, see the Xstream faq.
More information can be found here to get you started.
Please provide us your SSCCE so that users can try out your code example.
See details on how to write a SSCCE here and here
Also include a small valid XML file in your SSCCE.

Using org.apache.commons.digester.Digester in XML with attributes

I'm going to extract values from this XML/RDF:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:j.0="urn:turismoculturale.itdc.filas-1.0.0-RC1#">
<j.0:Chiesa rdf:ID="turismoCulturale_POI_880">
<j.0:title xml:lang="en">Church of S. Giuda Taddeo or S. Onofrio - Gaeta</j.0:title>
<j.0:title xml:lang="it">Chiesa S. Giuda Taddeo o S. Onofrio - Gaeta</j.0:title>
</j.0:Chiesa>
</rdf:RDF>
I would like to get en title when I am in "en" language and "it" title otherwise. I am able to set the title value in the Poi bean by using:
Digester digester = new Digester();
digester.setNamespaceAware( true );
digester.setRuleNamespaceURI( "urn:turismoculturale.itdc.filas-1.0.0-RC1#" );
digester.addObjectCreate( "*/Chiesa", Poi.class);
digester.addBeanPropertySetter("*/title", "title");
...
but I don't know if it is the english title or the italian one.
Ok - first and foremost, don't try to parse RDF/XML with an XML parser. It's never going to work because the semantics of the XML document are irrelevant with respect to RDF/XML and it is a bad idea (if you know how RDF/XML works), especially in your case where the RDF/XML is being generated dynamically (you can tell by the namespaces). You need to use an RDF parser to parse RDF.
So that means don't use an XML to Java object mapping tool, use an RDF to Java Object mapping tool.
Here is a great link explaining how to do this:
http://answers.semanticweb.com/questions/859/best-way-to-convert-rdfxml-file-to-pojos
And another:
http://answers.semanticweb.com/questions/3251/experience-using-java-based-frameworks-for-rdf-to-pojo-and-vice-versa-mapping
Along with links to all the tools in the aforementioned resource:
Jenabean
Empire
AliBaba
RDFReactor
For an out and out RDF parser, look at Jena:
http://incubator.apache.org/jena
It's an Apache project that is also nicely Maven'ed up.
The Commons Digester FAQ says:
Occasionally, people ask how they can fire a rule for an element based on the value of an attribute
There is no simple way to do this with Digester; the built-in rule-matching engines only provide the ability to match on element name. There is no support available for XPath expressions
It might be possible to create a custom "filtering" rule that has a child rule, and fires that child rule only when the appropriate conditions are set. There are no examples of such a solution, however.
Digester isn't a very good tool. It's too simplistic. Consider using a more comprehensive event-based API such as StAX.

Best approach to serialize XML to stream with Java?

We serialize/deserialize XML using XStream... and just got an OutOfMemory exception.
Firstly I don't understand why we're getting the error as we have 500MB allocated to the server.
Question is - what changes should we make to stay out of trouble? We want to ensure this implementation scales.
Currently we have ~60K objects, each ~50 bytes. We load the 60K POJO's in memory, and serialize them to a String which we send to a web service using HttpClient. When receiving, we get the entire String, then convert to POJO's. The XML/object hierarchy is like:
<root>
<meta>
<date>10/10/2009</date>
<type>abc</type>
</meta>
<data>
<field>x</field>
</data>
[thousands of <data>]
</root>
I gather the best approach is to not store the POJO's in memory and not write the contents to a single String. Instead we should write the individual <data> POJO's to a stream. XStream supports this but seems like the <meta> element wouldn't be supported. Data would need to be in form:
<root>
<data>
<field>x</field>
</data>
[thousands of <data>]
</root>
So what approach is easiest to stream the entire tree?
You definitely want to avoid serializing your POJOs into a humongous String and then writing that String out. Use the XStream APIs to serialize the POJOs directly to your OutputStream. I ran into the same situation earlier this year when I found that I was generating 200-300Mb XML documents and getting OutOfMemoryErrors. It was very easy to make the switch.
And ditto of course for the reading side. Don't read the XML into a String and ask XStream to deserialize from that String: deserialize directly from the InputStream.
You mention a second issue regarding not being able to serialize the <meta> element and the <data> elements. I don't think this is an XStream problem or limitation as I routinely serialize much more complex structures on the order of:
<myobject>
<item>foo</item>
<anotheritem>foo</anotheritem>
<alist>
<alistitem>
<value1>v1</value1>
<value2>v2</value2>
<value3>v3</value3>
...
</alistitem>
...
<alistitem>
<value1>v1</value1>
<value2>v2</value2>
<value3>v3</value3>
...
</alistitem>
</alist>
<anotherlist>
<anotherlistitem>
<valA>A</valA>
<valB>B</valB>
<valC>C</valC>
...
</anotherlistitem>
...
</anotherlist>
</myobject>
I've successfully serialized and deserialized nested lists too.
Not sure what the problem is here...you've found your answer on that webpage.
The example code on the link you provided suggests:
Writer someWriter = new FileWriter("filename.xml");
ObjectOutputStream out = xstream.createObjectOutputStream(someWriter, "root");
out.writeObject(dataObject);
// iterate over your objects...
out.close();
and for reading nearly identical but with Reader for Writer and Input for Output:
Reader someReader = new FileReader("filename.xml");
ObjectInputStream in = xstream.createObjectInputStream(someReader);
DataObject foo = (DataObject)in.readObject();
// do some stuff here while there's more objects...
in.close();
I'd suggest using tools like Visual VM or Eclipse Memory Analyzer to make sure you don't have a memory leak/problem.
Also, how do you know each object is 50 bytes? That doesn't sound likely.
Use XMLStreamWriter (or XStream) to serialize it, you can write whatever you want on it. If you have the option of getting the input stream instead of the entire string, use a SAXParser, it is event based and, although the implementation maybe a little bit clumsy, you will be able to read any XML that is thrown at you, even if it the XML is huge (I have parse 2GB+ more XML files with SAXParser).
Just as a side note, you should send the binary data and not the string to a XML parser. XML parsers will read the encoding of the byte array that is going to come next through the xml tag in the beginning of the XML sequence:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
A string is encoded in something already. It's better practice to let the XML parse the original stream before you create a String with it.

Categories