Extract XML blocks as string in Java - java

I have an XML as below
<accountProducts>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
<accountProduct>...</accountProduct>
</accountProducts>
Now I want to extract each of the accountProduct block as string. So is there any XML parsing technique to do that or I need to do string manipulation.
Any help please.

Using the DOM as suggested above, you will need to parse your XML with a DocumentBuilder.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
//if your document has namespaces, you can specify that in your builder.
DocumentBuilder db = dbf.newDocumentBuilder();
Using this object, you can call the parse() method.
Your XML input can be provided to a DOM parser as a file or as a stream.
As a file...
File f = new File("MyXmlFile.xml");
Document d = db.parse(f);
As a string...
String myXmlString = "...";
InputSource ss = new InputSource(new StringReader(myXmlString));
Document d = db.parse(ss);
Once you have a Document object, you can traverse the document with DOM functions or with XPATH. This example illustrates the DOM methods.
In your example, assuming that accountProduct nodes contain only text, the following should work.
NodeList nl = d.getElementsByTagName("accountProduct");
for(int i=0; i<nl.getLength(); i++) {
Element elem = (Element)nl.item(i);
System.out.println(elem.getTextContent());
}
If accountProduct contains mixed content (text and elements), you would need more code to extract what you need.

Use JAXP for this.
The Java API for XML Processing (JAXP) is for processing XML data using applications written in the Java programming language.

Related

How to extract XML data using Java in Android?

I have the following XML document which I'm trying to get the inner text. I have tried numerous ways, using Xpath, DOM, SAX but no success.
This is my XML, I'm not sure if it's the XML structure which is causing a problem or my code.
<?xml version="1.0"?>
<ArrayOfPurchaseEntitites xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema">
<PurchaseEntitites>
<rInstalmentAmt>634.0</rInstalmentAmt>
<rAnnualRate>12.0</rAnnualRate>
<rInterestAmt>2670.0</rInterestAmt>
<dFirstInstalment>3/31/2016 12:00:00 AM</dFirstInstalment>
<dLastInstalment>8/31/2018 12:00:00 AM</dLastInstalment>
<rInsurancePremium>1350.0</rInsurancePremium>
<sResponseCode>00</sResponseCode>
</PurchaseEntitites>
</ArrayOfPurchaseEntitites>
InputStream stream = connect.getInputStream();
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.parse(stream);
doc.normalize();
System.out.println("===============================================================");
String g = doc.getDocumentElement().getTextContent();
System.out.println(g);
NodeList rootNodes = doc.getElementsByTagName("ArrayOfPurchaseEntitites");
Node rootnode =rootNodes.item(0);
Element rootElement = (Element) rootnode;
NodeList noteslist = rootElement.getElementsByTagName("PurchaseEntitites");
for(int i = 0; i < noteslist.getLength(); i++)
{
Node theNote = noteslist.item(i);
Element noteElement =(Element) theNote;
Node theExpiryDate = noteElement.getElementsByTagName("dLastInstalment").item(0);
Element dateElement = (Element) theExpiryDate;
System.out.println(dateElement.getTextContent());
}
stream.close();
I had a similar problem where I wanted to call getElementsByTagName for the first item in a NodeList. The trick - which you already utilize - is to cast the Node to Element. However, just to be sure, I suggest you add if (rootnode instanceof Element).
Assuming you use packages javax.xml.parsers and org.w3c.dom (no wild guess) your code works nicely when the xml is read from a file.
So if there still a problem with the code (it's been a while since this question was asked) I suggest you update the question with more info regarding connect.getInputStream();.

Handle invalid xml character while serializing

I have requirement where I need to serialize a document which contains a string like ンᅧᅭ%ンᅨ&. While serializing it throws the following exception:
java.io.IOException: The character '' is an invalid XML character
Is there a way we can serialize this String as is with any workaround?
StringWriter stringOut = new StringWriter();
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("company");
doc.appendChild(rootElement);
String xml = "ンᅧᅭ%ンᅨ&";
//String xml = "ンᅧᅭ%ンᅨ&";
Element junk = doc.createElement("replyToQ");
junk.appendChild(doc.createCDATASection(xml));
//junk.appendChild(doc.createTextNode(stripNonValidXMLCharacters(xml)));
rootElement.appendChild(junk);
//org.w3c.dom.Document doc = this.toDOM();
//Serialize DOM
OutputFormat format = new OutputFormat(doc,"UTF-8",true);
format.setIndenting(false);
format.setLineSeparator("");
format.setPreserveSpace(true);
format.setOmitXMLDeclaration(false);
XMLSerializer serial = new XMLSerializer( stringOut, format );
// As a DOM Serializer
serial.asDOMSerializer();
serial.serialize( doc.getDocumentElement() );
EDIT: I read your question as a deserialisation question, not serialization. Sorry.
The answer is that you need to escape them using Uuicode entity escape strings.
Character ン becomes ソ. See Japanese Katakana chart
Also see here XML Escaping
You need to pre-process the file to correctly escape the xml characters.
read each character in the file
if the character is invalid xml, escape it appropriately
write character to temporary file
at the end of the original file, overwrite original with temporary file.
Your file is now valid xml and can be parsed by standard means. It will most likely be bigger. Give the supplier of your file a telling off for writing a buggy xml writer ;)

Java DOM parser to filter out SOAP namespace-prefixes?

I am parsing a SOAP that has elements with namespace-prefixed names:
<ns1:equipmentType>G</ns1:equipmentType>
So the parser faithfully creates elements with namespace-prefixed names:
ns1:equipmentType
Can I somehow tell the parser to filter out all the namespace-prefixes? so the element names will be like:
equipmentType
My code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = documentBuilder.parse(inputStream);
I don’t see that you set the parser to be namespace-aware so that’s most probably the missing thing here.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(inputStream);
Then invoking getLocalName() on a node will give you the name without any prefix.
However if that’s not enough for you and you really want to get rid of the name-spaces at all you can use an XML transformation to create a new DOM tree without the name-spaces:
Transformer trans = TransformerFactory.newInstance().newTransformer(
new StreamSource(new StringReader("<?xml version='1.0'?>"
+"<stylesheet xmlns='http://www.w3.org/1999/XSL/Transform' version='1.0'>"
+ "<template match='*' priority='1'>"
+ "<element name='{local-name()}'><apply-templates select='#*|node()'/></element>"
+ "</template>"
+ "<template match='#*' priority='0'>"
+ "<attribute name='{local-name()}'><value-of select='.'/></attribute>"
+ "</template>"
+ "<template match='node()' priority='-1'>"
+ "<copy><apply-templates select='#*|node()'/></copy>"
+ "</template>"
+"</stylesheet>")));
DOMResult result=new DOMResult();
trans.transform(new DOMSource(document), result);
document=(Document)result.getNode();
Make sure to do setNamespaceAware(true) on the DocumentBuilderFactory.
On the resulting Document instance you can then use getLocalName() to get element names without the prefix.
You can also use for example getElementsByTagNameNS() where you provide the namespace URI and the local tag name to find the elements you need. This decouples your code from the actual prefix that is being used for that namespace inside that particular document. I'm thinking this is actually what you're after.

Convert a String to w3c.dom.Element: XMLParseException:Start of root element expected

I found the following piece of code from a blog, when running it I get an exception
XMLParseException:Start of root element expected. at 9th line.
Can any one explain why I get the Exception and suggest any other way for converting String to an element?
String s = "Hello DOM Parser";
java.io.InputStream sbis = new java.io.StringBufferInputStream(s);
javax.xml.parsers.DocumentBuilderFactory b = javax.xml.parsers.DocumentBuilderFactory.newInstance();
b.setNamespaceAware(false);
org.w3c.dom.Document doc = null;
javax.xml.parsers.DocumentBuilder db = null;
db = b.newDocumentBuilder();
doc = db.parse(sbis);
org.w3c.dom.Element e = doc.getDocumentElement();
To create a DOM Element with a custom tag (which I assume is what you want, but can't be sure), you can use the following approach:
String customTag = "HelloDOMParser";
Document doc = documentBuilder.newDocument();
String fullName = nameSpacePrefix + ":" + customTag;
Element customElement = document.createElementNS(namespaceUri, fullName);
doc.appendChild(customElement);
I am assuming you know the namespace URI and your prefix (if any). If you don't use namespaces, just use the createElement() method instead.
Like said in the comment, "Hello DOM Parser" is not a XML element. And so the parser doesn't know what to do with it. I don't know what kind document you are building, but if you want HTML you can embed the text in a html tag for example;
<div>Hello DOM Parser</div>
<span>Hello DOM Parser</span>
if you are building XML, you can embed the text in any random html tag;
<mytag>Hello DOM Parser</mytag>
Some explanation on DOM;
http://www.w3.org/DOM
To answer your question, to convert a String to a w3c Element, you can use createElement;
Element hello = document.createElement("hello");
hello.appendChild(document.createTextNode("Hello DOM Parser"));
This results in;
<hello>Hello DOM Parser</hello>
The parse method of DocumentBuilder accept the input stream which contains the xml content. The following change will work for you:
String s = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root><elem1>Java</elem1></root>";
Try to avoid using the deprecated classes such as StringBufferInputStream. You can refer to the document below to know more about Java XML parsing.
http://www.java-samples.com/showtutorial.php?tutorialid=152

Xml document to DOM object using DocumentBuilderFactory

I am currently modifying a piece of code and I am wondering if the way the XML is formatted (tabs and spacing) will affect the way in which it is parsed into the DocumentBuilderFactory class.
In essence the question is...can I pass a big long string with no spacing into the DocumentBuilderFactory or does it need to be formatted in some way?
Thanks in advance, included below is the Class definition from Oracles website.
Class DocumentBuilderFactory
"Defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. "
The documents will be different. Tabs and new lines will be converted into text nodes. You can eliminate these using the following method on DocumentBuilderFactory:
http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setIgnoringElementContentWhitespace(boolean)
But in order for it to work you must set up your DOM parser to validate the content against a DTD or xml schema.
Alternatively you could programmatically remove the extra whitespace yourself using something like the following:
public static void removeEmptyTextNodes(Node node) {
NodeList nodeList = node.getChildNodes();
Node childNode;
for (int x = nodeList.getLength() - 1; x >= 0; x--) {
childNode = nodeList.item(x);
if (childNode.getNodeType() == Node.TEXT_NODE) {
if (childNode.getNodeValue().trim().equals("")) {
node.removeChild(childNode);
}
} else if (childNode.getNodeType() == Node.ELEMENT_NODE) {
removeEmptyTextNodes(childNode);
}
}
}
It should not affect the ability of the parser as long as the string is valid XML. Tabs and newlines are stripped out or ignored by parsers and are really for the aesthetics of the human reader.
Note you will have to pass in an input stream (StringBufferInputStream for example) to the DocumentBuilder as the string version of parse assumes it is a URI to the XML.
The DocumentBuilder builds different DOM objects for xml string with line feeds and xml string without line feeds. Here is the code I tested:
StringBuilder sb = new StringBuilder();
sb.append("<root>").append(newlineChar).append("<A>").append("</A>").append(newlineChar).append("<B>tagB").append("</B>").append("</root>");
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputStream xmlInput = new ByteArrayInputStream(sb.toString().getBytes());
Element documentRoot = builder.parse(xmlInput).getDocumentElement();
NodeList nodes = documentRoot.getChildNodes();
System.out.println("How many children does the root have? => "nodes.getLength());
for(int index = 0; index < nodes.getLength(); index++){
System.out.println(nodes.item(index).getLocalName());
}
Output:
How many children does the root have? => 4
null
A
null
B
But if the new newlineChar is removed from the StringBuilder,
the ouptput is:
How many children does the root have? => 2
A
B
This demonstrates that the DOM objects generated by DocumentBuilder are different.
There shouldn't be any effect regarding the format of the XML-String, but I can remember a strange problem, as I passed a long String to an XML parser. The paser was unable to parse a XML-File as it was written all in one long line.
It may be better if you insert line-breaks, in that kind, that the lines wold not be longer than, lets say 1000 bytes.
But sadly i do neigther remember why that error occured nor which parser I took.

Categories