How to remove CDATA from XML in Java and do some conversion? - java

I am trying create Java Servlet which will modify existing XML.
This a part of my orginal XML:
<customfieldvalues>
<div id="errorDiv" style="display:none;"/>
<![CDATA[
Vinduer, dører
]]>
</customfieldvalues>
I want to get the following result:
<customfieldvalues>
<div id="errorDiv" style="display:none;"/>
Vinduer, dører
</customfieldvalues>
I iterate over the XML structure with:
Document doc = parseXML(connection.getInputStream());
NodeList descNodes = doc.getElementsByTagName("customfieldvalues");
for (int i=0; i<descNodes.getLength();i++) {
Node node = descNodes.item(i);
// how to ?
}
So, I need to remove CDATA and convert the content.
I saw that I can use this for the conversion.

javax.xml.parsers.DocumentBuilderFactory.setCoalescing API
Specifies that the parser produced by this code will
convert CDATA nodes to Text nodes and append it to the
adjacent (if any) text node. By default the value of this is set to
false

Related

Why is DOM doing this? (Wrong nodeName XML)

I have this XML (just a little part.. the complete xml is big)
<Root>
<Products>
<Product ID="307488">
<ClassificationReference ClassificationID="AR" Type="AgencyLink"/>
<ClassificationReference ClassificationID="AM" Type="AgencyLink">
<MetaData>
<Value AttributeID="tipoDeCompra" ID="C">Compra Centralizada</Value>
</MetaData>
</ClassificationReference>
</Product>
</Products>
</Root>
Well... I want to get the data from the line
<Value AttributeID="tipoDeCompra" ID="C">Compra Centralizada</Value>
I'm using DOM and when I use nodoValue.getTextContent() I got "Compra Centralizada" and that is ok...
But when I use nodoValue.getNodeName() I got "MetaData" but I was expecting "Value"
What is the explanations for this behaviour?
Thanks!
Your nodeValuevariable most likely points to the MetaData node, so the returned name is correct.
Note that for an element node Node.getTextContent() returns the concatenation of the text content of all child nodes. Therefore in your example the text content of the MetaData element is equal to the text content of the Value element, namely Compra Centralizada.
I guess your are getting the Node object using getElementsByTagName("MetaData"). In this case nodoValue.getTextContent() will return the text content correctly but to get the node name you need to get the child node.
Your current node must be MetaData and getTextContent() will give all the text within its opening and closing tags. This is because you are getting
Compra Centralizada
as the value. You should get the first child using getChildNodes() and then can get the Value tag.

Reading XML tag from MediaWiki using Java

I need to read output of 'search' tag from following url usign Java.
First I need to read XML into some string from following URL:
http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother
I should end up having this:
<api>
<query-continue>
<search sroffset="1"/>
</query-continue>
<query>
<searchinfo totalhits="55180"/>
<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>
</query>
</api>
Then once I have the XML, I need to get content of the search tag:
Output of 'search' tag looks like this and I need to get two parts from the code in the middle:
<search>
<p ns="0" title="Big Brothers Big Sisters of America" snippet="<span class='searchmatch'>Big</span> <span class='searchmatch'>Brothers</span> <span class='searchmatch'>Big</span> Sisters of America is a 501(c)(3) non-profit organization whose goal is to help all children reach their potential through <b>...</b> " size="13008" wordcount="1906" timestamp="2014-04-15T06:46:01Z"/>
</search>
At the end, all I need is to have two strings, which would equal to this:
String title = Big Brothers Big Sisters of America
String snippet = "<span class='searchmatch'>Big..."
Can someone please help me amending this code, I am not sure what I am doing wrong. I don't think it's even retrieving XML from url, much less the tags inside the XML.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("http://en.wikipedia.org/w/api.php?format=xml&action=query&list=search&srlimit=1&srsearch=big+brother");
doc.getDocumentElement().normalize();
XPathFactory xFactory = XPathFactory.newInstance();
XPath xpath = xFactory.newXPath();
XPathExpression expr = xpath.compile("//query/search/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i=0; i<nodes.getLength();i++){
System.out.println(nodes.item(i).getNodeValue());
}
Sorry, I am a newbie and can't find the answer to this anywhere.
The main problem here is that you're asking for text nodes that are children of <search>, but in fact the <p ..> that you want is not a text node: it's an element. (In fact, the <search> element has no text node children, as you can tell when you view the response from that URL using "View Source".)
So what you want to do is change your XPath expression to
//query/search/p
which will give you the p element node. Then ask for the value of this node's two attributes title and snippet in your Java code:
Element e = (Element)(nodes.item(i));
String title = e.getAttribute("title");
String snippet = e.getAttribute("snippet");
Or, you could do two XPath queries, one for each attribute:
//query/search/p/#title
and
//query/search/p/#snippet
assuming there will only be one <p> element. If you were doing this over multiple <p> elements, you'd probably want to keep each pair of attributes together instead of having two separate lists of results.

Does org.dom4j.io.SAXReader.read(Reader reader) method preserves the order of elements and attributes of XML

My XML file is:
<XYZ>
<A name="one">
<label>I am A one</label>
</A>
<B name="two">
<label>I am B two</label>
</B>
<A name="three">
<label>I am A three</label>
</A>
</XYZ>
My Code is:
String myXmlAsString = //Read the above xml as String
Document document = new SAXReader().read(new StringReader(myXmlAsString ));
List<Element> dataElements = document.selectNodes("/XYZ");
My Question is:
If I read my XML file through above mentioned code then does the dataElements List returned by selectNodes(String xPathExpr) method will have the same order as in the original XML file?
If yes, does this holds true even if the XML has deep nesting and I call the selectNodes(String xPathExpr) method on any Element object from this document object.
XPath does not change the order of elements when returning results, so the elements are exactly in the same order as in your input xml.
Lists are ordered structures. There is no reason for the SAXReader to remove that order.

XPath - Get id attribute from parent element

i have following xml file:
<diagnostic version="1.0">
<!-- diagnostic panel 1 -->
<panel xml:id="0">
<!-- list controls -->
<control xml:id="0_0">
<settings description="text 1"/>
</control>
<control xml:id="0_1">
<settings description="text 2"/>
</control>
</panel>
<panel xml:id="1">
<!-- list controls -->
<control xml:id="1_0">
<settings description="text 3"/>
</control>
<control xml:id="1_1">
<settings description="text 4"/>
</control>
</panel>
</diagnostic>
and definition XPath:
//*[not(#description='-')]/#description
and Java code:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("diagnostic.xml");
XPath xpath = XPathFactory.newInstance().newXPath();
// XPath Query for showing all nodes value
XPathExpression expr = xpath.compile("//*[not(#description='-')]/#description");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(i + ": " + nodes.item(i).getParentNode() + nodes.item(i).getNodeValue());
}
This definition of XPath would return all attribute values ​​description where the value of this attribute is not '-'.
Output:
text 1
text 2
text 3
text 4
But I need to find this attribute description also attribute xml:id element control.
Output:
0_0 text 1
0_1 text 2
1_0 text 3
1_1 text 4
How to do that in my description also returns a xml:id element of control? I need to know that the description given element is control.
Someone correct me if I'm wrong, but I don't think this can be done with a single XPath expression. The concat function returns a single text result, not a list. I suggest you run multiple XPath expressions and construct your results from that, or run a single XPath expression to get the settings elements you need, then take the description attribute from it and concatenate it with the xml:id attribute from the parent element if that's a control one.
Nodes keep references to their parents. Use method getParentNode() to obtain it.
Here's an alternative: run this XPath expression...
//control[settings[#description!='-']]/#xml:id | //control/settings[#description!='-']/#description
... and then concatenate the text of the alternating results in the returned node list. In other words, text from item 0 + item 1, text from item 2 + item 3 etc.
The above XPath expression will return this node list:
0_0
text 1
0_1
text 2
1_0
text 3
1_1
text 4
You can then parse through that list and construct your results.
Be careful. This will only work if there's at most 1 settings element per control element. Also, you may find that on evaluation the XPath engine throws an error for that xml: prefix. It may say that it's unknown. You might have to bind that prefix to the correct namespace first. Since the xml prefix is reserved and bound by default to a specific namespace, this might not be needed. I'm not certain as I haven't used it before.
I've tested the expression in XMLSpy. It's not entirely impossible that the XPath engine used in Java (or the one you set for use) returns the nodes in another order. It might evaluate both parts of the "or" (the pipe symbol) separately and then dump the results into a single node list. I don't know what the XPath spec mandates regarding result ordering.
I may be just as wrong, but the nodes you traverse in the result are the XML nodes themselves. Your code sample is almost there:
- nodes.item(i) points to the attribute "description".
- nodes.item(i).getParentNode() points to the tag "settings".
- nodes.item(i).getParentNode().getParentNode() would point to the tag "control" (class Element). You could then use getAttribute() or getAttributeNS() on that node to find get the attribute you need.

How to insert XmlCursor content to DOM Document

Some API returns me XmlCursor pointing on root of XML Document. I need to insert all of this into another org.w3c.DOM represented document.
At start:
XmlCursor poiting on
<a>
<b>
some text
</b>
</a>
DOM Document:
<foo>
</foo>
At the end I want to have original DOM document changed like this:
<foo>
<someOtherInsertedElement>
<a>
<b>
some text
</b>
</a>
</someOtherInsertedElement>
</foo>
NOTE: document.importNode(cursor.getDomNode()) doesn't work - Exception is thrown: NOT_SUPPORTED_ERR: The implementation does not support the requested type of object or operation.
Try something like this:
Node originalNode = cursor.getDomNode();
Node importNode = document.importNode(originalNode.getFirstChild());
Node otherNode = document.createElement("someOtherInsertedElement");
otherNode.appendChild(importNode);
document.appendChild(otherNode);
So in other words:
Get the DOM Node from the cursor. In this case, it's a DOMDocument, so do getFirstChild() to get the root node.
Import it into the DOMDocument.
Do other stuff with the DOMDocument.
Append the imported node to the right Node.
The reason to import is that a node always "belongs" to a given DOMDocument. Just adding the original node would cause exceptions.
I was having the same issue.
This was failing:
Node importNode = document.importNode(originalNode);
This fixed the problem:
Node importNode = document.importNode(originalNode.getFirstChild());

Categories