stax xml confusion with getname function - java

I have a xml file like this:
<comment type="PTM">
<text evidence="19">Sumoylated following its interaction with PIAS1 and UBE2I.</text>
</comment>
<comment type="PTM">
<text evidence="17">Ubiquitinated, leading to proteasomal degradation.</text>
</comment>
<comment type="disease">
<text>A chromosomal aberration involving ZMYND11 is a cause of acute poorly differentiated myeloid leukemia. Translocation (10;17)(p15;q21) with MBTD1.</text>
</comment>
<comment type="disease" evidence="23">
<disease id="DI-04257">
<name>Mental retardation, autosomal dominant 30</name>
<acronym>MRD30</acronym>
<description>A disorder characterized by significantly below average general intellectual functioning associated with impairments in adaptive behavior and manifested during the developmental period. MRD30 patients manifest mild intellectual disability and subtle facial dysmorphisms, including hypertelorism, ptosis, and a wide mouth.</description>
<dbReference type="MIM" id="616083"/>
</disease>
<text>The disease is caused by mutations affecting the gene represented in this entry.</text>
</comment>
<comment type="similarity">
<text evidence="8">Contains 1 bromo domain.</text>
</comment>
<comment type="similarity">
<text evidence="9">Contains 1 MYND-type zinc finger.</text>
</comment>
I use stax to extract the disease information. This is part of my code:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader( new FileReader(p));
while(eventReader.hasNext()){
XMLEvent event = eventReader.nextEvent();
switch(event.getEventType()){
case XMLStreamConstants.START_ELEMENT:
StartElement startElement = event.asStartElement();
String qName = startElement.getName().getLocalPart();
if (qName.equalsIgnoreCase("comment")) {
System.out.println("Start Element : comment");
Iterator<Attribute> attributes = startElement.getAttributes();
Attribute a = attributes.next();
System.out.println("ATRIBUTES " + a.getName());
type = a.getValue();
System.out.println("Roll No : " + type);
} else if(qName.equalsIgnoreCase("text") && type.equals("disease")){ text = true; }
break;
case XMLStreamConstants.CHARACTERS:
Characters characters = event.asCharacters();
if(text){ res = res + " " + characters.getData();
//System.out.println("TEXT: " + res);
text = false;
}
break;
case XMLStreamConstants.END_ELEMENT:
EndElement endElement = event.asEndElement();
if(endElement.getName().getLocalPart().equalsIgnoreCase("comment")){
//System.out.println("End Element : comment");
//System.out.println();
}
break;
For this type of line:
<comment type="disease">
I can extract the info correctly, but when I try to find comment type "disease" in this line:
<comment type="disease" evidence="23">
it gives me type=evidence and not type=disease as it should be. Therefore it doesn't save anything from this kind of line.

First of all can we please get in the habit of using useful variable names, you have the following variables with their type: a(node), text(boolean), qName(String)... These variables leave me scratching my head and wondering what they are:
a - Just not a useful name, it should really be something like typeAttr or something noting that it should be the type="" attribute
text - its a boolean?! maybe collectText would be more appropriate since it designates that you should collect the next text events value.
qName - its a string which is the localPart of a QName, if its not a QName then dont name it as one..
But thats enough ranting you get the idea. Your problem lies in where you get the attribute. In XML attributes have no specific order and will not and should not be expected to return in the order which they are defined. In your code you have the following
Iterator<Attribute> attributes = startElement.getAttributes();
Attribute a = attributes.next();
System.out.println("ATRIBUTES " + a.getName());
type = a.getValue();
Here you get the first attribute from the element and set the type equal to its value. As I mentioned the XML attributes have no specific order so you are getting the evidence attribute. You should be getting the attribute by name:
Attribute a = startElement.getAttributeByName(QName.valueOf("type"));
System.out.println("ATRIBUTES " + a.getName());
type = a.getValue();

Sorry no direct answer but a comment on how to use StaX or XmlPull effectively: Streaming XML parsers are designed to be friendly for recursive descent parsing (avoiding explicit state modeling, something you'd often need with a SAX parser) -- in your case i'd expect the following methods (rejecting or ignoring all unexpected content):
Comment parseComment(XMLEventReader eventReader) {
// call parseText and parseDisease for the corresponding element starts
}
Text parseText(XMLEventReader eventReader) {
}
Disease parseDisease(XmlEventReader eventReader) {
}
That said, there is a tradeoff: If you don't need the streaming aspect (performance), you may be better of with just parsing to a DOM and then extracting the information as needed by walking or peeking into the DOM, avoiding a low level XML API altogether.

By using Stax I assume you are dealing with a large document, or a platform with limited resources... the fact is that memory overhead is largely a DOM related issue. VTD-XML on the other hand is far more efficient than DOM while retaining vitually all benefits of DOM style of coding... please read this latest research paper for more info
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
import com.ximpleware.*;
public class queryAttr {
public static void main(String[] s) throws VTDException{
VTDGen vg = new VTDGen();
vg.selectLcDepth(5);// improve XPath performance for deep document
if (!vg.parseFile("input.xml", false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/root/comment[#type='disease' and #evidence='23']");
int i=0,j=0;
while((i=ap.evalXPath())!=-1){
if (vn.toElement(VTDNav.FIRST_CHILD)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println(""+vn.toString(i));
if (vn.toElement(VTDNav.NS)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println("text node==>"+vn.toString(i));
}
if (vn.toElement(VTDNav.NS)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println("text node==>"+vn.toString(i));
}
if (vn.toElement(VTDNav.NS)){
System.out.println(" element name: "+ vn.toString(vn.getCurrentIndex()));
j=vn.getText();
if (i!=-1)
System.out.println("text node==>"+vn.toString(i));
}
vn.toElement(VTDNav.PARENT);
}
}
}
}

Related

compare xml files by choosing specific tag name for comparison, list attribute difference of that specific tag

Consider:
<ParentTag>
<Firstchild ID="id1" Title="title1">
<secondchild ID="" Title="">
<TagOfInterest ID="value1" Title=""/>
<TagOfInterest ID="value2" Title=""/>
<TagOfInterest ID="value3" Title=""/>
</secondchild>
</Firstchild>
</ParentTag>
And:
<secondxml>
<something ID="id1" Title="title1">
<anotherthing ID="" Title="">
<TagOfInterest ID="value1" Title=""/>
<TagOfInterest ID="dinosaur" Title=""/>
<TagOfInterest ID="nomore" Title=""/>
</anotherthing>
</something>
</secondxml>
I'm using XML Unit,
Req 1: comparison engine should compare only by tag name "tagofInterest".
Req 2: Within that tag if difference exist, compare by attribute.
Implementation which printed only tag names, but didn't give much control over tag of interest or difference by attribute within. Any better suggestion in the way of using XML Unit?
fr1 = new FileReader(expectedXML);
fr2 = new FileReader(actualXML);
Diff diff = new Diff(fr1, fr2);
DetailedDiff detDiff = new DetailedDiff(diff);
detDiff.overrideMatchTracker(new MatchTrackerImpl());
detDiff.overrideElementQualifier(new ElementNameQualifier());
detDiff.getAllDifferences();
class MatchTrackerImpl implements MatchTracker {
public void matchFound(Difference difference) {
if (difference != null) {
NodeDetail controlNode = difference.getControlNodeDetail();
NodeDetail testNode = difference.getTestNodeDetail();
System.out.println(printNode(controlNode.getNode()));
System.out.println(printNode(testNode.getNode()));
}
}
private static String printNode(Node node) {
if (node != null && node.getNodeType() == Node.ELEMENT_NODE) {
StringWriter sw = new StringWriter();
try {
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.transform(new DOMSource(node), new StreamResult(sw));
} catch (TransformerException te) {
System.out.println("nodeToString Transformer Exception");
}
return sw.toString();
}
return null;
}
}
From your example I understand that "tagofInterest"s appear as part of a forrest and not as a single element tree inside the document. Otherwise you could use XMLUnit 2.x and only apply the DifferenceEngine to the Elements of interest.
There is no built-in way in XMLUnit, you will have to provide an implementation of one of XMLUnit's extension points of your own.
The interface you are looking for is DifferenceListener in XMLUnit 1.x and DifferenceEvaluator in 2.x. Those are responsible for determining whether a difference detected by the engine is supposed to be reported (and if so, how severe it is).
You could provide an implementation of your own that downgrades all differences detected for nodes that are not named "tagofInterest". If you are interested in children of "tagofInterest" as well, it may become a bit more complex as you'd need to travel up to see whether there is a "tagofInterest" parent.

Java regular expression to remove empty xml nodes and childrens completely

I am struggling to find the best solution. Below is my XML :
<Dbtr>
<Nm>John doe</Nm>
<Id>
<OrgId>
<Othr>
<Id/>
</Othr>
</OrgId>
</Id>
</Dbtr>
This is should replaced like this below :
<Dbtr>
<Nm>John doe</Nm>
</Dbtr>
So all the empty nodes and children without any values should be left out.
I am using following expression and it don't work as per my wishes
docStr = docStr.replaceAll("<(\\w+)></\\1>|<\\w+/>", "");
Any help would be really appreciated.
Edit :
I am creating this XML (and not parsing it) this will be sent out to clearing house, who will reject this xml message because of this empty tags. The way I am creating this xml is not in my hand I just provide the values from the db and as you can see some of the values are empty, this code (I have no control) writes out the xml tag already and then writes the value, all I can control is to not write "null".
The best bet for me now is to get the output xml like this and replace it with some regexp logic and form an xml without empty tags, that can pass schema validation.
String xml = ""
+ "<Dbtr>"
+ " <Nm>John doe</Nm>"
+ " <Id>"
+ " <OrgId>"
+ " <Othr>"
+ " <Id/>"
+ " </Othr>"
+ " </OrgId>"
+ " </Id>"
+ "</Dbtr>";
while (true) {
String repl = xml.replaceAll("<(\\w+)>\\s*</\\1>|<\\w+/>", "");
if (repl.length() == xml.length())
break;
xml = repl;
}
System.out.println(xml);
// -> <Dbtr> <Nm>John doe</Nm> </Dbtr>

How to read namespace as it is in a xml using XMLStreamReader?

I have an xml file from which i read using an XMLStreamReader object.
So i'll keep it simple :
Let's take this xml example :
<mySample xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" attribute1="value1"/>
So what i need is to get the value (as a String) "xmlns:xsi" and get the value (as a String also) "http://www.w3.org/2001/XMLSchema-instance"
I did try to have a test like this :
if (reader.getEventType() != XMLStreamConstants.NAMESPACE){
attributeName = reader.getAttributeLocalName(i);
attributeValue = reader.getAttributeValue(i);
}
else{
attributeName = reader.getNamespacePrefix(i) + reader.getNamespaceURI(i);
attributeValue = reader.getAttributeValue(i);
}
But it did not work.
Obviously i missed something being a newbie to this API, so any help would be very welcome.
The JSR-173 specification (Stax API for Java) states the following regarding the NAMESPACE event :
Namespace
Namespace declarations can also exist outside of a StartElement and may be reported as a
standalone information item. In general Namespaces are reported as part of a StartElement
event. When namespaces are the result of an XQuery or XPath expression they may be
reported as standalone events.
So if you are looking at namespace events, you should most probably be checking StartElement events, and inspect them. Once again, from the spec :
Namespaces can be accessed using the following methods:
int getNamespaceCount();
String getNamespacePrefix(int index);
String getNamespaceURI(int index);
Only the namespaces declared on the current StartElement are available. The list does
not contain previously declared namespaces and does not remove redeclared namespaces.
At any point during the parsing, you can get the current complete namespace context :
The namespace context of the current state is available by calling
XMLStreamReader.getNamespaceContext() or
StartElement.getNamespaceContext(). These methods return an instance of the
javax.xml.namespace.NamespaceContext interface.
That's theory : most namespace declarations come from START_ELEMENT, some may come independently.
In practice, I have never came accross a NAMESPACE event reported by the API when reading from a file. It's almost always reported as part of a START_ELEMENT (and repeated in the corresponding END_ELEMENT), so you must check START_ELEMENT if you are interested in namespace declaration. For example, starting with your document :
String xml = "<?xml version=\"1.0\" encoding=\"utf-8\" ?><mySample xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" attribute1=\"value1\"/>";
XMLStreamReader reader = XMLInputFactory.newFactory().createXMLStreamReader(new StringReader(xml));
while (reader.hasNext()) {
int event = reader.next();
if (XMLStreamConstants.START_ELEMENT == event) {
if (reader.getNamespaceCount() > 0) {
// This happens
System.out.println("ELEMENT START: " + reader.getLocalName() + " , namespace count is: " + reader.getNamespaceCount());
for (int nsIndex = 0; nsIndex < reader.getNamespaceCount(); nsIndex++) {
String nsPrefix = reader.getNamespacePrefix(nsIndex);
String nsId = reader.getNamespaceURI(nsIndex);
System.out.println("\tNamepsace prefix: " + nsPrefix + " associated with URI " + nsId);
}
}
} else if(XMLStreamConstants.NAMESPACE == event) {
// This almost never happens
System.out.println("NAMESPACE EVENT");
}
}
Will produce :
ELEMENT START: mySample , namespace count is: 1
Namepsace prefix: xsi associated with URI http://www.w3.org/2001/XMLSchema-instance
Bottom line : you should check for NAMESPACE and START_ELEMENT events, even if most of times, you will only have START_ELEMENT reporting namespace declartions, it is not one or the other, it's both.

Why did qName work and LocalName did not?

I was learning the Java SAX API. I made my own XML feed using php. Here is the XML Document
Now when i wanted to make my application output to the console nothing came up. I pinpointed the problem to the endElement method in my XMLHandler that extends DefaultHandler. Here is my implementation of it.
public void endElement(String uri, String localName, String qName) throws SAXException {
//I added the next three lines for debugging
System.out.println("Found End Element " + count + " times");
System.out.println("Localname = " + localName);
System.out.println("QName = " + qName);
super.endElement(uri, localName, qName);
if (this.currentItem != null){
if (localName.equalsIgnoreCase(me.osama.XMLParsing.BaseFeedParser.ITEMNAME)){
currentItem.setItemName(builder.toString());
} else if (localName.equalsIgnoreCase(me.osama.XMLParsing.BaseFeedParser.ITEMSITE)){
currentItem.setItemSite(builder.toString());
} else if (localName.equalsIgnoreCase(me.osama.XMLParsing.BaseFeedParser.ITEMNO)){
currentItem.setItemNo(builder.toString());
} else if (localName.equalsIgnoreCase(me.osama.XMLParsing.BaseFeedParser.ITEM)){
System.out.println(currentItem);
items.add(currentItem);
}
builder.setLength(0);
}
count++;
}
Turns out that localName kept on coming empty hence the conditions never held true and the code never went into the decision block. On the other hand qName brought all names out properly and once i changed the variable to qName the List<Item> items collection type did fill up and worked correctly.
I am here to ask why did qName work and not localName? Whereas the tutorial from IBM's DeveloperWorks used an RSS feed and localName worked perfectly for him.
P.S. this is the feed the IBM Tutorial used: http://www.androidster.com/android_news.rss
As per the SAX namespace for Java API,
By default, an XML reader will report a Namespace URI and a localName
for every element that belongs in a namespace, in both the start and
end handler.
Perhaps, if you add a namespace to the XML and define your elements in that namespace, it would return a valid localName. The article also mentions that with namespace processing, some implementations will return empty qName.

How to get node contents from JDOM

I'm writing an application in java using import org.jdom.*;
My XML is valid,but sometimes it contains HTML tags. For example, something like this:
<program-title>Anatomy & Physiology</program-title>
<overview>
<content>
For more info click here
<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>
</content>
</overview>
<key-information>
<category>Health & Human Services</category>
So my problem is with the < p > tags inside the overview.content node.
I was hoping that this code would work :
Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
System.out.println(content.getText());
but it returns blank.
How do I return all the text ( nested tags and all ) from the overview.content node ?
Thanks
content.getText() gives immediate text which is only useful fine with the leaf elements with text content.
Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )
public static void main(String[] args) throws Exception {
SAXBuilder builder = new SAXBuilder();
String xmlFileName = "a.xml";
Document doc = builder.build(xmlFileName);
Element root = doc.getRootElement();
Element overview = root.getChild("overview");
Element content = overview.getChild("content");
XMLOutputter outp = new XMLOutputter();
outp.setFormat(Format.getCompactFormat());
//outp.setFormat(Format.getRawFormat());
//outp.setFormat(Format.getPrettyFormat());
//outp.getFormat().setTextMode(Format.TextMode.PRESERVE);
StringWriter sw = new StringWriter();
outp.output(content.getContent(), sw);
StringBuffer sb = sw.getBuffer();
System.out.println(sb.toString());
}
Output
For more info clickhere<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>
Do explore other formatting options and modify above code to your need.
"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "
You could try using method getValue() for the closest approximation, but what this does is concatenate all text within the element and descendants together. This won't give you the <p> tag in any form. If that tag is in your XML like you've shown, it has become part of the XML markup. It'd need to be included as <p> or embedded in a CDATA section to be treated as text.
Alternatively, if you know all elements that either may or may not appear in your XML, you could apply an XSLT transformation that turns stuff which isn't intended as markup into plain text.
Well, maybe that's what you need:
import java.io.StringReader;
import org.custommonkey.xmlunit.XMLTestCase;
import org.custommonkey.xmlunit.XMLUnit;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import org.testng.annotations.Test;
import org.xml.sax.InputSource;
public class HowToGetNodeContentsJDOM extends XMLTestCase
{
private static final String XML = "<root>\n" +
" <program-title>Anatomy & Physiology</program-title>\n" +
" <overview>\n" +
" <content>\n" +
" For more info click here\n" +
" <p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>\n" +
" </content>\n" +
" </overview>\n" +
" <key-information>\n" +
" <category>Health & Human Services</category>\n" +
" </key-information>\n" +
"</root>";
private static final String EXPECTED = "For more info click here\n" +
"<p>Learn more about the human body. Choose from a variety of Physiology (A&P) designed for complementary therapies.&#160; Online studies options are available.</p>";
#Test
public void test() throws Exception
{
XMLUnit.setIgnoreWhitespace(true);
Document document = new SAXBuilder().build(new InputSource(new StringReader(XML)));
List<Content> content = document.getRootElement().getChild("overview").getChild("content").getContent();
String out = new XMLOutputter().outputString(content);
assertXMLEqual("<root>" + EXPECTED + "</root>", "<root>" + out + "</root>");
}
}
Output:
PASSED: test on instance null(HowToGetNodeContentsJDOM)
===============================================
Default test
Tests run: 1, Failures: 0, Skips: 0
===============================================
I am using JDom with generics: http://www.junlu.com/list/25/883674.html
Edit: Actually that's not that much different from Prashant Bhate's answer. Maybe you need to tell us what you are missing...
If you're also generating the XML file you should be able to encapsulate your html data in <![CDATA[]]> so that it isn't parsed by the XML parser.
The problem is that the <content> node doesn't have a text child; it has a <p> child that happens to contain text.
Try this:
Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
Element p = content.getChild("p");
System.out.println(p.getText());
If you want all the immediate child nodes, call p.getChildren(). If you want to get ALL the child nodes, you'll have to call it recursively.
Not particularly pretty but works fine (using JDOM API):
public static String getRawText(Element element) {
if (element.getContent().size() == 0) {
return "";
}
StringBuffer text = new StringBuffer();
for (int i = 0; i < element.getContent().size(); i++) {
final Object obj = element.getContent().get(i);
if (obj instanceof Text) {
text.append( ((Text) obj).getText() );
} else if (obj instanceof Element) {
Element e = (Element) obj;
text.append( "<" ).append( e.getName() );
// dump all attributes
for (Attribute attribute : (List<Attribute>)e.getAttributes()) {
text.append(" ").append(attribute.getName()).append("=\"").append(attribute.getValue()).append("\"");
}
text.append(">");
text.append( getRawText( e )).append("</").append(e.getName()).append(">");
}
}
return text.toString();
}
Prashant Bhate's solution is nicer though!
If you want to output the content of some JSOM node just use
System.out.println(new XMLOutputter().outputString(node))

Categories