How to skip well-formed for java DOM parser - java

I know this has been asked multiple times here, but I've a different issue dealing with it. In my case, the app receives a non well-formed dom structure passed as a string. Here's a sample :
<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>
As you can see, the content is not well-formed. Now, if I try to parse using a normal SAX or DOM parse it'll throw an exception which is understood.
org.xml.sax.SAXParseException: The reference to entity "feature" must end with the ';' delimiter.
As per the requirement, I need to read this document,add few additional div tags and send the content back as a string. This works great by using a DOM parser as I can read through the input structure and add additional tags at their required position.
I tried using tools like JTidy to do a pre-processing and then parse, but that results in converting the document to a fully-blown html, which I don't want. Here's a sample code :
StringWriter writer = new StringWriter();
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(true);
tidy.parse(new ByteArrayInputStream(content.getBytes()), writer);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new ByteArrayInputStream(writer.toString().getBytes()));
// Traverse thru the content and add new tags
....
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
This completely converts the input to a well-formed html document. It then becomes hard to remove html tags manually. The other option I tried was to use SAX2DOM, which too creates a HTML doc. Here's a sample code .
ByteArrayInputStream is = new ByteArrayInputStream(content.getBytes());
Parser p = new Parser();
p.setFeature(IContentExtractionConstant.SAX_NAMESPACE,true);
SAX2DOM sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(is));
Document doc = (Document)sax2dom.getDOM();
I'll appreciate if someone can share their ideas.
Thanks

The simplest way is replacing xml reserved characters with the corresponding xml entities. You can do this manually:
content.replaceAll("&", "&");
If you don't want to modify your string before parsing it, I could propose you another way using SaxParser, but this solution is more complicated. Basically you have to:
write a LexicalHandler in
combination with ContentHandler
tell the parser to continue its
execution after fatal error (the
ErrorHandler isn't enough)
treat undeclared entities as simple
text
UPDATE
According to your comment, I'm going to add some details regarding the second solution. I've writed a class which extends DefaulHandler (default implementation of EntityResolver, DTDHandler, ContentHandler and ErrorHandler) and implements LexicalHandler. I've extended ErrorHandler's fatalError method (my implementations does nothing instead of throwing the exception) and ContentHandler's characters method which works in combination with startEntity method of LexicalHandler.
public class MyHandler extends DefaultHandler implements LexicalHandler {
private String currentEntity = null;
#Override
public void fatalError(SAXParseException e) throws SAXException {
}
#Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String content = new String(ch, start, length);
if (currentEntity != null) {
content = "&" + currentEntity + content;
currentEntity = null;
}
System.out.print(content);
}
#Override
public void startEntity(String name) throws SAXException {
currentEntity = name;
}
#Override
public void endEntity(String name) throws SAXException {
}
#Override
public void startDTD(String name, String publicId, String systemId)
throws SAXException {
}
#Override
public void endDTD() throws SAXException {
}
#Override
public void startCDATA() throws SAXException {
}
#Override
public void endCDATA() throws SAXException {
}
#Override
public void comment(char[] ch, int start, int length) throws SAXException {
}
}
This is my main which parses your xml not well formed. It's very important the setFeature, because without it the parser throws the SaxParseException despite of the ErrorHandler empty implementation.
public static void main(String[] args) throws ParserConfigurationException,
SAXException, IOException {
String xml = "<div class='video yt'><div class='yt_url'>http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata</div></div>";
SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
MyHandler myHandler = new MyHandler();
xmlReader.setContentHandler(myHandler);
xmlReader.setErrorHandler(myHandler);
xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler",
myHandler);
xmlReader.setFeature(
"http://apache.org/xml/features/continue-after-fatal-error",
true);
xmlReader.parse(new InputSource(new StringReader(xml)));
}
This main prints out the content of your div element which contains the error:
http://www.youtube.com/watch?v=U_QLu_Twd0g&feature=abcde_gdata
Keep in mind that this is an example which works with your input, maybe you'll have to complete it...for instance if you have some characters correctly escaped you should add some lines of code to handle this situation etc.
Hope this helps.

Related

How to improve performance of querying xml file with VTD-XML and XPath?

I am querying XML files with size of around 1 MB(20k+ lines). I am using XPath to describe what I want to get and VTD-XML library to get it. I think that I have some problems with performance.
The problem is, I am making about 5k+ queries to XML file. It takes approximately 16-17 seconds to retrieve all values. I want to ask you, if this is normal performance for such task? How I can improve it?
I am using VTD-XML library with AutoPilot navigation approach which give me opportunity to use XPath. Implementation is as following:
private VTDGen vg = new VTDGen();
private VTDNav vn;
private AutoPilot ap = new AutoPilot();
public void init(String xml) {
log.info("Creating document");
xml = xml.replace("<?xml version=\"1.0\"?>", "<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
byte[] bytes = xml.getBytes(StandardCharsets.UTF_8);
vg.setDoc(bytes);
try {
vg.parse(true);
vn = vg.getNav();
} catch (ParseException e) {
e.printStackTrace();
}
log.info("Document created");
}
public String parseXmlOrReturnNull(String query) {
String xPathStringVal = null;
try {
ap.selectXPath(query);
ap.bind(vn);
int i = -1;
while ((i = ap.evalXPath()) != -1) {
xPathStringVal = vn.getXPathStringVal();
}
}catch (XPathEvalException e) {
e.printStackTrace();
} catch (NavException e) {
e.printStackTrace();
} catch (XPathParseException e) {
e.printStackTrace();
}
return xPathStringVal;
}
My xml files have specific format, they are divided into lot of parts - segments, and my queries are same for all segments(I am querying it in a loop). For example part of xml:
<segment>
<a>
<b>value1</b>
<c>
<d>value2</d>
<e>value3</d>
</c>
</a>
</segment>
<segment>
<a>
<b>value4</b>
<c>
<d>value5</d>
<e>value6</d>
<f>value6</d>
</c>
</a>
</segment>
...
If I want to get value1 in first segment I am using query:
//segment[1]/a/b
for value 4 in second segment
//segment[2]/a/b
etc.
Intuition says a few things: in my approach every query is independent (it doesn't know anything about other query), it means that AutoPilot, my iterator, always starts at the beginning of the file when I want to query it.
My question is: Is there any way to set AutoPilot at the beginning of processing segment? And when I finish querying move AutoPilot to next segment? I think that if my method will start searching value not from the beginning but from specifying point It will be much faster.
Another way is to divide xml file into small xml files (one xml file = one segment) and querying those small xml files.
What do you think guys? Thanks in advance
Minor: The replace is not needed as UTF-8 is the default encoding; only when there is an encoding, one would need to patch it to UTF-8.
The XPath should only done once, to not start from [0] to the next index.
If you need a List representation you could use JAXB with annotations.
An event based primitive parsing without DOM object probably is best (SAXParser).
Handler handler = new org.xml.sax.helpers.DefaultHandler {
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes) throws SAXException {
}
#Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
}
#Override
public void characters(char ch[], int start, int length) throws SAXException {
}
};
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
InputStream in = new ByteArrayInputStream(bytes);
parser.parse(in, handler);

How to read < as < from an XML? [duplicate]

I am new to XML. I want to read the following XML on the basis of request name. Please help me on how to read the below XML in Java -
<?xml version="1.0"?>
<config>
<Request name="ValidateEmailRequest">
<requestqueue>emailrequest</requestqueue>
<responsequeue>emailresponse</responsequeue>
</Request>
<Request name="CleanEmail">
<requestqueue>Cleanrequest</requestqueue>
<responsequeue>Cleanresponse</responsequeue>
</Request>
</config>
If your XML is a String, Then you can do the following:
String xml = ""; //Populated XML String....
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xml)));
Element rootElement = document.getDocumentElement();
If your XML is in a file, then Document document will be instantiated like this:
Document document = builder.parse(new File("file.xml"));
The document.getDocumentElement() returns you the node that is the document element of the document (in your case <config>).
Once you have a rootElement, you can access the element's attribute (by calling rootElement.getAttribute() method), etc. For more methods on java's org.w3c.dom.Element
More info on java DocumentBuilder & DocumentBuilderFactory. Bear in mind, the example provided creates a XML DOM tree so if you have a huge XML data, the tree can be huge.
Related question.
Update Here's an example to get "value" of element <requestqueue>
protected String getString(String tagName, Element element) {
NodeList list = element.getElementsByTagName(tagName);
if (list != null && list.getLength() > 0) {
NodeList subList = list.item(0).getChildNodes();
if (subList != null && subList.getLength() > 0) {
return subList.item(0).getNodeValue();
}
}
return null;
}
You can effectively call it as,
String requestQueueName = getString("requestqueue", element);
In case you just need one (first) value to retrieve from xml:
public static String getTagValue(String xml, String tagName){
return xml.split("<"+tagName+">")[1].split("</"+tagName+">")[0];
}
In case you want to parse whole xml document use JSoup:
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
for (Element e : doc.select("Request")) {
System.out.println(e);
}
If you are just looking to get a single value from the XML you may want to use Java's XPath library. For an example see my answer to a previous question:
How to use XPath on xml docs having default namespace
It would look something like:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class Demo {
public static void main(String[] args) {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document dDoc = builder.parse("E:/test.xml");
XPath xPath = XPathFactory.newInstance().newXPath();
Node node = (Node) xPath.evaluate("/Request/#name", dDoc, XPathConstants.NODE);
System.out.println(node.getNodeValue());
} catch (Exception e) {
e.printStackTrace();
}
}
}
There are a number of different ways to do this. You might want to check out XStream or JAXB. There are tutorials and the examples.
If the XML is well formed then you can convert it to Document. By using the XPath you can get the XML Elements.
String xml = "<stackusers><name>Yash</name><age>30</age></stackusers>";
Form XML-String Create Document and find the elements using its XML-Path.
Document doc = getDocument(xml, true);
public static Document getDocument(String xmlData, boolean isXMLData) throws Exception {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setNamespaceAware(true);
dbFactory.setIgnoringComments(true);
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc;
if (isXMLData) {
InputSource ips = new org.xml.sax.InputSource(new StringReader(xmlData));
doc = dBuilder.parse(ips);
} else {
doc = dBuilder.parse( new File(xmlData) );
}
return doc;
}
Use org.apache.xpath.XPathAPI to get Node or NodeList.
System.out.println("XPathAPI:"+getNodeValue(doc, "/stackusers/age/text()"));
NodeList nodeList = getNodeList(doc, "/stackusers");
System.out.println("XPathAPI NodeList:"+ getXmlContentAsString(nodeList));
System.out.println("XPathAPI NodeList:"+ getXmlContentAsString(nodeList.item(0)));
public static String getNodeValue(Document doc, String xpathExpression) throws Exception {
Node node = org.apache.xpath.XPathAPI.selectSingleNode(doc, xpathExpression);
String nodeValue = node.getNodeValue();
return nodeValue;
}
public static NodeList getNodeList(Document doc, String xpathExpression) throws Exception {
NodeList result = org.apache.xpath.XPathAPI.selectNodeList(doc, xpathExpression);
return result;
}
Using javax.xml.xpath.XPathFactory
System.out.println("javax.xml.xpath.XPathFactory:"+getXPathFactoryValue(doc, "/stackusers/age"));
static XPath xpath = javax.xml.xpath.XPathFactory.newInstance().newXPath();
public static String getXPathFactoryValue(Document doc, String xpathExpression) throws XPathExpressionException, TransformerException, IOException {
Node node = (Node) xpath.evaluate(xpathExpression, doc, XPathConstants.NODE);
String nodeStr = getXmlContentAsString(node);
return nodeStr;
}
Using Document Element.
System.out.println("DocumentElementText:"+getDocumentElementText(doc, "age"));
public static String getDocumentElementText(Document doc, String elementName) {
return doc.getElementsByTagName(elementName).item(0).getTextContent();
}
Get value in between two strings.
String nodeVlaue = org.apache.commons.lang.StringUtils.substringBetween(xml, "<age>", "</age>");
System.out.println("StringUtils.substringBetween():"+nodeVlaue);
Full Example:
public static void main(String[] args) throws Exception {
String xml = "<stackusers><name>Yash</name><age>30</age></stackusers>";
Document doc = getDocument(xml, true);
String nodeVlaue = org.apache.commons.lang.StringUtils.substringBetween(xml, "<age>", "</age>");
System.out.println("StringUtils.substringBetween():"+nodeVlaue);
System.out.println("DocumentElementText:"+getDocumentElementText(doc, "age"));
System.out.println("javax.xml.xpath.XPathFactory:"+getXPathFactoryValue(doc, "/stackusers/age"));
System.out.println("XPathAPI:"+getNodeValue(doc, "/stackusers/age/text()"));
NodeList nodeList = getNodeList(doc, "/stackusers");
System.out.println("XPathAPI NodeList:"+ getXmlContentAsString(nodeList));
System.out.println("XPathAPI NodeList:"+ getXmlContentAsString(nodeList.item(0)));
}
public static String getXmlContentAsString(Node node) throws TransformerException, IOException {
StringBuilder stringBuilder = new StringBuilder();
NodeList childNodes = node.getChildNodes();
int length = childNodes.getLength();
for (int i = 0; i < length; i++) {
stringBuilder.append( toString(childNodes.item(i), true) );
}
return stringBuilder.toString();
}
OutPut:
StringUtils.substringBetween():30
DocumentElementText:30
javax.xml.xpath.XPathFactory:30
XPathAPI:30
XPathAPI NodeList:<stackusers>
<name>Yash</name>
<age>30</age>
</stackusers>
XPathAPI NodeList:<name>Yash</name><age>30</age>
following links might help
http://labe.felk.cvut.cz/~xfaigl/mep/xml/java-xml.htm
http://developerlife.com/tutorials/?p=25
http://www.java-samples.com/showtutorial.php?tutorialid=152
There are two general ways of doing that. You will either create a Domain Object Model of that XML file, take a look at this
and the second choice is using event driven parsing, which is an alternative to DOM xml representation. Imho you can find the best overall comparison of these two basic techniques here. Of course there are much more to know about processing xml, for instance if you are given XML schema definition (XSD), you could use JAXB.
There are various APIs available to read/write XML files through Java.
I would refer using StaX
Also This can be useful - Java XML APIs
You can make a class which extends org.xml.sax.helpers.DefaultHandler and call
start_<tag_name>(Attributes attrs);
and
end_<tag_name>();
For it is:
start_request_queue(attrs);
etc.
And then extends that class and implement xml configuration file parsers you want. Example:
...
public void startElement(String uri, String name, String qname,
org.xml.sax.Attributes attrs)
throws org.xml.sax.SAXException {
Class[] args = new Class[2];
args[0] = uri.getClass();
args[1] = org.xml.sax.Attributes.class;
try {
String mname = name.replace("-", "");
java.lang.reflect.Method m =
getClass().getDeclaredMethod("start" + mname, args);
m.invoke(this, new Object[] { uri, (org.xml.sax.Attributes)attrs });
}
catch (IllegalAccessException e) {
throw new RuntimeException(e);
}
catch (NoSuchMethodException e) {
throw new RuntimeException(e); }
catch (java.lang.reflect.InvocationTargetException e) {
org.xml.sax.SAXException se =
new org.xml.sax.SAXException(e.getTargetException());
se.setStackTrace(e.getTargetException().getStackTrace());
}
and in a particular configuration parser:
public void start_Request(String uri, org.xml.sax.Attributes attrs) {
// make sure to read attributes correctly
System.err.println("Request, name="+ attrs.getValue(0);
}
Since you are using this for configuration, your best bet is apache commons-configuration. For simple files it's way easier to use than "raw" XML parsers.
See the XML how-to

Handle illegal URI characters in xslt inclusion

In a xsl transformation I have a xslt file that includes some other xslt. The problem is that the URI for these xslt contains illegal characters, in particular '##'. The xslt looks like this:
<xsl:include href="/appdm/tomcat/webapps/sentys##1.0.0/WEB-INF/classes/xslt/release_java/xslt/gen.xslt" />
and when I try to instantiate a java Transformer I get the error:
javax.xml.transform.TransformerConfigurationException: javax.xml.transform.TransformerConfigurationException: javax.xml.transform.TransformerException: org.xml.sax.SAXException: org.apache.xml.utils.URI$MalformedURIException: Fragment contains invalid character:#
This is the java code:
public String xslTransform2String(String sXml, String sXslt) throws Exception {
String sResult = null;
try {
Source oStrSource = createStringSource(sXml);
DocumentBuilderFactory oDocFactory = DocumentBuilderFactory.newInstance();
oDocFactory.setNamespaceAware(true);
//sXslt is the xslt content with the inclusions
//<xsl:include href="/appdm/tomcat/webapps/sentys##1.0.0/WEB-INF/classes/xslt/release_java/xslt/gen.xslt" />"
Document oDocXslt = oDocFactory.newDocumentBuilder().parse(new InputSource(new StringReader(sXslt)));
Source oXsltSource = new DOMSource(oDocXslt);
StringWriter oStrOut = new StringWriter();
Result oTransRes = createStringResult(oStrOut);
Transformer oTrans = createXsltTransformer(oXsltSource);
oTrans.transform(oStrSource, oTransRes);
sResult = oStrOut.toString();
} catch (Exception oEx) {
throw new BddException(oEx, XmlProvider.ERR_XSLT, null);
}
return sResult;
}
private Transformer createXsltTransformer(Source oXsltSource) throws Exception {
Transformer transformer = getXsltTransformerFactory().newTransformer(
oXsltSource);
ErrorListener errorListener = new DefaultErrorListener();
transformer.setErrorListener(errorListener);
return transformer;
}
is there a way I can go with relative paths instead of absolute path?
Thank you
To avoid the MalformedURIException, replace the second or both # with %23.
See https://stackoverflow.com/a/5007362/4092205

How to get Embeded/nested XML from a big XML file using SAX parser

We are performing some operations on embedded/Nested XML.I am using SAXParser to parse the entire XML file.I want to get the entire nested XML with tags and value.For example my XML looks like.
I want entire XML within the <ANY_ELEMENT>.....</ANY-ELEMENT> tag.
<?xml version="1.0" encoding="UTF-8"?>
<x:xMessage xmlns:x="http://www.connecture.com/integration/x" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.connecture.com/integration/x xMessageWrapper.xsd
">
<x:xMessageHeader>
<Version>850</Version>
<Source>Source</Source>
<Target>target</Target>
<Timestamp>2013-12-31T12:00:00</Timestamp>
<RequestID>123456</RequestID>
<ResponseID>54321</ResponseID>
<Priority>3</Priority>
<Username>Deepak</Username>
<Password>Kumar</Password>
</x:xMessageHeader>
<x:xMessageBody>
<ANY-ELEMENT>
<xEnveloped_834A1 xsi:schemaLocation="....." xmlns="......."
..........................
..........................
some Complex XML
..........................
..........................
..........................
</ANY-ELEMENT>
</x:XMessageBody>
</x:XMessage>
Handler class Sample code:
public class MessageWrapperHandler extends DefaultHandler {
private boolean bActualMessage = false;
private String actualMessage = null;
private long lengthActualMessage=0;
public void startElement(String uri, String localName, String qName, Attributes attributes) {
if (qName.equalsIgnoreCase("ANY-ELEMENT")) {
bActualMessage = true;
//lengthActualMessage=How to know the length of Child XML
}
}
public void characters(char ch[], int start, int length) {
if (bActualMessage) {
actualMessage = new String(ch, start, length);
//trying to get embedded XML
bActualMessage = false;
}
}
}
But since next element after is XML content so giving me nothing.SO How to achieve it.
EDIT: You are free to modify XML after <ANY-ELEMENT> like adding contents into CDATA
Instead of SAX, I would recommend using StAX (a StAX implementation is included in the JDK/JRE since Java SE 6). StAX is similar to SAX except instead of having the events pushed to you, you pull (request) them.
In the code below the XMLStreamReader is advanced to the ANY-ELEMENT element. Once it is at the correct position you can interact with it as you wish.
import javax.xml.stream.*;
import javax.xml.transform.stream.StreamSource;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xmlSource = new StreamSource("src/forum19559825/input.xml");
XMLStreamReader xsr = xif.createXMLStreamReader(xmlSource);
Demo demo = new Demo();
demo.positionXMLStreamReaderAtAnyElement(xsr);
demo.processAnyElement(xsr);
}
private void positionXMLStreamReaderAtAnyElement(XMLStreamReader xsr) throws Exception {
while(xsr.hasNext()) {
if(xsr.getEventType() == XMLStreamReader.START_ELEMENT && "ANY-ELEMENT".equals(xsr.getLocalName())) {
break;
}
xsr.next();
}
}
private void processAnyElement(XMLStreamReader xmlStreamReaderAtAnyElement) {
// TODO: Stuff
System.out.println("FOUND IT");
}
}

Jdom parser not recognizing special character attributes

problem in parsing special character attributes using jdom
ex
< tag xml:lang="123" >
this case getAttributes() method return null
is there any solution to fix this.
Works without problems for me:
public class TestJdom
{
public static void main(String[] args) throws JDOMException, IOException {
String xmlString = "<test><tag xml:lang=\"123\"></tag></test>";
SAXBuilder builder = new SAXBuilder();
StringReader stringReader = new StringReader(new String(xmlString
.getBytes()));
Document doc = builder.build(stringReader);
List<?> attrs = doc.getRootElement().getChild("tag").getAttributes();
System.out.println(attrs);
}
}
You probably need to set namespace, check http://cs.au.dk/~amoeller/XML/programming/jdomexample.html

Categories