Failing to extract xml value element using JDOM & Xpath

Failing to extract xml value element using JDOM & Xpath - java

I have a method (getSingleNodeValue()) which when passed an xpatch expression will extract the value of the specified element in the xml document refered to in 'doc'. Assume doc at this point has been initialised as shown below and xmlInput is the buffer containing the xml content.
SAXBuilder builder = null;
Document doc = null;
XPath xpathInstance = null;
doc = builder.build(new StringReader(xmlInput));
When i call the method, i pass the following xpath xpression
/TOP4A/PERLODSUMDEC/TINPLD1/text()
Here is the method. It basically just takes an xml buffer and uses xpath to extract the value:
public static String getSingleNodeValue(String xpathExpr) throws Exception{
Text list = null;
try {
xpathInstance = XPath.newInstance(xpathExpr);
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}catch (Exception e){
throw new Exception(e);
}
return list==null ? "?" : list.getText();
}
The above method always returns "?" i.e. nothing is found so 'list' is null.
The xml document it looks at is
<TOP4A xmlns="http://www.testurl.co.uk/enment/gqr/3232/1">
<HEAD>
<Doc>ABCDUK1234</Doc>
</HEAD>
<PERLODSUMDEC>
<TINPLD1>10109000000000000</TINPLD1>
</PERLODSUMDEC>
</TOP4A>
The same method works with other xml documents so i am not sure what is special about this one. There is no exception so the xml is valid xml. Its just that the method always sets 'list' to null. Any ideas?
Edit
Ok as suggested, here is a simple running program that demonstrates the above
import org.jdom.*;
import org.jdom.input.*;
import org.jdom.xpath.*;
import java.io.IOException;
import java.io.StringReader;
public class XpathTest {
public static String getSingleNodeValue(String xpathExpr, String xmlInput) throws Exception{
Text list = null;
SAXBuilder builder = null;
Document doc = null;
XPath xpathInstance = null;
try {
builder = new SAXBuilder();
doc = builder.build(new StringReader(xmlInput));
xpathInstance = XPath.newInstance(xpathExpr);
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}catch (Exception e){
throw new Exception(e);
}
return list==null ? "Nothing Found" : list.getText();
}
public static void main(String[] args){
String xmlInput1 = "<TOP4A xmlns=\"http://www.testurl.co.uk/enment/gqr/3232/1\"><HEAD><Doc>ABCDUK1234</Doc></HEAD><PERLODSUMDEC><TINPLD1>10109000000000000</TINPLD1></PERLODSUMDEC></TOP4A>";
String xpathExpr = "/TOP4A/PERLODSUMDEC/TINPLD1/text()";
XpathTest xp = new XpathTest();
try {
System.out.println(xp.getSingleNodeValue(xpathExpr, xmlInput1));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
When i run the above, the output is
Nothing found
Edit
I have run some further testing and it appears that if i remove the namespace url it does work. Not sure why yet. Is there any way i can tell it to ignore the namespace?
Edit
Please also note that the above is implemented on JDK1.4.1 so i dont have the options for later version of the JDKs. This is the reason why i had to stick with Jdom.

The problem is with XML namespaces: your XPath query starts by selecting a 'TOP4A' element in the default namespace. Your XML file, however, has a 'TOP4A' element in the 'http://www.testurl.co.uk/enment/gqr/3232/1' namespace instead.
Is it an option to remove the xmlns from the XML?

Related

Convert HTML to PDF with Header and Footer

We have header and footer String as HTML content type but how to append both on every page.
In below java method we are passing three parameter htmlcontent, HeaderContent, FooterContent and return number of pages created in html, But where do we have to attach the header and footer content?
public static int generatePDF(String strFileName, String htmlContent,String headerHtml,String footerHtml) throws PDFNetException {
PDFDoc doc = new PDFDoc();
HTML2PDF converter = new HTML2PDF();
int nPages = 0;
try {
converter = new HTML2PDF();
doc = new PDFDoc();
converter.insertFromHtmlString(htmlContent);
try {
if (converter.convert(doc)) {
doc.save(strFileName, SDFDoc.e_linearized, null);
nPages = doc.getPageCount();
}
} catch (Exception ex) {
ex.printStackTrace();
}
} catch (Exception e) {
ex.printStackTrace();
} finally {
converter.destroy();
doc.close();
}
return nPages;
}

One option is to post-process the PDF, by using the Stamper class, to add headers/footers.
See the following sample code on how to use Stamper call
https://www.pdftron.com/documentation/samples/#stamper
The HTML2PDF converter appends pages to the PDFDoc object passed in, so you can do the following.
call HTML2PDF.InsertFromURL(url)
call HTML2PDF.Convert(pdfdoc)
run Stamper on pages x-y stamp
and repeat to keep appending pages to pdfdoc.

Query on xml file with special case

I have 2 large files which I gather from Stackoverflow named posts.xml and questions.txt with the following structure:
posts.xml:
<posts>
<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="322" ViewCount="21888" Body="..."/>
<row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="140" ViewCount="10912" Body="..." />
...
</posts>
A post can be question or answer (both)
questions.txt:
Id,CreationDate,CreationDatesk,Score
123,2008-08-01 16:08:52,20080801,48
126,2008-08-01 16:10:30,20080801,33
...
I wanna query on posts just one time and index the selected rows (which their ID is in questions.txt file) with lucene. Since the xml file is very large (about 50GB), the time of querying and indexing is important for me.
Now the question is: How can I find all the selected rows in posts.xml that are repeated in questions.txt
This is my approach until now:
SAXParserDemo.java:
public class SAXParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Posts.xml");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Handler.java:
public class Handler extends DefaultHandler {
public void getQuestiondId() {
ArrayList<String> qIDs = new ArrayList<String>();
BufferedReader br = null;
try {
String qId;
br = new BufferedReader(new FileReader("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Q.txt"));
while ((qId = br.readLine()) != null) {
qId = qId.split(",")[0]; //this is question id
findAndIndexOnPost(qId); //find this id on posts.xml then index it!
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
private void findAndIndexOnPost(String qID) {
}
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes)
throws SAXException {
if (qName.equalsIgnoreCase("row")) {
System.out.println(attributes.getValue("Id"));
switch (attributes.getValue("PostTypeId")) {
case "1":
String id = attributes.getValue("Id");
break;
case "2":
break;
default:
break;
}
}
}
}
UPDATE:
I need to keep pointer on xml file in every iteration. But with SAX I don't know how to do this.

What you have to do is:
read the TXT file (probably a simple stream will do).
add all Id values to a List<Integer> questionIds - one by one. You will have to parse them manually (with a regex or String.indexOf()).
in your Handler implementation simply compare if questionIds.contains(givenId).
send the received object (from XML) to Elastic Search with a simple REST request (POST/PUT).
Ta-da! Your data is now indexed with lucene.
Also, change the way you pass data to SAX Parser. Instead of giving it a File, create an implementation of InputStream for it which you can give to saxParser.parse(inputStream, userhandler);. Info on getting position in a stream here: Given a Java InputStream, how can I determine the current offset in the stream?.

Parse XML escaped in CDATA mixed with invalid HTML

I have the below element in a web service response. As you can see, it's escaped XML dumped as CDATA, so the XML parser just looks at it as a string and I'm unable to get the data I need from it through the usual means of XSLT and XPath. I need to turn this ugly string back into XML so that I can read it properly.
I have tried to do a search replace and simply converted all < to < and > to > and this works great, but there is a problem: The message.body element can actually contain HTML which is not valid XML. Might not even be valid HTML for all I know. So if I just replace everything, this will probably crash when I try to turn the string back into an XML document.
How can I unescape this safely? Is there a good way to do the replacement in the whole string except between the message.body open and closing tags for example?
<output><item type="object">
<ticket.id type="string">171</ticket.id>
<ticket.title type="string">SoapUI Test</ticket.title>
<ticket.created_at type="string">2013-12-03 12:50:54</ticket.created_at>
<ticket.status type="string">Open</ticket.status>
<updated type="string">false</updated>
<message type="object">
<message.id type="string">520</message.id>
<message.created_at type="string">2013-12-03 12:50:54.000</message.created_at>
<message.author type="string"/>
<message.body type="string">Just a test message...</message.body>
</message>
<message type="object">
<message.id type="string">521</message.id>
<message.created_at type="string">2013-12-03 13:58:32.000</message.created_at>
<message.author type="string"/>
<message.body type="string">Another message!</message.body>
</message>
</item>
</output>

This is actually lifted from the project i'm working on right now.
private Node stringToNode(String textContent) {
Element node = null;
try {
node = DocumentBuilderFactory.newInstance().newDocumentBuilder()
.parse(new ByteArrayInputStream(textContent.getBytes()))
.getDocumentElement();
} catch (SAXException e) {
logger.error(e.getMessage(), e);
} catch (IOException e) {
logger.error(e.getMessage(), e);
} catch (ParserConfigurationException e) {
logger.error(e.getMessage(), e);
}
return node;
}
This will give you a document object representing the string. I use this to get this back into the original document:
if (textContent.contains(XML_HEADER)) {
textContent = textContent.substring(textContent.indexOf(XML_HEADER) + XML_HEADER.length());
}
Node newNode = stringToNode(textContent);
if (newNode != null) {
Node importedNode = soapBody.getOwnerDocument().importNode(newNode, true);
nextChild.setTextContent(null);
nextChild.appendChild(importedNode);
}

This is my current solution. You give it an XPath for the nodes that are messed up and a set of element names that might include messed up HTML and other problems. Works roughly as follows
Pull out text content of nodes matched by XPATH
Run regex to wrap problematic child elements in CDATA
Wrap text in temporary element (otherwise it crashes if there are multiple root nodes)
Parse text back to DOM
Add child nodes of temporary node back in place of previous text content.
The regex solution in step 2 is probably not fool-proof, but don't really see a better solution at the moment. If you do, let me know!
CDataFixer
import java.util.*;
import javax.xml.xpath.*;
import org.w3c.dom.*;
public class CDataFixer
{
private final XmlHelper xml = XmlHelper.getInstance();
public Document fix(Document document, String nodesToFix, Set<String> excludes) throws XPathExpressionException, XmlException
{
return fix(document, xml.newXPath().compile(nodesToFix), excludes);
}
private Document fix(Document document, XPathExpression nodesToFix, Set<String> excludes) throws XPathExpressionException, XmlException
{
Document wc = xml.copy(document);
NodeList nodes = (NodeList) nodesToFix.evaluate(wc, XPathConstants.NODESET);
int nodeCount = nodes.getLength();
for(int n=0; n<nodeCount; n++)
parse(nodes.item(n), excludes);
return wc;
}
private void parse(Node node, Set<String> excludes) throws XmlException
{
String text = node.getTextContent();
for(String exclude : excludes)
{
String regex = String.format("(?s)(<%1$s\\b[^>]*>)(.*?)(</%1$s>)", Pattern.quote(exclude));
text = text.replaceAll(regex, "$1<![CDATA[$2]]>$3");
}
String randomNode = "tmp_"+UUID.randomUUID().toString();
text = String.format("<%1$s>%2$s</%1$s>", randomNode, text);
NodeList parsed = xml
.parse(text)
.getFirstChild()
.getChildNodes();
node.setTextContent(null);
for(int n=0; n<parsed.getLength(); n++)
node.appendChild(node.getOwnerDocument().importNode(parsed.item(n), true));
}
}
XmlHelper
import java.io.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;
import javax.xml.xpath.*;
import org.w3c.dom.*;
import org.xml.sax.*;
public final class XmlHelper
{
private static final XmlHelper instance = new XmlHelper();
public static XmlHelper getInstance()
{
return instance;
}
private final SAXTransformerFactory transformerFactory;
private final DocumentBuilderFactory documentBuilderFactory;
private final XPathFactory xpathFactory;
private XmlHelper()
{
documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
xpathFactory = XPathFactory.newInstance();
TransformerFactory tf = TransformerFactory.newInstance();
if (!tf.getFeature(SAXTransformerFactory.FEATURE))
throw new RuntimeException("Failed to create SAX-compatible TransformerFactory.");
transformerFactory = (SAXTransformerFactory) tf;
}
public DocumentBuilder newDocumentBuilder()
{
try
{
return documentBuilderFactory.newDocumentBuilder();
}
catch (ParserConfigurationException e)
{
throw new RuntimeException("Failed to create new "+DocumentBuilder.class, e);
}
}
public XPath newXPath()
{
return xpathFactory.newXPath();
}
public Transformer newIdentityTransformer(boolean omitXmlDeclaration, boolean indent)
{
try
{
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, indent ? "yes" : "no");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, omitXmlDeclaration ? "yes" : "no");
return transformer;
}
catch (TransformerConfigurationException e)
{
throw new RuntimeException("Failed to create Transformer instance: "+e.getMessage(), e);
}
}
public Templates newTemplates(String xslt) throws XmlException
{
try
{
return transformerFactory.newTemplates(new DOMSource(parse(xslt)));
}
catch (TransformerConfigurationException e)
{
throw new RuntimeException("Failed to create templates: "+e.getMessage(), e);
}
}
public Document parse(String xml) throws XmlException
{
return parse(new InputSource(new StringReader(xml)));
}
public Document parse(InputSource xml) throws XmlException
{
try
{
return newDocumentBuilder().parse(xml);
}
catch (SAXException e)
{
throw new XmlException("Failed to parse xml: "+e.getMessage(), e);
}
catch (IOException e)
{
throw new XmlException("Failed to read xml: "+e.getMessage(), e);
}
}
public String toString(Node node)
{
return toString(node, true, false);
}
public String toString(Node node, boolean omitXMLDeclaration, boolean indent)
{
try
{
StringWriter writer = new StringWriter();
newIdentityTransformer(omitXMLDeclaration, indent)
.transform(new DOMSource(node), new StreamResult(writer));
return writer.toString();
}
catch (TransformerException e)
{
throw new RuntimeException("Failed to transform XML into string: " + e.getMessage(), e);
}
}
public Document copy(Document document)
{
DOMSource source = new DOMSource(document);
DOMResult result = new DOMResult();
try
{
newIdentityTransformer(true, false)
.transform(source, result);
return (Document) result.getNode();
}
catch (TransformerException e)
{
throw new RuntimeException("Failed to copy XML: " + e.getMessage(), e);
}
}
}

Xpath expression for getting an attribute value fails in Java

I am trying to get an attribute value from an XML file, but my code fails with the exception below:
11-15 16:34:42.270: DEBUG/XpathUtil(403): exception = javax.xml.xpath.XPathExpressionException: javax.xml.transform.TransformerException: Extra illegal tokens: '#', 'source'
Here is the code I use to get the node list:
private static final String XPATH_SOURCE = "array/extConsumer#source";
mDocument = XpathUtils.createXpathDocument(xml);
NodeList fullNameNodeList = XpathUtils.getNodeList(mDocument,
XPATH_FULLNAME);
And here is my XpathUtils class:
public class XpathUtils {
private static XPath xpath = XPathFactory.newInstance().newXPath();
private static String TAG = "XpathUtil";
public static Document createXpathDocument(String xml) {
try {
Log.d(TAG , "about to create document builder factory");
DocumentBuilderFactory docFactory = DocumentBuilderFactory
.newInstance();
Log.d(TAG , "about to create document builder ");
DocumentBuilder builder = docFactory.newDocumentBuilder();
Log.d(TAG , "about to create document with parsing the xml string which is: ");
Log.d(TAG ,xml );
Document document = builder.parse(new InputSource(
new StringReader(xml)));
Log.d(TAG , "If i see this message then everythings fine ");
return document;
} catch (Exception e) {
e.printStackTrace();
Log.d(TAG , "EXCEPTION OCCURED HERE " + e.toString());
return null;
}
}
public static NodeList getNodeList(Document doc, String expr) {
try {
Log.d(TAG , "inside getNodeList");
XPathExpression pathExpr = xpath.compile(expr);
return (NodeList) pathExpr.evaluate(doc, XPathConstants.NODESET);
} catch (Exception e) {
e.printStackTrace();
Log.d(TAG, "exception = " + e.toString());
}
return null;
}
// extracts the String value for the given expression
public static String getNodeValue(Node n, String expr) {
try {
Log.d(TAG , "inside getNodeValue");
XPathExpression pathExpr = xpath.compile(expr);
return (String) pathExpr.evaluate(n, XPathConstants.STRING);
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
I get an exception thrown in the getNodeList method.
Now, according to http://www.w3schools.com/xpath/xpath_syntax.asp, to get an attribute value, you use the "#" sign. But for some reason, Java is complaining about this symbol.

Try
array/extConsumer/#source
as your XPath expression. This selects the source attribute of the extConsumer element.

Put a slash before the attribute spec:
array/extConsumer/#source

The w3schools page you linked to also says "Predicates are always embedded in square brackets." You just appended the #source. Try
private static final String XPATH_SOURCE = "array/extConsumer[#source]";
EDIT:
To be clear, this is if you're looking for a single item, which is what your original wording led me to believe. If you want to collect a bunch of source attributes, see the answers by vanje and Anon that suggest using a slash instead of square brackets.

XML validation in Java - why does this fail?

first time dealing with xml, so please be patient. the code below is probably evil in a million ways (I'd be very happy to hear about all of them), but the main problem is of course that it doesn't work :-)
public class Test {
private static final String JSDL_SCHEMA_URL = "http://schemas.ggf.org/jsdl/2005/11/jsdl";
private static final String JSDL_POSIX_APPLICATION_SCHEMA_URL = "http://schemas.ggf.org/jsdl/2005/11/jsdl-posix";
public static void main(String[] args) {
System.out.println(Test.createJSDLDescription("/bin/echo", "hello world"));
}
private static String createJSDLDescription(String execName, String args) {
Document jsdlJobDefinitionDocument = getJSDLJobDefinitionDocument();
String xmlString = null;
// create the elements
Element jobDescription = jsdlJobDefinitionDocument.createElement("JobDescription");
Element application = jsdlJobDefinitionDocument.createElement("Application");
Element posixApplication = jsdlJobDefinitionDocument.createElementNS(JSDL_POSIX_APPLICATION_SCHEMA_URL, "POSIXApplication");
Element executable = jsdlJobDefinitionDocument.createElement("Executable");
executable.setTextContent(execName);
Element argument = jsdlJobDefinitionDocument.createElement("Argument");
argument.setTextContent(args);
//join them into a tree
posixApplication.appendChild(executable);
posixApplication.appendChild(argument);
application.appendChild(posixApplication);
jobDescription.appendChild(application);
jsdlJobDefinitionDocument.getDocumentElement().appendChild(jobDescription);
DOMSource source = new DOMSource(jsdlJobDefinitionDocument);
validateXML(source);
try {
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StreamResult result = new StreamResult(new StringWriter());
transformer.transform(source, result);
xmlString = result.getWriter().toString();
} catch (Exception e) {
e.printStackTrace();
}
return xmlString;
}
private static Document getJSDLJobDefinitionDocument() {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = null;
try {
builder = factory.newDocumentBuilder();
} catch (Exception e) {
e.printStackTrace();
}
DOMImplementation domImpl = builder.getDOMImplementation();
Document theDocument = domImpl.createDocument(JSDL_SCHEMA_URL, "JobDefinition", null);
return theDocument;
}
private static void validateXML(DOMSource source) {
try {
URL schemaFile = new URL(JSDL_SCHEMA_URL);
Sche maFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(schemaFile);
Validator validator = schema.newValidator();
DOMResult result = new DOMResult();
validator.validate(source, result);
System.out.println("is valid");
} catch (Exception e) {
e.printStackTrace();
}
}
}
it spits out a somewhat odd message:
org.xml.sax.SAXParseException: cvc-complex-type.2.4.a: Invalid content was found starting with element 'JobDescription'. One of '{"http://schemas.ggf.org/jsdl/2005/11/jsdl":JobDescription}' is expected.
Where am I going wrong here?
Thanks a lot

I think you are missing the namespace on your elements. Rather than calling createElement(), you can try
document.createElementNS(JSDL_SCHEMA_URL, elementName)
If necessary, you may need to use a prefix, e.g.
document.createElementNS(JSDL_SCHEMA_URL, "jsdl:"+elementName)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Failing to extract xml value element using JDOM & Xpath - java

The problem is with XML namespaces: your XPath query starts by selecting a 'TOP4A' element in the default namespace. Your XML file, however, has a 'TOP4A' element in the 'http://www.testurl.co.uk/enment/gqr/3232/1' namespace instead. Is it an option to remove the xmlns from the XML?

Related

Convert HTML to PDF with Header and Footer

Query on xml file with special case

Parse XML escaped in CDATA mixed with invalid HTML

Xpath expression for getting an attribute value fails in Java

XML validation in Java - why does this fail?

Categories

Resources