Find elements in a Node without the proper namespace, in Java

Find elements in a Node without the proper namespace, in Java - java

So I have a xml doc that I've declared here:
DocumentBuilder dBuilder = dbFactory_.newDocumentBuilder();
StringReader reader = new StringReader(s);
InputSource inputSource = new InputSource(reader);
doc_ = dBuilder.parse(inputSource);
Then I have a function where I pass in a string and I want to match that to an element in my xml:
void foo(String str)
{
NodeList nodelist = doc_.getDocumentElement().getElementsByTagName(str);
}
The problem is when the str comes in it doesn't have any sort of namespace in it so the xml that I would be testing would be:
<Random>
<tns:node />
</Random>
and the str will be node. So nodelist is now null because its expecting tns:node but I passed in node. And I know its not good to ignore the namespace but in this instance its fine. My problem is that I don't know how to search the Node for an element while ignoring the namespace. I also thought about adding the namespace to the str that comes in but I have no idea how to do that either.
Any help would be greatly appreciated,
Thanks, -Josh

In order to match all nodes whose name is 'str' regardless of namespace use the following:
NodeList nodes = doc.getDocumentElement().getElementsByTagNameNS("*", str);
The wildcard "*" will match any namespace. See Element.getElementsByTagNameNS(...).
Edit: in addition, how #Wheezil correctly stated in a comment, you have to call DocumentBuilderFactory.setNamespaceAware(true) for this to work, otherwise namespaces will not be detected.

Related

Navigating XML using XPATH with a different namespace

I am struggling to find out how to navigate in to the area of the xml that uses namespaces. Using basic xpath i can navigate to the message detail node fine, but I am not sure what I need to do in terms of getting in to that block as everything inside uses namespaces. Please could someone help?
Thanks
<?xml version="1.0" encoding="UTF-8"?>
<Message>
<MessageList>
<MessageCount>2</MessageCount>
<DateTimeStamp>2016-02-11T12:50:26</DateTimeStamp>
<MessageDetail>
<MessageID>2332445456767</MessageID>
<Env:MessageContainer xmlns:Env="http://www.somesite.com/schema/v1.0/envelope" xmlns:BS="http://www.somesite.com/schema/v1.0/BusinessServices">
<Env:MessageParties>
public List<String> getRefs(String xmlMessageToSend)
{
try
{
Document doc = createDocument(xmlMessageToSend.getBytes());
XPath xpath = xPathFactory.newXPath();
xpath.setNamespaceContext(new NamespaceContext() {
#Override
public String getNamespaceURI(String prefix)
{
if (prefix == null)
throw new NullPointerException("Null prefix");
else if ("Env".equals(prefix))
return "http://www.om.com/schema/v1.0/envelope";
else if ("BS".equals(prefix))
return "http://www.o.com/schema/v1.0/BusinessServices";
return XMLConstants.NULL_NS_URI;
}
#Override
public Iterator getPrefixes(String namespaceURI)
{
throw new UnsupportedOperationException();
}
#Override
public String getPrefix(String namespaceURI)
{
throw new UnsupportedOperationException();
}
});
XPathExpression exp = xpath
.compile("/Message/MessageList/MessageDetail/Env:MessageContainer");
Node result = (Node)exp.evaluate(doc, XPathConstants.NODE);
System.out.println(result.getTextContent());
}
catch (XPathExpressionException | SAXException | IOException | ParserConfigurationException e)
{
e.printStackTrace();
}
return new ArrayList<String>();
}

You don't say what you're using to navigate the document, but generally, there should be a way in the API of whatever you're using for you to declare a namespace prefix that matches the one on Env:MessageContainer. Then you can use that prefix in your XPath, e.g. //e:MessageContainer (assuming you mapped 'e' to "http://www.somesite.com/schema/v1.0/envelope").

You need to use a message prefix in the XPath expression. For example,
//foo:MessageContainer
This prefix need not be the same prefix used for the namespace URI as in the original document. Here I've used the prefix foo even though it was Env in your document. As long as both prefixes map to the same URI (http://www.somesite.com/schema/v1.0/envelope in this example) the XPath will match.
How exactly you bind this prefix to the desired namespace URI varies depending on the host language in which the XPath expression is embedded. In XSLT, for example, you simply declare the relevant prefixes in the XSLT stylesheet, much as you would in any other XML document. In XOM, by contrast, you have to supply an XPathContext object that maps the namespace prefixes accordingly. And so on for other languages and APIs.

You can access your elements via e.g.
//Env:MessageContainer
But to achieve this, your xmlns:Env="http://www.somesite.com/schema/v1.0/envelope" xmlns:BS="http://www.somesite.com/schema/v1.0/BusinessServices" should be defined in the <root> element instead of (or in addition to) the <Env:MessageContainer> element.
But if you cannot change your source XML, the correct solution would be the comprehensive style of writing the same thing as above:
//*[local-name()='MessageContainer'][namespace-uri()='http://www.somesite.com/schema/v1.0/envelope']

namespace-unaware XPath expression fails if Saxon is on the CLASSPATH

I have the following sample XML file:
<a xmlns="http://www.foo.com">
<b>
</b>
</a>
Using the XPath expression /foo:a/foo:b (with 'foo' properly configured in the NamespaceContext) I can correctly count the number of b nodes and the code works both when Saxon-HE-9.4.jar is on the CLASSPATH and when it's not.
When, however, I parse the same file with a namespace-unaware DocumentBuilderFactory, the XPath expression "/a/b" correctly counts the number of b nodes only when Saxon-HE-9.4.jar is not on the CLASSPATH.
Code below:
import java.io.*;
import java.util.*;
import javax.xml.xpath.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import javax.xml.namespace.NamespaceContext;
public class FooMain {
public static void main(String args[]) throws Exception {
String xmlSample = "<a xmlns=\"http://www.foo.com\"><b></b></a>";
{
XPath xpath = namespaceUnawareXpath();
System.out.printf("[NS-unaware] Number of 'b' nodes is: %d\n",
((NodeList) xpath.compile("/a/b").evaluate(stringToXML(xmlSample, false),
XPathConstants.NODESET)).getLength());
}
{
XPath xpath = namespaceAwareXpath("foo", "http://www.foo.com");
System.out.printf("[NS-aware ] Number of 'b' nodes is: %d\n",
((NodeList) xpath.compile("/foo:a/foo:b").evaluate(stringToXML(xmlSample, true),
XPathConstants.NODESET)).getLength());
}
}
public static XPath namespaceUnawareXpath() {
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
return xpath;
}
public static XPath namespaceAwareXpath(final String prefix, final String nsURI) {
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
NamespaceContext ctx = new NamespaceContext() {
#Override
public String getNamespaceURI(String aPrefix) {
if (aPrefix.equals(prefix))
return nsURI;
else
return null;
}
#Override
public Iterator getPrefixes(String val) {
throw new UnsupportedOperationException();
}
#Override
public String getPrefix(String uri) {
throw new UnsupportedOperationException();
}
};
xpath.setNamespaceContext(ctx);
return xpath;
}
private static Document stringToXML(String s, boolean nsAware) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(nsAware);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(new ByteArrayInputStream(s.getBytes("UTF-8")));
}
}
Running the above with:
java -classpath dist/foo.jar FooMain
.. produces:
[NS-unaware] Number of 'b' nodes is: 1
[NS-aware ] Number of 'b' nodes is: 1
Running with:
java -classpath Saxon-HE-9.4.jar:dist/foo.jar FooMain
... produces:
[NS-unaware] Number of 'b' nodes is: 0
[NS-aware ] Number of 'b' nodes is: 1

Correct observation. Saxon doesn't work with a namespace-unaware DOM. There's no reason why it should. If you can find an XSLT/XPath processor that works with a namespace-unaware DOM, then go ahead and use it if you want, but its behaviour isn't defined by any standard.
If it were possible for Saxon to detect that the DOM is namespace-unaware, then it would throw an error rather than giving spurious results. Sadly, one of DOM's many design failings is that if you didn't create the DOM yourself, you can't tell whether it's namespace-aware or not.
Your comment "I need to be lenient on namespaces since I have to handle 3rd-party XML instances that are not always XSD valid." is a complete non-sequitur. It's true that a document can't be XSD-valid unless it is namespace-valid, but the converse is not true; loads of documents are namespace-valid without being XSD-valid.
Finally, as your experience shows, relying on the JAXP mechanism to load whatever XPath processor happens to be lying around on the classpath is very error-prone. You can't even control whether you get an XPath 1.0 or 2.0 processor by this mechanism (and again, you can't find out easily which you have got). If your code is dependent on the quirks of a particular XPath implementation then you need to load that implementation explicitly rather than relying on the JAXP search.
UPDATE (Sep 2015): Saxon 9.6 no longer includes the meta-inf services file that advertises it as a JAXP XPath provider. This means you will never pick up Saxon as your XPath processor simply because it is on the classpath: you have to ask for it explicitly.

The XPath language is only defined on namespace-well-formed XML, so the behaviour of different processors on a non-namespace-aware DOM tree (even one like <a><b/></a> that, had it been parsed in a namespace-aware manner, would not actually use any namespaces) is at best implementation-specific and at worst completely undefined.

Saxon 10 now supports XPaths without namespaces, you can configure it like this:
XPath xPath = new net.sf.saxon.xpath.XPathFactoryImpl().newXPath();
((XPathEvaluator)xPath).getStaticContext().setUnprefixedElementMatchingPolicy(UnprefixedElementMatchingPolicy.ANY_NAMESPACE);

JAXB: Get Tag as String

This question may have been answered before in some dark recess of the Interwebs, but I couldn't even figure out how to form a meaningful Google query to search for it.
So: Suppose I have a (simplified) XML document like so:
<root>
<tag1>Value</tag1>
<tag2>Word</tag2>
<tag3>
<something1>Foo</something1>
<something2>Bar</something2>
<something3>Baz</something3>
</tag3>
</root>
I know how to use JAXB to unmarshal this into a Java Object in the standard use cases.
What I don't know how to do is unmarshal tag3's contents wholesale into a String. By which I mean:
<something1>Foo</something1>
<something2>Bar</something2>
<something3>Baz</something3>
as a String, tags and all.

Use annotation #XmlAnyElement.
I've been looking for the same solution and I expected to find some annotation that prevents parsing dom and live it as it is, but did not find it.
Detail at:
Using JAXB to extract inner text of XML element
and
http://blog.bdoughan.com/2011/04/xmlanyelement-and-non-dom-properties.html
I added one cheking in method getElement(), otherwise we could get IndexOutOfBoundsException
if (xml.indexOf(START_TAG) < 0) {
return "";
}
For me it's quite strange behavior with this solution. method getElement() is called for every tag of your xml. The first call is for "Value", the second - "ValueWord", etc. It appends the next tag for previous
update:
I noticed that this approach works only for ONE occurence of tag that we want to parse to String. It's impossible to parse correctly the followint example:
<root>
<parent1>
<tag1>Value</tag1>
<tag2>Word</tag2>
<tag3>
<something1>Foo</something1>
<something2>Bar</something2>
<something3>Baz</something3>
</tag3>
</parent1>
<parent2>
<tag1>Value</tag1>
<tag2>Word</tag2>
<tag3>
<something1>TheSecondFoo</something1>
<something2>TheSecondBar</something2>
<something3>TheSecondBaz</something3>
</tag3>
</parent2>
"tag3" with parent tag "parent2" will contain parameters from the first tag (Foo, Bar, Baz) instead of (TheSecondFoo, TheSecondBar, TheSecondBaz)
Any suggestions are appreciated.
Thanks.

I have an utility method that might come in handy for you in that case. See if it helps. I made a sample code with your example:
public static void main(String[] args){
String text= "<root><tag1>Value</tag1><tag2>Word</tag2><tag3><something1>Foo</something1><something2>Bar</something2><something3>Baz</something3></tag3></root>";
System.out.println(extractTag(text, "<tag3>"));
}
public static String extractTag(String xml, String tag) {
String value = "";
String endTag = "</" + tag.substring(1);
Pattern p = Pattern.compile(tag + "(.*?)" + endTag);
Matcher m = p.matcher(xml);
if (m.find()) {
value = m.group(1);
}
return value;
}

StreamingPathFilter trims spaces

I use the XOM library to parse and process .docx documents. MS Word stores text content in runs (<w:r>) inside the paragraph tags (<w:p>), and often breaks the text into several runs. Sometimes every word and every space between them is in a separate run. When I load a run containing only a space, the parser removes that space and handles it as an empty tag, as a result, the output contains the text without spaces. How could I force the parser to keep all the spaces? I would prefer keeping this parser, but if there is no solution, could you recommend an alternative one?
This is how I call the parser:
StreamingPathFilter filter = new StreamingPathFilter("/w:document/w:body/*:*", prefixes);
Builder builder = new Builder(filter.createNodeFactory(null, contentTransform));
builder.build(documentFile);
...
StreamingTransform contentTransform = new StreamingTransform() {
#Override
public Nodes transform(nu.xom.Element node){
<...process XML and output text...>
}
}

Meanwhile, I found the solution to this issue, thanks to the hint of Elliotte Rusty Harold on the XOM mailing list.
First, the StreamingPathFilter is in fact not part of the nu.xom package, it belongs to nux.xom.
Second, the issue was caused by StreamingPathFilter. When I changed the code to use the default Builder constructor, the missing spaces appeared in the output.
Just for documentation, the new code looks like the following:
Builder builder = new Builder();
nu.xom.Document doc = builder.build(documentFile);
context = XPathContext.makeNamespaceContext(doc.getRootElement());
Nodes nodes = doc.getRootElement().query("w:body/*", context);
for (int i = 0; i < nodes.size(); i++) {
transform((nu.xom.Element) nodes.get(i));
}
...
private void transform(nu.xom.Element node){
//process nodes
...
}

Can I put all namspace definition to top level element with JAXB

Using handcrafted code my xml was like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns="http://musicbrainz.org/ns/mmd-1.0#"
xmlns:ext="http://musicbrainz.org/ns/ext-1.0#">
<artist-list offset="0" count="8">
<artist type="Person" id="00ed154e-8679-42f0-8f42-e59bd7e185af"
ext:score="100">
Now using JAXB which is much better but although the xml is perfectly valid I need to force it to put the xmlns:ext="http://musicbrainz.org/ns/ext#-1.0" within the metadata element not the artist element for compatability with client code that I have no control over.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<metadata xmlns="http://musicbrainz.org/ns/mmd-1.0#">
<artist-list offset="0" count="4">
<artist type="Person" id="00ed154e-8679-42f0-8f42-e59bd7e185af"
ext:score="100" xmlns:ext="http://musicbrainz.org/ns/ext#-1.0">
Can this be done please ?
EDIT:Worked round it with String replace because I only have to deal with one specific case
String xml = sw.toString();
//Remove extension namespace definition
xml=xml.replace("xmlns:ext=\"http://musicbrainz.org/ns/ext#-1.0","");
//Add it to the top instead
xml=xml.replace("<metadata xmlns=\"http://musicbrainz.org/ns/mmd-1.0#\">",
"<metadata xmlns=\"http://musicbrainz.org/ns/mmd-1.0#\" xmlns:ext=\"http://musicbrainz.org/ns/ext-1.0#\">");
//Now write out to the proper output stream
out.write(xml);

I don't think there's a way to do it using JAXB, but here's a quick post-processor using Dom4J:
public static void moveNameSpacesToRoot(Document document) {
final Element rootElement = document.getRootElement();
moveNameSpacesToRootElement(rootElement, rootElement);
}
#SuppressWarnings("unchecked")
private static void moveNameSpacesToRootElement(
Element thisElement, Element rootElement) {
if (!thisElement.equals(rootElement)) {
Namespace namespace = thisElement.getNamespace();
if (!namespace.equals(Namespace.NO_NAMESPACE)) {
Namespace existingRootNamespace =
rootElement.getNamespaceForPrefix(namespace.getPrefix());
if (existingRootNamespace == null) {
rootElement.add(namespace);
}
thisElement.remove(namespace);
}
}
for (Element child : (List<Element>) thisElement.elements()) {
moveNameSpacesToRootElement(child, rootElement);
}
}
Oh, I just realized that you need attributes, not elements. However, the change is trivial, so I'll leave that for you.

There is at least no documented feature in JAXB to control on which element the namespace prefix declaration should be placed. You should however be aware that the two XML snippets are semantically identical (it does not matter if the namespace prefix is declared on the same node or on any ancestor), so you should opt to fix the broken client code or get someone with control of the client code to fix it.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.