Xpath compile Regex not showing xml comments - java

Hey all I am terrible at Regex stuff and wondering what this //[not(*)] means exactly when placed into an XML XPath compile? Only thing I can find is (https://regex101.com/r/Kjodlj/1)
Match a single character [not(*)].
not() matches a single character not() (case sensitive)
NodeList nodeList = (NodeList) xPath.compile("//*[not(*)]").evaluate(document, XPathConstants.NODESET);
The above code does not seem to give me any of the comments that are throughout my XML file. Doing something like this:
NodeList nodeList = (NodeList) xPath.compile("//*").evaluate(document, XPathConstants.NODESET);
Does show the comments but also messes up the page parsing.
Is there a Regex that does both so that it still formats it correctly and also includes the comments as well? Or perhaps doing this in another form that's easier than using regex?

XPath.compile compiles XPath expressions, not Regex expressions. They are quite unrelated.
The XPath expression //*[not(*)] selects all elements in the document that do not have a child element (that is, all leaf elements). The way it works is:
// expands to /descendant-or-self::node()/
* expands to child::element()
not(X), where X is a node-set, tests whether the node-set is empty.
So the expression means
/descendant-or-self::node()/child::element()[empty(child::element())]
Which selects all elements that are a child of something in the document (actually, all elements are a child of something), and then filters this set to retain only those where child::element() returns nothing, that is, those that have no child elements.
But first you need to get it out of your head that this has anything to do with regular expressions. If you search a Regex tutorial hoping to get insights about XPath, you are going to get very confused.

Related

Why am I getting no spaces between text node values?

I am using Xpath expression to get text nodes from a XML document like below:
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
<proj>
<under>E01</under>
<under>E02</under>
</proj>
<name>John Doe</name>
<gender>male</gender>
</emp>
</company>
I have written the following XPATH expression to get the text values :
normalize-space(string(//emp))
It is extracting the correct values and the output is like below:
Acct1000E01E02John Doemale
Notice that there are no spaces between the text node values from different nodes.
I actually want the output value to be in this way:
`Acct 1000 E01 E02 John Doe`
I have used javax.xml.xpath to parse and build the tree as follows:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(new File("/employees.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "normalize-space(string(//emp))";
String output= (String)xPath.compile(expression).evaluate(document, XPathConstants.STRING);
I am using JAVA SE 10 here. So, the Xpath version is 1.0
Is there a better way to extract the text values?
I am pretty new to XPath so any suggestions would be helpful.
You are almost rigth here.
Picking the not operator is the right way to go.
It should be something like this:
/html/body/company/emp/*[not(self::gender)]
That is, all childnodes of emp except gender node.
Here go a full exemple in javascript:
let xpathExpression = '/html/body/company/emp/*[not(self::gender)]';
let contextNode = window.document;
let xpathResult = document.evaluate(xpathExpression, contextNode,
null, XPathResult.ANY_TYPE, null);
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
Oh dear, this one is complicated...
First of all, you haven't tagged your question with an XPath version. Usually people who aren't aware of XPath versions are using the ancient version 1.0, so I'll make that assumption: sorry if it's wrong.
In XPath 1.0, a function that is given a node-set and that expects a string uses the string value of the first node in the node-set, taken in document order.
In your query
normalize-space(string(//emp))
//emp selects a node-set, which happens to contain a single node, so string() takes the string value of that node. The string value of an element node is the concatenation of all its descendant text node. The normalize-space function removes leading and trailing whitespace, and normalizes internal space to a single whitespace character.
You have shown your XML in indented form as
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
etc, so it's reasonable to expect that the whitespace between elements forms part of the string value of the <emp> element. But you haven't told us how the document was parsed and turned into a node tree. Parsers often provide multiple options on how to do this, in particular, on how to handle the whitespace between element nodes. Most retain the whitespace by default, unless perhaps there is a schema or DTD that tells the parser that the whitespace is insignificant. Microsoft's MSXML parser, notoriously, drops the whitespace by default, which causes considerable problems when you are using XML to represent narrative documents, but actually makes life easier for people using XML for this kind of non-document data.
Your parser, for one reason or another (we can't tell) seems to have deleted the whitespace between element nodes. No XPath query is going to bring it back again. You may have options when building the document to retain the whitespace; that depends on the tools you are using.
Your second question asks about dropping one of the elements in the input. That's beyond the scope of XPath. XPath can only select nodes from the input, it cannot modify them in any way. To modify the tree, you need XSLT or XQuery.
Your attempt to solve the problem with //emp[not(descendant::gender)] is hopelessly doomed because this will only select employees that have no descendant element named gender. You appear to be guessing the semantics rather than using a specification or tutorial.

Too many results when finding within element with Selenium WebDriver

I did the following search
parts.get(i).findElements(By.xpath("//li[starts-with(#class, '_lessons--row-')]"))
and it returned dozens of results, while I see in Developer Tools, that there are no more than 3 of them.
parts.get(i) returns single WebElement.
Looks like it searches not children of a given element, but over entire page. Can double slash cause this? What double slash means in XPath?
Your xpath is faulty here.
"//li[starts-with(#class, '_lessons--row-')]"
// searches from root level, to search from node preappend .:
".//li[starts-with(#class, '_lessons--row-')]"
Try your xpath with .// , normally you should start xpath with "." to stop finding elements from root.
.//li[starts-with(#class, '_lessons--row-')]
// match relative data. which starts at the document root. In your case you are trying to locate using
//li[starts-with(#class, '_lessons--row-')]
So it will return all the match in your html. If you want to locate some specific portion of element with class have start text_lessons--row- . You have to make your xpath more specific.
e.g
//div[#id='someid']//li[starts-with(#class, '_lessons--row-')]
I had a similar case, but . before // didn't help me. Just added findElements(By.xpath("your_xpath")).stream().filter(WebElement::isDisplayed).toList() as a workaround.

Keep elements in text with XPath

I have a XML files, and each file contains some informations, it also contains description of itself closed in element <namespace:description></namespace:description>. This description will be inserted in HTML web page and uploaded to web.
The problem is that in description element are other HTML elements and I want to keep them there, so that text can be formatted, but XPath escape all those elements and returns only their text.
<namespace:descr>Some <i>nice</i> description</namespace:descr>
I tried variations on this XPath query: //*[local-name()='descr']
(I'm not really skilled with XPath)
Also tried something like //*[local-name()='descr']//*[not(descendant::*[self::p or self::i])] found in this answer, but it doesn't work for me.
So my question: is there some way to keep XML/HTML elements in text after using XPath query?
The return value of an XPath expression can either be a string, number, boolean or a node-set. Each of these types can be converted to one of the primitive types.
The expression //*[local-name()='descr'] returns a node-set but you then obviously convert it to a string which returns the concatenated text content of the first node in the node-set, stripping off all markup.
To print the content of the result node as markup you would need to do the following:
Retrieve the expression result as node-set. The implementation type of the node-set depends on the XPath engine, and for instance could be a DOM nodelist.
Serialize the nodes as XML fragment. This of course depends on the API node-set and the XPath engine. XSLT could be used for that but it may also be as simple as calling toString() on the node implementation.

Standard xpath syntax to find all nodes with a given name (e.g. "//nodeName") fails

I've loaded up an XML document and I'm attempting to use xpath to find all nodes with the name "CodeList". For whatever reason, the xpath expression //CodeList provides 0 nodes, but the xpath expression /.//CodeList provides me with the list of correctly identified nodes. Reading through various tutorials on the Internet, //CodeList should be the correct syntax to do what I want.
I'm not certain as to why this is happening. The xpath expression . and /. return the same node, which seems to be the document (getNodeName returns "#document").
Someone suggested that the libraries in my classpath could be the source of the problem.
So far, the only XML-related libraries that are dependencies are:
xmlbeans-2.3.0
xml-apis-1.3.04
xalan-2.7.1
xercesImpl-2.9.1
/CodeList and /.//CodeList should both return exactly the same result. If they don't, it's a bug. Both should return all the CodeList elements in no namespace. If your elements are all in a (default) namespace, both expressions should return nothing.
try
"//CodeList/*/text()"
and you'll have all child nodes in lines

How do I change part of a org.w3c.dom.Node using an xpath expression?

I'm using Java 6. Given a org.w3c.dom.Node, how do change the contents of one of its child elements (or potentially the node itself), given an xpath String expression representing one of those elements? Note by "contents", I'm always referring to text. If the child element represented by the path expression contained other child elements, those should go away and replaced with the text I want to substitute.
Thanks, - Dave
First, you find the element that the XPath expression points to (using the simple XPathAPI)
// `node` is your node
// `xpathExpr` is a String with the XPath expression
// `elem` is is element pointed by the XPath expression
Node elem = XPathAPI.selectSingleNode(node, xpathExpr);
then you use Node#setTextContent (javadoc):
On setting, any possible children this node may have are removed and, if it the new string is not empty or null, replaced by a single Text node containing the string this attribute is set to.
elem.setTextContent("This is the new content. Old content, go away!");

Categories