I am using Xpath expression to get text nodes from a XML document like below:
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
<proj>
<under>E01</under>
<under>E02</under>
</proj>
<name>John Doe</name>
<gender>male</gender>
</emp>
</company>
I have written the following XPATH expression to get the text values :
normalize-space(string(//emp))
It is extracting the correct values and the output is like below:
Acct1000E01E02John Doemale
Notice that there are no spaces between the text node values from different nodes.
I actually want the output value to be in this way:
`Acct 1000 E01 E02 John Doe`
I have used javax.xml.xpath to parse and build the tree as follows:
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document document = builder.parse(new File("/employees.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "normalize-space(string(//emp))";
String output= (String)xPath.compile(expression).evaluate(document, XPathConstants.STRING);
I am using JAVA SE 10 here. So, the Xpath version is 1.0
Is there a better way to extract the text values?
I am pretty new to XPath so any suggestions would be helpful.
You are almost rigth here.
Picking the not operator is the right way to go.
It should be something like this:
/html/body/company/emp/*[not(self::gender)]
That is, all childnodes of emp except gender node.
Here go a full exemple in javascript:
let xpathExpression = '/html/body/company/emp/*[not(self::gender)]';
let contextNode = window.document;
let xpathResult = document.evaluate(xpathExpression, contextNode,
null, XPathResult.ANY_TYPE, null);
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
console.log(xpathResult.iterateNext());
Oh dear, this one is complicated...
First of all, you haven't tagged your question with an XPath version. Usually people who aren't aware of XPath versions are using the ancient version 1.0, so I'll make that assumption: sorry if it's wrong.
In XPath 1.0, a function that is given a node-set and that expects a string uses the string value of the first node in the node-set, taken in document order.
In your query
normalize-space(string(//emp))
//emp selects a node-set, which happens to contain a single node, so string() takes the string value of that node. The string value of an element node is the concatenation of all its descendant text node. The normalize-space function removes leading and trailing whitespace, and normalizes internal space to a single whitespace character.
You have shown your XML in indented form as
<company>
<emp>
<dept>Acct</dept>
<salary>1000</salary>
etc, so it's reasonable to expect that the whitespace between elements forms part of the string value of the <emp> element. But you haven't told us how the document was parsed and turned into a node tree. Parsers often provide multiple options on how to do this, in particular, on how to handle the whitespace between element nodes. Most retain the whitespace by default, unless perhaps there is a schema or DTD that tells the parser that the whitespace is insignificant. Microsoft's MSXML parser, notoriously, drops the whitespace by default, which causes considerable problems when you are using XML to represent narrative documents, but actually makes life easier for people using XML for this kind of non-document data.
Your parser, for one reason or another (we can't tell) seems to have deleted the whitespace between element nodes. No XPath query is going to bring it back again. You may have options when building the document to retain the whitespace; that depends on the tools you are using.
Your second question asks about dropping one of the elements in the input. That's beyond the scope of XPath. XPath can only select nodes from the input, it cannot modify them in any way. To modify the tree, you need XSLT or XQuery.
Your attempt to solve the problem with //emp[not(descendant::gender)] is hopelessly doomed because this will only select employees that have no descendant element named gender. You appear to be guessing the semantics rather than using a specification or tutorial.
Related
Hey all I am terrible at Regex stuff and wondering what this //[not(*)] means exactly when placed into an XML XPath compile? Only thing I can find is (https://regex101.com/r/Kjodlj/1)
Match a single character [not(*)].
not() matches a single character not() (case sensitive)
NodeList nodeList = (NodeList) xPath.compile("//*[not(*)]").evaluate(document, XPathConstants.NODESET);
The above code does not seem to give me any of the comments that are throughout my XML file. Doing something like this:
NodeList nodeList = (NodeList) xPath.compile("//*").evaluate(document, XPathConstants.NODESET);
Does show the comments but also messes up the page parsing.
Is there a Regex that does both so that it still formats it correctly and also includes the comments as well? Or perhaps doing this in another form that's easier than using regex?
XPath.compile compiles XPath expressions, not Regex expressions. They are quite unrelated.
The XPath expression //*[not(*)] selects all elements in the document that do not have a child element (that is, all leaf elements). The way it works is:
// expands to /descendant-or-self::node()/
* expands to child::element()
not(X), where X is a node-set, tests whether the node-set is empty.
So the expression means
/descendant-or-self::node()/child::element()[empty(child::element())]
Which selects all elements that are a child of something in the document (actually, all elements are a child of something), and then filters this set to retain only those where child::element() returns nothing, that is, those that have no child elements.
But first you need to get it out of your head that this has anything to do with regular expressions. If you search a Regex tutorial hoping to get insights about XPath, you are going to get very confused.
I have a XML files, and each file contains some informations, it also contains description of itself closed in element <namespace:description></namespace:description>. This description will be inserted in HTML web page and uploaded to web.
The problem is that in description element are other HTML elements and I want to keep them there, so that text can be formatted, but XPath escape all those elements and returns only their text.
<namespace:descr>Some <i>nice</i> description</namespace:descr>
I tried variations on this XPath query: //*[local-name()='descr']
(I'm not really skilled with XPath)
Also tried something like //*[local-name()='descr']//*[not(descendant::*[self::p or self::i])] found in this answer, but it doesn't work for me.
So my question: is there some way to keep XML/HTML elements in text after using XPath query?
The return value of an XPath expression can either be a string, number, boolean or a node-set. Each of these types can be converted to one of the primitive types.
The expression //*[local-name()='descr'] returns a node-set but you then obviously convert it to a string which returns the concatenated text content of the first node in the node-set, stripping off all markup.
To print the content of the result node as markup you would need to do the following:
Retrieve the expression result as node-set. The implementation type of the node-set depends on the XPath engine, and for instance could be a DOM nodelist.
Serialize the nodes as XML fragment. This of course depends on the API node-set and the XPath engine. XSLT could be used for that but it may also be as simple as calling toString() on the node implementation.
What is the most efficient to search an element?
Would it take to traverse through the complete DOM4j document?Should I use XPATH here?
I am actually comparing two XML documents. Will iterate through first xml one by one and search for it the second xml document.
It is not a straightforward comparison. I would be comparing name attribute value with second xml's elements. And if first xml has any name such as name="xx.yy" then I need to look for <xx>
<yy></yy>
</xx> in second xml.
Maybe you could use Jsoup for this? I don'k know what kind of comparison are you up to, but with Jsoup you could simply select all nodes from both XMLs and iterate over both collections in one loop.
Jsoup is very effective and easy to use if you need to select random node just by its attribute (any attribute) tag name or content.
I'd like to query a HTML document as XML (e.g. with XPath), so I need to pass the HTML through some form of HTML cleaner.
But I'd also like to make modifications to the original source string based on the results of the queries.
Is there a Java HTML parser around that retains indexes to the original source string, so I can locate a node and modify the correct part of the original string?
Cheers.
It sounds like Jericho is almost exactly what you want. It is a robust HTML parser designed specifically for making unintrusive modifications to the source document.
While it doesn't come with DOM, SAX, or StAX interfaces, it has custom APIs that are similar enough to those standards that you should be able to adapt your approach to them fairly easily, or write an adapter between whatever you are using and Jericho. For instance, you can do XPath queries on Jericho documents using Jaxen -- see this blog entry for an example.
Jericho has begin and end attributes for every element, and even for parts of the element like the tag name or even an attribute name, so you can edit the document yourself with that information, but where Jericho really shines is the OutputDocument class, which lets you specify replacements directly by calling the appropriate methods with the Jericho elements that match your query instead of having to explicitly call getBegin() and getEnd() on them and pass that to some replacement method.
We use jericho html parser to do the parsing and htmlcleaner to do the actual clean up.
We had problems with jericho's behavior within a server app ( memory management, logging ) that we fixed. (the original developer didn't think our issues were important enough to put in the main code branch). Our fork is on github.
We also made fixes to htmlcleaner.
I don't know about the "retain indexes to the original text" part but Jericho is a very good HTML parser library.
Here is an example of how to remove every span from a html:
public static String removeSpans(String html) {
Source source = new Source(html);
source.fullSequentialParse();
OutputDocument outputDocument = new OutputDocument(source);
List<Tag> tags = source.getAllTags();
for (Tag tag : tags) {
String tagname = tag.getName().toLowerCase();
if (tagname.equals("span")) {
//remove the <span>
outputDocument.remove(tag);
}
}
return outputDocument.toString();
}
I guess you could use HTML Parser.
You can get indexes to original Page using getStartPosition() and getEndPosition() from class Node.
As others have suggested, you probably want to render the DOM. This basically just means constructing the node tree, it wont alter the document source unless you use an HTML cleaner like jTidy. Then you have easy access to the document and can modify it as required. I would suggest DOM4J, it has a good api and xpath support too.
Re your "indexing" requirement, during your traversal/querying of the document you can cache in a list or map any elements or nodes that you wish to modify the text of at a later point.
this works great
http://jtidy.sourceforge.net/
EXAMPLE
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(boolean xhtml); // set desired config options using tidy setters
... // (equivalent to command line options)
tidy.parse(inputStream, System.out);
For crawling the DOM, i recommend using JDOM, its way faster then simple XML.
http://www.jdom.org/
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("root");
Text text = doc.createText("This is the root");
root.appendChild(text);
doc.appendChild(root);
As far as implementation is concerned i would make a new document, and add nodes to it from the source.
You could try ANTLR with an HTML grammar.
You could take (at least) 2 approaches - try and use it as an actual HTML parser, and then get the indexes into the original string that you are interested in.
Or, it also has built-in support for doing in-place transformations on source text, where you define the transformations that you want to perform on the text as part of the grammar.
I'm doing some tests, but I see no difference when I use or not the normalize() method.
But the examples at ExampleDepot website use it.
So, what is it for? (The documentation wasn't clear for me either)
You can programmatically build a DOM tree that has extraneous structure not corresponding to actual XML structures - specifically things like multiple nodes of type text next to each other, or empty nodes of type text. The normalize() method removes these, i.e. it combines adjacent text nodes and removes empty ones.
This can be useful when you have other code that expects DOM trees to always look like something built from an actual XML document.
This basically means that the following XML element
<foo>hello
wor
ld</foo>
could be represented like this in a denormalized node:
Element foo
Text node: ""
Text node: "Hello "
Text node: "wor"
Text node: "ld"
When normalized, the node will look like this
Element foo
Text node: "Hello world"
It cleans code from adjacent Text nodes and empty Text nodes
there are a lot of possible DOM trees that correspond to the same XML structure and each XML structure has at least one corresponding DOM tree. So conversion from DOM to XML is surjective.
So it may happen that:
dom_tree_1 != dom_tree_2
# but:
dom_tree_1.save_DOM_as_XML() == dom_tree_2.save_DOM_as_XML()
And there is no way for ensuring:
dom_tree == dom_tree.save_DOM_as_XML().load_DOM_from_XML()
But we would like to have it bijective. That means each XML structure corresponds to one particular DOM tree.
So you can define a subset of all possible DOM trees that is bijective to the set of all possible XML structures.
# still:
dom_tree.save_DOM_as_XML() == dom_tree.normalized().save_DOM_as_XML()
# but with:
dom_tree_n = dom_tree.normalize()
# we now even have:
dom_tree_n == dom_tree_n.save_DOM_as_XML().load_DOM_from_XML().normalize()
So normalized DOM trees can be perfectly reconstructed from their XML representation. There is no information loss.
Normalize the root element of the XML document. This ensures that all Text nodes under the root node are put into a "normal" form, which means that there are neither adjacent Text nodes nor empty Text nodes in the document.