Parse xml nodes having text with any namespace using jsoup - java

I am trying to parse XML from URL using Jsoup.
In this given XML there are nodes with namespace.
for ex: <wsdl:types>
Now I want to get all nodes which contain text as "types" but can have any namespace.
I am able to get this nodes using expression as "wsdl|types".
But how can I get all nodes containing text as "types" having any namespace. ?
I tried with expression as "*|types" but it didn't worked.
Please help.

There is no such selector (yet). But you can use a workaround - a not as easy to read like a selector, but it's a solution.
/*
* Connect to the url and parse the document; a XML Parser is used
* instead of the default one (html)
*/
final String url = "http://www.consultacpf.com/webservices/producao/cdc.asmx?wsdl";
Document doc = Jsoup.connect(url).parser(Parser.xmlParser()).get();
// Elements of any tag, but with 'types' are stored here
Elements withTypes = new Elements();
// Select all elements
for( Element element : doc.select("*") )
{
// Split the tag by ':'
final String s[] = element.tagName().split(":");
/*
* If there's a namespace (otherwise s.length == 1) use the 2nd
* part and check if the element has 'types'
*/
if( s.length > 1 && s[1].equals("types") == true )
{
// Add this element to the found elements list
withTypes.add(element);
}
}
You can put the essential parts of this code into a method, so you get something like this:
Elements nsSelect(Document doc, String value)
{
// Code from above
}
...
Elements withTypes = nsSelect(doc, "types");

Related

reading xml file with multiple child node

Consider i have a XML file like the below xml file.
<top>
<CRAWL>
<NAME>div[class=name],attr=0</NAME>
<PRICE>span[class~=(?i)(price-new|price-old)],attr=0</PRICE>
<DESC>div[class~=(?i)(sttl dyn|bin)],attr=0</DESC>
<PROD_IMG>div[class=image]>a>img,attr=src</PROD_IMG>
<URL>div[class=name]>a,attr=href</URL>
</CRAWL>
<CRAWL>
<NAME>img[class=img],attr=alt</NAME>
<PRICE>div[class=g-b],attr=0</PRICE>
<DESC>div[class~=(?i)(sttl dyn|bin)],attr=0</DESC>
<PROD_IMG>img[itemprop=image],attr=src</PROD_IMG>
<URL>a[class=img],attr=href</URL>
</CRAWL>
</top>
what i want is first take all the values coming under and after finishing the first operation go to the next one and repeat it even though i have more than two tag.I have managed to get if just one is available. using the values coming inside the tags i am doing some other function. in each it has values from different and i am using that values for different operations. everything else if fine other than i dont know how to loop the fetching inside the xml file.
regards
If I'm understanding this correctly, you're trying to extract data from ALL tags that exist within your XML fragment. There are multiple solutions to this. I'm listing them below:
XPath: If you know exactly what your XML structure is, you can employ XPath for each node=CRAWL to find data within tags:
// Instantiate XPath variable
XPath xpath = XPathFactory.newInstance().newXPath();
// Define the exact XPath expressions you want to get data for:
XPathExpression name = xpath.compile("//top/CRAWL/NAME/text()");
XPathExpression price = xpath.compile("//top/CRAWL/PRICE/text()");
XPathExpression desc = xpath.compile("//top/CRAWL/DESC/text()");
XPathExpression prod_img = xpath.compile("//top/CRAWL/PROD_IMG/text()");
XPathExpression url = xpath.compile("//top/CRAWL/URL/text()");
At this point, each of the variables above will contain the data for each of the tags. You could drop this into an array for each where you will have all the data for each of the tags in all elements.
The other (more efficient solution) is to have the data stored by doing DOM based parsing:
// Instantiate the doc builder
DocumentBuilder xmlDocBuilder = domFactory.newDocumentBuilder();
Document xmlDoc = xmlDocBuilder.parse("xmlFile.xml");
// Create NodeList of element tag "CRAWL"
NodeList crawlNodeList = xmlDoc.getElementsByTagName("CRAWL");
// Now iterate through each item in the NodeList and get the values of
// each of the elements in Name, Price, Desc etc.
for (Node node: crawlNodeList) {
NamedNodeMap subNodeMap = node.getChildNodes();
int currentNodeMapLength = subNodeMap.getLength();
// Get each node's name and value
for (i=0; i<currentNodeMapLength; i++){
// Iterate through all of the values in the nodeList,
// e.g. NAME, PRICE, DESC, etc.
// Do something with these values
}
}
Hope this helps!

JAXP XPath 1.0 or 2.0 - how to distinguish empty strings from non-existent values

Given the following XML instance:
<entities>
<person><name>Jack</name></person>
<person><name></name></person>
<person></person>
</entities>
I am using the following code to: (a) iterate over the persons and (b) obtain the name of each person:
XPathExpression expr = xpath.compile("/entities/person");
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0 ; i < nodes.getLength() ; i++) {
Node node = nodes.item(i);
String innerXPath = "name/text()";
String name = xpath.compile(innerXPath).evaluate(node);
System.out.printf("%2d -> name is %s.\n", i, name);
}
The code above is unable to distinguish between the 2nd person case (empty string for name) and the 3rd person case (no name element at all) and simply prints:
0 -> name is Jack.
1 -> name is .
2 -> name is .
Is there a way to distinguish between these two cases using a different innerXPath expression? In this SO question it seems that the XPath way would be to return an empty list, but I 've tried that too:
String innerXPath = "if (name) then name/text() else ()";
... and the output is still the same.
So, is there a way to distinguish between these two cases with a different innerXPath expression? I have Saxon HE on my classpath so I can use XPath 2.0 features as well.
Update
So the best I could do based on the accepted answer is the following:
XPathExpression expr = xpath.compile("/entities/person");
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0 ; i < nodes.getLength() ; i++) {
Node node = nodes.item(i);
String innerXPath = "name";
NodeList names = (NodeList) xpath.compile(innerXPath).evaluate(node, XPathConstants.NODESET);
String nameValue = null;
if (names.getLength()>1) throw new RuntimeException("impossible");
if (names.getLength()==1)
nameValue = names.item(0).getFirstChild()==null?"":names.item(0).getFirstChild().getNodeValue();
System.out.printf("%2d -> name is [%s]\n", i, nameValue);
}
The above code prints:
0 -> name is [Jack]
1 -> name is []
2 -> name is [null]
In my view this is not very satisfactory as logic is spread in both XPath and Java code and limits the usefulness of XPath as a host language and API-agnostic notation. My particular use case was to just keep a collection of XPaths in a property file and evaluate them at runtime in order to obtain the information I need without any ad-hoc extra handling. Apparently that's not possible.
The JAXP API, being based on XPath 1.0, is pretty limited here. My instinct would be to return the Name element (as a NodeList). So the XPath expression required is simply "Name". Then cases 1 and 2 will return a nodelist of length 1, while case 3 will return a nodelist of length 0. Cases 1 and 2 can then easily be distinguished within the application by getting the value of the node and testing whether it is zero-length.
Using /text() is always best avoided anyway, since it causes your query to be sensitive to the presence of comments in the XML.
As a long-time user of Saxon XSLT, I'm pleased to find once again that I like Michael Kay's recommendation here. Generally, I like the pattern of returning a collection for queries, even for queries that are expected to return only at most one instance.
What I don't like doing is having to open a bundled interface to try to solve a particular need and then finding that one has to reimplement much of what the original interface handled.
Therefore, here's a method that uses Michael's recommendation while avoiding the cost of having to reimplement a Node-to-String transformation that is recommended in other comments in this thread.
#Nonnull
public Optional<String> findString( #Nonnull final String expression )
{
try
{
// for XpathConstants.STRING XPath returns an empty string for both values of no length
// and for elements that are not present.
// therefore, ask for a NODESET and then retrieve the first Node if any
final FluentIterable<Node> matches =
IterableNodeList.from( (NodeList) xpath.evaluate( expression, node, XPathConstants.NODESET ) );
if ( matches.isEmpty() )
{
return Optional.absent();
}
final Node firstNode = matches.first().get();
// now let XPath process a known-to-exist Node to retrieve its String value
return Optional.fromNullable( (String) xpath.evaluate( ".", firstNode, XPathConstants.STRING ) );
}
catch ( XPathExpressionException xee )
{
return Optional.absent();
}
}
Here, XPath.evaluate is called a second time to do whatever it usually does to transform the first found Node to the requested String value. Without this, there is a risk that a re-implementation will yield a different result than a direct call for an XPathConstant.STRING over the same source node and for the same expression.
Of course, this code is using Guava Optional and FluentIterable to make the intention more explicit. If you don't want Guava, use Java 8 or refactor the implementation using nulls and NodeList's own collection methods.

Retrieving Reviews from Amazon using JSoup

I'm using JSoup to retrive reviews from a particular webpage in Amazon and what I have now is this:
Document doc = Jsoup.connect("http://www.amazon.com/Presto-06006-Kitchen-Electric-Multi-Cooker/product-reviews/B002JM202I/ref=sr_1_2_cm_cr_acr_txt?ie=UTF8&showViewpoints=1").get();
String title = doc.title();
Element reviews = doc.getElementById("productReviews");
System.out.println(reviews);
This gives me the block of html which has the reviews but I want only the text without all the tags div etc. I want to then write all this information into a file. How can I do this? Thanks!
Use text() method
System.out.println(reviews.text());
While text() will get you a bunch of text, you'll want to first use jsoup's select(...) methods to subdivide the problem into individual review elements. I'll give you the first big division, but it will be up to you to subdivide it further:
public static List<Element> getReviewList(Element reviews) {
List<Element> revList = new ArrayList<Element>();
Elements eles = reviews.select("div[style=margin-left:0.5em;]");
for (Element element : eles) {
revList.add(element);
}
return revList;
}
If you analyze each element, you should see how amazon further subdivides the information held including the title of the review, the date of the review and the body of the text it holds.

How to retrieve specific element (or nodelist) within an existing element from DOM in Java

I am trying to parse an XML file which contains multiple records of name A. Each A has multiple group records with name B . The various records within B have names x, y and z.
My questions are:
How do I navigate to B and
how do I obtain all values of x in loop.
The DOM is set to the document (i.e. elements of name "A")
I am using a DOM parser in Java.
Sample record:
<A>
<B><x>123</x><y>asdf</y><z>A345</z></B>
<B><x>987</x><y>ytre</y><z>Z959</z></B>
</A>
Document yourDom = ....;
XPathFactory xpf = XPathFactory.newInstance();
XPath xp = xpf.newXPath();
XPathExpression xpe = xp.compile("//A/B/*");
NodeList nodes = (NodeList) xpe.evaluate(yourDom, XPathConstants.NODESET);
Apart from using the standard DOM API directly which is usually a bit verbose for these tasks, you could also use jOOX as a jquery-like wrapper for DOM. Here's an example how to use it:
// Loop over all x element values within B using css-style selectors
for (String x : $(document).find("B x").texts()) {
// ...
}
// Loop over all x element values within B using XPath
for (String x : $(document).xpath("//B/x").texts()) {
// ...
}
// Loop over all x element values within B using the jOOX API
for (String x : $(document).find("B").children("x").texts()) {
// ...
}

Jsoup: Optimal way of checking whether a <div> has an ID

I am able to iterate through all div elements in a document, using getElementsByTag("div").
Now I want to build a list of only div elements that have the attribute "id" (i.e. div elements with attribute "class" shouldn't be in the list).
Intuitively, I was thinking of checking something like this:
if (divElement.attr("id") != "")
add_to_list(divElement);
Is my approach correct at all?
Is there a more optimal way of testing for having the "id" attribute? (the above uses string comparison for every element in the DOM document)
You can do it like this:
Elements divsWithId = doc.select("div[id]");
for(Element element : divsWithId){
// do something
}
Reference:
JSoup > Selector Syntax
Try this:
var all_divs = document.getElementsByTagName("div");
var divs_with_id = [];
for (var i = 0; i < all_divs.length; i++)
if (all_divs[i].hasAttribute("id"))
divs_with_id.push(all_divs[i]);

Categories