Deal with whitespaces in CSS class names with Jsoup

Deal with whitespaces in CSS class names with Jsoup - java

I want to select some supermarket product info from this page:
http://www.angeloni.com.br/super/index?grupo=15022
For that I should select <ul> tags with class "lstProd ":
If the class name were "lstProd" it would be easy, but the problem is the whitespace at the end of name. I couldn't make Jsoup deal with it.
I tried the code below and other ways but it always get an empty list.
org.jsoup.nodes.Document document = Jsoup.connect("http://www.angeloni.com.br/super/index?grupo=15022").get();
org.jsoup.select.Elements list = doc.select("ul.lstProd ");
the code snippet from html page that I want to get:
<ul class="lstProd ">
<li>
<span class="cod">CÓD. 1341372</span>
<span class="lnkImgProd">
<a href="/super/produto?grupo=15022&idProduto=1341372">
<img src="http://assets.angeloni.com.br/files/images/7/1B/C6/1341372_1_V.jpg" width="120" height="120"
alt="Creme Dental SORRISO Super Refrescante Tubo 90g">
</a>
</span>
<div class="RgtDetProd">
<div class="boxInfoProd">
<span class="descr">
<a href="/super/produto?grupo=15022&idProduto=1341372">Creme Dental SORRISO Super Refrescante
Tubo 90g</a>
</span>
<ul class="lstProdFlags after">
</ul>
</div>
...

I think you are facing two completely separate problems:
Jsoup does not load the site you think it loads. The website you specified renders its contents via JavaScript and loads some content after initial page loading through AJAX. JSoup can't deal with this. You either need to investigative the AJAX calls and get them directly with Jsoup, or you use something like selenium webdriver to get the page in a real browser which will render everything as you expect it.
CSS class names can't contain spaces for practical purposes 1. In HTML spaces are used as separator between class names. Hence <ul class="lstProd "> is the same as <ul class="lstProd">. In CSS selectors however a class name is specified by .className, i.e. dot followed by the class name. You can concatinate several classes like this: element.select(".className1.className2")
1 Technically you can put spaces in CSS classes, but you need to escape them with '\ '. See https://mathiasbynens.be/notes/css-escapes or Which characters are valid in CSS class names/selectors?
edit: be more precise about CSS class names

CSS class names CAN contain whitespaces.
And <ul class="lstProd "> is NOT same as <ul class="lstProd">.
And I can see that you have multiple <ul> with same class name.
The better way to inspect or traverse such element is by nth-child
So to find your required selector you can use #abaProd > ul:nth-child(4)
For more details about nth-child

Related

Is there a way to parse an entire HTML tag in JSoup?

Hi I'm wondering if there's a way to parse an entire HTML tag using JSoup? In my example pictures below, the five elements (4 images and 1 string) are all inside the "li" container. However, when you open the "li" tag, there are multiple nested containers. Is there a way to parse it so that I have access to all 5 elements contained in this "li" tag? I'm thinking of using getElementsMatchingOwnText("Collins") but that seems to only get me "span class="text text_14 mix-text_color7">Panorama". Any help would be appreciated, thanks!

Yes, you can iterate over the children of your <li> tag using jsoup.
Here is a simplified version of the HTML in your screenshot, showing the 5 elements:
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
Assuming you have selected this specific <li> tag in your document, you can use the following approach:
String html = "<li><span class=\"foo\"><img src=\"bar\" class=\"img\"></span><span class=\"bar\">Collins</span><i class=\"baz1\"><img src=\"baz1\" class=\"img\"></i><i class=\"baz2\"><img src=\"baz2\" class=\"img\"></i><i class=\"baz3\"><img src=\"baz3\" class=\"img\"></i></li>";
Document document = Jsoup.parse(html);
Element element = document.selectFirst("li");
element.children().forEach(child -> {
// do your processing here - this is just an example:
if (child.hasText()) {
System.out.println(child.text());
} else {
System.out.println(child.html());
}
});
The above code prints the following output:
<img src="bar" class="img">
Collins
<img src="baz1" class="img">
<img src="baz2" class="img">
<img src="baz3" class="img">
UPDATE
If the starting point is a URL, then you would need to start with this:
Document document = Jsoup.connect("https://www...").get();
Then the exercise is about identifying a unique way to find your specific element. So, if we update my earlier example, let's assume your web page is like this:
<html>
<head>...</head>
<body>
<div>
<ul class="vList_4">
<li>
<span class="foo"><img src="bar" class="img"></span>
<span class="bar">Collins</span>
<i class="baz1"><img src="baz1" class="img"></i>
<i class="baz2"><img src="baz2" class="img"></i>
<i class="baz3"><img src="baz3" class="img"></i>
</li>
</ul>
</div>
</body
</html>
Here we have a class in a <ul> tag called vList_4. If that is a unique class name, we can use it to jump to that section of the HTML page (IDs are better than class names because they are guaranteed to be unique - but I did not see any ID names in your screenshot).
Now, instead of my previous selector:
Element element = document.selectFirst("li");
We can use this more specific selector:
Element element = document.selectFirst("ul.vList_4 li");
The same results will be printed as before.
So, it's all about you looking at the page structure and figuring out how to jump to the relevant section of the page.
See here for technical details describing how selectors are constructed.

How to display java String with html tag appended, with the html behavior in angualrjs front end

I have a string in java,I need to append html tag to it dynamically so that when displayed in the frond it,the html tags behavior is felt.
Eg:
String content="Hello World,this is a test <em>content</em> to demonstrate the requirement";
In the above string content is wrapped inside the <em> tag.But when I am trying to display it in angularjs front end, the string is not taking the tag behavior and displayed as "Hello World,this is a test <em>content</em> to demonstrate the requirement".

use angular-sanitize.js for the same -
example
<div ng-controller="testCtrl">
<div ng-bind-html="stringTest"></div>
</div>

you can use ng-bind-html
<div ng-controller="testCtrl">
<div ng-bind-html="stringTest"></div>
</div>
However, if you find this directive too restrictive and when you absolutely trust the source of the content you are binding to, then you can also use ng-bind-html-unsafe.
<div ng-controller="testCtrl">
<div ng-bind-html-unsafe="stringTest"></div>
</div>

element.getText() method is not working in java selenium

<span class="label label-danger" style="font-size : 13px; font-weight : 400;">Critical</span>
Below is the xpath which I am using:
.//tr[#data-index='0']/td/span
I have a line in HTML source like above. So, I have used corresponding Xpath and used getText() method to get the text i.e. Critical. I am succeed in that.
But, I have another line in another page like this.
<div class="col-xs-12">
<div id="project-update-success-information" class="panel-confirmation success" style="display: none;">
<span class="fa fa-check"/>
Project Updated
</div>
Below is the xpath which I am using:-
.//*[#id='project-update-success-information']/span
I have used the corresponding Xpath and getText(),but unfortunately it doesn't retrieve the text for me. I doubted that there is no </span> close tag in the second line which causes the problem. Is there any other way to get the text?

This question has many answers already, but none of them really explains the problem. First, let us get your initial confusion about self-closing elements out of the way, before moving on to the real problem: No, it is not a problem that an element like
<span class="fa fa-check"/>
does not have a </span> tag. There is no need to indicate where it ends because the /> already tells you that this element does not contain anything and closes at this point.
Then let's look at only the fragment of the document that you show:
<div class="col-xs-12">
<div id="project-update-success-information" class="panel-confirmation success" style="display: none;">
<span class="fa fa-check"/>
Project Updated
</div>
</div>
An XPath expression like (note that most likely you do not need the . at the very beginning of the expression):
//*[#id='project-update-success-information']
will return the inner div element with all that it contains. What it does contain is, exactly in this order:
a whitespace-only text node
a self-closing span element with no content other than an attribute
the text node that contains "Project Updated"
So, it is not at all surprising that when you select the inner div and use .getText(), you end up with 2 text nodes in the result. Another way to get at the text content of an element is by using text() in the XPath expression:
//*[#id='project-update-success-information']/text()
which will return (individual elements separated by --------):
[whitespace-only text node]
-----------------------
Project Updated
The solutions are either
use getText() to retrieve all text nodes and later exclude those that only contain whitespace or
use an XPath expression that targets text nodes directly and excludes the ones that only contain whitespace. The standard way of doing this is with [normalize-space()]:
//*[#id='project-update-success-information']/text()[normalize-space()]
Note that, in general, there is no guarantee that the text content of an element will be in one single text node. It is very likely that you will sometimes encounter HTML or XML where elements have several text nodes, all of them containing non-whitespace characters, e.g.:
<div>
Project
<span/>
Updated
</div>

Try this text() method like below:-
//span[#class='fa fa-check']/text()
Hope it will help you :)

The element is empty and thus contains no text
<span class="fa fa-check"/>
If on the other hand it was like
<span class="fa fa-check">Some content</span>
then it would, as in yor first attempt, contain some text.
Without knowing more of the content I would try another xpath method: following-sibling.

Try:
driver.findElement(By.className("panel-confirmation success")).getText();

differentiate two html elements with same class

I have this html code below and I want to differentiate between these two PagePostsSectionPagelet as I only want to find web elements from the first PagePostsSectionPagelet. Is there any way I can do it without using <div id="PagePostsSectionPagelet-183102686112-0" as the value will not always be the same?
<div id="PagePostsSectionPagelet-183102686112-0" data-referrer="PagePostsSectionPagelet-183102686112-0">
<div class="_1k4h _5ay5">
<div class="_5sem">
</div>
</div>
<div id="PagePostsSectionPagelet-183102686112-1" class="" data-referrer="PagePostsSectionPagelet-183102686112-1" style="">
<div class="_1k4h _5ay5">
<div class="_5dro _5drq">
<div class="clearfix">
<span class="_5em9 lfloat _ohe _50f4 _50f7">Earlier in 2015</span>
<div id="u_jsonp_3_4e" class="_6a uiPopover rfloat _ohf">
</div>
</div>
<div id="u_jsonp_3_4j" class="_5sem">
<div id="u_jsonp_3_4g" class="_5t6j">
<div class="_1k4h _5ay5">
<div class="_5sem">
</div>
</div>
Tried using //div[#class='_1k4h _5ay5']//div[#class ='_5sem'] but it will return both.
Using //div[#class='_5dro _5drq']//span[contains(#class,'_5em9 lfloat _ohe _50f4 _50f7') and contains(text(), '')] will help me find the second PagePostsSectionPagelet instead.

you need to use the following xpath:
//div[contains(#class,'_1k4h') and contains(#class,'_5ay5')]
as selenium doesn't work properly with search of several classes in one attribute.
I mean By.Class("_1k4h _5ay5") will found nothing in any case and By.Xpath("//div[#class='_1k4h _5ay5']") can also found nothing in case of class will be "_5ay5 _1k4h" or " _5ay5 _1k4h".(as they possibly generated automatically, its may be have different position on page reload)
But for the best result by performance and by correctness I think will be the following xpath:
".//div[contains(#id, 'PagePostsSectionPagelet')][1]" -- for first div
".//div[contains(#id, 'PagePostsSectionPagelet')][2]" -- for second div

I see that dynamic in the div id is only the number so you can use something like:
WebElement element = driver.FindElements(By.XPath("//div[contains(.,'PagePostsSectionPagelet')])")[1];
This will take only the first web element.

Try using a css selector as below and refine further if required.
The code below returns a List of matching WebElements and then you grab the first one in the List.
List<WebElement> listOfElements = driver.findElements(By.cssSelector("div[data-referrer]"));
WebElement myElement = listOfElements.get(0);
Hint: use the Chrome console to test your css and xpath selectors directly. e.g. use
$$("div[data-referrer]") in the console to reveal what will get selected.

Selenium CSS selector syntax for checking class and text both

Question is for JAVA + Selenium:
My HTML is:
<section class="d-menu d-outclass-bootstrap unclickable d-apps d-app-list">
<section class="standard-component image-sequence-button" tabindex="0" role="link">
<div class="image-region">
<div class="core-component image">...
</div>
<div class="sequence-region">
<div class="core-component section">
<div>
<section class="standard-component text hide-section-separator-line">
<div class="text-region">
<div class="core-component text">
<span class="main-text">BART Times</span>
<span class="sub-text">Provider</span>
</div>
</div>
</section>
<section class="standard-component speech-bubble hide-section-separator-line">...
<section class="standard-component text">...
</div>
</div>
</div>
<div class="button-region">
<div class="core-component button" tabindex="0" role="link">...
</div>
</section>
<section class="standard-component image-sequence-button" tabindex="0" role="link">...
<section class="standard-component image-sequence-button" tabindex="0" role="link">...
<section class="standard-component image-sequence-button" tabindex="0" role="link">...</section>
EDIT:
All <section class="standard-component image-sequence-button"... have exact same structure and hierarchy (same attributes for all tags). The only thing that changes are the TEXT values of the tags(e.g. span)
PART1:
I'm looking for various elements inside the second section tag. So, What I'm trying to do is get the <span class="main-text"> which has a value BART Times because of the business requirement.
I already know how to get it via xpath:
My xpath (verified via firebug):
"//section//div[#class = 'sequence-region']//section[#class = 'standard-component text hide-section-separator-line']//span[#class = 'main-text' and text() = '%s']"
I can get the span tag via checking for %s values (e.g. BART Times).
However, due to design considerations, we've been told to use CSS only. So, I tried to come up with a CSS counterpart for the above xpath but did not find it.
The following CSS
"section div.sequence-region section.standard-component.text.hide-section-separator-line span[class=main-text]"
returns all the span tags under all the section tags.
Question1: How do I get the span tag which has a certain TEXT value (the %s part of xpath)?
Things I've tried for that last span tag which did not worked(according to the firebug):
span.main-text[text='BART Times']
span[class=main-text][text='BART Times']
span.main-text:contains('BART Times')
span[class=main-text]:contains('BART Times')
span.main-text[text="BART Times"]
span[class=main-text][text="BART Times"]
span.main-text[text=\"BART Times\"]
span[class=main-text][text=\"BART Times\"]
span[text="BART Times"]
span[text=\"BART Times\"]
span:contains('BART Times')
span:contains("BART Times")
span:contains(\"BART Times\")
So, basically I want to put a check on BOTH class and TEXT value of the span tag in CSS selector.
Part 2:
Then I want to get the <section class="standard-component image-sequence-button"... element where I found the <span class="main-text"> and then find other elements inside that specific section tag
Question 2:
Assuming, I found the span tag in question 1 via CSS, how do I get the section tag (which is a super--- parent of the span tag)?
If CSS is not possible, please provide an xpath counterpart for this as a workaround for a while.

CSS selectors can't select based on text. The answers to Is there a CSS selector for elements containing certain text? go into detail on why.
To select based on class and text in xpath: //span[contains(#class, 'main-text') and text() = 'BART Times']

Regarding question 1, it is not possible, as stated in the other answer here. This is another thread about the topic : CSS selector based on element text?
Regarding question 2, once again there is no such parent selector in XPath : Is there a CSS parent selector?. Now for the xpath counterpart, you can use parent axis (parent::*) or shortcut notation for the same (..), or put the span selector as predicate for the parent (the third example below) :
....//span[#class = 'main-text' and text() = '%s']/parent::*
....//span[#class = 'main-text' and text() = '%s']/..
....//*[span[#class = 'main-text' and text() = '%s']]
See the following thread for some better (yet more complicated) alternative to match element by CSS class using XPath, just in case you haven't came across link on this topic : How can I find an element by CSS class with XPath?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Deal with whitespaces in CSS class names with Jsoup - java

Related

Is there a way to parse an entire HTML tag in JSoup?

How to display java String with html tag appended, with the html behavior in angualrjs front end

element.getText() method is not working in java selenium

differentiate two html elements with same class

Selenium CSS selector syntax for checking class and text both

Categories

Resources