I have a web page from which I have saved in an HtmlPage object. I applied an XPath and its result is being stored in a list.
List<?> items = null;
items = page.getByXPath("//div[contains(#class,'search-result-cards')]/div[contains(#class,'listContainer')]");
Now what I observed, is that when I iterate through these items, using HtmlElement, I get just the first line of the div tag which contains the class listContainer but not its child nodes. However, on using he.asXml() method, I get the complete information about the subnodes as well.
for(HtmlElement he : (List<HtmlElement>) items)
{
br.write("Printing just the element ::: "+he);
br.write(he.asXml());
}
Here, br is a BufferedWriter object which is being used to write the output to the file.
The issue is that I want all this information which is coming after I'm calling he.asXml() method in the HtmlElement object only. Is it possible? I tried typecasting directly a string to HtmlElement Object which didn't work. Can anyone please help?
Output
Printing just the element ::: HtmlDivision[<div class="listContainer" data-ptitle="3139847000" data-reactid="402">]
he.asXml() Output
<div class="listContainer" data-ptitle="3139847000" data-reactid="402">
<div class="imageContainer" data-reactid="403">
<div class="prodInfoContainer" data-reactid="406">
.
.
.
The dots represents these nodes keep on going, as the output is very large.
Let me know if any other information is needed that I may have not mentioned.
.toString() prints only the current DomElement, not the children.
You need to get the children, either by using XPath, something like:
List<HtmlElement> items = page.getByXPath("//div[contains(#class,'listContainer')]");
for (HtmlElement item : items) {
List<HtmlElement> children = item.getByXPath(".//div");
for (HtmlElement child : children) {
System.out.println(child);
}
}
Or
for (HtmlElement child : item.getHtmlElementDescendants()) {
System.out.println(child);
}
Related
I am trying to read all the heading tag on a page and need to click only one heading tag named "dropdown". The sample structure of HTML is as follows
<div> <ul> <li>
<a href="submit_button_clicked.php">
<h2>Submit Button Clicked</h2>
<figure>
</a>
</li>
<li>
<a href="dropdown.php">
<h2>Dropdown</h2>
<figure>
What i did is to create a custom xpath and store it in List,then iterate through list using for loop but i am unable to /read/write the value of tag on console.
List l = ff.findElements(By.xpath("//div/ul/li/a/h2"));
To retrieve the text value of an element use:
element.getText();
In your case with your list it would look something like this:
for(WebElement element : l) {
System.out.println(element.getText());
}
Since you want to click on an element, it would be better to use an xpath such as the following:
ff.findElements(By.xpath("h2[text()='Dropdown']")).click();
To find and click the specific element you want. The above xpath selector looks for a h2 element with the exact text 'Dropdown' and then clicks on it.
Reading all <h2> tags can look something like:
List<WebElement> elements = ff.findElements(By.xpath("//h2"));
for(WebElement element : elements) {
System.out.println(element.getText()); // just to show that it prints text
}
Note that I defined list as List<WebElement> which is to avoid usage of raw types, and changed xpath to match any <h2>.
But when you need to click, usually you are required to click on parent <a> element, not on <h2> itself, i.e. the following should click on a correct link
ff.findElement(By.xpath("//a[#href='dropdown.php']")).click();
But if you want to find a link from header, in the above loop:
List<WebElement> elements = ff.findElements(By.xpath("//h2"));
for(WebElement element : elements) {
if("Download".equals(element.getText()) {
// get the parent <a> element and click on it
element.findElement(By.xpath("..")).click();
}
}
Hi please do it like below
WebDriver driver = new FirefoxDriver();
driver.get("http://www.seleniumhq.org");
driver.manage().timeouts().implicitlyWait(15, TimeUnit.SECONDS);
// Hi please do it like below ,take all H2 tag inside the list
List<WebElement> myH2Tags = driver.findElements(By.tagName("h2")); // you can put any tag name as per your requirement
for(int i=0;i<myH2Tags.size();i++){
System.out.println("Value of My H2 Tags are : " + myH2Tags.get(i).getText());
if(myH2Tags.get(i).getText().equals("Selenium News")){ // you can replace this with drop down value
myH2Tags.get(i).click();
}
// to avoid stale element exception you have to re identify the elements
myH2Tags = driver.findElements(By.tagName("h2"));
}
I have to fetch two labels 'Text 1', 'Text 2' which belongs to same class ='xyz', which are located in two div's.Structure as shown below.
<div class='xyz'>TEXT 1</div>
<div class='xyz'>TEXT 2</div>
Can anyone please help me to solve this ?
You find elements by className and then use getText() to get the text:
List<WebElement> elements = driver.findElements(By.className("xyz"));
for(WebElement element:elements) {
System.out.println(element.getText());
}
Use FindElements method and then access to necessary div using index, e.g:
var elements = driver.FindElements(By.CssSelector((".xyz"));
//get text in first element;
elements[0].getText();
//in second
elements[1].getText(); //etc
In jsoup Element.children() returns all children (descendants) of Element. But, I want the Element's first-level children (direct children).
Which method can I use?
Element.children() returns direct children only. Since you get them bound to a tree, they have children too.
If you need the direct children elements without the underlying tree structure then you need to create them as follows
public static void main(String... args) {
Document document = Jsoup
.parse("<div><ul><li>11</li><li>22</li></ul><p>ppp<span>sp</span</p></div>");
Element div = document.select("div").first();
Elements divChildren = div.children();
Elements detachedDivChildren = new Elements();
for (Element elem : divChildren) {
Element detachedChild = new Element(Tag.valueOf(elem.tagName()),
elem.baseUri(), elem.attributes().clone());
detachedDivChildren.add(detachedChild);
}
System.out.println(divChildren.size());
for (Element elem : divChildren) {
System.out.println(elem.tagName());
}
System.out.println("\ndivChildren content: \n" + divChildren);
System.out.println("\ndetachedDivChildren content: \n"
+ detachedDivChildren);
}
Output
2
ul
p
divChildren content:
<ul>
<li>11</li>
<li>22</li>
</ul>
<p>ppp<span>sp</span></p>
detachedDivChildren content:
<ul></ul>
<p></p>
This should give you the desired list of direct descendants of the parent node:
Elements firstLevelChildElements = doc.select("parent-tag > *");
OR You can also try to retrieve the parent element, get the first child node via child(int index) and then try to retrieve siblings of this child via siblingElements().
This will give you the list of first level children excluding the used child, however you'd have to add the child externally.
Elements firstLevelChildElements = doc.child(0).siblingElements();
You could always use the ELEMENT.child(index) with the index you can choose which child you want.
Here you can get the value of first-level children
Element addDetails = doc.select("div.container > div.main-content > div.clearfix > div.col_7.post-info > ul.no-bullet").first();
Elements divChildren = addDetails.children();
for (Element elem : divChildren) {
System.out.println(elem.text());
}
I'm using JSoup to retrive reviews from a particular webpage in Amazon and what I have now is this:
Document doc = Jsoup.connect("http://www.amazon.com/Presto-06006-Kitchen-Electric-Multi-Cooker/product-reviews/B002JM202I/ref=sr_1_2_cm_cr_acr_txt?ie=UTF8&showViewpoints=1").get();
String title = doc.title();
Element reviews = doc.getElementById("productReviews");
System.out.println(reviews);
This gives me the block of html which has the reviews but I want only the text without all the tags div etc. I want to then write all this information into a file. How can I do this? Thanks!
Use text() method
System.out.println(reviews.text());
While text() will get you a bunch of text, you'll want to first use jsoup's select(...) methods to subdivide the problem into individual review elements. I'll give you the first big division, but it will be up to you to subdivide it further:
public static List<Element> getReviewList(Element reviews) {
List<Element> revList = new ArrayList<Element>();
Elements eles = reviews.select("div[style=margin-left:0.5em;]");
for (Element element : eles) {
revList.add(element);
}
return revList;
}
If you analyze each element, you should see how amazon further subdivides the information held including the title of the review, the date of the review and the body of the text it holds.
I have an element like this :
<td> TextA <br/> TextB </td>
How can I extract TextA and TextB separately?
Several ways. That really depends on the document itself and whether the given HTML markup is consistent or not. In this particular example you could get the td's child nodes by Element#childNodes() and then test every node individually if it's a TextNode or not.
E.g.
Element td = getItSomehow();
for (Node child : td.childNodes()) {
if (child instanceof TextNode) {
System.out.println(((TextNode) child).text());
}
}
which results in
TextA
TextB
I think it would be nice if Jsoup offered a Element#textNodes() or something to get the child text nodes like as Element#children() does to get the child elements (which would have returned the <br /> element in your example).