I am working on a web scraper using Jsoup and want to pull a link out of a table.
This is what I'm looking at:
<ul class="inline-list indent>
<li>
::marker
Some Other Text
(Date & Time Stamp)
</li>
I want www.linkhere.com and Some Other Text. I have already figured out how to get Some Other Text, but I can't get www.linkhere.com.
This is what I tried:
Document results = Jsoup.connect(url).get();
tTable = ("li:nth-of-type(1)");
Element row : results.select("ul.indent.inline-list:nth-of-type(1)")
Element link = results.select("ul.indent.inline-list:nth-of-type(1) > a").first();
tName = row.select(tTable).text();
articleLink = link.attr("href");
System.out.println(tName);
System.out.println(articleLink);
This gives me the error:
NullPointerException: Cannot invoke "org.jsoup.nodes.Element.attr(String)" because "llink" is null
You're using such selector:
"ul.indent.inline-list:nth-of-type(1) > a"
The first part ul.indent.inline-list:nth-of-type(1) selects the first <ul> element. The second part > a expects that <a> will be direct child of <ul>. That will not match what you want because there's <li> element between them so the solution would be to use:
"ul.indent.inline-list:nth-of-type(1) > li > a"
or if your idea was to match the first <li> you have to use:
"ul.indent.inline-list > li:nth-of-type(1) > a"
Related
How can I get the first div after the h1 tag .
The html:
<h1> Shalom </h1>
<b> Tov </b>
<div> ddd </div> <! I need to take this div >
My java jsoup code
Elements apresh = doc.select("h1 ~ div");
String csdsdsdf = apresh.html();
System.out.printf(csdsdsdf);
But it doesn't work. Can you help me ?
From what you have mentioned in the comments, I believe you are trying to extract the first element from the matching elements based on your selector "h1 ~ div".
You can use the below provided method from the API.
public Element first(): Get the first matched element.
I've found two ways to do this:
Document doc = Jsoup.parse("<h1> Shalom </h1>" +
"<b> Tov </b>" +
"<div> ddd </div>");
// 1 Select DIV which is after B which is after H1.
System.out.println(doc.select("h1 + b + div"));
// 2 More flexible solution which involves going one level up to parent
// and then selecting the first DIV.
System.out.println(doc.select("h1").first().parent().select("div").first());
The following list represents page navigation buttons:
<div class="list">
<ul class="pageNav">
<li class="paginate_button ">
1</li>
<li class="paginate_button ">
2</li>
<li class="paginate_button ">
3</li>
</ul>
</div>
To go to the second page for instance, I am using this Selenium Java code:
//after setting up webdriver
List<WebElement> li = driver.findElements(By.className("pageNav"));
System.out.println(li.get(2).getText());
li.get(2).click();
It's printing the text correctly "2", but not clicking or navigating correctly as if I was manually doing it on the actual website. I also tried replacing the link with an actual link like:
Visit our page
But still no luck. What am I doing wrong?
Thank you in advanced!
Try any of these below code.
In your tried code, I have noticed that you were using class locator to click on links element. But your <ul> tag does not contains the link. Inside <ul> tag, <li> tag is present and each <li> tag contains separate <a> tag.
so, here you should go with xpath or cssSelector locator.
Method 1) By using xpath locator
List<WebElement> links = driver.findElements(By.xpath("//ul[#class='pageNav']/li/a"));
System.out.println(links.size());
links.get(1).click(); //indexing start from 0, if you want to click on second link then pass indexing as 1.
Suggestion:- Instead of using absolute xpath, use relative xpath.
Method 2) By using cssSelector locator
List<WebElement> links = driver.findElements(By.cssSelector("ul.pageNav>li>a"));
System.out.println(links.size());
links.get(1).click(); //indexing start from 0, if you want to click on second link then pass indexing as 1.
Try below code
//getting all the anchor tag elements and storing in a list
List<WebElement> links = driver.findElements(By.xpath("//ul[#class='pageNav']//li[starts-with(#class,'paginate_button')]/a"));
System.out.println(links.size());
//performs click on second links
links.get(1).click();
If you're facing any abnormal difficulty which you are not able to handle directly , then you can first try to move to that element using actions class then click it as below:
WebElement we = driver.findElement(By.cssSelector("div.list > ul.pageNav li:nth-child(2));
Actions action = new Actions(driver);
action.moveToElement(we).click().build().perform();
I would like to get the text(Which is null right now but get some text in future, so printing null should be fine for now) from second "109-top-dark-grey-block ng-binding" class . Tried tabIndex and nth-child both are not working.
"
<div class="122-top-section-btm-half">
<div class="108-top-grey-m12x3"></div>
<div class="109-top-dark-grey-block ng-binding">ab ab xyz</div>
</div>
"
"
<div class="d122-top-section-btm-half">
<div class="108-top-grey-m12x4"></div>
<div class="109-top-dark-grey-block ng-binding"></div>
"
Update
To get the text of the second div block nth-child should work. I tested the selector locally in chrome tools:
So in your Java:
String elementText = driver.findElement(By.cssSelector(".d122-top-section-btm-half:nth-child(2) .ng-binding")).getText();
Should do the trick - as the CSS spec says nth-child is 1 indexed - not 0 - so its the 2nd child.
Old Answer
Based on the HTML snippet you provided you could use a CSS selector. So you could do:
String elementText = driver.findElement(By.cssSelector(".d122-top-section-btm-half .109-top-dark-grey-block")).getText();
Or if you are just after the element with the ng-binding within your first div then it would be cleaner:
String elementText = driver.findElement(By.cssSelector(".d122-top-section-btm-half .ng-binding")).getText();
Both would return the element text - maybe take a look at CSS Selectors Guide to learn more.
for Example:
<div>
this is first
<div>
second
</div>
</div>
I am working on Natural Language Processing and I have to translate a website(not by using Google Translate) for which i have to extract both sentences "this is first" and "second" separately so that i can replace them with other language text in respective divs. If i extract text for first it will show "this is first second" and if I using recursion to dig deeper, it will only extract "second"
Help me out please!
EDIT
Using ownText() method will create problem in the following html code:
<div style="top:+0.2em; font-size:95%;">
the
<a href="/wiki/Free_content" title="Free content">
free
</a>
<a href="/wiki/Encyclopedia" title="Encyclopedia">
encyclopedia
</a>
that
<a href="/wiki/Wikipedia:Introduction" title="Wikipedia:Introduction">
anyone can edit
</a>
.
</div>
It will print:
the that.
free
encyclopedia
anyone can edit
But it must be:
the
that
.
encyclopedia
anyone can edit
If i extract text for first it will show "this is first second"
Use ownText() instead of text() and you'll get only the element contains directly.
Here's an example:
final String html = "<div>\n"
+ " this is first\n"
+ " <div>\n"
+ " second\n"
+ " </div>\n"
+ "</div>";
Document doc = Jsoup.parse(html); // Get your Document from somewhere
Element first = doc.select("div").first(); // Select 1st element - take the first found
String firstText = first.ownText(); // Get own text
Element second = doc.select("div > div").first(); // Same as above, but with 2nd div
String secondText = second.ownText();
System.out.println("1st: " + firstText);
System.out.println("2nd: " + secondText);
You can use XML parser, in whatever language you are using. Here is for Java: http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
It seems like you're using textContent in the div's to extract the content, which will get you the content of that element, and all descendent elements. (Java: this would be the getTextContent method on the Element)
Instead examine the childNodes (Java: getChildNodes method on the Element). The nodes have a property "nodeType" (Java: getNodeType) which you can look at to work out whether the node is a Text Node (Java: Node.TEXT_NODE), or an Element (Java: Node.ELEMENT_NODE). So to take you example you have a tree of Nodes which look like this...
div (Element)
this is first (TextNode)
div (Element)
second (TextNode)
The outer div directly contains only two nodes - the first piece of text, and the inner div. That inner div then contains the text "second".
So loop over the nodes in the outer div, if the node is a text node, translate, otherwise recurse into the Element. Note that there are other kinds of nodes, Comments and the like, but for your purposes you can probably ignore those.
Assuming you're using the w3c DOM API
http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html
Elements divs=doc.getElementsByTag("div");
for (Element element : divs) {
System.out.println(element.text());
}
This should work if you are using jsoup HTML parser.
The following is a bunch of links <a elements. ONLY one of them has a substring "long" as a value for the attribute href
<a class="c1" href= "very_lpng string" > name1 </a>
<a class="g2" href= "verylong string" > name2 </a> // The one that I need
<a class="g4" href= "very ling string" > name3 </a>
<a class="g5g" href= "very ng string" > name4 </a>
...................
I need to click the link whose href has substring "long" in it. How can I do this?
PS: driver.findElement(By.partialLinkText("long")).click(); // b/c it chooses by the name
I need to click the link who's href has substring "long" in it. How can I do this?
With the beauty of CSS selectors.
your statement would be...
driver.findElement(By.cssSelector("a[href*='long']")).click();
This means, in english,
Find me any 'a' elements, that have the href attribute, and that attribute contains 'long'
You can find a useful article about formulating your own selectors for automation effectively, as well as a list of all the other equality operators. contains, starts with, etc... You can find that at: http://ddavison.io/css/2014/02/18/effective-css-selectors.html
use driver.findElement(By.partialLinkText("long")).click();
You can do this:
//first get all the <a> elements
List<WebElement> linkList=driver.findElements(By.tagName("a"));
//now traverse over the list and check
for(int i=0 ; i<linkList.size() ; i++)
{
if(linkList.get(i).getAttribute("href").contains("long"))
{
linkList.get(i).click();
break;
}
}
in this what we r doing is first we are finding all the <a> tags and storing them in a list.After that we are iterating the list one by one to find <a> tag whose href attribute contains long string. And then we click on that particular <a> tag and comes out of the loop.
With the help of xpath locator also, you can achieve the same.
Your statement would be:
driver.findElement(By.xpath(".//a[contains(#href,'long')]")).click();
And for clicking all the links contains long in the URL, you can use:-
List<WebElement> linksList = driver.findElements(By.xpath(".//a[contains(#href,'long')]"));
for (WebElement webElement : linksList){
webElement.click();
}