How to use Jsoup to find element by ID? - java

I am trying to scrape the Top Stories section in google news for all the titles. In order to only get the titles in the Top Stories section, I must narrow into this tag:
<div class="section top-stories-section" id=":2r">..</div>
This is the code I use (in Eclipse):
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://news.google.com";
Document document = Jsoup.connect(url).get();
// Extract data
Element topStories = document.getElementById(":2r").;
Elements titles = topStories.select("span.titletext");
// Output data
for (Element title : titles) {
System.out.println("Title: " + title.text());
}
}
I always seem to be getting a NullPointerException. It doesn't work either, when I try to reach the Top Stories like this:
Element topStories = document.select("#:2r").first();
Am I missing something? Shouldn't this be working? I am relatively new to this, please help and thank you!

Judging from the error message (and actually looking at the page) that div tag doesn't contain an id attribute. Instead you could select based on the CSS class
Element topStories = document.select("div.section.top-stories-section").first();

Related

JSoup, how to return data from a dynamic <a href> tag

Very new to JSoup, trying to retrieve a changeable value that is stored within an tag, specifically from the following website and html.
Snapshot of HTML
the results after "consitituency/" are changeable and dependent on the input of the user. I am able to retrieve the h2 tags themselves but not the information within. At the moment the best return I can get is just tags using the method below
The desired return would be something that I can substring down into
Dublin Bay South
The actual return is
<well.col-md-4.h2></well.col-md-4.h2>
private String jSoupTDRequest(String aLine1, String aLine3) throws IOException {
String constit = "";
String h2 = "h2";
String url = "https://www.whoismytd.com/search?utf8=✓&form-input="+aLine1+"%2C+"+aLine3+"+Ireland";
//Switch to try catch if time
Document doc = Jsoup.connect(url)
.timeout(6000).get();
//Scrape elements from relevant section
Elements body = doc.select("well.col-md-4.h2");
Element e = new Element("well.col-md-4.h2");
constit = e.toString();
return constit;
I am extremely new to JSoup and scraping in general. Would appreciate any input from someone who knows what they're doing or any alternate ways to try and get the desired result
Change your scraping elements from relevant section code as follows:
Select the very first <div class="well"> element first.
Element tdsDiv = doc.select("div.well").first();
Select the very first <a> link element next. This link points to the constituency.
Element constLink = tdsDiv.select("a").first();
Get the constituency name by grabbing this link's text content.
constit = constLink.text();
import org.junit.jupiter.api.Test;
import java.io.IOException;
#DisplayName("JSoup, how to return data from a dynamic <a href> tag")
class JsoupQuestionTest {
private static final String URL = "https://www.whoismytd.com/search?utf8=%E2%9C%93&form-input=Kildare%20Street%2C%20Dublin%2C%20Ireland";
#Test
void findSomeText() throws IOException {
String expected = "Dublin Bay South";
Document document = Jsoup.connect(URL).get();
String actual = document.getElementsByAttributeValue("href", "/constituency/dublin-bay-south").text();
Assertions.assertEquals(expected, actual);
}
}

How to get data from a URL in new lines using JSOUP?

I'm scrapping IMDB chart of 250 movies. I want to store each movie name in an array, but I don't know why it puts all the movie names into the first index, i.e Array[0].
Below is my code.
Can anyone please help me out. I've to complete another project and this is the main thing that is needed.
If you can direct me any website or tutorial I'll be very thankful to you.
try {
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").userAgent("Mozilla").get();
int counterVariable = 0;
for (Element el : doc.select(".lister-list")) {
mString[counterVariable] = el.select(".titleColumn").text();
totalNumberOfLines++;
counterVariable++;
}
} catch (Exception e) {
System.out.println("Sorry website couldn't be opened");
System.out.println(e);
}
System.out.println(mString[0]);// It's putting all the names into this index
The problem is that you have only one element matching selector .lister-list, so iterating over it does not make much sense. When you call el.select(".titleColumn").text(); Jsoup concatenates text from all matching elements. This is why you get all results in one element. Instead you can try to select all td tags with class tittleColumn that are children of tr element that are child of .lister-list
for (Element el : doc.select(".lister-list > tr > td.titleColumn")) {
mString[counterVariable] = el.text();
totalNumberOfLines++;
counterVariable++;
}
More about jsoup css selectors you can learn here.

How to extract WebElements using linkText from fields and click on it

I would like to catch a text within the field and be able to click on that element. It extracts all the elements' texts into log when I use the following:
String text;
text = HomePageFields.TableOneColumn(driver).getText();
System.out.println("Table One Column contains following:\n" + text);
The TableOneColumn xpath is on different class:
public static WebElement TableOneColumn(WebDriver driver) throws IOException {
element = driver.findElement(By.xpath("//div[contains(#eventproxy,'isc_QMetricsView_0')]/div[1]/div[1]/div[1]/div[1]/div[1]/div[contains(#style,'position')]/div"));
return element;
I tried to use:
HomePageFields.TableOneColumn(driver).findElement(By.linkText("RFI Overview")).click();
But it gives an error saying won't find the element.
Here is the html link to that particular text. But other text contain in the same tag but different locations within that main tag.
Actually By.linkText() locates <a> elements by the exact text it displays while desire text is not inside any <a> tag. That's why you're in trouble.
You should try using By.xpath() with this text as below :-
WebElement el = driver.findElement(By.xpath(".//div[descendant::td[text() = 'RFI Overview']]"));
System.out.println("Table One Column contains following:\n" + el.getText());
el.click();
Or
WebElement el = driver.findElement(By.xpath(".//div[normalize-space(.) = 'RFI Overview']"));
System.out.println("Table One Column contains following:\n" + el.getText());
el.click();
Or As I'm seeing in provided screenshot desire <div> has id attribute which value looks like unique. If this attribute value is fixed for this element, you can also try using By.id() as :-
WebElement el = driver.findElement(By.id("isc_3BL"));
System.out.println("Table One Column contains following:\n" + el.getText());
el.click();
driver.findElement(By.xpath("//td[.='RFI Overview']")).click();
I'd suggest using the div with id isc_3BL as well, but I am not certain that is a static id. If it is, you could definitely use it to isolate from any other outside table containing the same exact td with text "RFI Overview"

JSoup parsing a text file containing a html table with Java

I am really unsure how I can get the information I need to place into a database, the code below just prints the whole file.
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
System.out.println(doc.toString());
My HTML is here from line 61 and I am needing to get the items under the column headings but also grab the MMSI number which is not under a column heading but in the href tag. I haven't used JSoup other than to get the HTML from the web page. I can only really see tutorials to use php and I'd rather not use it.
To get those information, the best way is to use Jsoup's selector API. Using selectors, your code will look something like this (pseudeocode!):
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements matches = doc.select("<your selector here>");
for( Element element : matches )
{
// do something with found elements
}
There's a good documentation available here: Use selector-syntax to find elements. If you get stuck nevertheless, please describe your problem.
Here are some hints for that selector, you can use:
// Select the table with class 'shipinfo'
Elements tables = doc.select("table.shipinfo");
// Iterate over all tables found (since it's only one, you can use first() instead
for( Element element : tables )
{
// Select all 'td' tags of that table
Elements tdTags = element.select("td");
// Iterate over all 'td' tags found
for( Element td : tdTags )
{
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}

Jsoup get href within a class

I have this html code that I need to parse
<a class="sushi-restaurant" href="/greatSushi">Best Sushi in town</a>
I know there's an example for jsoup that you can get all links in a page,e.g.
Elements links = doc.select("a[href]");
for (Element link : links) {
print(" * a: <%s> (%s)", link.attr("abs:href"),
trim(link.text(), 35));
}
but I need a piece of code that can return me the href for that specific class.
Thanks guys
You can select elements by class. This example finds elements with the class sushi-restaurant, then gets the absolute URL of the first result.
Make sure that when you parse the HTML, you specify the base URL (where the document was fetched from) to allow jsoup to determine what the absolute URL of a link is.
public static void main(String[] args) {
String html = "<a class=\"sushi-restaurant\" href=\"/greatSushi\">Best Sushi in town</a>";
Document doc = Jsoup.parse(html, "http://example.com/");
// find all <a class="sushi-restaurant">...
Elements links = doc.select("a.sushi-restaurant");
Element link = links.first();
// 'abs:' makes "/greatsushi" = "http://example.com/greatsushi":
String url = link.attr("abs:href");
System.out.println("url = " + url);
}
Shorter version:
String url = doc.select("a.sushi-restaurant").first().attr("abs:href");
Hope this helps!
Elements links = doc.select("a");
for (Element link : links) {
String attribute=link.attr("class");
if(attribute.equalsIgnoreCase("sushi-place")){
print link.href//You probably need this
}
}

Categories