Java jsoup link extracting - java

I am trying to extract the links within a given element in jsoup. Here what I have done but its not working:
Document doc = Jsoup.connect(url).get();
Elements element = doc.select("section.row");
Element s = element.first();
Elements se = s.getElementsByTag("article");
for(Element link : se){
System.out.println("link :" + link.select("href"));
}
Here is the html:
The thing I am trying to do is get all the links withing the article classes. I thought that maybe first I must select the section class ="row", and then after that derive somehow the links from the article class but I could not make it work.

Try out this.
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("#main"); //select section with the id = main
Elements allArtTags = section.select("article"); // select all article tags in that section
for (Element artTag : allArtTags ){
Elements atags = artTag.select("a"); //select all a tags in each article tag
for(Element atag : atags){
System.out.println(atag.text()); //print the link text or
System.out.println(atag.attr("href"));//print link
}
}

I'm using this in one of my projects:
final Elements elements = doc.select("div.item_list_section.item_description");
you'll have to get the elements you want to extract links from.
private static ... inspectElement(Element e) {
try {
final String name = getAttr(e, "a[href]");
final String link = e.select("a").first().attr("href");
//final String price = getAttr(e, "span.item_price");
//final String category = getAttr(e, "span.item_category");
//final String spec = getAttr(e, "span.item_specs");
//final String datetime = e.select("time").attr("datetime");
...
}
catch (Exception ex) { return null; }
}
private static String getAttr(Element e, String what) {
try {
return e.select(what).first().text();
}
catch (Exception ex) { return ""; }
}

Related

How to get Ticker symbol from table using jsoup?

I'm trying to get the symbols from the table at YahooFinance, but can't figure out why my code doesn't detect the table.
This is what I tried:
public String[] getTrendingTickers() {
String[] trendingTickers = new String[30];
int numTickers = 0;
String url = "https://finance.yahoo.com/trending-tickers/";
try {
Document document = Jsoup.connect(url).get();
for (Element row : document.select("table.W(100%) tr")) {
String ticker = row.select(
".Fz\\(s\\).Ta\\(start\\)\\!.Bgc\\(\\$lv2BgColor\\).Z\\(1\\).Bgc\\(\\$lv3BgColor\\).Pos\\(st\\).simpTblRow\\:h_Bgc\\(\\$hoverBgColor\\).Pend\\(10px\\).Start\\(0\\).Pend\\(15px\\).Pstart\\(6px\\).Ta\\(start\\).Va\\(m\\)")
.text();
System.out.println(ticker);
trendingTickers[numTickers] = ticker;
numTickers++;
}
} catch (Exception e) {
System.out.println(e);
}
return trendingTickers;
}
With the error org.jsoup.select.Selector$SelectorParseException: Could not parse query 'table.W(100%).tr': unexpected token at '(100%).tr'
Here is some sample code that creates a list of all the symbols in the table of the page you reference:
Document document = Jsoup.connect("https://finance.yahoo.com/trending-tickers/").get();
Element table = document.select("table tbody").first();
List<String> symbols = new ArrayList<>();
for (Element row: table.select("tr")) {
symbols.add(row.select("td").first().text());
}
System.out.println(symbols);
See https://jsoup.org/apidocs/org/jsoup/select/Selector.html for details on the selector syntax.

Having trouble webscraping Premier League results in Java with JSoup

I am a complete beginner to webscraping. I have followed a couple tutorials online, but I can't seem to get it to work with Premiere League results.
Here is the exact link I've tried scraping from: https://www.premierleague.com/results
My goal is to read all the home-team and away teams as well as get their results (1-1 etc.). If anyone could help I would really appreicate it! Below is code I've tried so far:
First attempt
String element = doc.select("div.fixtures__matches-list span.competitionLabel1").first().text();
Second attempt
Elements elements = doc.select("div.fixtures__matches-list");
Elements matches = doc.getElementsByClass("matchList");
Element ULElement = matches.get(0);
Elements childElements = ULElement.children();
for (Element e : childElements) {
String first = e.select("ul.matchList").select("li.matchFixtureContainer data-home").text();
System.out.println(e.text());
}
Third attempt
Elements test = doc.getElementsByClass("fixtures");
Element firstE = test.get(0);
System.out.println(firstE.text())
for (Element e : test) {
System.out.println(e.text());
}
Fourth attempt
Elements names = doc.select("data-home");
for (Element name : names) {
System.out.println(name.text());
}
Fifth attempt
String webUrl = "https://www.premierleague.com/results";
Document doc = null;
try {
doc = Jsoup.connect(webUrl).timeout(6000).get();
}
catch(IOException e) {
e.printStackTrace();
}
Elements body = doc.select("div.tabbedContent");
for (Element e : body) {
String data = e.select("div.col-12 section.fixtures div.fixtures__matches-list ul.matchList").text();
}
I really can't figure it out.

Getting a block of text using Jsoup

Basically what I'm attempting to do is input the song and artist in the url which will then bring me to the page with the song's lyrics I'm then going to find the correct way to get those lyrics. I'm new to using Jsoup. So far the issue I've been having is I can't figure out the correct way to get the lyrics. I've tried getting the first "div" after the "b" but it doesn't seem to work out the way I plan.
public static void search() throws MalformedURLException {
Scanner search = new Scanner(System.in);
String artist;
String song;
artist = search.nextLine();
artist = artist.toLowerCase();
System.out.println("Artist saved");
song = search.nextLine();
song = song.toLowerCase();
System.out.println("Song saved");
artist = artist.replaceAll(" ", "");
System.out.println(artist);
song = song.replaceAll(" ", "");
System.out.println(song);
try {
Document doc;
doc = Jsoup.connect("http://www.azlyrics.com/lyrics/"+artist+"/"+song+".html").get();
System.out.println(doc.title());
for(Element element : doc.select("div")) {
if(element.hasText()) {
System.out.println(element.text());
break;
}
}
} catch (IOException e){
e.printStackTrace();
}
}
I don't know if this is consistent or not in all song pages, but in the page you have shown, the lyrics appear with the div element whose first attribute is margin. If this is consistent, you could try something on the order of...
Elements eles = doc.select("div[style^=margin]");
System.out.println(eles.html());
Or if it's always the 6th div element with lyrics, you could use that:
Elements eles = doc.select("div");
if (eles.size() >= 6) {
System.out.println(eles.get(6).html());
}

JSoup parsing data from within a tag

I am managing to parse most of the data I need except for one as it is contained within the a href tag and I am needing the number that appears after "mmsi="
Sunsail 4013
my current parser fetches all the other data I need and is below. I tried a few things out the code commented out returns unspecified occasionally for an entry. Is there any way I can add to my code below so that when the data is returned the number "235083844" returns before the name "Sunsail 4013"?
try {
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements tables = doc.select("table.shipInfo");
for( Element element : tables )
{
Elements tdTags = element.select("td");
//Elements mmsi = element.select("a[href*=/showship.php?mmsi=]");
// Iterate over all 'td' tags found
for( Element td : tdTags ){
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Example of data parsed and html file here
You can use attr on an Element object to retrieve a particular attribute's value
Use substring to get the required value if the String pattern is consistent
Code
// Using just your anchor html tag
String html = "Sunsail 4013";
Document doc = Jsoup.parse(html);
// Just selecting the anchor tag, for your implementation use a generic one
Element link = doc.select("a").first();
// Get the attribute value
String url = link.attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ link.text());
Gives,
235083844 Sunsail 4013
Modified condition in your for loop from your code:
...
for (Element td : tdTags) {
// Print it's text if not empty
final String text = td.text();
if (text.isEmpty() == false) {
if (td.getElementsByTag("a").first() != null) {
// Get the attribute value
String url = td.getElementsByTag("a").first().attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ td.text());
}
else {
System.out.println(td.text());
}
}
}
...
The above code would print the desired output.
If you need value of attribute, you should use attr() method.
for( Element td : tdTags ){
Elements aList = td.select("a");
for(Element a : aList){
String val = a.attr("href");
if(StringUrils.isNotBlank(val)){
String yourId = val.substring(val.indexOf("=") + 1);
}
}

jSoup extract Text out of DIV tag to String

I want to extract some Text out of a website and store in String.
<div class="textclass" id="textid" itemprop="itemtext">I want to get this Text</div>
What goes into the question marks?
protected Void doInBackground(Void... params) {
try {
Document document = Jsoup.connect(url).get();
Elements text = document.select("???");
desc = text.attr("???");
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
Use the below
Elements text = document.select("div");
String desc = text.text();
Log.i(".........",+desc);
The log after trying at my end
01-31 04:45:15.272: I/.........(1233): I want to get this Text
Edit:
You can use
Elements text = document.select("div[class=textclass]");
or using id
Elements text = document.select("div[id=textid]");
or
Elements text = document.select("div[itemprop=itemtext]");
You can try this:
Document doc1 = Jsoup.connect(url).get();
Element contentDiv = doc1.select("div[id=textid]").first();
String text=contentDiv.getElementsByTag("div").text();
System.out.println(text); // The result
So get the text in the div with the id "textid" saved in the variable "text".

Categories