I am a complete beginner to webscraping. I have followed a couple tutorials online, but I can't seem to get it to work with Premiere League results.
Here is the exact link I've tried scraping from: https://www.premierleague.com/results
My goal is to read all the home-team and away teams as well as get their results (1-1 etc.). If anyone could help I would really appreicate it! Below is code I've tried so far:
First attempt
String element = doc.select("div.fixtures__matches-list span.competitionLabel1").first().text();
Second attempt
Elements elements = doc.select("div.fixtures__matches-list");
Elements matches = doc.getElementsByClass("matchList");
Element ULElement = matches.get(0);
Elements childElements = ULElement.children();
for (Element e : childElements) {
String first = e.select("ul.matchList").select("li.matchFixtureContainer data-home").text();
System.out.println(e.text());
}
Third attempt
Elements test = doc.getElementsByClass("fixtures");
Element firstE = test.get(0);
System.out.println(firstE.text())
for (Element e : test) {
System.out.println(e.text());
}
Fourth attempt
Elements names = doc.select("data-home");
for (Element name : names) {
System.out.println(name.text());
}
Fifth attempt
String webUrl = "https://www.premierleague.com/results";
Document doc = null;
try {
doc = Jsoup.connect(webUrl).timeout(6000).get();
}
catch(IOException e) {
e.printStackTrace();
}
Elements body = doc.select("div.tabbedContent");
for (Element e : body) {
String data = e.select("div.col-12 section.fixtures div.fixtures__matches-list ul.matchList").text();
}
I really can't figure it out.
Related
I prepare the program and I wrote this code with helping but the first 10 times it works then it gives me NULL values,
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
//Document doc = Jsoup.parse(url);
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
int i = 0;
String[] currencyStr = new String[11];
String[] buyStr = new String[11];
String[] sellStr = new String[11];
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table.table-markets");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(i);
currencyStr[i] = curreny.text();
buyStr[i] = buy.text();
sellStr[i] = sell.text();
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
i++;
}
for(i = 0; i < 11; i++){
System.out.println("currency: " + currencyStr[i]);
System.out.println("buy: " + buyStr[i]);
System.out.println("sell: " + sellStr[i]);
}
here is the code, I guess it is a connection problem but I could not solve it I use Netbeans, Do I have to change the connection properties of Netbeans or should I have to add something more in the code
can you help me?
There's nothing wrong with the connection. Your query simply doesn't match the page structure.
Somewhere on your page, there's an element with class borsaMain, that has a direct child with class detL. And then somewhere in the descendants tree of detL, there is your table. You can write this as the following CSS element selector query:
.borsaMain > .detL table
There will be two tables in the result, but I suspect you are looking for the first one.
So basically, you want something like:
Element table = doc.selectFirst(".borsaMain > .detL table");
for (Element row : table.select("tr:has(td)")) {
// your existing loop code
}
I will start off by stating that working with HTML and JSoup for that matter is very foreign to me so if this comes across as a stupid question, I apologize.
What I am trying to achieve with my code is to print the contents from the table on this link https://www.stormshield.one/pve/stats/daviddean/sch into my console in a format like this for each entry:
Wall Launcher
50
grade grade grade grade grade
15% ImpactKnockback
42% Reload Speed
15% Impact Knockback
42% Reload Speed
15% ImpactKnockback
42% Durability
My main issue is pretty much supplying the correct name for the table and the rows, once I can do that the formatting isn't really an issue for me.
This is the code I have tried to use to no avail:
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
for (Element table : doc.select("table schematics")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
You need to find your table element, and it's head and rows.
Be careful, it is not always the first() element, I add it as an example.
Here is what you need to do:
Document doc = null;
try {
doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
} catch (IOException e) {
e.printStackTrace();
}
Element table = doc.body().getElementsByTag("table").first();
Element thead = table.getElementsByTag("thead").first();
StringBuilder headBuilder = new StringBuilder();
for (Element th : thead.getElementsByTag("th")) {
headBuilder.append(th.text());
headBuilder.append(" ");
}
System.out.println(headBuilder.toString());
Element tbody = table.getElementsByTag("tbody").first();
for (Element tr : tbody.getElementsByTag("tr")) {
StringBuilder rowBuilder = new StringBuilder();
for (Element td : tr.getElementsByTag("td")) {
rowBuilder.append(td.text());
rowBuilder.append(" ");
}
System.out.println(rowBuilder.toString());
}
The output is :
I am trying to extract the links within a given element in jsoup. Here what I have done but its not working:
Document doc = Jsoup.connect(url).get();
Elements element = doc.select("section.row");
Element s = element.first();
Elements se = s.getElementsByTag("article");
for(Element link : se){
System.out.println("link :" + link.select("href"));
}
Here is the html:
The thing I am trying to do is get all the links withing the article classes. I thought that maybe first I must select the section class ="row", and then after that derive somehow the links from the article class but I could not make it work.
Try out this.
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("#main"); //select section with the id = main
Elements allArtTags = section.select("article"); // select all article tags in that section
for (Element artTag : allArtTags ){
Elements atags = artTag.select("a"); //select all a tags in each article tag
for(Element atag : atags){
System.out.println(atag.text()); //print the link text or
System.out.println(atag.attr("href"));//print link
}
}
I'm using this in one of my projects:
final Elements elements = doc.select("div.item_list_section.item_description");
you'll have to get the elements you want to extract links from.
private static ... inspectElement(Element e) {
try {
final String name = getAttr(e, "a[href]");
final String link = e.select("a").first().attr("href");
//final String price = getAttr(e, "span.item_price");
//final String category = getAttr(e, "span.item_category");
//final String spec = getAttr(e, "span.item_specs");
//final String datetime = e.select("time").attr("datetime");
...
}
catch (Exception ex) { return null; }
}
private static String getAttr(Element e, String what) {
try {
return e.select(what).first().text();
}
catch (Exception ex) { return ""; }
}
Basically what I'm attempting to do is input the song and artist in the url which will then bring me to the page with the song's lyrics I'm then going to find the correct way to get those lyrics. I'm new to using Jsoup. So far the issue I've been having is I can't figure out the correct way to get the lyrics. I've tried getting the first "div" after the "b" but it doesn't seem to work out the way I plan.
public static void search() throws MalformedURLException {
Scanner search = new Scanner(System.in);
String artist;
String song;
artist = search.nextLine();
artist = artist.toLowerCase();
System.out.println("Artist saved");
song = search.nextLine();
song = song.toLowerCase();
System.out.println("Song saved");
artist = artist.replaceAll(" ", "");
System.out.println(artist);
song = song.replaceAll(" ", "");
System.out.println(song);
try {
Document doc;
doc = Jsoup.connect("http://www.azlyrics.com/lyrics/"+artist+"/"+song+".html").get();
System.out.println(doc.title());
for(Element element : doc.select("div")) {
if(element.hasText()) {
System.out.println(element.text());
break;
}
}
} catch (IOException e){
e.printStackTrace();
}
}
I don't know if this is consistent or not in all song pages, but in the page you have shown, the lyrics appear with the div element whose first attribute is margin. If this is consistent, you could try something on the order of...
Elements eles = doc.select("div[style^=margin]");
System.out.println(eles.html());
Or if it's always the 6th div element with lyrics, you could use that:
Elements eles = doc.select("div");
if (eles.size() >= 6) {
System.out.println(eles.get(6).html());
}
I am managing to parse most of the data I need except for one as it is contained within the a href tag and I am needing the number that appears after "mmsi="
Sunsail 4013
my current parser fetches all the other data I need and is below. I tried a few things out the code commented out returns unspecified occasionally for an entry. Is there any way I can add to my code below so that when the data is returned the number "235083844" returns before the name "Sunsail 4013"?
try {
File input = new File("shipMove.txt");
Document doc = Jsoup.parse(input, null);
Elements tables = doc.select("table.shipInfo");
for( Element element : tables )
{
Elements tdTags = element.select("td");
//Elements mmsi = element.select("a[href*=/showship.php?mmsi=]");
// Iterate over all 'td' tags found
for( Element td : tdTags ){
// Print it's text if not empty
final String text = td.text();
if( text.isEmpty() == false )
{
System.out.println(td.text());
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Example of data parsed and html file here
You can use attr on an Element object to retrieve a particular attribute's value
Use substring to get the required value if the String pattern is consistent
Code
// Using just your anchor html tag
String html = "Sunsail 4013";
Document doc = Jsoup.parse(html);
// Just selecting the anchor tag, for your implementation use a generic one
Element link = doc.select("a").first();
// Get the attribute value
String url = link.attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ link.text());
Gives,
235083844 Sunsail 4013
Modified condition in your for loop from your code:
...
for (Element td : tdTags) {
// Print it's text if not empty
final String text = td.text();
if (text.isEmpty() == false) {
if (td.getElementsByTag("a").first() != null) {
// Get the attribute value
String url = td.getElementsByTag("a").first().attr("href");
// Check for nulls here and take the substring from '=' onwards
String id = url.substring(url.indexOf('=') + 1);
System.out.println(id + " "+ td.text());
}
else {
System.out.println(td.text());
}
}
}
...
The above code would print the desired output.
If you need value of attribute, you should use attr() method.
for( Element td : tdTags ){
Elements aList = td.select("a");
for(Element a : aList){
String val = a.attr("href");
if(StringUrils.isNotBlank(val)){
String yourId = val.substring(val.indexOf("=") + 1);
}
}