How can I print the contents of this HTML table using JSoup? - java

I will start off by stating that working with HTML and JSoup for that matter is very foreign to me so if this comes across as a stupid question, I apologize.
What I am trying to achieve with my code is to print the contents from the table on this link https://www.stormshield.one/pve/stats/daviddean/sch into my console in a format like this for each entry:
Wall Launcher
50
grade grade grade grade grade
15% ImpactKnockback
42% Reload Speed
15% Impact Knockback
42% Reload Speed
15% ImpactKnockback
42% Durability
My main issue is pretty much supplying the correct name for the table and the rows, once I can do that the formatting isn't really an issue for me.
This is the code I have tried to use to no avail:
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
for (Element table : doc.select("table schematics")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}

You need to find your table element, and it's head and rows.
Be careful, it is not always the first() element, I add it as an example.
Here is what you need to do:
Document doc = null;
try {
doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
} catch (IOException e) {
e.printStackTrace();
}
Element table = doc.body().getElementsByTag("table").first();
Element thead = table.getElementsByTag("thead").first();
StringBuilder headBuilder = new StringBuilder();
for (Element th : thead.getElementsByTag("th")) {
headBuilder.append(th.text());
headBuilder.append(" ");
}
System.out.println(headBuilder.toString());
Element tbody = table.getElementsByTag("tbody").first();
for (Element tr : tbody.getElementsByTag("tr")) {
StringBuilder rowBuilder = new StringBuilder();
for (Element td : tr.getElementsByTag("td")) {
rowBuilder.append(td.text());
rowBuilder.append(" ");
}
System.out.println(rowBuilder.toString());
}
The output is :

Related

Having trouble webscraping Premier League results in Java with JSoup

I am a complete beginner to webscraping. I have followed a couple tutorials online, but I can't seem to get it to work with Premiere League results.
Here is the exact link I've tried scraping from: https://www.premierleague.com/results
My goal is to read all the home-team and away teams as well as get their results (1-1 etc.). If anyone could help I would really appreicate it! Below is code I've tried so far:
First attempt
String element = doc.select("div.fixtures__matches-list span.competitionLabel1").first().text();
Second attempt
Elements elements = doc.select("div.fixtures__matches-list");
Elements matches = doc.getElementsByClass("matchList");
Element ULElement = matches.get(0);
Elements childElements = ULElement.children();
for (Element e : childElements) {
String first = e.select("ul.matchList").select("li.matchFixtureContainer data-home").text();
System.out.println(e.text());
}
Third attempt
Elements test = doc.getElementsByClass("fixtures");
Element firstE = test.get(0);
System.out.println(firstE.text())
for (Element e : test) {
System.out.println(e.text());
}
Fourth attempt
Elements names = doc.select("data-home");
for (Element name : names) {
System.out.println(name.text());
}
Fifth attempt
String webUrl = "https://www.premierleague.com/results";
Document doc = null;
try {
doc = Jsoup.connect(webUrl).timeout(6000).get();
}
catch(IOException e) {
e.printStackTrace();
}
Elements body = doc.select("div.tabbedContent");
for (Element e : body) {
String data = e.select("div.col-12 section.fixtures div.fixtures__matches-list ul.matchList").text();
}
I really can't figure it out.

Parsing currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

I prepare the program and I wrote this code with helping but the first 10 times it works then it gives me NULL values,
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
//Document doc = Jsoup.parse(url);
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
int i = 0;
String[] currencyStr = new String[11];
String[] buyStr = new String[11];
String[] sellStr = new String[11];
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table.table-markets");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(i);
currencyStr[i] = curreny.text();
buyStr[i] = buy.text();
sellStr[i] = sell.text();
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
i++;
}
for(i = 0; i < 11; i++){
System.out.println("currency: " + currencyStr[i]);
System.out.println("buy: " + buyStr[i]);
System.out.println("sell: " + sellStr[i]);
}
here is the code, I guess it is a connection problem but I could not solve it I use Netbeans, Do I have to change the connection properties of Netbeans or should I have to add something more in the code
can you help me?
There's nothing wrong with the connection. Your query simply doesn't match the page structure.
Somewhere on your page, there's an element with class borsaMain, that has a direct child with class detL. And then somewhere in the descendants tree of detL, there is your table. You can write this as the following CSS element selector query:
.borsaMain > .detL table
There will be two tables in the result, but I suspect you are looking for the first one.
So basically, you want something like:
Element table = doc.selectFirst(".borsaMain > .detL table");
for (Element row : table.select("tr:has(td)")) {
// your existing loop code
}

Select a particular HTML table with JSOUP

I have my code as:
public static void main(String[] args) throws IOException {
org.jsoup.nodes.Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
org.jsoup.select.Elements rows = doc.select("tr");
for (org.jsoup.nodes.Element row : rows) {
org.jsoup.select.Elements columns = row.select("td");
for (org.jsoup.nodes.Element column : columns) {
System.out.print(column.text());
}
System.out.println();
}
}
It is printing out all the table rows that on the webpage, is it possible if I just want to print out a selected table in the website?
Try to select a particular table element first and then loop over its nested elements.
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
Element table = doc.select("table.wikitable").get(1);
Elements body = table.select("tbody");
Elements rows = body.select("tr");
for (Element row : rows) {
System.out.print(row.select("th").text());
System.out.print(row.select("td").text());
System.out.println();
}
}
Output:
Ibu negaraKuala Lumpur
Pusat pentadbiranPutrajaya
Tarikh Hari Kebangsaan31 Ogos 1957
Cogan Kata NegaraBersekutu Bertambah Mutu
BenuaAsia, Asia Tenggara
Koordinat Geografi2 30 U, 112 30 T
Jumlah hujan tahunan2000mm ~ 2500mm
IklimTropika dengan suhu 24–35 Darjah Celsius
Bunga kebangsaanBunga Raya
Binatang rasmiHarimau
Puncak tertinggiGunung Kinabalu, Banjaran Crocker (4175m)
Puncak tertinggi SemenanjungGunung Tahan, Banjaran Tahan (2187 m)
Banjaran terpanjangBanjaran Titiwangsa (500 km)
Sungai terpanjangSungai Rajang, Sarawak (563 km)
Sungai terpanjang di SemenanjungSungai Pahang (475 km)
Jambatan terpanjangJambatan Pulau Pinang (13.5 km)
Gua terbesarGua Niah, Sarawak
Bangunan tertinggiMenara Berkembar Petronas (452m)
Negeri terbesarSarawak (124,450 km persegi)
Negeri terkecilPerlis (810 km persegi)
Tempat paling lembapBukit Larut (lebih 5080 mm)
Tempat paling keringJelebu (kurang daripada 1500 mm)
Kawasan paling padatKuala Lumpur (6074/km², 15,543/batu persegi)
Penanaman eksport utamaKelapa sawit dan getah
Read more documentation here about JSOUP.
The best way to do this is grab the table by its title. Since the title is embedded in a cousin element of the table, and CSS has no parent selector, you can use a combination of CSS and Jsoup API calls to achieve this.
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://ms.wikipedia.org/wiki/Malaysia").get();
Element table = doc.select("span#Trivia").parents().first().nextElementSibling();
Elements rows = table.select("tr");
for (Element row : rows) {
String header = row.select("th").text();
String value = row.select("td").text();
System.out.println(header + ": " + value);
}
}

Getting a block of text using Jsoup

Basically what I'm attempting to do is input the song and artist in the url which will then bring me to the page with the song's lyrics I'm then going to find the correct way to get those lyrics. I'm new to using Jsoup. So far the issue I've been having is I can't figure out the correct way to get the lyrics. I've tried getting the first "div" after the "b" but it doesn't seem to work out the way I plan.
public static void search() throws MalformedURLException {
Scanner search = new Scanner(System.in);
String artist;
String song;
artist = search.nextLine();
artist = artist.toLowerCase();
System.out.println("Artist saved");
song = search.nextLine();
song = song.toLowerCase();
System.out.println("Song saved");
artist = artist.replaceAll(" ", "");
System.out.println(artist);
song = song.replaceAll(" ", "");
System.out.println(song);
try {
Document doc;
doc = Jsoup.connect("http://www.azlyrics.com/lyrics/"+artist+"/"+song+".html").get();
System.out.println(doc.title());
for(Element element : doc.select("div")) {
if(element.hasText()) {
System.out.println(element.text());
break;
}
}
} catch (IOException e){
e.printStackTrace();
}
}
I don't know if this is consistent or not in all song pages, but in the page you have shown, the lyrics appear with the div element whose first attribute is margin. If this is consistent, you could try something on the order of...
Elements eles = doc.select("div[style^=margin]");
System.out.println(eles.html());
Or if it's always the 6th div element with lyrics, you could use that:
Elements eles = doc.select("div");
if (eles.size() >= 6) {
System.out.println(eles.get(6).html());
}

Remove White Space From Text that i scraped from website

I am trying to scrape a list of medicines from a website.
I am using JSOUP to parse the Html.
Here is my code :
URL url = new URL("http://www.medindia.net/drug-price/index.asp?alpha=a");
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.getElementsByAttributeValue("style", "padding-left:5px;border-right:1px solid #A5A5A5;");
for(Element row : rows){
String htm = row.text();
if(!(htm.equals("View Price")||htm.contains("Show Details"))) {
System.out.println(htm);
System.out.println();
}
}
Here is the Output that I am getting:
P.S. This is not the complete output But As I couldn't Take The Screen Shot of the complete output, I just displayed it.
I need to Know Two Things :
Question 1. Why am I getting an Extra Space In front of each Drug Name and why am I getting Extra New Line After Some Drug's Name?
Question 2. How do I resolve this Issue?
A few things:
It's not the complete output because there's more than one page. I put a for loop that fixes that for you.
You should probably trim the output using htm.trim()
You should probably make sure to not print when there's a newLine (!htm.isEmpty())
That website has a weird character with ASCII value 160 in it. I added a small fix that solves the problem. (with .replace)
Here's the fixed code:
for(char page='a'; page <= 'z'; page++) {
String urlString = String.format("http://www.medindia.net/drug-price/index.asp?alpha=%c", page);
URL url = new URL(urlString);
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.getElementsByAttributeValue("style", "padding-left:5px;border-right:1px solid #A5A5A5;");
for(Element row : rows){
String htm = row.text().replace((char) 160, ' ').trim();
if(!(htm.equals("View Price")||htm.contains("Show Details"))&& !htm.isEmpty())
{
System.out.println(htm.trim());
System.out.println();
}
}
}
Do one thing :
Use trim function in syso : System.out.println(htm.trim());
UPDATED :
After a lot of effort I was able to parse all 80 medicines like this :-
URL url = new URL("http://www.medindia.net/drug-price/index.asp?alpha=a");
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.select("td.ta13blue");
Elements rows1 = doc1.select("td.ta13black.tbold");
int cnt=0;
for(Element row : rows){
cnt++;
String htm = row.text().trim();
if(!(htm.equals("View Price")||htm.contains("Show Details") || htm.startsWith("Drug"))) {
System.out.println(cnt+" : "+htm);
System.out.println();
}
}
for(Element row1 : rows1){
cnt++;
String htm = row1.text().trim();
if(!(htm.equals("View Price")||htm.contains("Show Details") || htm.startsWith("Drug"))) {
System.out.println(cnt+" : "+htm);
System.out.println();
}
}
1) Taking elements by style is quite dangerous;
2) Calling ROWS what instead is a list of FIELDS is even more dangerous :)
3) Opening the page , you can see that the extra lines are added ONLY after "black names", name of items not wrapped in an anchor link.
You problem is then that the second field in that rows is not Show Details nor View Price and not even empty... it is:
<td bgcolor="#FFFFDB" align="center"
style="padding-left:5px;border-right:1px solid #A5A5A5;">
</td>
It is a one space string. Modify your code like this:
for(Element row : rows){
String htm = row.text().trim(); // <!-- This one
if(!
(htm.equals("View Price")
|| htm.contains("Show Details")
|| htm.equals(" ")) // <!-- And this one
) {
System.out.println(htm);
System.out.println();
}
}

Categories