i am trying to scrap text from https://in-the-sky.org/data/object.php?id=A216&day=17&month=6&year=2022
so i wrote a code like
import java.util.Iterator;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
public static void main(String args[]) {
int num = 216;
int day = 17;
int month = 6;
int year = 2022;
String url ="https://in-the-sky.org/data/object.php?id=A"+Integer.toString(num)+"&day="+Integer.toString(day)+"&month="+Integer.toString(month)+"&year="+Integer.toString(year);
System.out.println(url);
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
System.out.println("=======================================================");
Elements element = doc.select("div.col-md-6 col-md-pull-6");
String output = element.select("p").text();
System.out.println(output);
System.out.println("=======================================================");
}
}
but it doesnt work well. i would like someone to help me please
I believe that you can use Elements element = doc.select("div.col-md-6 > p"); to get your desired output.
Related
I am trying to get some data from a website. I make copy paste from my old program. But its not working. My code is below.
import java.io.IOException;
import javax.swing.JOptionPane;
import org.jsoup.Jsoup;
import org.jsoup.Connection.Response;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Veri {
public static void main(String[] args) {
Veri();
}
public static void Veri() {
try {
String url = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/Sayfalar/default.aspx";
Response res = Jsoup.connect(url).timeout(6000).execute();
Document doc = res.parse();
Element ele = doc.select("table[class=dataTable hover nowrap excelexport data-tables no-footer]").first();
for (int i = 0; i < 100; i++) {
System.out.println(ele.select("td").iterator().next().text());
}
} catch (IOException c) {
JOptionPane.showMessageDialog(null, "Veriler Alınırken Bir Harta Oluştu!");
c.printStackTrace();
}
}
}
I got the below error
Exception in thread "main" java.lang.NullPointerException at
Veri.Veri(Veri.java:37) at Veri.main(Veri.java:20)
The page has probably changed a little bit since you last used your program.
Try this:
import org.jsoup.Jsoup;
import org.jsoup.Connection.Response;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Veri {
public static void main(String[] args) {
Veri();
}
public static void Veri() {
try {
String url = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/Sayfalar/default.aspx";
Response res = Jsoup.connect(url).timeout(6000).execute();
Document doc = res.parse();
Element ele = doc.select("table[class=dataTable hover nowrap excelexport]").first();
Elements lines = ele.select("tr");
for (Element elt : lines) {
System.out.println(elt.text());
System.out.println("------------------------");
}
} catch (IOException c) {
JOptionPane.showMessageDialog(null, "Veriler Alınırken Bir Harta Oluştu!");
c.printStackTrace();
}
}
}
I think you get all the information needed this way.
<span class="c-city__hrMin" data-bind="{attr:{id:'p'+id()}}" id="p64">10:52</span>
How do I get this to print out just 10:52
So far I have tried
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.w3c.dom.Node;
import org.jsoup.select.*;
public class Main
{
public static void main(String [] args) {
Document doc = null;
try {
doc = Jsoup.connect("https://www.timeanddate.com/worldclock/personal.html").get();
String title = doc.title();
Elements elements = doc.select(".c-city__hrMin");
System.out.println("Website : " + title + elements.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
From this the output is Website : The Personal World Clock but their isn't any syntax error
Simply
doc.select(".c-city__hrMin") should work.
But if this class c-city__hrMin presents in other elements too then try
doc.select(span[class=c-city__hrMin]) It will select all span element having that class exclusively.
NB: For more reference and idea about Jsoup CSS Selectors follow this. You can try the selectors for a documents here also.
I wrote a program which reads the name and the rating of the top 250 movies on imdb and return the mean of the rating. I have the follow program
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class da {
/**
* #param args
*/
public static void main(String[] args) {
try {
Document doc=Jsoup.connect("http://www.imdb.com/chart/top").get();
Elements e=doc.getElementsByClass("titleColumn");
Elements t=doc.getElementsByClass("imdbRating");
float suma=0;
for(int i=0;i<e.size();i++)
suma=suma+Float.parseFloat(t.get(i).text());
System.out.println(suma/250);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
My question is why in 't' it needs "imdbRating" because if i look in the html on the page i see that where rating is located it writes "ratingColumn imdbRating" (i did this program by mistake and i don't know why it is working this way and not the other way)
You don't need the element e in this program. The titleColumn in the webpage just contains the title of the movie. Considering you only need the ratings, this is unnecessary. You can just use the t element when I renamed to ratings and cleaned up your code a little bit:
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class da {
/**
* #param args
*/
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").get();
Elements ratings = doc.select(".ratingColumn.imdbRating");
float suma = 0;
for(int i = 0; i < ratings.size(); i++)
suma = suma + Float.parseFloat(ratings.get(i).child(0).text());
System.out.println(suma/250);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
EDIT: To select elements with multiple classes, you must use doc#select and pass it a CSS query like above.
nicholas79171 has a good answer, but I would just like to point out that you can use CSS Selectors to target the ratings directly, without all of the dom traversal methods.
Document doc = Jsoup.connect("http://www.imdb.com/chart/top").get();
float ratingSum = 0;
Elements ratings = doc.select("td.ratingColumn.imdbRating > strong");
for (Element rating : ratings)
ratingSum += Float.parseFloat(rating.ownText());
System.out.println(ratingSum / ratings.size());
You can't use getElementsByClass to get an element which contain multiple classes; it only works singularly; If you wanted to get them with multiple elements you might use select on your Document. You can read more about how select works here.
I have created a web scraper which brings the market data of share rates from the website of stock exchange. www.psx.com.pk in that site there is a hyperlink of Market Summary. From that link I have to scrap the data. I have created a program which is as follows.
package com.market_summary;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.Locale;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ComMarket_summary {
boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();
public static void createConnection() throws IOException {
System.setProperty("http.proxyHost", "191.1.1.202");
System.setProperty("http.proxyPort", "8080");
String tempUrl = "http://www.psx.com.pk/index.php";
doc = Jsoup.connect(tempUrl).get();
System.out.println("Successfully Connected");
}
public static void parsingHTML() throws Exception {
File fold = new File("C:\\market_smry.csv");
fold.delete();
File fnew = new File("C:\\market_smry.csv");
for (Element table : doc.getElementsByTag("table")) {
for (Element trElement : table.getElementsByTag("tr")) {
trElement2 = trElement.getElementsByTag("td");
tdElements = trElement.getElementsByTag("td");
FileWriter sb = new FileWriter(fnew, true);
if (trElement.hasClass("marketData")) {
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append("\r\n");
}
for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" | ");
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
}
System.out.println(sampleList.add(tdElements));
}
}
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);
public static String formatData(String text) {
String tmp = null;
try {
Date d = FORMATTER_MMM_d_yyyy.parse(text);
tmp = FORMATTER_dd_MMM_yyyy.format(d);
} catch (ParseException pe) {
tmp = text;
}
return tmp;
}
public static void main(String[] args) throws IOException, Exception {
createConnection();
parsingHTML();
}
}
Now, the problem is when I execute this program it should create a .csv file but what actually happens is it's not creating any file. When I debug this code I found that program is not entering in the loop. I don't understand that why it is doing so. While when I run the same program on the other website which have slightly different page structure it is running fine.
What I understand that this data is present in the #document which is a virtual element and doesn't mean anything that's why program can't read it while there is no such thing in other website. Kindly, help me out to read the data inside the #document element.
Long Story Short
Change your temp url to http://www.psx.com.pk/phps/index1.php
Explanation
There is no table in the document of http://www.psx.com.pk/index.php.
Instead it is showing it's content in two frameset.
One is dummy with url http://www.psx.com.pk/phps/blank.php.
Another one is the real page which is showing actual data and it's url is
http://www.psx.com.pk/phps/index1.php
How to get the web address not by the title but by the description of the link (in this case, "następna strona" it's means next page) with html code?
More specifically draw the internet address of the link name which is between text
następna strona
package outerDictionary;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class adressWWW {
public static void main(String[] args) {
Document doc;
List<String> wikiWords = new ArrayList<String>();
String addresWWW="http://pl.wiktionary.org/w/index.php?title=Kategoria:angielski_(indeks)&pagefrom=abducent#mw-pages";
try {
doc = Jsoup .connect(addresWWW).get();
String title = doc.title();
System.out.println(title);
//Element inDiv = doc.select("a[title=Kategoria:angielski (indeks)]").first();
Element inDiv = doc.select("a[title=Kategoria:angielski (indeks)]następna strona").first();
System.out.println(inDiv);
String row = inDiv.attr("abs:href");
System.out.println("xxx "+row);
// System.out.println(row.text());}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
for (String x : wikiWords)
System.out.println(x);
System.out.println(wikiWords.size());
}}
You can test the text of each link:
Document doc = Jsoup.connect("http://pl.wiktionary.org/w/index.php?title=Kategoria:angielski_(indeks)&pagefrom=abducent#mw-pages").get();
for( Element element : doc.select("a") )
{
if( element.text().equalsIgnoreCase("następna strona") )
{
System.out.println(element);
}
}
Or using the selector syntax:
// ...
for( Element element : doc.select("a:contains(następna strona)") )
{
System.out.println(element);
}
In both cases, the result is:
następna strona
następna strona