When I am scraping with the following code it does not show any element within body tag, but in manually checking with view-source it shows the elements in body. How to scrape the hyperlinks in the following URL?
public static void main(String[] args) throws SQLException, IOException {
String search_url = "http://www.manta.com/search?search=geico";
Document doc = Jsoup.connect(search_url).userAgent("Mozilla").get();
System.out.println(doc);
Elements links = doc.select("a[href]");
System.out.println(links);
for (Element a : links) {
System.out.println(a);
String linkhref=a.attr("href");
System.out.println(linkhref);
}
}
Related
I'm working on code for parsing the weather site.
I found a CSS class with needed data on the web-site. How to pick up from there "on October 12" in the form of a string? (Tue, Oct 12)
public class Pars {
private static Document getPage() throws IOException {
String url = "https://www.gismeteo.by/weather-mogilev-4251/3-day/";
Document page = Jsoup.parse(new URL(url), 3000);
return page;
}
public static void main(String[] args) throws IOException {
Document page = getPage();
Element Nameday = page.select("div [class=date date-2]").first();
String date = Nameday.select("div [class=date date-2").text();
System.out.println(Nameday);
}
}
The code is written for the purpose of parsing the weather site. On the page I found the right class in which only the date and day of the week I need. But at the stage of converting data from a class, an error crashes into a string.
The problem is with class selector, it should look like this: div.date.date-2
Working code example:
public class Pars {
private static Document getPage() throws IOException {
String url = "https://www.gismeteo.by/weather-mogilev-4251/3-day/";
return Jsoup.parse(new URL(url), 3000);
}
public static void main(String[] args) throws IOException {
Document page = getPage();
Element dateDiv = page.select("div.date.date-2").first();
if(dateDiv != null) {
String date = dateDiv.text();
System.out.println(date);
}
}
}
Here is an answer to Your problem: Jsoup select div having multiple classes
In future, please make sure Your question is more detailed and well structured. Here is the "asking questions" guideline: https://stackoverflow.com/help/how-to-ask
I need to check a value on https://new.ppy.sh/u/9889129 from div class profile-header-extra__rank-box (pp score). But it's returns nothig from there. How to do it?
public class mainClass
{
public static void main(String[] args) throws Exception {
String url = "https://new.ppy.sh/u/9889129";
Document document = Jsoup.connect(url).get();
String ppValue = document.select(".profile-header-extra__rank-global").text();
System.out.println("PP: "+ppValue);
}
}
So i have started with some java, i am not that good i am still a beginner..
what im trying to do is grab specific information from Yahoo finance with Jsoup.
public class WebScraping {
public static void main(String[] args) throws Exception {
String url = "https://in.finance.yahoo.com/q/is?s=AAPL&annual";
Document document = Jsoup.connect(url).get();
String information = document.select(".yfnc_tabledata1").text();
System.out.println("Information: " + information);
}
}
but i get the whole table i want specific information like the Net Income and the income only for year 2015
so i found the solution
public static void main(String[] args) throws Exception {
String url = "https://in.finance.yahoo.com/q/is?s=AAPL&annual";
Document document = Jsoup.connect(url).get();
String information = document.select("table tr:eq(7) > td:eq(2)").text();
System.out.println("Information: " + information);
}
}
I'm using jsoup to clean a html page, the problem is that when I save the html locally, the images do not show because they are all relative links.
Here's some example code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class so2 {
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??
System.out.println(doc);
}
}
Output:
<html>
<head>
<title>The Title</title>
</head>
<body>
<p><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></p>
</body>
</html>
The output still shows the links as a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif".
I would like it to convert them to a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
Can anyone show me how to get jsoup to convert all the links to absolute links?
You can select all the links and transform their hrefs to absolute using Element.absUrl()
Example in your code:
EDIT (added processing of images)
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com");
Elements select = doc.select("a");
for (Element e : select){
// baseUri will be used by absUrl
String absUrl = e.absUrl("href");
e.attr("href", absUrl);
}
//now we process the imgs
select = doc.select("img");
for (Element e : select){
e.attr("src", e.absUrl("src"));
}
System.out.println(doc);
}
I amm trying to get the data from a website. With this code:
#WebServlet(description = "get content from teamforge", urlPatterns = { "/JsoupEx" })
public class JsoupEx extends HttpServlet {
private static final long serialVersionUID = 1L;
private static final String URL = "http://www.moving.com/real-estate/city-profile/results.asp?Zip=60505";
public JsoupEx() {
super();
}
protected void doGet(HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
Document doc = Jsoup.connect(URL).get();
for (Element table : doc.select("table.DataTbl")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 1) {
System.out.println(tds.get(0).text() + ":"
+ tds.get(2).text());
}
}
}
}
}
I am using the jsoup parser. When run, I do not get any errors, just no output.
Please help on this.
With the following code
public class Tester {
private static final String URL = "http://www.moving.com/real-estate/city-profile/results.asp?Zip=60505";
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect(URL).get();
System.out.println(doc);
}
}
I get a java.net.SocketTimeoutException: Read timed out. I think the particuliar URL you are trying to crawl is too slow for Jsoup. Being in Europe, my connection might be slower as yours. However you might want to check for this exception in the log of your AS.
By setting the timeout to 10 seconds, I was able to download and parse the document :
Connection connection = Jsoup.connect(URL);
connection.timeout(10000);
Document doc = connection.get();
System.out.println(doc);
With the rest of your code I get :
Population:78,413
Population Change Since 1990:53.00%
Population Density:6,897
Male:41,137
Female:37,278
.....
thanx Julien, I tried with the following code, getting SocketTimeoutException. And code is
Connection connection=Jsoup.connect("http://www.moving.com/real-estate/city-
profile/results.asp?Zip=60505");
connection.timeout(10000);
Document doc = connection.get();
System.out.println(doc);