I want to remove the script when reading url not file, please help me
Document connect = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm");
Elements selects = connect.select("div.middle-col");
System.out.println(selects.removeAttr("script").html());
This is how you need to remove script element:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TestJsoup {
public static void main(String args[]) throws IOException {
Document doc = Jsoup.connect("http://www.tutorialspoint.com/ant/ant_deploying_applications.htm").get();
Elements selects = doc.select("div.middle-col");
for (Element script : selects) {
Elements scripts = script.select("script");
scripts.remove();
}
System.out.println(selects.html());
}
}
Additionally, you can use Jsoup.Clean(html,white).
Related
I am trying to get data from a webpage (http://steamcommunity.com/id/Winning117/games/?tab=all) using a specific tag but I keep getting null. My desired result is to get the "hours played" for a specific game - Cluckles' Adventure in this case.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestScrape {
public static void main(String[] args) throws Exception {
String url = "http://steamcommunity.com/id/Winning117/games/?tab=all";
Document document = Jsoup.connect(url).get();
Element playTime = document.select("div#game_605250").first();
System.out.println(playTime);
}
}
Edit: How can I tell if a webpage is using JavaScript and is therefore unable to be parsed by Jsoup?
To execute javascript in java code there is Selenium :
Selenium-WebDriver makes direct calls to the browser using each
browser’s native support for automation.
To include it with maven use this dependency:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-server</artifactId>
<version>3.4.0</version>
</dependency>
Next I give you code of simple JUnit test that creates instance of WebDriver and goes to given url and executes simple script to get rgGames .
File chromedriver you have to download at https://sites.google.com/a/chromium.org/chromedriver/downloads.
package SeleniumProject.selenium;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Map;
import org.junit.After;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.openqa.selenium.support.ui.ExpectedCondition;
import org.openqa.selenium.support.ui.WebDriverWait;
import junit.framework.TestCase;
#RunWith(JUnit4.class)
public class ChromeTest extends TestCase {
private static ChromeDriverService service;
private WebDriver driver;
#BeforeClass
public static void createAndStartService() {
service = new ChromeDriverService.Builder()
.usingDriverExecutable(new File("D:\\Downloads\\chromedriver_win32\\chromedriver.exe"))
.withVerbose(false).usingAnyFreePort().build();
try {
service.start();
} catch (IOException e) {
System.out.println("service didn't start");
// TODO Auto-generated catch block
e.printStackTrace();
}
}
#AfterClass
public static void createAndStopService() {
service.stop();
}
#Before
public void createDriver() {
ChromeOptions chromeOptions = new ChromeOptions();
DesiredCapabilities capabilities = DesiredCapabilities.chrome();
capabilities.setCapability(ChromeOptions.CAPABILITY, chromeOptions);
driver = new RemoteWebDriver(service.getUrl(), capabilities);
}
#After
public void quitDriver() {
driver.quit();
}
#Test
public void testJS() {
JavascriptExecutor js = (JavascriptExecutor) driver;
// Load a new web page in the current browser window.
driver.get("http://steamcommunity.com/id/Winning117/games/?tab=all");
// Executes JavaScript in the context of the currently selected frame or
// window.
ArrayList<Map> list = (ArrayList<Map>) js.executeScript("return rgGames;");
// Map represent properties for one game
for (Map map : list) {
for (Object key : map.keySet()) {
// take each key to find key "name" and compare its vale to
// Cluckles' Adventure
if (key instanceof String && key.equals("name") && map.get(key).equals("Cluckles' Adventure")) {
// print all properties for game Cluckles' Adventure
map.forEach((key1, value) -> {
System.out.println(key1 + " : " + value);
});
}
}
}
}
}
As you can see selenium loads page at
driver.get("http://steamcommunity.com/id/Winning117/games/?tab=all");
And to get data of all games by Winning117 it returns rgGames variable:
ArrayList<Map> list = (ArrayList<Map>) js.executeScript("return rgGames;");
The page you want to scrape is load by js,and there is not any #game_605250 element that jsoup get.All datas are write in page by using js.
But when I print document to a file ,I see some data like this:
<script language="javascript">
var rgGames = [{"appid":224260,"name":"No More Room in Hell","logo":"http:\/\/cdn.steamstatic.com.8686c.com\/steamcommunity\/public\/images\/apps\/224260\/670e9aba35dc53a6eb2bc686d302d357a4939489.jpg","friendlyURL":224260,"availStatLinks":{"achievements":true,"global_achievements":true,"stats":false,"leaderboards":false,"global_leaderboards":false},"hours_forever":"515","last_played":1492042097},{"appid":241540,"name":"State of Decay","logo":"http:\/\/....
then,you can extract 'rgGames' by some StringTools and format it to json obj.
It't not a clerver method,but it worked
try this :
public class TestScrape {
public static void main(String[] args) throws Exception {
String url = "http://steamcommunity.com/id/Winning117/games/?tab=all";
Document document = Jsoup.connect(url).get();
Element playTime = document.select("div#game_605250");
Elements val = playTime.select(".hours_played");
System.out.println(val.text());
}
}
I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}
Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.
I have searched more than anything for correct solution still i couldn't fix.
Please look on this & help me.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class NewClass {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}
}
I have the above code for getting article and its contents from an single page.
Document doc = Jsoup.connect("http://tamilblog.ishafoundation.org").get();
I want to do this for several websites.
In this line or using some iteration i want to apply my code for several webpages say 500+.
And i want to save it in separate text document under its article title and its contents.
I am new to coding so i could not find the correct code.
I was doing this code for past two months to create my code.
For starter you can do something like this,
String[] urls={"http://tamilblog.ishafoundation.org","url2","url3"};//your 500 urls wil be stored here,
for(String url: urls){
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("section#content");
Elements article = section.select("article");
for (Element a : article) {
System.out.println("Title : \n" + a.select("a").text());
System.out.println("Article summary: \n" + a.select("div.entry-summary").text());
}
}
I have the following code that is supposed to extract data from HTML document. I used eclipse. It gives me two errors (though, this code is copied and pasted from JSoup site as a tutorial). The errors in 1) File, and 2) Elements. I can't see any problem in these two types.
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestClass
{
public static void main(String args[]) throws IOException
{
try{
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
}//try
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}//catch
}
}</i>
You forgot to import them.
import java.io.File;
import org.jsoup.select.Elements;
See also:
Java tutorial - Using package members
Hint: read the "Quick Fix" options suggested by Eclipse. It's already the 1st option for File.
Can anyone help me figure out why I can't use jsoup to read table in this link below:
http://data.fpt.vn/InfoDNS.aspx?domain=google.com
I use it to get DNS of a host.
Here is the code that I used:
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class dnsjava {
public static void main(String... args) throws Exception {
String fpt = "http://data.fpt.vn/InfoDNS.aspx?domain=google.com";
String espn = "http://espn.go.com/mens-college-basketball/conferences/standings/_/id/2/year/2012/acc-conference"
org.jsoup.nodes.Document doc = Jsoup.connect(fpt).get();
Elements table = doc.select("table.tabular");
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.text());
System.out.println(tds.text());
}
}
}
It work with the url of espn and doc.select("table.tablehead"); but with fpt url, nothing happen!
Thank you for your help!
looks like the response you are seeking is not present, when i did the "view source"(in browser) of the link.
doc.select("table.tabular"); //
"tabular" is not present in response.