I have a project that requires me to use JSOUP for web scraping. I was able to get the data from the main page of the website that I want to scrape. but, as I scrape deeper into the page by looping into the hyperlink and accessing it, I get the following errors:
java.io.IOException: Input is binary and unsupported
at org.jsoup.UncheckedIOException.<init>(UncheckedIOException.java:11)
at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:38)
at org.jsoup.parser.CharacterReader.<init>(CharacterReader.java:43)
at org.jsoup.parser.TreeBuilder.initialiseParse(TreeBuilder.java:38)
at org.jsoup.parser.HtmlTreeBuilder.initialiseParse(HtmlTreeBuilder.java:65)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:46)
at org.jsoup.parser.Parser.parseInput(Parser.java:35)
at org.jsoup.helper.DataUtil.parseInputStream(DataUtil.java:169)
at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:835)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:285)
when I inspect the website, there are parts of the website that contains a commented binary data and I think it caused the problem. I've tried using this code:
Document docs2 = Jsoup.connect("https://www.kiatravels.co.id/group_tour/index?TOUR_ID=1467&ID=15803").ignoreContentType(true).get();
but still didn't work.
Here's hoping some brainy code master can help!
It looks like you navigated to the "Download Itinerary" link, which opens a pdf. Before parsing the link with Jsoup, you'll want to check the content-type of the url response.
Connection.Response res = Jsoup.connect(url).execute();
String contentType = res.contentType();
You'll probably want to ignore MIME types that are not text/html.
Related
As per the html the source code is:
{result.data}
While requesting the URL result.data is set with 100 and am able to see the value as 100 in the browser. Where as while I am trying to execute the java program with the same url request I am unable to see the value as I have seen in browser.
URL url = new URL(site)
url.openConnection() etc..
I wanted to get the same content as I have seen in the browser through java program.
Your question is not very descriptive, but I guess you are trying to scrape data from the site.
You can use the following libraries for this task:
Jaunt (http://jaunt-api.com)
Jsoup (http://jsoup.org/cookbook/extracting-data/dom-navigation)
HTMLUnit
To what i understand, you want to do one of the below things :
Instead of reading the result line by line, you want to parse it as an XML to as to traverse to div(s) and other html tags.
For this purpose i would suggest you to use jsoup library.
When you hit the URL: www.abcd.com/number=500 in browser, it loads an empty div and on load it fetches data from somewhere, this data which it fetches on load, you want to fetch this using java ?
For this, there must be some js in the resulting page, which is fetching data by hitting some service on page load, you will need to look up in the page to know the service details and instead of hitting this URL (www.abcd.com/number=500) you will need to hit that service to get data.
I asked this question before and Evgeniy Dorofeev answered it. Although worked for direct link only, but I accepted his answer. He just told me about check the content type from direct link:
String requestUrl = "https://dl-ssl.google.com/android/repository/android-14_r04.zip";
URL url = new URL(requestUrl);
URLConnection c = url.openConnection();
String contentType = c.getContentType();
As far I know, there are two URL types to download a file:
Direct link. For example: https://dl-ssl.google.com/android/repository/android-14_r04.zip. From this link, we can download data directly and get the file name, included with file extension (in this link, .zip extension). So we can know what file to be downloaded. You can try to download from that link.
Undirect link. For example: http://www.example.com/directory/download?file=52378. Have you ever tried to download data from Google Drive? When downloading data from Google Drive, it will gives you an undirect link, such as the link above. We never know whether the link contains a file or webpage. Also, we don't know the file name and file extension is, because of this link type is unclear and random.
I need to check whether it is a file or webpage. I must download it if the content type is a file.
So my question:
How do I check the content type from an undirect link?
As shown in the comments of this question, can HTTP-redirects solves the problem?
Thanks for your help.
After you open an URLConnection, a header file is returned. There are some information about the file in it. You can pull what you want from there. For example:
URLConnection u = url.openConnection();
long length = Long.parseLong(u.getHeaderField("Content-Length"));
String type = u.getHeaderField("Content-Type");
length is size of the file in bytes, type is something like application/x-dosexec or application/x-rar.
Such links redirect browsers to the actual content using HTTP redirects. To get the correct content type, all you have to do is tell HttpURLConnection to follow the redirects by setting setFollowRedirects() to true (documented here).
MimeTypeMap.getFileExtensionFromUrl(url)
This one worked for me, you have to use retrofit to check the headers of response. First you have to define an endpoint to call it with the url you want to check:
#GET
suspend fun getContentType(#Url url: String): Response<Unit>
Then you call it like this to get the content type header:
api.getContentType(url).headers()["content-type"]
If I have a HTML String object, using Selenium in Java, how can I get the browser to open that String as a HTML page? I have seen this done before but I don't remember the format that the URL needs to be.
For this example, let's say the string is :
<h2>This is a <i>test</i></h2>
I looked through this page and couldn't find the answer but I might be overlooking it. For example I tried this URL and it didn't work for me:
data:<h2>This is a <i>test</i></h2>
Here is a link for documentation http://en.wikipedia.org/wiki/Data_URI_scheme. You need to specify MIME-type of data. Try data:text/html,<h2>This is a <i>test</i></h2>
Jsoup.connect("http://www.design.cmu.edu/community.php?s=3").get();
Could someone please show me why the code gave me the error:
java.nio.charset.IllegalCharsetNameException: 'ISO-8859-1'
The problem is in the target page. It is not well-formed at all.
When parsing the page, JSoup tries to fix the page and for one thing, parses the content type to "text/html; charset='iso-8859-1'"(with the single quotes included).
It then passes this string(with the single quotes) and uses it to get the charset:
Charset.forName("'ISO-8859-1'");
which fails.
The problem is in the target page.
Maybe you can use this alternative instead, which doesn't parse the charset from the page, because you explicitly pass it along:
String url = "http://www.design.cmu.edu/community.php?s=3";
Document document = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
I am trying to scrape data from a website which uses javascript to load much of their content. Right now I am using jSoup to parse html pages, however since much of the content is loaded using javascript I haven't been able to parse the data I want.
How should I go about getting this javascript content? Should I first save the page then load and parse it using jSoup? If so, what should I use to load javascript content before I save? Is there an API which you would recommend that could output html?
Currently using java.
You might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context - among other things, you can define a "ready" function for the page and wait to scrape until the function (which might check for the existence of certain DOM elements, etc) returns true.
The other option, depending on the page, is to use a console like Firebug to figure out what data is being loaded (i.e. what URLs are being retrieved by the AJAX calls on the page), and scrape the data directly from those URLs.
If the data are generated with javascript then the data are in the downloaded page.
Better is to directly parse them on the fly as you do with plain HTML or Text parsing.
If you cannot isolate tokens with jSoup API just parse them using direct String options, as a plain text.
I tried using htmlUnit however I found it very slow.
I ended up using the curl command line function within java which worked for my purposes.
String command = "curl "+url;
Process p = Runtime.getRuntime().exec(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((s = stdInput.readLine()) != null) {
html = html+s+"\n";
}
return html;