Java - Reading URL Response - java

I want to grab the source code of a page which contains the word "true" or "false". That's the only two words that would be on that page, no other formatting.
So I just need Java to URL connect to that page "http://example.com/example.php" and just grab the contents.

Try this tutorial : Java URL example - Download the contents of a URL

There is an easy way to do that using Apache Commons IO. For example for mentioned action you can just write
URL url = new URL("http://example.com/example.php");
String content = IOUtils.toString(new InputStreamReader(
url.openStream()));

Related

Read Twitter page from Java

I would like to read the content of a tweet from Java, by using the public link of that tweet, e.g.:
http://twitter.com/FoodRhythms/status/461201880354271232/photo/1
I am using the same procedure used for reading content from other types of pages:
String XMLstring = "";
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
XMLstring += inputLine;
in.close();
However, while with other pages this works, when reading from Twitter links the returned content is empty (the BufferedReader object does not contain any line).
Any hint on this?
You are approaching the problem in unnecessary complicated way. You would use an HTML parser library, such as Jsoup to parse the URL of it's content.
An example would be as follows:
String url = "https://twitter.com/FoodRhythms/status/461201880354271232/photo/1";
Document doc = Jsoup.connect(url).get();
Element tweetText = doc.select("p.js-tweet-text.tweet-text").first();
System.out.println(tweetText.text());
which would output
Miso-Glazed Japanese Eggplant
In a similar fashion, you can select any element you want!
Reading from twitter requires oAuth authentication, you will need to adhere to the twitter for java library to get the required data.
Hope this helps..!
http://www.javacodegeeks.com/2012/03/twitter-api-on-your-java-application.html
The problem with your code right now is that you are trying to access twitter via http where an implicit redirect to https is happening. Using your exact code, modify the url to be https and all should work fine.

Web Crawler specifically for downloading images and files

I am doing an assignment for one of my classes.
I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.
I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.
I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.
Question : Anyone know any specialized parser api for images and links ?
I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.
i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have:
**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
UPDATE
Based on Yoshi's answer, I was able to get my code to work right. Here's the link:
https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java
Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.
Several steps:
Jsoup: how to get an image's absolute url?
how to download image from any web page in java
You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.
You don't need to use URLConnection class, jSoup have wrapper for it.
e.g
You can use only one line of code to get DOM object:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Instead of this code:
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
Update1
try to add in your code next lines:
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();

Get the source code of a URL

I need to get the source code of the particular URL using a java code. I was able to get the source code for UTF-8 encoded web page but was not able to get the code for ISO-8859-1 encoded character set. My question, is it possible to get the source code of website with iso-8859-1 using a java program? Please help
If you are reading by using following method you need to Specify character set explicitly by
URL url = new URL(URL_TO_READ);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream(),"ISO-8859-1" ));
How ever if there is little parsing include with your requirement I would suggest you to use JSOUP and it will read the character-set from the response of server, Also you could explicitly set the charset

Returned URL as String is not valid in JSF

I'm trying to make use of google api as text-to-speech. So, I build a String then should pass it as a URL to a component to obtain a MP3 with the spoken words.
So, this is my code:
URI uri = new URI("http://translate.google.com/translate_tts?tl=es&q="+ URLEncoder.encode((String)this.text.getValue(), "UTF-8"));
When I make uri.toString() its return a well formed URL. If I copy and paste this output in the browser works pefectly.
But if I assign this returned String to the source property of a ice:outputMedia is not working. Then inspect the HTML generated in the page and the String in src property is:
http://translate.google.com/translate_tts?tl=es&q=Bobby+need+peanuts
The & symbol has been replaced by &.
How can I avoid this to make a valid URL?
You need to decode the url on the client side using Javascript.
var decoded = decodeURI(URI)

Java getting source code from a website

I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...
The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.
If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?
Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());
Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?

Categories