Need help in getting HTML of a website in Java

Need help in getting HTML of a website in Java - java

I got some code from java httpurlconnection cutting off html and I am pretty much the same code to fetch html from websites in Java.
Except for one particular website that I am unable to make this code work with:
I am trying to get HTML from this website:
http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289
But I keep getting junk characters. Although it works very well with any other website like http://www.google.com.
And this is the code that I am using:
public static String PrintHTML(){
URL url = null;
try {
url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
} catch (MalformedURLException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
HttpURLConnection connection = null;
try {
connection = (HttpURLConnection) url.openConnection();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
try {
System.out.println(connection.getResponseCode());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String html = builder.toString();
System.out.println("HTML " + html);
return html;
}
I don't understand why it doesn't work with the URL that I mentioned above.
Any help will be appreciated.

That site is incorrectly gzipping the response regardless of the client's capabilities. Normally a server should only gzip the response whenever the client supports it (by Accept-Encoding: gzip). You need to ungzip it using GZIPInputStream.
reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));
Note that I also added the right charset to the InputStreamReader constructor. Normally you'd like to extract it from the Content-Type header of the response.
For more hints, see also How to use URLConnection to fire and handle HTTP requests? If all what you after all want is parsing/extracting information from the HTML, then I strongly recommend to use a HTML parser like Jsoup instead.

Related

Can't download html

I'm trying to download this html
I'm using this code:
Document doc = null;
try {
doc =Jsoup.connect(link).userAgent("Mozilla").get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Log.i ("html", doc.toString());
UPDATED:
ASLO tried to use it:
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet(link);
HttpResponse response = null;
try {
response = client.execute(request);
} catch (ClientProtocolException e1) {
//
e1.printStackTrace();
} catch (IOException e1) {
//
e1.printStackTrace();
}
InputStream in = null;
try {
in = response.getEntity().getContent();
} catch (IllegalStateException e1) {
//
e1.printStackTrace();
} catch (IOException e1) {
//
e1.printStackTrace();
}
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(in, "UTF-8"));
} catch (UnsupportedEncodingException e) {
//
e.printStackTrace();
}
StringBuilder str = new StringBuilder();
String line = null;
try {
while((line = reader.readLine()) != null)
{
str.append(line);
}
} catch (IOException e1) {
//
e1.printStackTrace();
}
try {
in.close();
} catch (IOException e1) {
//
e1.printStackTrace();
}
String html = str.toString();
Log.e("html", html);
again responce like this one:
<html>
<body>
<script>document.cookie="BPC=f563534535121d5a1ba5bd1e153b";
document.location.href="http://...link.../all?attempt=1";</script>
</body>
</html>
I can't find any solution... Page can not be downloaded maybe because haven't cookie ... or what?

In the script tag, you have this statement :
document.location.href="....link..../all?attempt=1";
Normally it forces the browser to reload the page with the location. I think it's the page "....link...?attempt=1" that you want to download in fact.
It is not sure that it will work anyway if you don't use the cookie defined in the script but it deserves a try.

How to know a webpage last-modified date & time in Android Java

How can I know the date a webpage last-modified using Android Java? or how I can request for
If-Modified-Since: Allows a 304 Not Modified to be returned if content is unchanged
using http headers?

I could do it this way:
URL url = null;
try {
url = new URL("http://www.example.com/example.pdf");
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
HttpURLConnection httpCon = null;
try {
httpCon = (HttpURLConnection) url.openConnection();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long date = httpCon.getLastModified();

Facebook Graph API request returning IOException "Hostname <fbcdn-profile-a.akamaihd.net> was not verified"

So I'm trying to simply fetch the user's profile photo from facebook but I'm getting a null response from facebook.request(path) and the IOException "Hostname fbcdn-profile-a.akamaihd.net was not verified".
Anyone know what could be causing this exception? Here's my method to call the facebook.request:
public Bitmap getUserPic(String path){
URL picURL = null;
try {
responsePic = facebook.request(path);
picURL = new URL(responsePic);
HttpURLConnection conn = (HttpURLConnection)picURL.openConnection();
conn.setDoInput(true);
conn.connect();
InputStream is = conn.getInputStream();
userPic = BitmapFactory.decodeStream(is);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (FacebookError e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return userPic;
}
The string "path" is "me/picture"
Edit:
Also tried setting picURL to "https://fbcdn-profile-a.akamaihd.net/hprofile-ak-snc4/260885_608260639_822979518_q.jpg" which is the url that the request should return. Still no photo :(
Thanks for any help

It sounds like a issue with the HTTPS connection used to get the image from the Facebook CDN. What happens if you request the regular HTTP version of the image?
E.g. http://fbcdn-profile-a.akamaihd.net/hprofile-ak-snc4/260885_608260639_822979518_q.jpg

401 response when reading tweets

I'm new to mobile apps development. I'm developing a blackberry application which reads tweets from the user's timeline. So far I managed to get the OAuth access token. The problem happens when I try to use this access token to read the tweets I get a 401 response with a message "Unauthorized". I'm not using any libraries I'm doing everything on my own. Could anyone help me with this?
Thanks,
Here's the code:
HttpConnectionFactory factory = new HttpConnectionFactory( url,
HttpConnectionFactory.TRANSPORT_WIFI |
HttpConnectionFactory.TRANSPORT_WAP2 |
HttpConnectionFactory.TRANSPORT_BIS |
HttpConnectionFactory.TRANSPORT_BES |
HttpConnectionFactory.TRANSPORT_DIRECT_TCP);
httpConn = factory.getNextConnection();
httpConn.setRequestMethod(HttpProtocolConstants.HTTP_METHOD_GET);
httpConn.setRequestProperty("WWW-Authenticate","OAuth realm=http://twitter.com/");
httpConn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
httpConn.setRequestProperty("Content-Length", Integer.toString(header.getBytes().length));
os = httpConn.openOutputStream();
os.write(header.getBytes());
os.close();
os = null;
input = httpConn.openDataInputStream();
int resp = httpConn.getResponseCode();
// Dialog.alert(httpConn.getDate()+" : "+System.currentTimeMillis());
if (resp == HttpConnection.HTTP_OK) {
XMLReader parser;
try {
parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(this);
parser.parse(new InputSource(input));
for(int i=0 ; i<2 ; i++)
{
tweets.addElement( parser.getProperty("text").toString());
Dialog.alert(parser.getProperty("text").toString());
}
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Dialog.alert("your tweet was posted successfully :)");
}
Dialog.alert(httpConn.getResponseCode()+": "+httpConn.getResponseMessage());
return (httpConn.getResponseCode()+": "+httpConn.getResponseMessage());
} catch (IOException e) {
return "exception";
} catch (NoMoreTransportsException nc) {
return "noConnection";
} finally {
try {
httpConn.close();
input.close();
} catch (IOException e) {
e.printStackTrace();
}
}

I'm not an expert in OAuth, however just a note:
This:
httpConn.setRequestMethod(HttpProtocolConstants.HTTP_METHOD_GET);
and this:
httpConn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
are mutually exclusive things. You are posting data to server, so it should be a POST (not GET).

Problem fetching XML file in Java

I am trying to use Google's unofficial weather API in an Android Application.
I use this code:
//get the text from the edit text
userZip = zipCode.getText().toString();
//create a link using the zip code
//TODO sanitize input
System.out.println(userZip);
link = "http://www.google.com/ig/api?weather=" + userZip;
System.out.println(link);
//connect to the link
URL googleWeatherService = null;
try {
googleWeatherService = new URL(link);
} catch (MalformedURLException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
SAXBuilder parser = new SAXBuilder();
try {
doc = parser.build(googleWeatherService);
} catch (JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
But I get the error java.io.IOException Couldn't open http://www.google.com/ig/api?weather=08003 (just using 08003 as an example).
If you go to the link in FF you get a nice XML file of weather, so what am I doing wrong?

I think you need to open the connection with the url and get the inputstream for this to work. I would try this:
URL googleWeatherService = null;
URLConnection conn = null;
try {
googleWeatherService = new URL(link);
conn = googleWeatherService.openConnection();
} catch (MalformedURLException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
SAXBuilder parser = new SAXBuilder();
try {
doc = parser.build(conn.getInputStream());
Hopefully this does the trick for you!
Otherwise if this fails, it sounds like you're having to deal with URL redirects, which is a problem i used to have. You would need to do the following in that case:
URL googleWeatherService = null;
URLConnection conn = null;
try {
googleWeatherService = new URL(link);
HttpURLConnection ucon = (HttpURLConnection) googleWeatherService.openConnection();
ucon.setInstanceFollowRedirects(false);
URL secondURL = new URL(ucon.getHeaderField("Location"));
conn = secondURL.openConnection();
} catch (MalformedURLException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
SAXBuilder parser = new SAXBuilder();
try {
doc = parser.build(conn.getInputStream());
Hope this solves it!

This worked perfectly for me:
package weather;
import org.dom4j.Document;
import org.dom4j.io.OutputFormat;
import org.dom4j.io.SAXReader;
import org.dom4j.io.XMLWriter;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
/**
* GoogleWeather
* #author Michael
* #since 2/12/11
*/
public class GoogleWeather
{
public static void main(String[] args)
{
for (String userZip : args)
{
BufferedReader br = null;
try
{
String link = "http://www.google.com/ig/api?weather=" + userZip;
System.out.println(link);
URL googleWeatherService = new URL(link);
br = new BufferedReader(new InputStreamReader(googleWeatherService.openStream()));
SAXReader reader = new SAXReader();
Document document = reader.read(googleWeatherService);
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(System.out, format);
writer.write(document);
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
close(br);
}
}
}
private static void close(BufferedReader br)
{
try
{
if (br != null)
{
br.close();
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
Here's the result it brought back:
<?xml version="1.0" encoding="UTF-8"?>
<xml_api_reply version="1">
<weather module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0">
<forecast_information>
<city data="Hebron, CT"/>
<postal_code data="06248"/>
<latitude_e6 data=""/>
<longitude_e6 data=""/>
<forecast_date data="2011-02-12"/>
<current_date_time data="2011-02-13 03:00:47 +0000"/>
<unit_system data="US"/>
</forecast_information>
<current_conditions>
<condition data="Partly Cloudy"/>
<temp_f data="28"/>
<temp_c data="-2"/>
<humidity data="Humidity: 45%"/>
<icon data="/ig/images/weather/partly_cloudy.gif"/>
<wind_condition data="Wind: NW at 14 mph"/>
</current_conditions>
<forecast_conditions>
<day_of_week data="Sat"/>
<low data="16"/>
<high data="36"/>
<icon data="/ig/images/weather/partly_cloudy.gif"/>
<condition data="Partly Cloudy"/>
</forecast_conditions>
<forecast_conditions>
<day_of_week data="Sun"/>
<low data="30"/>
<high data="38"/>
<icon data="/ig/images/weather/snow.gif"/>
<condition data="Snow Showers"/>
</forecast_conditions>
<forecast_conditions>
<day_of_week data="Mon"/>
<low data="23"/>
<high data="46"/>
<icon data="/ig/images/weather/cloudy.gif"/>
<condition data="Cloudy"/>
</forecast_conditions>
<forecast_conditions>
<day_of_week data="Tue"/>
<low data="12"/>
<high data="29"/>
<icon data="/ig/images/weather/cloudy.gif"/>
<condition data="Windy"/>
</forecast_conditions>
</weather>
</xml_api_reply>

Are you able to retrieve other URIs successfully?
You could be hitting problems with the JVM configuration. In most environments I've come across, if your machine is configured so that the web browser can make HTTP requests successfully, then Java will also be able to make them successfully. But I've heard of special JVM configuration being needed when you're behind a proxy server, and I've no idea whether anything similar might be needed in an Android environment.

I've ran into this problem with JDOM2 and it was really a NetworkOnMainThreadException in disguise. Threw it off the main thread and everything worked.
new Thread(new Runnable() {
#Override
public void run() {
<your code>
}
}).start();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Need help in getting HTML of a website in Java - java

Related

Can't download html

How to know a webpage last-modified date & time in Android Java

Facebook Graph API request returning IOException "Hostname <fbcdn-profile-a.akamaihd.net> was not verified"

401 response when reading tweets

Problem fetching XML file in Java

Categories

Resources