java httpurlconnection cutting off html - java

Hey, I'm trying to get the html from a twitter profile page, but httpurlconnection is only returning a small snippet of the html. My code
for(int i = 0; i < urls.size(); i++)
{
URL url = new URL(urls.get(i));
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("User-Agent","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
System.out.println(connection.getResponseCode());
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null)
{
builder.append(line);
}
String html = builder.toString();
}
I always get 200 as the response code for each call. However about 1/3 of the time the entire html document is returned, and the other half only the first few hundred lines. The amount returned when the html is cutoff is not always the same.
Any ideas? Thanks for any help!
Additional Info: After viewing the headers it seems I'm getting duplicate content-length headers. The first is the full length, the other is much shorter (and probably representative of the length I'm getting some of the time) How can I handle duplicate headers?

This worked fine for me, I added a newline after builder.append(line); to make it more readable in the console, but other than that it returned all the HTML for this page:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
public class RetrieveHTML {
public static void main(String[] args) throws IOException {
List<String> urls = new ArrayList<String>();
urls.add("http://stackoverflow.com/questions/3285077/java-httpurlconnection-cutting-off-html");
for (int i = 0; i < urls.size(); i++) {
URL url = new URL(urls.get(i));
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
System.out.println(connection.getResponseCode());
String line;
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
String html = builder.toString();
System.out.println("HTML " + html);
}
}
}

Check out my HTTP class
https://stackoverflow.com/questions/9349378/java-net-httpurlconnection-returning-your-browsers-cookie-functionality-has-be
based on this API. Feel free to change some stuff.

Related

Get page content from URL in java

Not able to access content of this page "kissanime.com" (it's not returning anything) from URL by this code :
String a="http://kissanime.com";
url = new URL(a);
URLConnection conn = url.openConnection();
try ( // open the stream and put it into BufferedReader
BufferedReader br = new BufferedReader(
new InputStreamReader(conn.getInputStream()))) {
String inputLine;
while ((inputLine = br.readLine()) != null) {
System.out.println(inputLine);
}
}
As my above comment , you need to set user agent header by setRequestProperty method as below.
String a = "http://kissanime.com";
URLConnection connection = new URL(a).openConnection();
connection
.setRequestProperty("User-Agent",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();
BufferedReader r = new BufferedReader(new InputStreamReader(connection.getInputStream(),
Charset.forName("UTF-8")));
StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
sb.append(line);
}
System.out.println(sb.toString());
Now you will get somethings !!

how to detect encoding when i'm using bufferedReader

i know this question was asked many times however i'm stuck with this problem and nothing i've read helped me.
i have this code:
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();
i'm trying to get content of this webpage http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/ and all nonlatin symbols have been displayed wrong.
i tried set encoding like:
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "WINDOWS-1251"));
and at this point everething was well! but i cant change encoding for each website i try to parse and i need some solution.
so guys, i know that there is not that easy to detect encoding as it seems but i'm realy need it. if someone had such problem please explain me how you have solved it!
any help appriciated!
this is entire code of the function i'm using to get content:
protected Map<String, String> getFromUrl(String url){
Map<String, String> mp = new HashMap<String, String>();
String newCookie = "", redirect = null;
try{
String host = this.getHostName(url), content = "", header = "", UA = this.getUA(), cookie = this.getCookie(host, UA), referer = "http://"+host+"/";
URL U = new URL(url);
URLConnection conn = U.openConnection();
conn.setRequestProperty("Host", host);
conn.setRequestProperty("User-Agent", UA);
conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
conn.setRequestProperty("Accept-Language", "ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3");
conn.setRequestProperty("Accept-Encoding", "gzip,deflate");
conn.setRequestProperty("Accept-Charset", "utf-8;q=0.7,*;q=0.7");
conn.setRequestProperty("Keep-Alive", "115");
conn.setRequestProperty("Connection", "keep-alive");
conn.setRequestProperty("Connection", "keep-alive");
if(referer != null)conn.setRequestProperty("Referer", referer);
if(cookie != null && !cookie.contentEquals(""))conn.setRequestProperty("Cookie", cookie);
for(int i=0; ; i++){
String name = conn.getHeaderFieldKey(i);
String value = conn.getHeaderField(i);
if(name == null && value == null)break;
else if(name != null)if(name.contentEquals("Set-Cookie"))newCookie += value + " ";
else if(name.toLowerCase().trim().contentEquals("location"))redirect = value;
header += name + ": " + value + "\r\n";
}
if(!newCookie.contentEquals("") && !newCookie.contentEquals(cookie))this.setCookie(host, UA, newCookie.trim());
try{
BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
while((line = reader.readLine()) != null)content += line+"\r\n";
reader.close();
}
catch(Exception e){/*System.out.println(url+"\r\n"+e);*/}
mp.put("url", url);
mp.put("header", header);
mp.put("content", content);
}
catch(Exception e){
mp.put("url", "");
mp.put("header", "");
mp.put("content", "");
}
if(redirect != null && this.redirectCount < 3){
mp = getFromUrl(redirect);
this.redirectCount++;
}
return mp;
}
Use jsoup for example. Detecting character encoding of a random website is complex issue because of lying/non-existent headers and 2 different meta tags. For example, the page you linked doesn't send the charset in Content-Type header.
And you're going to need a HTML parser anyway, you didn't think of going with a regex, did you?
Here's example usage:
Connection connection = Jsoup.connect("http://www.garazh.com.ua/tires/catalog/Marangoni/E-COMM/description/");
connection
.header("Host", host)
.header("User-Agent", UA)
.header("Accept", "text/html,application/xhtml+xml,application/xmlq=0.9,*/*q=0.8")
.header("Accept-Language", "ru-ru,ruq=0.8,en-usq=0.5,enq=0.3")
.header("Accept-Encoding", "gzip,deflate")
.header("Accept-Charset", "utf-8q=0.7,*q=0.7")
.header("Keep-Alive", "115")
.header("Connection", "keep-alive");
connection.followRedirects(true);
Document doc = connection.get();
Map<String, String> cookies = connection.response().cookies();
Elements titles = doc.select(".title");
for( Element title : titles ) {
System.out.println(title.ownText());
}
Output:
Шины Marangoni E-COMM
Описание шины Marangoni E-COMM
You want to look for the 'Content-Type' header:
Content-Type: text/html; charset=utf-8
The "charset" part there is what you're looking for.

403 Forbidden with Java but not web browser?

I am writing a small Java program to get the amount of results for a given Google search term. For some reason, in Java I am getting a 403 Forbidden but I am getting the right results in web browsers. Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class DataGetter {
public static void main(String[] args) throws IOException {
getResultAmount("test");
}
private static int getResultAmount(String query) throws IOException {
BufferedReader r = new BufferedReader(new InputStreamReader(new URL("https://www.google.com/search?q=" + query).openConnection()
.getInputStream()));
String line;
String src = "";
while ((line = r.readLine()) != null) {
src += line;
}
System.out.println(src);
return 1;
}
}
And the error:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/search?q=test
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at DataGetter.getResultAmount(DataGetter.java:15)
at DataGetter.main(DataGetter.java:10)
Why is it doing this?
You just need to set user agent header for it to work:
URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();
BufferedReader r = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
sb.append(line);
}
System.out.println(sb.toString());
The SSL was transparently handled for you as could be seen from your exception stacktrace.
Getting the result amount is not really this simple though, after this you have to fake that you're a browser by fetching the cookie and parsing the redirect token link.
String cookie = connection.getHeaderField( "Set-Cookie").split(";")[0];
Pattern pattern = Pattern.compile("content=\\\"0;url=(.*?)\\\"");
Matcher m = pattern.matcher(response);
if( m.find() ) {
String url = m.group(1);
connection = new URL(url).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.setRequestProperty("Cookie", cookie );
connection.connect();
r = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
sb = new StringBuilder();
while ((line = r.readLine()) != null) {
sb.append(line);
}
response = sb.toString();
pattern = Pattern.compile("<div id=\"resultStats\">About ([0-9,]+) results</div>");
m = pattern.matcher(response);
if( m.find() ) {
long amount = Long.parseLong(m.group(1).replaceAll(",", ""));
return amount;
}
}
Running the full code I get 2930000000L as a result.
For me it worked by adding the header:
"Accept": "*/*"
You probably aren't setting the correct headers. Use LiveHttpHeaders (or equivalent) in the browser to see what headers the browser is sending, then emulate them in your code.
It's because the site uses SSL. Try using the Jersey HTTP Client. You will probably also have to learn a little about HTTPS and the certificates, but I think Jersey can bet set to ignore most of the details relating to the actual security.

Java httpURLConnection threaded

Simple question:
Is it possible to request SEVERAL httpURLConnection at the same time? I'm creating a tool to check if pages exists on certain server, and at the moment, Java seems to wait until for each httpURLConnection finishes to start a new one. Here's my code:
public static String GetSource(String url){
String results = "";
try{
URL SourceCode = new URL(url);
URLConnection connect = SourceCode.openConnection();
connect.setRequestProperty("Host", "www.someserver.com");
connect.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0");
connect.setRequestProperty("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
connect.setRequestProperty("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
connect.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
connect.setRequestProperty("Keep-Alive", "115");
connect.setRequestProperty("Connection", "keep-alive");
BufferedReader in = new BufferedReader(new InputStreamReader(connect.getInputStream(), "UTF-8"));
String inputLine;
while ((inputLine = in.readLine()) != null){
results += inputLine;
}
return results;
}catch(Exception e){
// Something's wrong
}
return results;
}
Thanks a lot!
Yes it is possible, the code you posted can be called from multiple threads at the same time.
You need to create a thread for each hit. Create a class that implements Runnable, then put all of your connection code inside the run method.
Then run it with something like this...
for(int i=0; i < *thread count*; i++){
Thread currentThread = new Thread(*Instance of your runnable Class*);
currentThread.start();
}

HttpsURLConnection and POST

some time ago i wrote this program in python, that logged in a website using https, took some info and logged out.
The program was quite simple:
class Richiesta(object):
def __init__(self,url,data):
self.url = url
self.data = ""
self.content = ""
for k, v in data.iteritems():
self.data += str(k)+"="+str(v)+"&"
if(self.data == ""):
self.req = urllib2.Request(self.url)
else:
self.req = urllib2.Request(self.url,self.data)
self.req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6')
self.req.add_header('Referer', baseurl+'/www/')
self.req.add_header('Cookie', cookie )
def leggi(self):
while(self.content == ""):
try:
r = urllib2.urlopen(self.req)
except urllib2.HTTPError, e:
print("Errore del server, nuovo tentativo tra 15 secondi")
time.sleep(15)
except urllib2.URLError, e:
print("Problema di rete, proverò a riconnettermi tra 20 secondi")
time.sleep(20)
else:
self.content = r.read().decode('utf-8')
def login(username,password):
global cookie
print("Inizio la procedura di login")
url = "https://example.com/auth/Authenticate"
data = {"login":"1","username":username,"password":password}
f = Richiesta(url,data)
f.leggi()
Now, for some reason, I have to translate it in java. Untill now, this is what i've writte:
import java.net.*;
import java.security.Security.*;
import java.io.*;
import javax.net.ssl.*;
public class SafeReq {
String baseurl = "http://www.example.com";
String useragent = "Mozilla/5.0 (Windows NT 5.1; rv:2.0b6) Gecko/20100101 Firefox/4.0b6";
String content = "";
public SafeReq(String s, String sid, String data) throws MalformedURLException {
try{
URL url = new URL(s);
HttpsURLConnection request = ( HttpsURLConnection ) url.openConnection();
request.setUseCaches(false);
request.setDoOutput(true);
request.setDoInput(true);
request.setFollowRedirects(true);
request.setInstanceFollowRedirects(true);
request.setRequestProperty("User-Agent",useragent);
request.setRequestProperty("Referer","http://www.example.com/www/");
request.setRequestProperty("Cookie","sid="+sid);
request.setRequestProperty("Origin","http://www.example.com");
request.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
request.setRequestProperty("Content-length",String.valueOf(data.length()));
request.setRequestMethod("POST");
OutputStreamWriter post = new OutputStreamWriter(request.getOutputStream());
post.write(data);
post.flush();
BufferedReader in = new BufferedReader(new InputStreamReader(request.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
content += inputLine;
}
post.close();
in.close();
} catch (IOException e){
e.printStackTrace();
}
}
public String leggi(){
return content;
}
}
The problem is, the login doesn't work, and when i try to get a page that require me to be logged, i get the "Login Again" message.
The two classes seems quite the same, and i can't understand why i can't make the second one to work ... any idea?
Where do you get your sid from? From the symptoms, I would guess that your session cookie is not passed correctly to the server.
See this question for possible solution: Cookies turned off with Java URLConnection.
In general, I recommend you to use HttpClient for implementing HTTP conversations in Java (anything more complicated than a simple one-time GET or POST). See code examples (I guess "Form based logon" example is appropriate in your case).
Anyone looking for this in the future, take a look at HtmlUnit.
This answer has a nice example.

Categories