Optimized option for getting text from a web page

Optimized option for getting text from a web page - java

I used url.openConnection() to get text from a webpage
but i got time delay in execution while i tried it in loops
i also tried httpUrl.disconnect().
but the change is not that much...
can anyone give me a better option for this
i used the following code for this
for(int i=0;i<10;i++){
URL google = new URL(array[i]);//array of links
HttpURLConnection yc =(HttpURLConnection)google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
source=source.concat(inputLine);
}
in.close();
yc.disconnect();
}

A couple of issues I can see.
in.readLine() doesn't retain the newline so when you use concat, all the newlines have been removed.
Using concat in a loop like this builds a longer and longer String. This will get slower and slower with each line you add.
Instead you might find IOUtils useful.
URL google = new URL("123newyear.com/2011/calendars/");
String text = IOUtils.toString(google.openConnection().getInputStream());

See Reading Directly from a URL for details on how to to get a stream from which you can read the contents of the URL.
Basically, you
Create a url URL url = new URL("123newyear.com/2011/calendars/";
Call openstream() on the URL object
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
Read from the stream (like you did).

Related

JAVA image request from multipart-form

I'm trying to redesign my code that originally took in JSON POST. To now take in images from POST requests. The problem is that when trying to debug it keeps giving me weird Unicode for the image data. I've tried looking into this problem from scratch, but all the examples I've been finding have been used for static images already on the hard drive.
I've tried this for my code inside my JSP file and it works fine for JSON data posts. Can someone tell me if I'm on the right path at all for this?
try{
BufferedReader in = new BufferedReader(
new InputStreamReader(request.getInputStream()));
String inputLine;
StringBuffer resp = new StringBuffer();
while((inputLine = in.readLine()) != null){
resp.append(inputLine);
}
System.out.println("Images request line: \"" +resp.toString().substring(0, 200)+ "\"");
in.close();
}catch(Exception e){}

Getting incomplete HTML source on url.openConnection()

I am trying to get HTML page source for a website. But I am not able to get some image links, which I think are populated dynamically on the webpage.
I am using java as:
url = new URL(firstLevelURL);
connection = (HttpURLConnection) url.openConnection();
try ( // Read all the text returned by the server
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
// Read each line of "in" until done, adding each to "response"
while ((str = br.readLine()) != null) {
// str is one line of text readLine() strips newline characters
//I am not able to get this image as it is loaded dynamically using javascript/ajax or something.
if(str.contains("<img id=\"tileImage")) {
response = str;
break;
}
}
}
I tried using :
connection.setReadTimeout(15*1000);
But the page is still not loading completely
Is there any way to wait for page to load completely before fetching HTML source

HttpUrlConnection's response omits the word 'http'

I create the URL object using a string like "http://www.example.com/a?s=12". I read the HTML response in the string serverResponse. This string is expected to have the entire HTML of a page, which has JavaScript and CSS includes. But strangely, the word "http:" is missing from all the URLs present in the response, eg in place of "http://example.com" I get "//asd.com". Any ideas?
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("GET");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer serverResponse = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
serverResponse.append(inputLine);
System.out.println(inputLine);
}
in.close();
System.out.println(serverResponse);

See here: Protocol-relative URLs

This string is expected to have the entire HTML of a page, which has javascript and CSS includes.
Why? A properly-constructed site will use relative URLs as much as possible. This seems to be one of them. Well done them, or you if it's your work.
But strangely, the word "http:" is missing from all the URLs present in the response, eg in place of "http://example.com" I get "//asd.com". Any ideas?
It's called a protocol-relative URL.

Wrong encoding with Java HttpURLConnection

Trying to read a generated XML from a MS Webservice
URL page = new URL(address);
StringBuffer text = new StringBuffer();
HttpURLConnection conn = (HttpURLConnection) page.openConnection();
conn.connect();
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent());
BufferedReader buff = new BufferedReader(in);
box.setText("Getting data ...");
String line;
do {
line = buff.readLine();
text.append(line + "\n");
} while (line != null);
box.setText(text.toString());
or
URL u = new URL(address);
URLConnection uc = u.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
inputLine = java.net.URLDecoder.decode(inputLine, "UTF-8");
System.out.println(inputLine);
}
in.close();
Any page reads fine except the web service output
it reads the greater and less than signs strangely
it read < to "& lt;" and > to "& gt;" without spaces, but if i type them here without spaces stackoverflow makes them < and >
Please help
thanks

First there seem to be a confusion on this row:
inputLine = java.net.URLDecoder.decode(inputLine, "UTF-8");
This effectively says that you expect every row in the document that your server is providing to be URL encoded. URL encoding is not the same as document encoding.
http://en.wikipedia.org/wiki/Percent-encoding
http://en.wikipedia.org/wiki/Character_encoding
Looking at your code snippet, I think URL encoding (percent encoding) is not what you're after.
In terms of document character encoding. You are making a conversion on this line:
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent());
conn.getContent() returns an InputStream that operates on bytes, whilst the reader operates on chars - the character encoding conversion is done here. Checkout the other constructors of InputStreamReader which takes the encoding as second argument. Without the second argument you are falling back on whatever is your platform default in java.
InputStreamReader(InputStream in, String charsetName)
for instance lets you change your code to:
InputStreamReader in = new InputStreamReader((InputStream) conn.getContent(), "utf-8");
But the real question will be "what encoding is your server providing the content in?" If you own the server code too, you may just hard code it to something reasonable such as utf-8. But if it can vary, you need to look at the http header Content-Type to figure it out.
String contentType = conn.getHeaderField("Content-Type");
The contents of contentType will look like
text/plain; charset=utf-8
A short hand way of getting this field is:
String contentEncoding = conn.getContentEncoding();
Notice that it's entirely possible that no charset is provided, or no Content-Type header, in which case you must fall back on reasonable defaults.

Mark Rotteveel is correct, the webservice is the culprit here it's for some reason sending the greater than and less than sign with the & lt and & gt format
Thanks Martin Algesten but i have already stated i worked around it i was just looking for why it was this way.

Java URL Connection Time Out

I am attempting to connect to a website where I'd like to extract its HTML contents. My application will never connect to the site - only time out.
Here is my code:
URL url = new URL("www.website.com");
URLConnection connection = url.openConnection();
connection.setConnectTimeout(2000);
connection.setReadTimeOut(2000);
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream());
String line;
while ((line = reader.readLine()) != null) {
// do stuff with line
}
reader.close();
Any ideas would be greatly appreciated. Thanks!

I believe the url should be (ie. you need a protocol):
URL url = new URL("http://www.website.com");
If that doesn't help then post your real SSCCE that demonstrates the problem so we don't have to guess what you are really doing because we can't tell if you are using your try/catch block correctly or if you are just ignoring exceptions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimized option for getting text from a web page - java

Related

JAVA image request from multipart-form

Getting incomplete HTML source on url.openConnection()

HttpUrlConnection's response omits the word 'http'

Wrong encoding with Java HttpURLConnection

Java URL Connection Time Out

Categories

Resources