I am trying to get HTML page source for a website. But I am not able to get some image links, which I think are populated dynamically on the webpage.
I am using java as:
url = new URL(firstLevelURL);
connection = (HttpURLConnection) url.openConnection();
try ( // Read all the text returned by the server
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"))) {
// Read each line of "in" until done, adding each to "response"
while ((str = br.readLine()) != null) {
// str is one line of text readLine() strips newline characters
//I am not able to get this image as it is loaded dynamically using javascript/ajax or something.
if(str.contains("<img id=\"tileImage")) {
response = str;
break;
}
}
}
I tried using :
connection.setReadTimeout(15*1000);
But the page is still not loading completely
Is there any way to wait for page to load completely before fetching HTML source
Related
I am trying to download the latest HTML code from this website, until recently the URL displayed all the information I needed. Recently the web designer changed the format so a portion of the data is displayed and the user must hit the 'next' button to display next portion of data.
The URL doesn't change though.
Anyone know how I can download all the information using JAVA??
Thanks. This is my current code:
[code]
URL url = null;
InputStream is = null;
BufferReader br;
String line;
try {
url = new URL("HTTP://...../..../...");
is = url.openStream();
br = new BufferedReader(new InputStreamReader(is));
while ( (line = br.readLine() ) != null)
System.out.println(line);
} catch(IOException e) {
}
....
[/code]
I create the URL object using a string like "http://www.example.com/a?s=12". I read the HTML response in the string serverResponse. This string is expected to have the entire HTML of a page, which has JavaScript and CSS includes. But strangely, the word "http:" is missing from all the URLs present in the response, eg in place of "http://example.com" I get "//asd.com". Any ideas?
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
con.setRequestMethod("GET");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer serverResponse = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
serverResponse.append(inputLine);
System.out.println(inputLine);
}
in.close();
System.out.println(serverResponse);
See here: Protocol-relative URLs
This string is expected to have the entire HTML of a page, which has javascript and CSS includes.
Why? A properly-constructed site will use relative URLs as much as possible. This seems to be one of them. Well done them, or you if it's your work.
But strangely, the word "http:" is missing from all the URLs present in the response, eg in place of "http://example.com" I get "//asd.com". Any ideas?
It's called a protocol-relative URL.
I need to make a struts 2 application. In the one view of this app, I have to get the view of another application through the URL provided for example (http://localhost:8080/hudson/)...
Now.
1. How to connect with the other application? (Can it be done with Apache HttpURLClient? OR any other way please guide. )
2 .If it can be done with Apache HttpURLClient, then how to render the Response object in stuts2 framework.
Please help. Many thanks in advance.
You can use java.net package to resolve the issue. Example Code :
URL urlApi = new URL(requestUrl);
HttpURLConnection httpURLConnection = (HttpURLConnection) urlApi.openConnection();
httpURLConnection.setRequestMethod(<requestMethod>); // GET or POST
httpURLConnection.setDoOutput(true);
//in case HTTP POST method un-comment following to write request body
//DataOutputStream ds = new DataOutputStream(httpURLConnection.getOutputStream());
//ds.writeBytes(body);
InputStream content = (InputStream) httpURLConnection.getInputStream();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(content));
StringBuilder stringBuilder = new StringBuilder(100);
String line = null;
while ((line = bufferedReader.readLine()) != null) {
stringBuilder.append(line);
}
String serverResult = stringBuilder.toString();
//now you have a string representation of the server result page
//do what you need
Hope this helps
I have to do the same, i`m using struts2 but i have doing a servlet. In struts.xml you have to put you can get the content of the url with httpauarlconnection or with httpclient (apache) and and put it at servlet response.
I have this but i have problems with the relative links of the html because it try to resolve with my domain name (the name of the servlet that make the work).
I used url.openConnection() to get text from a webpage
but i got time delay in execution while i tried it in loops
i also tried httpUrl.disconnect().
but the change is not that much...
can anyone give me a better option for this
i used the following code for this
for(int i=0;i<10;i++){
URL google = new URL(array[i]);//array of links
HttpURLConnection yc =(HttpURLConnection)google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
source=source.concat(inputLine);
}
in.close();
yc.disconnect();
}
A couple of issues I can see.
in.readLine() doesn't retain the newline so when you use concat, all the newlines have been removed.
Using concat in a loop like this builds a longer and longer String. This will get slower and slower with each line you add.
Instead you might find IOUtils useful.
URL google = new URL("123newyear.com/2011/calendars/");
String text = IOUtils.toString(google.openConnection().getInputStream());
See Reading Directly from a URL for details on how to to get a stream from which you can read the contents of the URL.
Basically, you
Create a url URL url = new URL("123newyear.com/2011/calendars/";
Call openstream() on the URL object
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
Read from the stream (like you did).
I've followed a tutorial and came up with the following method to read the webpage content into a CharSequence
public static CharSequence getURLContent(URL url) throws IOException {
URLConnection conn = url.openConnection();
String encoding = conn.getContentEncoding();
if (encoding == null) {
encoding = "ISO-8859-1";
}
BufferedReader br = new BufferedReader(new
InputStreamReader(conn.getInputStream(),encoding));
StringBuilder sb = new StringBuilder(16384);
try {
String line;
while ((line = br.readLine()) != null) {
sb.append(line);
sb.append('\n');
}
} finally {
br.close();
}
return sb;
}
It will return a representation of the webpage specified by the url.
However,this representation is hugely different from what I use "view page source" in my Firefox,and since I need to scrape data from the original webpage(some data segement in the original "view page source" file),it will always fail to find required text on this Java representation.
Did I go wrong somewhere?I need your advice guys,thanks a lot for helping!
You need to use an HTML-parsing library to build a data structure representing the HTML text on this webpage. My recommendation is to use this library: http://htmlparser.sourceforge.net.
Things like the request useragent and cookies can change what the server returns in the response. So the problem is more likely in the details of the request you are sending rather than in how you are reading the response.
Things like HttpClient will allow you to more easily simulate the request being sent from a browser.