Get URL from InputStream - java

I'm wondering how InputStream in java is implemented from low level perspective.
Suppose I write a below java code for making connection with a website.
url = new URL("[some url info]");
URLConnection urlcon = url.openConnection();
InputStream in = urlcon.getInputStream();
while((readcount = in.read(buffer)) != -1){
fos.write(buffer,0,readcount);
Could I know URL from InputStream("in" in the above code block) directly by casting it's type and call an appropriate method like below? Are there any other ways to get URL from InputStream?
(newtype) new = (newtype) in;
String Url = new.appropriatemethod();
I searched all subclasses of InputStream, but I couldn't find any classes which have a interface to give it's URL.
(https://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html)
However, I think InputStream somehow have information of URL to receive data from a website which has this URL.
I might have big misunderstanding about "Stream".
Thank you for reading my quetion. :)

InputStream is just a stream not a URLConnection. When you type InputStream in = urlcon.getInputStream(); . You will get input stream not a url connection. When you type in.read then you are reading the Stream and not URLConnection.
An InputStream is a reference to source of data (be it a file, network
connection etc), that we want to process as follows:
we generally want to read the data as "raw bytes" and then write our own code to do something interesting with the bytes;
we generally want to read the data in sequential order: that is, to get to the nth byte of data, we have to read all the preceding bytes first, and we're not guaranteed to be able to "jump back" again once we've read them.

Related

How to Write Image File from Request Body Without Writing to Memory Nodejs

I'm trying to (HTTP) POST an image to a Nodejs server that is configured using Express. I have been able to accomplish this successfully using JSON, but unless I am mistaken, there is no way to obtain the image string without loading the entire request body into a new variable before parsing it as JSON. Since images are quite large and the image should already be stored in the request body anyway, is there a way to immediately pipe the image contents into fs.writeFile()? The content type for the request does not have to be JSON. I have tried using a querystring as well, but that was unsuccessful. The content type cannot be just an image though because I have to include a tag for the image too (in this case the user's email address).
Here is the code for when I attempted to use a query string. It is located in a post route method for the express app:
fs.writeFile('profiles/images/user.png', new Buffer(req.body.image, 'base64'),
function(error)
{
if (error)
res.end(error);
}
);
No error occurs, and the code creates the .png file, but the file is somehow corrupted and is larger than it should be.
All of this is actually for an Android app, so here is also the Java code that I am using to send the request:
URLConnection connection = new URL(UPLOAD_PICTURE_URL).openConnection();
connection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;");
connection.setDoOutput(true);
String image = Base64.encodeToString(
IOUtils.toByteArray(new FileInputStream(filePath)),
Base64.NO_WRAP
);
OutputStreamWriter out = new OutputStreamWriter(connection.getOutputStream());
out.write("email=" + email + "&image=" + image);
out.close();
Perhaps this belongs in another topic, but along the same lines, does anybody know a way to pipe the file input stream in the android code directly to the URLConnection's output stream with base64 encoding? I have tried writing the string literal (the out.write() line above ^) and then creating a Base64OutputStream to write the image before piping that stream into the URLConnection's outputstream, but calling req.body.image in the node app after doing that just returns undefined. And finally, does anybody know if IOUtils.toByteArray() (from Apache Commons), when used as the input argument for an input/output stream constructor, writes the entire byte array to memory anyway on the Android side? If so, is there a way of avoiding that too?
Thanks in advance.

How to programmarically read the source code of a webpage after JavaScript has altered the DOM?

I want to view source code of a web page, but the JavaScript change it.
E.g. https://delicious.com/search/ali this is a site page when we click CTRL+U it shows the source code which JavaScript changed not actual one. If you see code using Inspect Element than it shows the complete source code. so I want to get the complete source code.
kindly let me know is there any technique to get the source code provided by the Inspect Element. I am building a software and this is the requirement of that. It is Good if the technique or api you are going to refer me is in JAVA.
I am going to build a software which gets urls from this site.
But because of change made by the JavaScript I can't get the actual Source code.
I'm not sure, but this might be what you are asking for. The code takes a URL object, gets the server's response, and returns the body of the response. This should be a HTML document in your case.
String getSource(URL url) {
HttpURLConnection connection = url.openConnection();
connection.setDoOutput(true);
connection.setDoInput(true);
connection.getOutputStream().write(42);
byte[] bytes = new byte[512];
try (BufferedInputStream bis = new BufferedInputStream(connection.getInputStream())) {
StringBuilder response = new StringBuilder(500);
int in;
while ((in = bis.read(bytes)) != -1) {
response.append(new String(bytes, 0, in));
}
return response.toString().split("\r\n\r\n")[1];
};
}

BlackBerry UTF-8 InputStreamReader on Socket issue

I'm trying to read the response from a server using a socket and the information is UTF-8 encoded. I'm wrapping the InputStream from the socket in an InputStreamReader with the encoding set to "UTF-8".
For some reason it seems like only part of the response is read and then the reading just hangs for about a minute or two and then it finishes. If I set the encoding on the InputStreamReader to "ISO-8859-1" then I can read all of the data right away, but obviously not all of the characters are displayed correctly.
Code looks something like the following
socketConn = (SocketConnection)Connector.open(url);
InputStreamReader is = new InputStreamReader(socketConn.openInputStream(), "UTF-8");
Then I read through the headers and the content. The content is chunked and I read the line with the size of each chunk (convert to decimal from hex) to know how much to read.
I'm not understanding the difference in reading with the two encodings and the effect it can have because it works without issue with ISO-8859-1 and it works eventually with UTF-8, there is just the long delay.
It's hard to get the reason of the delay.
You may try another way of getting the data from the network:
byte[] data = IOUtilities.streamToBytes(socketConn.openInputStream());
I believe the above should be passed without delay. Then having got the bytes from network you can start data processing. Note you can always get a String from bytes representing a string in UTF-8 encoding:
String stringInUTF8 = new String(bytes, "UTF-8");
UPDATE: see the second comment to this post.
I was already removing the chunk sizes on the fly so I ended up doing something somewhat similar to the IOUtilities answer. Instead of using an InputStreamReader I just used the InputStream. InputStream has a read method that can fill an array of bytes, so for each chunk the code looks something like this
byte[] buf = new buf[size];
is.read(buf);
return new String(buf, "UTF-8");
This seems to work, doesn't cause any delays and I can remove the extra information about the chunks on the fly.

URLConnection and content length : how much data is download?

I've created a servlet which reads the content of a file to a byte array which subsequently is written to the OutputStream of the response:
// set headers
resp.setHeader("Content-Disposition","attachment; filename=\"file.txt\"");
resp.setHeader("Content-Length", "" + fileSize);
// output file content.
OutputStream out = resp.getOutputStream();
out.write(fileBytes);
out.close();
Now, I've also written a "client" which needs to find out how big the file is. This should be easy enough as I've added the "Content-Length" header.
URLConnection conn = url.openConnection();
long fileSize = conn.getContentLength();
However, I am a little uncertain about the big picture. As I understand my own servlet, the entire file content is dumped to the OutputStream of the response. However, does calling getContentLength() also result in the actual file data somehow partially or fully being downloaded? In other words, when i invoke conn.getContentLength(), how much of the file will be returned from the server? Does the headers come "separate" from the content?
All input highly appreciated!
However, does calling getContentLength() also result in the actual
file data somehow partially or fully being downloaded?
No, the getContentLength() method just returns a String value of the size of the content as an Integer.
In other words, when i invoke conn.getContentLength(), how much of
the file will be returned from the server?
None of the file will be downloaded.
Does the headers come "separate" from the content?
Yes, the headers come "separate" from the content.
Now you're certain :D
See the javadocs
Returns the value of the content-length header field.
So a call to getContentLength() merely reads the header value and does not cause any downloading. You have to call getContent() for that.

How is the best way to extract the entire content from a BufferedReader object in Java?

i'm trying to get an entire WebPage through a URLConnection.
What's the most efficient way to do this?
I'm doing this already:
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
String line = bf.readLine();
while(line!=null){
html.append(line);
line = bf.readLine();
}
bf.close();
html has the entire HTML page.
I think this is the best way. The size of the page is fixed ("it is what it is"), so you can't improve on memory. Perhaps you can compress the contents once you have them, but they aren't very useful in that form. I would imagine that eventually you'll want to parse the HTML into a DOM tree.
Anything you do to parallelize the reading would overly complicate the solution.
I'd recommend using a StringBuilder with a default size of 2048 or 4096.
Why are you thinking that the code you posted isn't sufficient? You sound like you're guilty of premature optimization.
Run with what you have and sleep at night.
What do you want to do with the obtained HTML? Parse it? It may be good to know that a bit decent HTML parser can already have a constructor or method argument which takes straight an URL or InputStream so that you don't need to worry about streaming performance like that.
Assuming that all you want to do is described in your previous question, with for example Jsoup you could obtain all those news links extraordinary easy like follows:
Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&tab=wn").get();
Elements newsLinks = document.select("h2.title a:eq(0)");
for (Element newsLink : newsLinks) {
System.out.println(newsLink.attr("href"));
}
This yields the following after only a few seconds:
http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas
http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html
http://www.abc.es/agencias/noticia.asp?noticia=550415
http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133
http://www.abc.es/agencias/noticia.asp?noticia=550292
http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml
http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html
http://www.infocielo.com/IC/Home/index.php?ver_nota=22642
http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html
http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias
http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1
http://www.ambito.com/noticia.asp?id=547722
http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469
http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html
http://www.lanacion.com.ar/nota.asp?nota_id=1314014
http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html
http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html
http://www.larazon.com.ar/show/sdf_0_176100096.html
http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp
http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html
http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html
http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622
http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html
http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama
http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html
http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos
http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html
Did someone already said that regex is absolutely the wrong tool to parse HTML? ;)
See also:
Pros and cons of HTML parsers in Java
Your approach looks pretty good, however you can make it somewhat more efficient by avoiding the creation of intermediate String objects for each line.
The way to do this is to read directly into a temporary char[] buffer.
Here is a slightly modified version of your code that does this (minus all the error checking, exception handling etc. for clarity):
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
char[] charBuffer = new char[4096];
int count=0;
do {
count=bf.read(charBuffer, 0, 4096);
if (count>=0) html.append(charBuffer,0,count);
} while (count>0);
bf.close();
For even more performance, you can of course do little extra things like pre-allocating the character array and StringBuffer if this code is going to be called frequently.
You can try using commons-io from apache (http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html)
new String(IOUtils.toCharArray(connection.getInputStream()))
There are some technical considerations. You may wish to use HTTPURLConnection instead of URLConnection.
HTTPURLConnection supports chunked transfer encoding, which allows you to process the data in chunks, rather than buffering all of the content before you start doing work. This can lead to an improved user experience.
Also, HTTPURLConnection supports persistent connections. Why close that connection if you're going to request another resource right away? Keeping the TCP connection open with the web server allows your application to quickly download multiple resources without spending the overhead (latency) of establishing a new TCP connection for each resource.
Tell the server that you support gzip and wrap a BufferedReader around GZIPInputStream if the response header says the content is compressed.

Categories