I have this, but I was wondering if there is a faster way:
URL url=new URL(page);
InputStream is = new BufferedInputStream(url.openConnection().getInputStream());
BufferedReader in=new BufferedReader(new InputStreamReader(is));
String tmp="";
StringBuilder sb=new StringBuilder();
while((tmp=in.readLine())!=null){
sb.append(tmp);
}
Probably network is the biggest overhead, there isn't much you can do on Java code side. But using IOUtils is at least much faster to implement:
String page = IOUtils.toString(url.openConnection().getInputStream());
Remember to close underlying stream.
if you need manipulating with your html, find some library. Like for example jsoup.
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
Example:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
If you're using Apache Commons IO's IOUtils as Tomasz suggests, there's an even simpler method: toString(URL), or its preferred cousins that take a charset (of course that requires knowing the resource's charset in advance).
String string = IOUtils.toString( new URL( "http://some.url" ));
or
String string = IOUtils.toString( new URL( "http://some.url" ), "US-ASCII" );
Related
Given a string:
String exampleString = "example";
How do I convert it to an InputStream?
Like this:
InputStream stream = new ByteArrayInputStream(exampleString.getBytes(StandardCharsets.UTF_8));
Note that this assumes that you want an InputStream that is a stream of bytes that represent your original string encoded as UTF-8.
For versions of Java less than 7, replace StandardCharsets.UTF_8 with "UTF-8".
I find that using Apache Commons IO makes my life much easier.
String source = "This is the source of my input stream";
InputStream in = org.apache.commons.io.IOUtils.toInputStream(source, "UTF-8");
You may find that the library also offer many other shortcuts to commonly done tasks that you may be able to use in your project.
You could use a StringReader and convert the reader to an input stream using the solution in this other stackoverflow post.
There are two ways we can convert String to InputStream in Java,
Using ByteArrayInputStream
Example :-
String str = "String contents";
InputStream is = new ByteArrayInputStream(str.getBytes(StandardCharsets.UTF_8));
Using Apache Commons IO
Example:-
String str = "String contents"
InputStream is = IOUtils.toInputStream(str, StandardCharsets.UTF_8);
You can try cactoos for that.
final InputStream input = new InputStreamOf("example");
The object is created with new and not a static method for a reason.
Given a string:
String exampleString = "example";
How do I convert it to an InputStream?
Like this:
InputStream stream = new ByteArrayInputStream(exampleString.getBytes(StandardCharsets.UTF_8));
Note that this assumes that you want an InputStream that is a stream of bytes that represent your original string encoded as UTF-8.
For versions of Java less than 7, replace StandardCharsets.UTF_8 with "UTF-8".
I find that using Apache Commons IO makes my life much easier.
String source = "This is the source of my input stream";
InputStream in = org.apache.commons.io.IOUtils.toInputStream(source, "UTF-8");
You may find that the library also offer many other shortcuts to commonly done tasks that you may be able to use in your project.
You could use a StringReader and convert the reader to an input stream using the solution in this other stackoverflow post.
There are two ways we can convert String to InputStream in Java,
Using ByteArrayInputStream
Example :-
String str = "String contents";
InputStream is = new ByteArrayInputStream(str.getBytes(StandardCharsets.UTF_8));
Using Apache Commons IO
Example:-
String str = "String contents"
InputStream is = IOUtils.toInputStream(str, StandardCharsets.UTF_8);
You can try cactoos for that.
final InputStream input = new InputStreamOf("example");
The object is created with new and not a static method for a reason.
I'm trying to parese an URL with JSoup which contains the following Text: Ætterni.
After parsing the document the same string looks like that: Ætterni.
How do I prevent this form happening? I want the document 1:1 exactly like it was.
Code:
doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();
Use
doc.outputSettings().escapeMode(EscapeMode.xhtml);
for avoiding entities conversion.
You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL. This way you have a 1:1 copy of the response.
InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.
You should not use Reader/Writer for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.
i'm trying to get an entire WebPage through a URLConnection.
What's the most efficient way to do this?
I'm doing this already:
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
String line = bf.readLine();
while(line!=null){
html.append(line);
line = bf.readLine();
}
bf.close();
html has the entire HTML page.
I think this is the best way. The size of the page is fixed ("it is what it is"), so you can't improve on memory. Perhaps you can compress the contents once you have them, but they aren't very useful in that form. I would imagine that eventually you'll want to parse the HTML into a DOM tree.
Anything you do to parallelize the reading would overly complicate the solution.
I'd recommend using a StringBuilder with a default size of 2048 or 4096.
Why are you thinking that the code you posted isn't sufficient? You sound like you're guilty of premature optimization.
Run with what you have and sleep at night.
What do you want to do with the obtained HTML? Parse it? It may be good to know that a bit decent HTML parser can already have a constructor or method argument which takes straight an URL or InputStream so that you don't need to worry about streaming performance like that.
Assuming that all you want to do is described in your previous question, with for example Jsoup you could obtain all those news links extraordinary easy like follows:
Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&tab=wn").get();
Elements newsLinks = document.select("h2.title a:eq(0)");
for (Element newsLink : newsLinks) {
System.out.println(newsLink.attr("href"));
}
This yields the following after only a few seconds:
http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas
http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html
http://www.abc.es/agencias/noticia.asp?noticia=550415
http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133
http://www.abc.es/agencias/noticia.asp?noticia=550292
http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml
http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html
http://www.infocielo.com/IC/Home/index.php?ver_nota=22642
http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html
http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias
http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1
http://www.ambito.com/noticia.asp?id=547722
http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469
http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html
http://www.lanacion.com.ar/nota.asp?nota_id=1314014
http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html
http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html
http://www.larazon.com.ar/show/sdf_0_176100096.html
http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp
http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html
http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html
http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622
http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html
http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama
http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html
http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos
http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html
Did someone already said that regex is absolutely the wrong tool to parse HTML? ;)
See also:
Pros and cons of HTML parsers in Java
Your approach looks pretty good, however you can make it somewhat more efficient by avoiding the creation of intermediate String objects for each line.
The way to do this is to read directly into a temporary char[] buffer.
Here is a slightly modified version of your code that does this (minus all the error checking, exception handling etc. for clarity):
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
char[] charBuffer = new char[4096];
int count=0;
do {
count=bf.read(charBuffer, 0, 4096);
if (count>=0) html.append(charBuffer,0,count);
} while (count>0);
bf.close();
For even more performance, you can of course do little extra things like pre-allocating the character array and StringBuffer if this code is going to be called frequently.
You can try using commons-io from apache (http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html)
new String(IOUtils.toCharArray(connection.getInputStream()))
There are some technical considerations. You may wish to use HTTPURLConnection instead of URLConnection.
HTTPURLConnection supports chunked transfer encoding, which allows you to process the data in chunks, rather than buffering all of the content before you start doing work. This can lead to an improved user experience.
Also, HTTPURLConnection supports persistent connections. Why close that connection if you're going to request another resource right away? Keeping the TCP connection open with the web server allows your application to quickly download multiple resources without spending the overhead (latency) of establishing a new TCP connection for each resource.
Tell the server that you support gzip and wrap a BufferedReader around GZIPInputStream if the response header says the content is compressed.
Given a string:
String exampleString = "example";
How do I convert it to an InputStream?
Like this:
InputStream stream = new ByteArrayInputStream(exampleString.getBytes(StandardCharsets.UTF_8));
Note that this assumes that you want an InputStream that is a stream of bytes that represent your original string encoded as UTF-8.
For versions of Java less than 7, replace StandardCharsets.UTF_8 with "UTF-8".
I find that using Apache Commons IO makes my life much easier.
String source = "This is the source of my input stream";
InputStream in = org.apache.commons.io.IOUtils.toInputStream(source, "UTF-8");
You may find that the library also offer many other shortcuts to commonly done tasks that you may be able to use in your project.
You could use a StringReader and convert the reader to an input stream using the solution in this other stackoverflow post.
There are two ways we can convert String to InputStream in Java,
Using ByteArrayInputStream
Example :-
String str = "String contents";
InputStream is = new ByteArrayInputStream(str.getBytes(StandardCharsets.UTF_8));
Using Apache Commons IO
Example:-
String str = "String contents"
InputStream is = IOUtils.toInputStream(str, StandardCharsets.UTF_8);
You can try cactoos for that.
final InputStream input = new InputStreamOf("example");
The object is created with new and not a static method for a reason.