I am writing a program in Java which has to make around 6.5 million calls to various pages on same server (URL will be slightly altered by appending a user name that will be read from a text file) .. Firstly I want to know the most time efficient way of doing this, secondly can anybody give a guess as to how much time this may consume?? Currently I am reading each url in a separate thread of ExecutorService object .. something like this
ExecutorService executor = Executors.newFixedThreadPool(10);
Runnable worker = new MyRunnable(allUsers[n]);
executor.execute(worker);
and the run method looks like:
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
page = page + line;
// More code follows
}
Any suggestions will be highly appreciated
No, nobody can guess how much "time" this will consume. We don't know if the server will take a millisecond or an hour to complete a request.
The most efficient way is to use a more efficient API, one that allows bulk requests.
At 10 threads your program will likely be IO bound. You will need to profile the number of threads you need to ensure full CPU use. You can avoid this using Java 7 features / an NIO framework, like Netty or MINA, so one thread can service many requests concurrently. (I'm not sure if these are client side).
I concur with the other comments and answers that it is impossible to predict how long this will take, and that a "bulk transfer" request will most likely give better performance.
A couple more points:
If you can use a RESTful API that returns JSON or XML instead of web pages ... that will be faster.
On the client side, this is going to be efficient if the documents you are fetching are large:
while ((line = br.readLine()) != null) {
page = page + line;
}
That is going to do an excessive amount of copying. A better approach is this:
StringBuilder sb = new StringBuilder(...);
while ((line = br.readLine()) != null) {
sb.append(line);
}
page = sb.toString();
If you can get a good estimate of the page size, then create the StringBuilder that big.
Related
I'm trying to parse a huge CSV (56595 lines) into a JSONArray but it's taking a considerable amount of time. This is what my code looks like and it takes ~17 seconds to complete. I'm limiting my results based on one of the columns but the code still has to go through the entire CSV file.
Is there a more efficient way to do this? I've excluded the catch's, finally's and throws to save space.
File
Code
...
BufferedReader reader = null;
String line = "";
//jArray is retrieved by an ajax call and used in a graph
JSONArray jArray = new JSONArray();
HttpClient httpClient = new DefaultHttpClient();
try {
//url = CSV file
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpClient.execute(httpGet);
int responseCode = response.getStatusLine().getStatusCode();
if (responseCode == 200) {
try {
reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));
while (((line = reader.readLine()) != null)) {
JSONObject json = new JSONObject();
String[] row = line.split(",");
//skips first three rows
if(row.length > 2){
//map = 4011
if(row[1].equals(map)) {
json.put("col0", row[0]);
json.put("col1", row[1]);
json.put("col2", row[2]);
json.put("col3", row[3]);
json.put("col4", row[4]);
json.put("col5", row[5]);
json.put("col6", row[6]);
jArray.put(json);
}
}
return jArray;
}
...
Unfortunately, the main delay will predictably be at downloading the file from HTTP, so all your chances will rest upon optimizing your code. So, based upon the info you provided, I can suggest some enhancements to optimize your algorithm:
It was a good idea to process the input file in streaming mode, reading line by line with a with BufferedReader. Usually it is a good practice to set an explicit buffer size (BufferedReader's default size is 8Kb), but being the source a network connection, I doubt it will be any better in this case. Anyway, you should try 16Kb, for instance.
Since the number of output items is very low (49, you said), it doesn't matter to store it in an array (for a higher amount, I would have recommend you to chose another collection, like LinkedList), but it is always useful to pre-size it with an estimated size. In JSONArray, I suppose it would be enough to put a null item at position 100 (for example) at the beginning of your method.
The biggest deal I think of is the line line.split(","), because that makes the program go through the whole line, duplicate its contents character by character into an array, and the worst of it all, for eventually use it only in a 0.1% of cases.
And there might be even a worse drawback: Merely splitting by comma might be not a good way to properly parse a JSON line. I mean: Are you sure the json values cannot contain a comma as part of user data?
Well, to solve this problem I suggest you to code your own json custom parsing algorithm, which might be a little hard, but it will be worth the effort. You must code a state machine in which you detect the second value and, if the key coincides with the filtering value ("4011"), continue parsing the rest of the line. In this way, you will save a big amount of time and memory.
We are streaming data from server to client due to a huge volume. It's basically a browser requesting for search results (these records are streamed to the browser). In the performance test, it streamed 150,00+ records in batches of 1,000.
Option 1: use Websocket sending the batch of records via org.springframework.messaging.simp.SimpMessageSendingOperations which performed always within 30-40 seconds.
Option 2: use normal outputstream, via java.io.Writer which Spring MVC injects in the controller method. However, it was performing poorly ranging from 3-4 mins!
The reason why we are considering option 2 is due to some inconsistent behaviour with Websockets (as admittedly we are using it for the first time). Hence, I coded our fallback to normal streaming via outputstream but the slowdown is not acceptable. Does anyone have any idea why there is a huge performance difference between the 2 options? Is it still possible to make option 2 faster (currently I will be playing around with the batch size)?
The snippet for the the batch writing process (where the app is reading the search results from a NoSQL input stream and writing to the brower's outputstream) is as follows:
public class DefaultStreamProcessor implements StreamProcessor {
public void process(InputStream searchResponse) {
try(BufferedReader br = new BufferedReader(new InputStreamReader(searchResponse))) {
String line = null;
while ((line = br.readLine()) != null) {
addToBatch(line);
if (isBatchFull()) {
//option 1: use websocket
//simpMessageSendingOperations.convertAndSend("/topic/result/"+searchId, new ResultObject(getBatchAsMessage()));
//option 2: use normal printwriter
//writer.write(getBatchAsMessage());
}
}
}
}
}
I've been trying to get information from a webpage, specifically this site: http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D (among other similar ones). I'm using the URL and URLConnection packages to do so. I'm trying to get a certain number from the webpage - on this page, I want the total number of articles (16428).
It says this near the top of the page: "Results: 1 to 20 of 16428" and when I look at the page source manually I can find this. However, when I try to use the java connection to obtain this number from the page source, for some reason the number it gets is "863399" instead of "16428".
Code:
URL connection = new URL("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D");
URLConnection yc = connection.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String html = "";
String inputLine;
while ((inputLine = in.readLine()) != null) html += inputLine;
in.close();
int startMarker = html.indexOf("ncbi_resultcount");
int endMarker = html.indexOf("ncbi_op");
System.out.println(html.substring(startMarker, endMarker));
When I run this code, I get:
ncbi_resultcount" content="863399" />
rather than:
ncbi_resultcount" content="16428" />
Does anyone know why this is / how I can fix it?
Thanks!
I can't reproduce your problem and I have no idea why this is happening. Perhaps it's sniffing specific Java user agent versions. You'd then need to try to set the User-Agent header to something else to pretend as a "real" webbrowser.
yc.setRequestProperty("User-Agent", "Mozilla");
Unrelated to the concrete problem, I'd suggest to use a real HTML parser for this job, such as Jsoup. It's then as easy as:
Document document = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed?term=%22pulmonary%20disease%2C%20chronic%20obstructive%22%5BMesh%5D").get();
Element nbci_resultcount = document.select("meta[name=ncbi_resultcount]").first();
System.out.println(nbci_resultcount.attr("content")); // 16433
i'm trying to get an entire WebPage through a URLConnection.
What's the most efficient way to do this?
I'm doing this already:
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
String line = bf.readLine();
while(line!=null){
html.append(line);
line = bf.readLine();
}
bf.close();
html has the entire HTML page.
I think this is the best way. The size of the page is fixed ("it is what it is"), so you can't improve on memory. Perhaps you can compress the contents once you have them, but they aren't very useful in that form. I would imagine that eventually you'll want to parse the HTML into a DOM tree.
Anything you do to parallelize the reading would overly complicate the solution.
I'd recommend using a StringBuilder with a default size of 2048 or 4096.
Why are you thinking that the code you posted isn't sufficient? You sound like you're guilty of premature optimization.
Run with what you have and sleep at night.
What do you want to do with the obtained HTML? Parse it? It may be good to know that a bit decent HTML parser can already have a constructor or method argument which takes straight an URL or InputStream so that you don't need to worry about streaming performance like that.
Assuming that all you want to do is described in your previous question, with for example Jsoup you could obtain all those news links extraordinary easy like follows:
Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&tab=wn").get();
Elements newsLinks = document.select("h2.title a:eq(0)");
for (Element newsLink : newsLinks) {
System.out.println(newsLink.attr("href"));
}
This yields the following after only a few seconds:
http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas
http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html
http://www.abc.es/agencias/noticia.asp?noticia=550415
http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133
http://www.abc.es/agencias/noticia.asp?noticia=550292
http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml
http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html
http://www.infocielo.com/IC/Home/index.php?ver_nota=22642
http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html
http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias
http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1
http://www.ambito.com/noticia.asp?id=547722
http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469
http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html
http://www.lanacion.com.ar/nota.asp?nota_id=1314014
http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html
http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html
http://www.larazon.com.ar/show/sdf_0_176100096.html
http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp
http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html
http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html
http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622
http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html
http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama
http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html
http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos
http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html
Did someone already said that regex is absolutely the wrong tool to parse HTML? ;)
See also:
Pros and cons of HTML parsers in Java
Your approach looks pretty good, however you can make it somewhat more efficient by avoiding the creation of intermediate String objects for each line.
The way to do this is to read directly into a temporary char[] buffer.
Here is a slightly modified version of your code that does this (minus all the error checking, exception handling etc. for clarity):
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
char[] charBuffer = new char[4096];
int count=0;
do {
count=bf.read(charBuffer, 0, 4096);
if (count>=0) html.append(charBuffer,0,count);
} while (count>0);
bf.close();
For even more performance, you can of course do little extra things like pre-allocating the character array and StringBuffer if this code is going to be called frequently.
You can try using commons-io from apache (http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html)
new String(IOUtils.toCharArray(connection.getInputStream()))
There are some technical considerations. You may wish to use HTTPURLConnection instead of URLConnection.
HTTPURLConnection supports chunked transfer encoding, which allows you to process the data in chunks, rather than buffering all of the content before you start doing work. This can lead to an improved user experience.
Also, HTTPURLConnection supports persistent connections. Why close that connection if you're going to request another resource right away? Keeping the TCP connection open with the web server allows your application to quickly download multiple resources without spending the overhead (latency) of establishing a new TCP connection for each resource.
Tell the server that you support gzip and wrap a BufferedReader around GZIPInputStream if the response header says the content is compressed.
I want to use curl in java. Is curl built-in with Java or I have to install it from any 3rd party source to use with Java? If it needs to be separately installed, how can that be done?
You can make use of java.net.URL and/or java.net.URLConnection.
URL url = new URL("https://stackoverflow.com");
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"))) {
for (String line; (line = reader.readLine()) != null;) {
System.out.println(line);
}
}
Also see the Oracle's simple tutorial on the subject. It's however a bit verbose. To end up with less verbose code, you may want to consider Apache HttpClient instead.
By the way: if your next question is "How to process HTML result?", then the answer is "Use a HTML parser. No, don't use regex for this.".
See also:
How to use java.net.URLConnection to fire and handle HTTP requests?
What are the pros and cons of the leading Java HTML parsers?
Some people have already mentioned HttpURLConnection, URL and URLConnection. If you need all the control and extra features that the curl library provides you (and more), I'd recommend Apache's httpclient.
The Runtime object allows you to execute external command line applications from Java and would therefore allow you to use cURL however as the other answers indicate there is probably a better way to do what you are trying to do. If all you want to do is download a file the URL object will work great.
Using standard java libs, I suggest looking at the HttpUrlConnection class
http://java.sun.com/javase/6/docs/api/java/net/HttpURLConnection.html
It can handle most of what curl can do with setting up the connection.
What you do with the stream is up to you.
Curl is a non-java program and must be provided outside your Java program.
You can easily get much of the functionality using Jakarta Commons Net, unless there is some specific functionality like "resume transfer" you need (which is tedious to code on your own)
Use Runtime to call Curl. This code works for both Ubuntu and Windows.
String[] commands = new String {"curl", "-X", "GET", "http://checkip.amazonaws.com"};
Process process = Runtime.getRuntime().exec(commands);
BufferedReader reader = new BufferedReader(new
InputStreamReader(process.getInputStream()));
String line;
String response;
while ((line = reader.readLine()) != null) {
response.append(line);
}
Paste your curl command into curlconverter.com/java/ and it'll convert it into Java code using java.net.URL and java.net.URLConnection.