I want to use curl in java. Is curl built-in with Java or I have to install it from any 3rd party source to use with Java? If it needs to be separately installed, how can that be done?
You can make use of java.net.URL and/or java.net.URLConnection.
URL url = new URL("https://stackoverflow.com");
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"))) {
for (String line; (line = reader.readLine()) != null;) {
System.out.println(line);
}
}
Also see the Oracle's simple tutorial on the subject. It's however a bit verbose. To end up with less verbose code, you may want to consider Apache HttpClient instead.
By the way: if your next question is "How to process HTML result?", then the answer is "Use a HTML parser. No, don't use regex for this.".
See also:
How to use java.net.URLConnection to fire and handle HTTP requests?
What are the pros and cons of the leading Java HTML parsers?
Some people have already mentioned HttpURLConnection, URL and URLConnection. If you need all the control and extra features that the curl library provides you (and more), I'd recommend Apache's httpclient.
The Runtime object allows you to execute external command line applications from Java and would therefore allow you to use cURL however as the other answers indicate there is probably a better way to do what you are trying to do. If all you want to do is download a file the URL object will work great.
Using standard java libs, I suggest looking at the HttpUrlConnection class
http://java.sun.com/javase/6/docs/api/java/net/HttpURLConnection.html
It can handle most of what curl can do with setting up the connection.
What you do with the stream is up to you.
Curl is a non-java program and must be provided outside your Java program.
You can easily get much of the functionality using Jakarta Commons Net, unless there is some specific functionality like "resume transfer" you need (which is tedious to code on your own)
Use Runtime to call Curl. This code works for both Ubuntu and Windows.
String[] commands = new String {"curl", "-X", "GET", "http://checkip.amazonaws.com"};
Process process = Runtime.getRuntime().exec(commands);
BufferedReader reader = new BufferedReader(new
InputStreamReader(process.getInputStream()));
String line;
String response;
while ((line = reader.readLine()) != null) {
response.append(line);
}
Paste your curl command into curlconverter.com/java/ and it'll convert it into Java code using java.net.URL and java.net.URLConnection.
Related
Hey I a having a little trouble here. I am doing File Writing at school and we got the challenge of reading a webpage. How is it possible to do it? I had a go with a JSoup and an Apache plugin, but neither worked, but I have to use the net import
I am a bit of a noob at coding, so there will probably be a couple of errors!
Here is my code:
URL oracle = new URL("http://www.oracle.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(oracle.openStream()));
String inputLine;
while ((inputLine = br.readLine()) != null){
System.out.println(inputLine);
}
br.close();
There is no output from the program, and earlier I managed output but it was in the form of HTML, however I deleted that code, ironically looking for a fix for that issue.
Any help or solutions would be greatly appreciated! Thank you all very much!
The code example is from Reading Directly from a URL, but the tutorial is old. The url http://www.oracle.com now redirects to https://www.oracle.com/ but you don't follow the redirect.
If you use a URL that does not redirect, like http://www.google.com you will see that the code works.
If you want a more robust program that handles redirects, you'll probably want to use a HttpURLConnection instead of the basic URL, as it has more features for you to use.
I am writing a program in Java which has to make around 6.5 million calls to various pages on same server (URL will be slightly altered by appending a user name that will be read from a text file) .. Firstly I want to know the most time efficient way of doing this, secondly can anybody give a guess as to how much time this may consume?? Currently I am reading each url in a separate thread of ExecutorService object .. something like this
ExecutorService executor = Executors.newFixedThreadPool(10);
Runnable worker = new MyRunnable(allUsers[n]);
executor.execute(worker);
and the run method looks like:
is = url.openStream(); // throws an IOException
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
page = page + line;
// More code follows
}
Any suggestions will be highly appreciated
No, nobody can guess how much "time" this will consume. We don't know if the server will take a millisecond or an hour to complete a request.
The most efficient way is to use a more efficient API, one that allows bulk requests.
At 10 threads your program will likely be IO bound. You will need to profile the number of threads you need to ensure full CPU use. You can avoid this using Java 7 features / an NIO framework, like Netty or MINA, so one thread can service many requests concurrently. (I'm not sure if these are client side).
I concur with the other comments and answers that it is impossible to predict how long this will take, and that a "bulk transfer" request will most likely give better performance.
A couple more points:
If you can use a RESTful API that returns JSON or XML instead of web pages ... that will be faster.
On the client side, this is going to be efficient if the documents you are fetching are large:
while ((line = br.readLine()) != null) {
page = page + line;
}
That is going to do an excessive amount of copying. A better approach is this:
StringBuilder sb = new StringBuilder(...);
while ((line = br.readLine()) != null) {
sb.append(line);
}
page = sb.toString();
If you can get a good estimate of the page size, then create the StringBuilder that big.
I am doing an assignment for one of my classes.
I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.
I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.
I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.
Question : Anyone know any specialized parser api for images and links ?
I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.
i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have:
**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
UPDATE
Based on Yoshi's answer, I was able to get my code to work right. Here's the link:
https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java
Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.
Several steps:
Jsoup: how to get an image's absolute url?
how to download image from any web page in java
You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.
You don't need to use URLConnection class, jSoup have wrapper for it.
e.g
You can use only one line of code to get DOM object:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Instead of this code:
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
Update1
try to add in your code next lines:
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();
I have a problem once again where I cant find the source code because its hidden or something... When my java program indexes the page it finds everything but the info i need... I assume its hidden for a reason but is there anyway around this?
Its just a bunch of tr/td tags that show up in firebug but dont show up when viewing the page source or when i do below
URL url = new URL("my url");
URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
I really have no idea how to attempt to get the info that i need...
The reason for this behavior is because probably those tags are dynamically injected into the DOM using javascript and are not part of the initial HTML which is what you can fetch with an URLConnection. They might even be created using AJAX. You will need a javascript interpreter on your server if you want to fetch those.
If they don't show up in the page source, they're likely being added dynamically by Javascript code. There's no way to get them from your server-side script short of including a javascript interpreter, which is rather high-overhead.
The information in the tags is presumably coming from somewhere, though. Why not track that down and grab it straight from there?
Try Using Jsoup.
Document doc = doc=Jsoup.parse("http:\\",10000);
System.out.print(doc.toString());
Assuming that the issue is that the "missing" content is being injected using javascript, the following SO Question is pertinent:
What's a good tool to screen-scrape with Javascript support?
i'm trying to get an entire WebPage through a URLConnection.
What's the most efficient way to do this?
I'm doing this already:
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
String line = bf.readLine();
while(line!=null){
html.append(line);
line = bf.readLine();
}
bf.close();
html has the entire HTML page.
I think this is the best way. The size of the page is fixed ("it is what it is"), so you can't improve on memory. Perhaps you can compress the contents once you have them, but they aren't very useful in that form. I would imagine that eventually you'll want to parse the HTML into a DOM tree.
Anything you do to parallelize the reading would overly complicate the solution.
I'd recommend using a StringBuilder with a default size of 2048 or 4096.
Why are you thinking that the code you posted isn't sufficient? You sound like you're guilty of premature optimization.
Run with what you have and sleep at night.
What do you want to do with the obtained HTML? Parse it? It may be good to know that a bit decent HTML parser can already have a constructor or method argument which takes straight an URL or InputStream so that you don't need to worry about streaming performance like that.
Assuming that all you want to do is described in your previous question, with for example Jsoup you could obtain all those news links extraordinary easy like follows:
Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&tab=wn").get();
Elements newsLinks = document.select("h2.title a:eq(0)");
for (Element newsLink : newsLinks) {
System.out.println(newsLink.attr("href"));
}
This yields the following after only a few seconds:
http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas
http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html
http://www.abc.es/agencias/noticia.asp?noticia=550415
http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133
http://www.abc.es/agencias/noticia.asp?noticia=550292
http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml
http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html
http://www.infocielo.com/IC/Home/index.php?ver_nota=22642
http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html
http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias
http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1
http://www.ambito.com/noticia.asp?id=547722
http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469
http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html
http://www.lanacion.com.ar/nota.asp?nota_id=1314014
http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html
http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html
http://www.larazon.com.ar/show/sdf_0_176100096.html
http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp
http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html
http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html
http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622
http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html
http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama
http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html
http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos
http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html
Did someone already said that regex is absolutely the wrong tool to parse HTML? ;)
See also:
Pros and cons of HTML parsers in Java
Your approach looks pretty good, however you can make it somewhat more efficient by avoiding the creation of intermediate String objects for each line.
The way to do this is to read directly into a temporary char[] buffer.
Here is a slightly modified version of your code that does this (minus all the error checking, exception handling etc. for clarity):
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
char[] charBuffer = new char[4096];
int count=0;
do {
count=bf.read(charBuffer, 0, 4096);
if (count>=0) html.append(charBuffer,0,count);
} while (count>0);
bf.close();
For even more performance, you can of course do little extra things like pre-allocating the character array and StringBuffer if this code is going to be called frequently.
You can try using commons-io from apache (http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html)
new String(IOUtils.toCharArray(connection.getInputStream()))
There are some technical considerations. You may wish to use HTTPURLConnection instead of URLConnection.
HTTPURLConnection supports chunked transfer encoding, which allows you to process the data in chunks, rather than buffering all of the content before you start doing work. This can lead to an improved user experience.
Also, HTTPURLConnection supports persistent connections. Why close that connection if you're going to request another resource right away? Keeping the TCP connection open with the web server allows your application to quickly download multiple resources without spending the overhead (latency) of establishing a new TCP connection for each resource.
Tell the server that you support gzip and wrap a BufferedReader around GZIPInputStream if the response header says the content is compressed.