how to read/fetch the XML file from an URL using Java? - java

I want to read an XML file from an URL and I want to parse it. How can I do this in Java??

Reading from a URL is know different than any other input source. There are several different Java tools for XML parsing.

You can use Xstream it supports this.
URL url = new URL("yoururl");
BufferedReader in = new BufferedReader(
new InputStreamReader(
url.openStream()));
xSteamObj.fromXML(in);//return parsed object

Two steps:
Get the bytes from the server.
Create a suitable XML source for it, perhaps even a Transformer.
Connect the two and get e.g. a DOM for further processing.

I use JDOM:
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.input.*;
StringBuilder responseBuilder = new StringBuilder();
try {
// Create a URLConnection object for a URL
URL url = new URL( "http://127.0.0.1" );
URLConnection conn = url.openConnection();
HttpURLConnection httpConn;
httpConn = (HttpURLConnection)conn;
BufferedReader rd = new BufferedReader(new InputStreamReader(httpConn.getInputStream()));
String line;
while ((line = rd.readLine()) != null)
{
responseBuilder.append(line + '\n');
}
}
catch(Exception e){
System.out.println(e);
}
SAXBuilder sb = new SAXBuilder();
Document d = null;
try{
d = sb.build( new StringReader( responseBuilder.toString() ) );
}catch(Exception e){
System.out.println(e);
}
Of course, you can cut out the whole read URL to string, then put a string reader on the string, but Ive cut/pasted from two different areas. So this was easier.

This is a good candidate for using Streaming parser : StAX
StAX was designed to deal with XML streams serially; than compared to DOM APIs that needs entire document model at one shot. StAX also assumes that the contents are dynamic and the nature of XML is not really known. StAX use cases comprise of processing pipeline as well.

Related

How to get whole data from Solr

I have to write some logic in Java which should retrieve all the index data from Solr.
As of now I am doing it like this
String confSolrUrl = "http://localhost/solr/master/select?q=*%3A*&wt=json&indent=true"
LOG.info(confSolrUrl);
url = new URL(confSolrUrl);
URLConnection conn = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String inputLine;
//save to this filename
String fileName = "/qwertyuiop.html";
File file = new File(fileName);
if (!file.exists())
{
file.createNewFile();
}
FileWriter fw = new FileWriter(file.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
while ((inputLine = br.readLine()) != null) {
bw.write(inputLine);
}
bw.close();
br.close();
System.out.println("Done");
In my file I will get the whole HTML file that I can parse and extract my JSON.
Is there any better way to do it?
Instead of get the resource from the url and parse it?
I just wrote an application to do this, take a look at github: https://github.com/freedev/solr-import-export-json
If you want read all data from a solr collection the first problem you're facing is the pagination, in this case we are talking of deep paging.
A direct http request like you did will return a relative short amount of documents. And you can even have millions or billions of documents in a solr collection.
So you should use the correct API, i.e. Solrj.
In my project I just did it.
I would also suggest this reading:
https://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Garbage when loading xml content with URLConnection

I'm trying to load the content of an XML page using URLConnection but I'm getting back garbage characters. The same code works for me for pretty much any other site so I'm not sure what's the issue.
Here's the relevant code:
String url = "http://myUrl";
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setConnectTimeout(60*2000); // wait only 60 seconds for a response
conn.setReadTimeout(60*2000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
wholeDocument += inputLine;
}
Printing out wholeDocument produces a bunch of characters like this: er���;�pI.���$6
I am using encoding = 'UTF-8'.
I also tried using XML libraries, for example:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new URL(baseUrl).openStream());
System.out.println("doc = " + doc);
But the result is the same. When using curl in a terminal app (I'm on a mac) the result is similar although the characters look like this: ???0??KZV??????0N6?aH:$?X9v???$>???`
Any idea how to solve this?
If you check the headers of your response you will see Content-Encoding: gzip indicating that the body of the response has been compressed, you need to uncompress it first, that's why you get those weird characters. More details about Http Compression.
A good way to check the headers with curl is to use the verbose option -v, In this case thanks to curl -v http://sites.one.co.il/XML/VOD/ | more, I could quickly see the response headers.
Expanding on the other answer, you can check if the received file is gzip encoded, and decode it if so by:
if (conn.getHeaderField("Content-Encoding") != null &&
conn.getHeaderField("Content-Encoding").equals("gzip")){
InputStream gzStream = new GZIPInputStream(conn.getInputStream());
InputStreamReader isr = new InputStreamReader(gzStream, encoding);
} else {
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), encoding);
}
Alternatively, you can specify that you wouldn't like gzip encoded data by:
conn.setRequestProperty("Accept-Encoding", "identity");

HTTP URL connection response

I am trying to hit the URL and get the response from my Java code.
I am using URLConnection to get this response. And writing this response in html file.
When opening this html in browser after executing the java class, I am getting only google home page and not with the results.
Whats wrong with my code, my code here,
FileWriter fWriter = null;
BufferedWriter writer = null;
URL url = new URL("https://www.google.co.in/?gfe_rd=cr&ei=aS-BVpPGDOiK8Qea4aKIAw&gws_rd=ssl#q=google+post+request+from+java");
byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
String encoding = new String(encodedBytes);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("User-Agent", "Mozilla/5.0");
connection.setRequestProperty("Accept-Charset", "UTF-8");
connection.setDoInput(true);
connection.setRequestProperty("Authorization", "Basic " + encoding);
connection.connect();
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
String line;
try {
fWriter = new FileWriter(new File("f:\\fileName.html"));
writer = new BufferedWriter(fWriter);
while ((line = in.readLine()) != null) {
String s = line.toString();
writer.write(s);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Same code works couple of days back, but not now.
The reason is that this url does not return search results it self. You have to understand google's working process to understand it. Open this url in your browser and view its source. You will only see lots of javascript there.
Actually, in a short summary, google uses Ajax requests to process search queries.
To perform required task you either have to use a headless browser (the hard way) which can execute javascript/ajax OR better use google search api as directed by anand.
This method of searching is not advised is supposed to fail, you must use google search APIs for this kind of work.
Note: Google uses some redirection and uses token, so even if you will find a clever way to handle it, it is ought to fail in long run.
Edit:
This is a sample of how using Google search APIs you can get your work done in reliable way; please do refer to the source for more information.
public static void main(String[] args) throws Exception {
String google = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=";
String search = "stackoverflow";
String charset = "UTF-8";
URL url = new URL(google + URLEncoder.encode(search, charset));
Reader reader = new InputStreamReader(url.openStream(), charset);
GoogleResults results = new Gson().fromJson(reader, GoogleResults.class);
// Show title and URL of 1st result.
System.out.println(results.getResponseData().getResults().get(0).getTitle());
System.out.println(results.getResponseData().getResults().get(0).getUrl());
}

Optimized option for getting text from a web page

I used url.openConnection() to get text from a webpage
but i got time delay in execution while i tried it in loops
i also tried httpUrl.disconnect().
but the change is not that much...
can anyone give me a better option for this
i used the following code for this
for(int i=0;i<10;i++){
URL google = new URL(array[i]);//array of links
HttpURLConnection yc =(HttpURLConnection)google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
source=source.concat(inputLine);
}
in.close();
yc.disconnect();
}
A couple of issues I can see.
in.readLine() doesn't retain the newline so when you use concat, all the newlines have been removed.
Using concat in a loop like this builds a longer and longer String. This will get slower and slower with each line you add.
Instead you might find IOUtils useful.
URL google = new URL("123newyear.com/2011/calendars/");
String text = IOUtils.toString(google.openConnection().getInputStream());
See Reading Directly from a URL for details on how to to get a stream from which you can read the contents of the URL.
Basically, you
Create a url URL url = new URL("123newyear.com/2011/calendars/";
Call openstream() on the URL object
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
Read from the stream (like you did).

how to (simply) generate POST http request from java to do the file upload

I would like to upload files from java application/applet using POST http event. I would like to avoid to use any library not included in SE, unless there is no other (feasible) option.
So far I come up only with very simple solution.
- Create String (Buffer) and fill it with compatible header (http://www.ietf.org/rfc/rfc1867.txt)
- Open connection to server URL.openConnection() and write content of this file to OutputStream.
I also need to manually convert binary file into POST event.
I hope there is some better, simpler way to do this?
You need to use the java.net.URL and java.net.URLConnection classes.
There are some good examples at http://java.sun.com/docs/books/tutorial/networking/urls/readingWriting.html
Here's some quick and nasty code:
public void post(String url) throws Exception {
URL u = new URL(url);
URLConnection c = u.openConnection();
c.setDoOutput(true);
if (c instanceof HttpURLConnection) {
((HttpURLConnection)c).setRequestMethod("POST");
}
OutputStreamWriter out = new OutputStreamWriter(
c.getOutputStream());
// output your data here
out.close();
BufferedReader in = new BufferedReader(
new InputStreamReader(
c.getInputStream()));
String s = null;
while ((s = in.readLine()) != null) {
System.out.println(s);
}
in.close();
}
Note that you may still need to urlencode() your POST data before writing it to the connection.
You need to learn about the chunked encoding used in newer versions of HTTP. The Apache HttpClient library is a good reference implementation to learn from.

Categories