How to get whole data from Solr - java

I have to write some logic in Java which should retrieve all the index data from Solr.
As of now I am doing it like this
String confSolrUrl = "http://localhost/solr/master/select?q=*%3A*&wt=json&indent=true"
LOG.info(confSolrUrl);
url = new URL(confSolrUrl);
URLConnection conn = url.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String inputLine;
//save to this filename
String fileName = "/qwertyuiop.html";
File file = new File(fileName);
if (!file.exists())
{
file.createNewFile();
}
FileWriter fw = new FileWriter(file.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
while ((inputLine = br.readLine()) != null) {
bw.write(inputLine);
}
bw.close();
br.close();
System.out.println("Done");
In my file I will get the whole HTML file that I can parse and extract my JSON.
Is there any better way to do it?
Instead of get the resource from the url and parse it?

I just wrote an application to do this, take a look at github: https://github.com/freedev/solr-import-export-json
If you want read all data from a solr collection the first problem you're facing is the pagination, in this case we are talking of deep paging.
A direct http request like you did will return a relative short amount of documents. And you can even have millions or billions of documents in a solr collection.
So you should use the correct API, i.e. Solrj.
In my project I just did it.
I would also suggest this reading:
https://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Related

How can I write my exported XML to an XML File?

My job is to search for a book on WikiBooks and export the corresponding XML file. This file should be edited later. However, the error occurs earlier.
My idea was to read the page and write it line by line to an XML file. Here is my code for it:
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Welches Buch wollen Sie suchen?");
book = (reader.readLine());
book = replaceSpace(book);
URL url = new URL("https://de.wikibooks.org/wiki/Spezial:Exportieren/" + book);
URLConnection uc = url.openConnection();
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0");
uc.connect();
BufferedReader xmlReader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
File file = new File("wiki.xml");
FileWriter writer = new FileWriter(file);
String inputLine;
while ((inputLine = xmlReader.readLine()) != null) {
writer.write(inputLine + "\n");
}
xmlReader.close();
The code is executed without an error message, but the saved file ends in the middle of a word and is therefore incomplete.
How can I work around this problem?
As the comment suggested the problem is that the content of the stream is not flushed to the file. If you call close() on your writer the content is automatically flushed to the file.
Here is your code with the added statement in the end:
BufferedReader xmlReader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
File file = new File("wiki.xml");
FileWriter writer = new FileWriter(file);
String inputLine;
while ((inputLine = xmlReader.readLine()) != null) {
writer.write(inputLine + "\n");
}
writer.close();
xmlReader.close();
A much easier solution that is built into Java is using the Files class. My suggestion is that you replace the above code with the following simple statement, which directly stores your InputStream into a file and automatically takes care of the streams.
Files.copy(uc.getInputStream(), Paths.get("wiki.xml"), StandardCopyOption.REPLACE_EXISTING);

Google Dataflow: how to parse big file with valid JSON array from FileIO.ReadableFile

In my pipeline FileIO.readMatches() transform reads big JSON file(around 300-400MB) with a valid JSON array and returns FileIO.ReadableFile object to the next transform. My task is to read each JSON object from that JSON array, add new properties and output to the next transform.
At the moment my code to parse the JSON file looks like this:
// file is a FileIO.ReadableFile object
InputStream bis = new ByteArrayInputStream(file.readFullyAsBytes());
// Im using gson library to parse JSON
JsonReader reader = new JsonReader(new InputStreamReader(bis, "UTF-8"));
JsonParser jsonParser = new JsonParser();
reader.beginArray();
while (reader.hasNext()) {
JsonObject jsonObject = jsonParser.parse(reader).getAsJsonObject();
jsonObject.addProperty("Somename", "Somedata");
// processContext is a ProcessContext object
processContext.output(jsonObject.toString());
}
reader.close();
In this case the whole content of the file will be in my memory which brings options to get java.lang.OutOfMemoryError. Im searching for solution to read one by one all JSON objects without keeping the whole file in my memory.
Possible solution is to use method open() from object FileIO.ReadableFile which returns ReadableByteChannel channel but Im not sure how to use that channel to read specifically one JSON object from that channel.
Updated solution
This is my updated solution which reads the file line by line
ReadableByteChannel readableByteChannel = null;
InputStream inputStream = null;
BufferedReader bufferedReader = null;
try {
// file is a FileIO.ReadableFile
readableByteChannel = file.open();
inputStream = Channels.newInputStream(readableByteChannel);
bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
String line;
while ((line = bufferedReader.readLine()) != null) {
if (line.length() > 1) {
// my final output should contain both filename and line
processContext.output(fileName + file);
}
}
} catch (IOException ex) {
logger.error("Exception during reading the file: {}", ex);
} finally {
IOUtils.closeQuietly(bufferedReader);
IOUtils.closeQuietly(inputStream);
}
I see that this solution doesnt work with Dataflow running on n1-standard-1 machine and throws java.lang.OutOfMemoryError: GC overhead limit exceeded exception and works correctly on n1-standard-2 machine.
ReadableByteChannel is a java NIO API, introduced in Java 7. Java provides a way to convert it to an InputStream: InputStream bis = Channels.newInputStream(file.open()); - I believe this is the only change you need to make.

reading remote csv file without downloading

I have requirement to read remote big csv file line by line (basically streaming). After each read I want to persist record in db. Currently I am achieving it through below code but I am not sure if it download complete file and keep it in jvm memory. I assume it is not. Can I write this code in better way using some java 8 stream features
URL url = new URL(baseurl);
HttpURLConnection urlConnection = url.openConnection();
if(connection.getResponseCode() == 200)
{
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String current;
while((current = in.readLine()) != null)
{
persist(current);
}
}
First you should use a try-with-resources statement to automatically close your streams when reading is done.
Next BufferedReader has a method BufferedReader::lines which returns a Stream<String>.
Then your code should look like this:
URL url = new URL(baseurl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
if (connection.getResponseCode() == 200) {
try (InputStreamReader streamReader = new InputStreamReader(connection.getInputStream());
BufferedReader br = new BufferedReader(streamReader);
Stream<String> lines = br.lines()) {
lines.forEach(s -> persist(s)); //should be a method reference
}
}
Now it's up to you to decide if the code is better and your assumption is right that you don't keep the whole file in the JVM.

Optimized option for getting text from a web page

I used url.openConnection() to get text from a webpage
but i got time delay in execution while i tried it in loops
i also tried httpUrl.disconnect().
but the change is not that much...
can anyone give me a better option for this
i used the following code for this
for(int i=0;i<10;i++){
URL google = new URL(array[i]);//array of links
HttpURLConnection yc =(HttpURLConnection)google.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
source=source.concat(inputLine);
}
in.close();
yc.disconnect();
}
A couple of issues I can see.
in.readLine() doesn't retain the newline so when you use concat, all the newlines have been removed.
Using concat in a loop like this builds a longer and longer String. This will get slower and slower with each line you add.
Instead you might find IOUtils useful.
URL google = new URL("123newyear.com/2011/calendars/");
String text = IOUtils.toString(google.openConnection().getInputStream());
See Reading Directly from a URL for details on how to to get a stream from which you can read the contents of the URL.
Basically, you
Create a url URL url = new URL("123newyear.com/2011/calendars/";
Call openstream() on the URL object
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
Read from the stream (like you did).

how to read/fetch the XML file from an URL using Java?

I want to read an XML file from an URL and I want to parse it. How can I do this in Java??
Reading from a URL is know different than any other input source. There are several different Java tools for XML parsing.
You can use Xstream it supports this.
URL url = new URL("yoururl");
BufferedReader in = new BufferedReader(
new InputStreamReader(
url.openStream()));
xSteamObj.fromXML(in);//return parsed object
Two steps:
Get the bytes from the server.
Create a suitable XML source for it, perhaps even a Transformer.
Connect the two and get e.g. a DOM for further processing.
I use JDOM:
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.input.*;
StringBuilder responseBuilder = new StringBuilder();
try {
// Create a URLConnection object for a URL
URL url = new URL( "http://127.0.0.1" );
URLConnection conn = url.openConnection();
HttpURLConnection httpConn;
httpConn = (HttpURLConnection)conn;
BufferedReader rd = new BufferedReader(new InputStreamReader(httpConn.getInputStream()));
String line;
while ((line = rd.readLine()) != null)
{
responseBuilder.append(line + '\n');
}
}
catch(Exception e){
System.out.println(e);
}
SAXBuilder sb = new SAXBuilder();
Document d = null;
try{
d = sb.build( new StringReader( responseBuilder.toString() ) );
}catch(Exception e){
System.out.println(e);
}
Of course, you can cut out the whole read URL to string, then put a string reader on the string, but Ive cut/pasted from two different areas. So this was easier.
This is a good candidate for using Streaming parser : StAX
StAX was designed to deal with XML streams serially; than compared to DOM APIs that needs entire document model at one shot. StAX also assumes that the contents are dynamic and the nature of XML is not really known. StAX use cases comprise of processing pipeline as well.

Categories