Convert XML to JSON efficiently for Huge Files - java

I have several XML files ( in size of GBs ) that are to be converted to JSON. I am easily able to convert small sized files ( in KiloBytes ) using the JSON library ( org.json - https://mvnrepository.com/artifact/org.json/json/20180813 ).
Here's the code that i am using
static String line="",str="";
BufferedReader br = new BufferedReader(new FileReader(link));
FileWriter fw = new FileWriter(outputlink);
JSONObject jsondata = null;
while ((line = br.readLine()) != null)
{
str+=line;
}
jsondata = XML.toJSONObject(str);
But the large files ( even the <100 MB ones ) are taking too long to process and the larger ones are throwing java.lang.OutOfMemoryError: Java heap space. So, how to optimize the code to process large files ( or any other approach/library ).
UPDATE
I have updated the code and I am writing XML into JSON segment by segment
My XML :
<PubmedArticleSet>
<PubmedArticle>
</PubmedArticle>
<PubmedArticle>
</PubmedArticle>
...
</PubmedArticleSet>
So I am ignoring the root node <PubmedArticleSet> ( I will add it later ) converting each <PubmedArticle> </PubmedArticle> to JSON and writing at a time
br = new BufferedReader(new FileReader(link));
fw = new FileWriter(outputlink,true);
StringBuilder str = new StringBuilder();
br.readLine(); // to skip the first three lines and the root
br.readLine();
br.readLine();
while ((line = br.readLine()) != null) {
JSONObject jsondata = null;
str.append(line);
System.out.println(str);
if (line.trim().equals("</PubmedArticle>")) { // split here
jsondata = XML.toJSONObject(str.toString());
String jsonPrettyPrintString = jsondata.toString(PRETTY_PRINT_INDENT_FACTOR);
fw.append(jsonPrettyPrintString.toString());
System.out.println("One done"); // One section done
str= new StringBuilder();
}
}
fw.close();
I am no longer getting the HeapError but still the processing is taking hours for ~300 MB range files. Kindly provide any suggestions to speed up this process.

This statement is the main reason that kills your performance:
str+=line;
This causes the allocation, copying and deallocation of numerous of String objects.
You need to use a StringBuilder:
StringBuilder builder = new StringBuilder();
while ( ... ) {
builder.append(line);
}
It may also help (to a lesser extent) to read the file in larger chunks and not line by line.

The IO operation of reading a large file is very time consuming. Try utilizing a library to handle this for you. For example with apache commons IO:
File xmlFile= new File("D:\\path\\file.xml");
String xmlStr= FileUtils.readFileToString(xmlFile, "UTF-8");
JSONObject xmlJson = XML.toJSONObject(xmlStr);

Related

Not able to upload 1 GB file in Minio using java?

I have a java program that uploads files from local to Minio browser. The file size is around 900 MB. When I'm executing the java program I get -
Java.lang.OutOfMemoryError - Java heap Size
I tried increasing heap size both in eclipse.ini as well as under Run-->Configurations-->Project to -Xms4096M -Xmx8192M.
After increasing the heap size when I executed the program I recieve -
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
How to upload large size files to Minio using Java ?
This is how my java program looks like -
StringBuilder stringBuilder = new StringBuilder();
File[] files = new File(path).listFiles();
showFiles(files);
System.out.println(pathList);
ListIterator<String> itr=pathList.listIterator();
while(itr.hasNext()){
String relativePath=itr.next();
if(relativePath!=null) {
String absolutePath=path+(relativePath).replaceFirst("minio_files", "");
System.out.println(absolutePath);
System.out.println(relativePath);
File f =new File(absolutePath);
BufferedReader reader = new BufferedReader(new FileReader(f));
String line = null;
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(ls);
}
if(stringBuilder.length()!=0) {
// delete the last new line separator
stringBuilder.deleteCharAt(stringBuilder.length() - 1);
}
reader.close();
// Create a InputStream for object upload.
ByteArrayInputStream bais = new ByteArrayInputStream(stringBuilder.toString().getBytes("UTF-8"));
Do you absolutely need to remove a trailing line separator from your text file?
If this not absolutely required you could let the minio client libraries handle the upload transparently:
String absolutePath=path+(relativePath).replaceFirst("minio_files", "");
File f =new File(absolutePath);
minio.putObject("bucketName", f.getName(), absolutePath);
According to the minio docs this allows uploads of up to 5 GB. This is easier to implement and faster than any other solution.
If you absolutely need to remove a trailing line separator, you should at least pre-size the StringBuilder (and use the correct code to remove the trailing line separator):
File f = new File(absolutePath);
stringBuilder.ensureCapacity((int) f.length()+2);
try (BufferedReader reader = new BufferedReader(new FileReader(f))) {
String line;
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(ls);
}
if (stringBuilder.length() != 0) {
// delete the last new line separator
stringBuilder.setLength(stringBuilder.length() - ls.length());
}
}
Please beware that this code can never upload files larger than about 2GB:
arrays in java cannot be larger than Integer.MAX_VALUE-5
therefore StringBuilder cannot be used to create strings with more than Integer.MAX_VALUE-5 characters
transforming the string into a UTF-8 encoded byte array cannot produce a byte array longer than Integer.MAX_VALUE-5 bytes
since UTF-8 is a multibyte encoding, transforming a string with Integer.MAX_VALUE-5 characters into a byte array might not be possible

Java Out of memory error occurs while reading a file having a single line which enormously long

Our application need to read a file with a single line and that single line contains large amount data . What we are doing is that , read the line from file and store it in string and tokenize the string with - and store to a list . From that list some entries are to be checked.
the method is as follows
public bollean checkMessage(String filename){
boolean retBool = true;
LinkedList tokenList;
int size;
String line = "";
try {
File file = new File(filename);
FileInputStream fs = new FileInputStream(file);
InputStreamReader is = new InputStreamReader(fs);
BufferedReader br = new BufferedReader(is);
while ((line = br.readLine()) != null) {
line.trim();
tokenList = tokenizeString(line, "-");
if (tokenList == null) {
retBool = false;
resultMsg = "Error in File.java "
}
if (retBool) {
retBool = checkMessagePart(tokenList);
}
}
}
the error occurs at line , while ((line = br.readLine()) != null)
error is
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.BufferedReader.readLine(BufferedReader.java:363)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
Actually increasing heapsize didn't work. the size of the file trying to read is more than 1gb. Also tried to read as chunks of bytes , but when adding the read data to StringBuilder or list will again generate the MemoryError
If the problem is that you cannot read the file to a String, then don't do it. Read it token by token by using some other method. The easy one is using Scanner with the right delimiter ("-" in your case). If you find its performance lacking, you could resort to implementing your own version of BufferedReader in which the "lines" are split by that character instead of the normal values.

Java Buffered reader running out of heap space

I'm trying to parse a very large file (~1.2 GB). Some lines of the file are bigger than the maximum allowed String size.
FileReader fileReader = new FileReader(filePath);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while ((line = bufferedReader.readLine()) != null) {
//Do something
}
bufferedReader.close();
Error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuffer.append(StringBuffer.java:369)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at sax.parser.PrettyPrintXML.format(PrettyPrintXML.java:30)
line 30 :
while ((line = bufferedReader.readLine()) != null) {
Can anyone suggest any alternative approach for this case.
You are using readLine() on a file that doesn't have lines. So it tries to read the entire file as a single lines. This does not scale.
Solution: don't. Read a chunk at a time, or maybe even a character at a time: whatever is dictated by the unstated structure of your file.
I believe maximum string character length is 2^31-1 [2,147,483,647] and 1.2GB txt file(assuming is a txt file) can store about 1,200,000,000 characters. Why do you need to read all the data? What are you using it for? Can you split the file up into several files or read and parse it as a smaller string. Need more info.
You can use Apache commons IO :
https://commons.apache.org/proper/commons-io/description.html
example:
InputStream in = new URL( "http://commons.apache.org" ).openStream();
try {
System.out.println( IOUtils.toString( in ) );
} finally {
IOUtils.closeQuietly(in);
}

How to split a very long string

I have big file (about 30mb) and here the code I use to read data from the file
BufferedReader br = new BufferedReader(new FileReader(file));
try {
String line = br.readLine();
while (line != null) {
sb.append(line).append("\n");
line = br.readLine();
}
Then I need to split the content I read, so I use
String[] inst = sb.toString().split("GO");
The problem is that sometimes the sub-string is over the maximum String length and I can't get all the data inside the string. How can I get rid of this?
Thanks
Scanner s = new Scanner(input).useDelimiter("GO"); and use s.next()
WHY PART:- The erroneous result may be the outcome of non contiguous heap segment as the CMS collector doesn't de-fragment memory.
(It does not answer your how to solve part though).
You may opt for loading the whole string partwise, i.e using substring

Capture data read from file into string stream Java

I'm coming from a C++ background, so be kind on my n00bish queries...
I'd like to read data from an input file and store it in a stringstream. I can accomplish this in an easy way in C++ using stringstreams. I'm a bit lost trying to do the same in Java.
Following is a crude code/way I've developed where I'm storing the data read line-by-line in a string array. I need to use a string stream to capture my data into (rather than use a string array).. Any help?
char dataCharArray[] = new char[2];
int marker=0;
String inputLine;
String temp_to_write_data[] = new String[100];
// Now, read from output_x into stringstream
FileInputStream fstream = new FileInputStream("output_" + dataCharArray[0]);
// Convert our input stream to a BufferedReader
BufferedReader in = new BufferedReader (new InputStreamReader(fstream));
// Continue to read lines while there are still some left to read
while ((inputLine = in.readLine()) != null )
{
// Print file line to screen
// System.out.println (inputLine);
temp_to_write_data[marker] = inputLine;
marker++;
}
EDIT:
I think what I really wanted was a StringBuffer.
I need to read data from a file (into a StringBuffer, probably) and write/transfer all the data back to another file.
In Java, first preference should always be given to buying code from the library houses:
http://commons.apache.org/io/api-1.4/org/apache/commons/io/IOUtils.html
http://commons.apache.org/io/api-1.4/org/apache/commons/io/FileUtils.html
In short, what you need is this:
FileUtils.readFileToString(File file)
StringBuffer is one answer, but if you're just writing it to another file, then you can just open an OutputStream and write it directly out to the other file. Holding a whole file in memory is probably not a good idea.
In you simply want to read a file and write another one:
BufferedInputStream in = new BufferedInputStream( new FileInputStream( "in.txt" ) );
BufferedOutputStream out = new BufferedOutputStream( new FileOutputStream( "out.txt" ) );
int b;
while ( (b = in.read()) != -1 ) {
out.write( b );
}
If you want to read a file into a string:
StringWriter out = new StringWriter();
BufferedReader in = new BufferedReader( new FileReader( "in.txt" ) );
int c;
while ( (c = in.read()) != -1 ) {
out.write( c );
}
StringBuffer buf = out.getBuffer();
This can be made more efficient if you read using byte arrays. But I recommend that you use the excellent apache common-io. IOUtils (http://commons.apache.org/io/api-1.4/org/apache/commons/io/IOUtils.html) will do the loop for you.
Also, you should remember to close the streams.
I also come from C++, and I was looking for a class similar to the C++ 'StringStreamReader', but I couldn't find it. In my case (which I think was very simple), I was trying to read a file line by line and then read a String and an Integer from each of these lines. My final solution was to use two objects of the class java.util.Scanner, so that I could use one of them to read the lines of the file directly to a String and use the second one to re-read the content of each line (now in the String) to the variables (a new String and a positive 'int'). Here's my code:
try {
//"path" is a String containing the path of the file we want to read
Scanner sc = new Scanner(new BufferedReader(new FileReader(new File(path))));
while (sc.hasNextLine()) { //while the file isn't over
Scanner scLine = new Scanner(sc.nextLine());
//sc.nextLine() returns the next line of the file into a String
//scLine will now proceed to scan (i.e. analyze) the content of the string
//and identify the string and the positive 'int' (what in C++ would be an 'unsigned int')
String s = scLine.next(); //this returns the string wanted
int x;
if (!scLine.hasNextInt() || (x = scLine.nextInt()) < 0) return false;
//scLine.hasNextInt() analyzes if the following pattern can be interpreted as an int
//scLine.nextInt() reads the int, and then we check if it is positive or not
//AT THIS POINT, WE ALREADY HAVE THE VARIABLES WANTED AND WE CAN DO
//WHATEVER WE WANT WITH THEM
//in my case, I put them into a HashMap called 'hm'
hm.put(s, x);
}
sc.close();
//we finally close the scanner to point out that we won't need it again 'till the next time
} catch (Exception e) {
return false;
}
return true;
Hope that helped.

Categories