java heap space error while writing to a file - java

I am trying to write a huge data around 64000 records at a time to a file. I am getting exceptions that I attached bellow.
the code that I used to write is
Path outputpath = Paths.get("file1.json");
try (BufferedWriter writer = Files.newBufferedWriter(outputpath, StandardCharsets.UTF_8, WRITE)) {
writer.write(jsonObject.toString());
} catch (Exception e) {
//error msg
}
Here my "jsonObject" is nothing but a json Array which contains 65000 rows .
Can you please help me to write this to my file in an efficient way ,so that I can avoid that heap Space Error.

You've cut your stacktrace a little bit short, but I'll assume the exception happens in jsonObject.toString().
Basically, you have to decide between two things: either allocate more memory or break the big operation into several smaller ones. Adding memory is quick and simple, but if you expect even more data in the future, it won't solve your problem forever. As others have mentioned, use -Xmx and/or -Xms on java command line.
The next thing you could try is to use a different JSON library. Perhaps the one you are using now is not particularly suited for large JSON objects. Or there might be a newer version.
As the last resort, you can always construct the JSON yourself. It's a string after all, and you already have the data in memory. To be as efficient as possible, you don't even need to build the entire string at once, you could just go along and write bits and pieces to your BufferedWriter.

You can try to iterate through your json object:
Iterator<String> keys = (Iterator<String>) jsonObject.keys();
while (keys.hasNext()) {
String key = keys.next();
JSONObject value = jsonObject.getJSONObject(key);
writer.write(value.toString());
}
PS. You need to check your json object's structure.

Related

Peformance issues reading CSV files in a Java (Spring Boot) application

I am currently working on a spring based API which has to transform csv data and to expose them as json.
it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each.
I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers.
Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.
The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines.
To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.
Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.
So I actually have several questions :
#1. How could I improve the speed of the CSV reading ?
#2. Is the multithread implementation with Callable correct ?
#3. How could I reduce the amount of heap memory used in the process ?
#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?
Here below the CSV method
public static final int NUMBER_OF_THREADS = 10;
public static List<List<String>> readCsv(InputStream inputStream) {
List<List<String>> rowList = new ArrayList<>();
ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
List<Future<List<String>>> listOfFutures = new ArrayList<>();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
String line = null;
while ((line = reader.readLine()) != null) {
CallableLineReader callableLineReader = new CallableLineReader(line);
Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
listOfFutures.add(futureCounterResult);
}
reader.close();
pool.shutdown();
} catch (Exception e) {
//log Error reading csv file
}
for (Future<List<String>> future : listOfFutures) {
try {
List<String> row = future.get();
}
catch ( ExecutionException | InterruptedException e) {
//log Error CSV processing interrupted during execution
}
}
return rowList;
}
And the Callable implementation
public class CallableLineReader implements Callable<List<String>> {
private final String line;
public CallableLineReader(String line) {
this.line = line;
}
#Override
public List<String> call() throws Exception {
return Arrays.asList(line.replace("\"", "").split(","));
}
}
I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).
The memory consumption would be less from the replace and split operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.
If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.
Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List. Assuming you are using Java 8 or later, the Stream API can be very helpful. You could rewrite the method like this:
public static Stream<List<String>> readCsv(InputStream inputStream) {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
}
Note that this throws unchecked exceptions in case of an I/O error.
This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.
Instead of trying out a different approach, try to run with a profiler first and see where time is actually being spent. And use this information to change the approach.
Async-profiler is a very solid profiler (and free!) and will give you a very good impression of where time is being spent. And it will also show the time spend on garbage collection. So you can easily see the ratio of CPU utilization caused by garbage collection. It also has the ability to do allocation profiling to figure out which objects are being created (and where).
For a tutorial see the following link.
Try using Spring batch and see if it helps your scenario.
Ref : https://howtodoinjava.com/spring-batch/flatfileitemreader-read-csv-example/

Java Strings : how the memory works with immutable Strings

I have a simple question.
byte[] responseData = ...;
String str = new String(responseData);
String withKey = "{\"Abcd\":" + str + "}";
in the above code, are these three lines taking 3X memory. for example if the responseData is 1mb, then line 2 will take an extra 1mb in memory and then line 3 will take extra 1mb + xx. is this true? if no, then how it is going to work. if yes, then what is the optimal way to fix this. will StringBuffer help here?
Yes, that sounds about right. Probably even more because your 1MB byte array needs to be turned into UTF-16, so depending on the encoding, it may be even bigger (2MB if the input was ASCII).
Note that the garbage collector can reclaim memory as soon as the variables that use it go out of scope. You could set them to null as early as possible to help it make this as timely as possible (for example responseData = null; after you constructed your String).
if yes, then what is the optimal way to fix this
"Fix" implies a problem. If you have enough memory there is no problem.
the problem is that I am getting OutOfMemoryException as the byte[] data coming from server is quite big,
If you don't, you have to think about a better alternative to keeping a 1MB string in memory. Maybe you can stream the data off a file? Or work on the byte array directly? What kind of data is this?
The problem is that I am getting OutOfMemoryException as the byte[] data coming from server is quite big, thats why I need to figure it out first that am I doing something wrong ....
Yes. Well basically your fundamental problem is that you are trying to hold the entire string in memory at one time. This is always going to fail for a sufficiently large string ... even if you code it in the most optimal memory efficient fashion possible. (And that would be complicated in itself.)
The ultimate solution (i.e. the one that "scales") is to do one of the following:
stream the data to the file system, or
process it in such a way that you don't need ever need the entire "string" to be represented.
You asked if StringBuffer will help. It might help a bit ... provided that you use it correctly. The trick is to make sure that you preallocate the StringBuffer (actually a StringBuilder is better!!) to be big enough to hold all of the characters required. Then copy data into it using a charset decoder (directly or using a Reader pipeline).
But even with optimal coding, you are likely to need a peak of 3 times the size of your input byte[].
Note that your OOME problem is probably nothing to do with GC or storage leaks. It is actually about the fundamental space requirements of the data types you are using ... and the fact that Java does not offer a "string of bytes" data type.
There is no such OutOfMemoryException in my apidocs. If it's OutOfMemoryError, especially on the server-side, you definitely got a problem.
When you receive big requests from clients, those String related statements are not the first problem. Reducing 3X to 1X is not the solution.
I'm sorry I can't help without any further codes.
Use back-end storage
You should not store the whole request body on byte[]. You can store them directly on any back-end storage such as a local file, a remote database, or cloud storage.
I would
copy stream from request to back-end with small chunked buffer
Use streams
If can use Streams not Objects.
I would
response.getWriter().write("{\"Abcd\":");
copy <your back-end stored data as stream>);
response.getWriter().write("}");
Yes, if you use a Stringbuffer for the code you have, you would save 1mb of heap space in the last step. However, considering the size of data you have, I recommend an external memory algorithm where you bring only part of your data to memory, process it and put it back to storage.
As others have mentioned, you should really try not to have such a big Object in your mobile app, and that streaming should be your best solution.
That said, there are some techniques to reduce the amount memory your app is using now:
Remove byte[] responseData entirely if possible, so the memory it used can be released ASAP (assuming it is not used anywhere else)
Create the largest String first, and then substring() it, Android uses Apache Harmony for its standard Java library implementation. If you check its String class implementation, you'll see that substring() is implemented simply by creating a new String object with the proper start and end offset to the original data and no duplicate copy is created. So doing the following would cuts the overall memory consumption by at least 1/3:
String withKey = StringBuilder().append("{\"Abcd\").append(str).append("}").toString();
String str = withKey.substring("{\"Abcd\".length(), withKey.length()-"}".length());
Never ever use something like "{\"Abcd\":" + str + "}" for large Strings, under the hood "string_a"+"string_b" is implemented as new StringBuilder().append("string_a").append("string_b").toString(); so implicitly you are creating two (or at least one if the compiler is mart) StringBuilders. For large Strings, it's better that you take over this process yourself as you have deep domain knowledge about your program that the compiler doesn't, and knows how to best manipulate the strings.

JAVA processing file with java.lang.OutOfMemoryError: GC overhead limit exceeded error

I have the following JAVA class to read from a file containing many lines of tab delimited strings. An example line is like the following:
GO:0085044 GO:0085044 GO:0085044
The code read each line and use split function to put three sub strings into an array, then it put them into a two level hash.
public class LCAReader {
public static void main(String[] args) {
Map<String, Map<String, String>> termPairLCA = new HashMap<String, Map<String, String>>();
File ifile = new File("LCA1.txt");
try {
BufferedReader reader = new BufferedReader(new FileReader(ifile));
String line = null;
while( (line=reader.readLine()) != null ) {
String[] arr = line.split("\t");
if( termPairLCA.containsKey(arr[0]) ) {
if( termPairLCA.get(arr[0]).containsKey(arr[1]) ) {
System.out.println("Error: Duplicate term in LCACache");
} else {
termPairLCA.get(arr[0]).put(new String(arr[1]), new String(arr[2]));
}
} else {
Map<String, String> tempMap = new HashMap<String, String>();
tempMap.put( new String(arr[1]), new String(arr[2]) );
termPairLCA.put( new String(arr[0]), tempMap );
}
}
reader.close();
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
}
When I ran the program, I got the following run time error after some time of running. I noticed the memory usage kept increasing.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.compile(Pattern.java:1469)
at java.util.regex.Pattern.(Pattern.java:1150)
at java.util.regex.Pattern.compile(Pattern.java:840)
at java.lang.String.split(String.java:2304)
at java.lang.String.split(String.java:2346)
at LCAReader.main(LCAReader.java:17)
The input file is almost 2G and the machine I ran the program has 8G memory. I also tried -Xmx4096m parameter to run the program but that did not help. So I guess there is some memory leak in my code, but I cannot find them.
Can anyone help me on this? Thanks in advance!
There's no memory leak; you're just trying to store too much data. 2GB of text will take 4GB of RAM as Java characters; plus there's about 48 bytes per String object overhead. Assuming the text is in 100 character lines, there's about another GB, for a total of 5GB -- and we haven't even counted the Map.Entry objects yet! You'd need a Java heap of at least, conservatively, 6GB to run this program on your data, and maybe more.
There are a couple of easy things you can do to improve this. First, lose the new String() constructors -- they're useless and just make the garbage collector work harder. Strings are immutable so you never need to copy them. Second, you could use the intern pool to share duplicate strings -- this may or may not help, depending on what the data actually looks like. But you could try, for example,
tempMap.put(arr[1].intern(), arr[2].intern() );
These simple steps might help a lot.
I don't see any leak, you simply need a very huge amount of memory to store your map.
There is a very good tool for verifying this: making a heap dump with the option -XX:+HeapDumpOnOutOfMemoryError and import it into Eclipse Memory Analyzer which comes in a standalone version. It can show you the biggest retained objects and the references tree that could prevent the garbage collector to do its job.
In addition a profiler such as Netbeans Profiler can give you a lot of interesting real-time informations (for instance to check the number of String and Char instances).
Also it is a good practice to split your code into different classes each having a different responsability: the "two keys map" class (TreeMap) on one side and a "parser" class on the other side, it should make debugging easier...
This is definitely not a good idea to store this huge map inside the RAM... or you need to make a benchkmark with some smaller files and extrapolate to obtain the estimated RAM you need to have on your system to fit your worste case... And set Xmx to the proper value.
Why don't you use a Key Value store such as Berckley DB: simpler than a Relational DB and should fit exactly you need of two levels indexing.
Check this post for the choice of the store: key-value store suggestion
Good luck
You probably shouldn't use String.split and store the information as pure String as this generates lots of String objects on the fly.
Try using a char based approach since your format seems rather fixed so you know the exact indizes of the different data points on one line.
If your a bit more into experimenting you could try to use a NIO-backed approach with a memory mapped DirectByteBuffer or a CharBuffer that is used to traverse the file. There you could just mark the indizes of different data points into Marker-objects and only load the real String-data later on in the process when needed.

getting Java OutOfMemoryError: Java heap space error that I can't debug

I am struggling to figure out what's causing this OutofMemory Error. Making more memory available isn't the solution, because my system doesn't have enough memory. Instead I have to figure out a way of re-writing my code.
I've simplified my code to try to isolate the error. Please take a look at the following:
File[] files = new File(args[0]).listFiles();
int filecnt = 0;
LinkedList<String> urls = new LinkedList<String>();
for (File f : files) {
if (filecnt > 10) {
System.exit(1);
}
System.out.println("Doing File " + filecnt + " of " + files.length + " :" + f.getName());
filecnt++;
FileReader inputStream = null;
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
finally {
if (inputStream != null) {
inputStream.close();
}
}
inputStream.close();
String mystring = builder.toString();
String temp[] = mystring.split("\\|NEWandrewLINE\\|");
for (String s : temp) {
String temp2[] = s.split("\\|NEWandrewTAB\\|");
if (temp2.length == 22) {
urls.add(temp2[7].trim());
}
}
}
I know this code is probably pretty confusing :) I have loads of text files in the directory that is specified in args[0]. These text files were created by me. I used |NEWandrewLINE| to indicate a new row in the text file, and |NEWandrewTAB| to indicate a new column. In this code snippet, I am trying to access the URL of each stored row (which is in the 8th column of each row). So, I read in the whole text file. String split on |NEWandrewLINE| and then string split again on the substrings on |NEWandrewTAB|. I add the URL to the LinkedList (called "urls") with the line: urls.add(temp2[7].trim())
Now, the output of running this code is:
Doing File 0 of 973 :results1322453406319.txt
Doing File 1 of 973 :results1322464193519.txt
Doing File 2 of 973 :results1322337493419.txt
Doing File 3 of 973 :results1322347332053.txt
Doing File 4 of 973 :results1322330379488.txt
Doing File 5 of 973 :results1322369464720.txt
Doing File 6 of 973 :results1322379574296.txt
Doing File 7 of 973 :results1322346981999.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at Twitter.main(Twitter.java:86)
Where main line 86 relates to the line builder.append(d); in this example.
But the thing I don't understand is that if I comment out the line urls.add(temp2[7].trim()); I don't get any error. So the error seems to be caused by the linkedlist "urls" overfilling. But why then does the reported error relate to the StringBuilder?
Try to replace urls.add(temp2[7].trim()); with urls.add(new String(temp2[7].trim()));.
I suppose that your problem is that you are in fact storing the entire file content and not just the extracted URL field in your urls list, although that's not really obvious. It is actually an implementation specific issue with the String class, but usually String#split and String#trim return new String objects, which contain the same internal char array as the original string and only differs in their offset and length fields. Using the new String(String) constructor makes sure that you only keep the relevant part of the original data.
The linked list is using more memory each time you add a string. This means you can be left it not enough memory to build your StringBuilder.
The way to avoid this issue to write the results to a file instead of to a List as you don't appear to have enough memory to keep the List in memory.
Because this is
out of memory and not out of heap
you have LOTS of small temporary objects
I would suggest you give your JVM a -X maximum heap size limit that fits in your RAM.
To use less memory I would use a buffered reader to pull in the entire line and save on the temporary object creation.
The simple answer is: you should not load all the URLs from the text files into memory. You are surely doing this because you want to process them in a next step. So instead of adding them to a List in memory do the next step (maybe storing in a database or check if it is reachable) and forget that URL.
How many URLS do you have? Looks like you're just storing more of them than you can handle.
As far as I can see, the linked list is the only object that is not scoped inside the loop, so cannot be collected.
For an OOM error, it doesn't really matter where it is thrown.
To check this properly, use a profiler (look at JVisualVM for a free one, and you probably already have it). You'll see which objects are in the heap. You can also have the JVM dump its memory into a file when it crashes, then analyse that file with visualvm. You should see that one thing is grabbing all of your memory. I'm suspecting it's all the URLs.
There are several experts in here already, so, I'l be brief to the problems:
Inappropriate use of String Builder:
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
Java is beautiful when you process small amounts of data at a time, remember the garbage collector.
Instead, I would recommend that you read the file (Text file) 1 line at a time, process the line, and move on, never create a huge memory ball of StringBuilder just to get a String,
Immagine of your text file is 1 GB in size, you are done mate.
Add the real process while reading the file (as in item #1)
You dont need to close InputStream again, the code in finally block is good enough.
regards
if the linkedlist eats your memory every command which allocates memory afterwards may fail with an OOM error. So this looks like your problem.
You're reading the files into memory. At least one file is simply too big to fit into the default JVM heap. You can allow it use a lot more memory with an arg like -Xmx1g on the command line after java.
By the way this is really inefficient to read a file one character at a time!
Instead of trying to split the string (which basically creates an array of substrings based on the split) - thereby using more than double the memory each time you use the slpit, you should try to do regex based matching of the start and end patterns, extract individual sub-strings one by one and then extract the URL from that.
Also, if your file is large, I would suggest that you not even load all of that into memory at once ... stream its contents to a buffer (of manageable size) and use the pattern based search on that (and keep removing / adding more to the buffer as you progress through the file contents).
The implementation will slow down the program a bit but will use a considerably lesser amount of memory.
One major problem in your code is that you read whole file into a string builder, then convert it into string and then split it into smaller parts. So if file size is large you will get into trouble. As suggested by others process the file line by line as that should save a lot of memory.
Also you should check what is the size of your list after processing each file. If the size is very large you may want to use different approach or increase the memory for your process via -Xmx option.

How to avoid out of memory in StringBuilder or String in Java

I am getting a lot of data from a webservice containing xml entity references. While replacing those with the respective characters I am getting an out of memory error. Can anybody give an example of how to avoid that? I have been stuck for two days on this problem.
This is my code:
public String decodeXMLData(String s)
{
s = s.replaceAll(">",">");
System.out.println("string value is"+s);
s = s.replaceAll("<", "<");
System.out.println("string value1 is"+s);
s = s.replaceAll("&", "&");
s = s.replaceAll(""", "\"");
s = s.replaceAll("&apos;", "'");
s = s.replaceAll(" ", " ");
return s;
}
You should use a SAX parser, not parse it on your own.
Just look in to these resources, they have code samples too:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
http://www.java-samples.com/showtutorial.php?tutorialid=152
http://www.totheriver.com/learn/xml/xmltutorial.html
Take a look at Apache Commons Lang | StringEscapeUtils.unescapeHtml.
Calling five times replaceAll, you are creating five new String objects. In total, you are working with six Strings. This is not an efficent way to XML-decode a string.
I reccommend you using a more robust implementation of XML-encoding/decoding methods, like those contained in Commons Lang libraries. In particular, StringEscapeUtils may help you to get your job done.
The method as shown would not be a source of out of memory errors (unless the string you are handling is as big as the remaining free heap).
What uou could be running into is the fact that String.substring() calls do not allocate a new string, but create a string object which re-uses the one that substring is called on. If your code exists of reading large buffers and creating strings from those buffers, you might need to use new String(str.substring(index)) to force reallocation of the string values into new small char arrays.
You can try increasing JVM memory, but that will only delay the inevitable if the problem is serious (i.e. if you're trying to claim gigabytes for example).
If you've got a single String that causes you to run out of memory trying to do this, it must be humongous :) Suggestion to use a SAX parser to handle it and print it in bits and pieces is a good one.
Or split it up into smaller bits yourself and send each of those to a routine that does what you want and discard the result afterwards.

Categories