I am developing a system which loads a huge CSV file (with more than 1 million lines) and saves into database. Also every line has more than one thousand field. A CSV file is considered as one batch and each line is considered as its child object. During the time of adding objects, every object will be saved in List of single batch and at some point I am running out of memory as the List will have more than 1 million objects being added. I cannot split the file into N numbers since there is dependency between the lines which are not in serial order(any line can have dependency to other lines).
Following is the general logic:
Batch batch = new Batch();
while (csvLine !=null ){
{
String[] values = csvLine.split( ",", -1 );
Transaction txn = new Transaction();
txn.setType(values[0]);
txn.setAmount(values[1]);
/*
There are more than one thousand transaction fields in one line
*/
batch.addTransaction (txn);
}
batch.save();
Is there any way we can handle this type of condition with the server having low memory?
In the old times, we used to process large quantities of data stored on sequential tapes with little memory and disk. But it took loooong time!
Basically, you build a buffer of lines than can fit in your memory, browse all file to resolve dependencies and fully process those lines. Then you iterate on next buffer until you have processed all file. If requires a full read of the file per each buffer, but allows to save memory.
There may be another problem here, because you want to store all records in a single batch. The batch will have to require enough memory to store all the records, so here again you have a risk to exhaust memory. But you can again use the good old methods, and save many batches of smaller size.
If you want to make sure that everything will be either fully inserted in database or everything will be rejected, you can simply use a transaction:
declare transaction at the beginning of your job
save all your batches inside this single transaction
commit the transaction when everithing is done
Professional grade databases (MySQL, PostgreSQL, Oracle, etc.) can use rollback segments on disk to be able to process one transaction without exhausting memory. Of course it is far slower than in memory operations (not speaking if for any reason you have to rollback such a transaction!) but at least it works unless you exhaust the available physical disk...
Dedicate a separate database table just for the CSV import. Maybe with additional fields for those cross-references you mentioned.
If you need to analize CSV fields in java, restrain the number of value instances by caching:
public class SharedStrings {
private Map<String, String> sharedStrings = new HashMap<>();
public String share(String s) {
if (s.length() <= 15) {
String t = sharedStrings.putIfAbsent(s, s); // Since java 8
if (t != null) {
s = t;
}
/*
// Older java:
String t = sharedString.get(s);
if (t == null) {
sharedString.put(s, s);
} else {
s = t;
}
*/
}
return s;
}
}
In your case, with long records, it might even make sence to GZipOutputStream the read line, as bytes, to a shorter byte array.
But then a database seems more logical.
The following will possibly not apply if you are using all fields of a csvLine.
String#split uses String#substring, which in turn does not create a new string but keeps the original string in memory and references the respective portion.
So this line would keep the original string in memory:
String a = "...very long and comma separated";
String[] split = a.split(",");
String b = split[1];
a = null;
So if your are not using all data of the csvLine you should wrap every entry of values in a new String, i.e. in the above example you would do
String b = new String(split[1]);
otherwise the gc is unable to free string a.
I ran into this while i was extracting one column of a csv file with millions of lines.
Related
I am currently working on a spring based API which has to transform csv data and to expose them as json.
it has to read big CSV files which will contain more than 500 columns and 2.5 millions lines each.
I am not guaranteed to have the same header between files (each file can have a completly different header than another), so I have no way to create a dedicated class which would provide mapping with the CSV headers.
Currently the api controller is calling a csv service which reads the CSV data using a BufferReader.
The code works fine on my local machine but it is very slow : it takes about 20 seconds to process 450 columns and 40 000 lines.
To improve speed processing, I tried to implement multithreading with Callable(s) but I am not familiar with that kind of concept, so the implementation might be wrong.
Other than that the api is running out of heap memory when running on the server, I know that a solution would be to enhance the amount of available memory but I suspect that the replace() and split() operations on strings made in the Callable(s) are responsible for consuming a large amout of heap memory.
So I actually have several questions :
#1. How could I improve the speed of the CSV reading ?
#2. Is the multithread implementation with Callable correct ?
#3. How could I reduce the amount of heap memory used in the process ?
#4. Do you know of a different approach to split at comas and replace the double quotes in each CSV line ? Would StringBuilder be of any healp here ? What about StringTokenizer ?
Here below the CSV method
public static final int NUMBER_OF_THREADS = 10;
public static List<List<String>> readCsv(InputStream inputStream) {
List<List<String>> rowList = new ArrayList<>();
ExecutorService pool = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
List<Future<List<String>>> listOfFutures = new ArrayList<>();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
String line = null;
while ((line = reader.readLine()) != null) {
CallableLineReader callableLineReader = new CallableLineReader(line);
Future<List<String>> futureCounterResult = pool.submit(callableLineReader);
listOfFutures.add(futureCounterResult);
}
reader.close();
pool.shutdown();
} catch (Exception e) {
//log Error reading csv file
}
for (Future<List<String>> future : listOfFutures) {
try {
List<String> row = future.get();
}
catch ( ExecutionException | InterruptedException e) {
//log Error CSV processing interrupted during execution
}
}
return rowList;
}
And the Callable implementation
public class CallableLineReader implements Callable<List<String>> {
private final String line;
public CallableLineReader(String line) {
this.line = line;
}
#Override
public List<String> call() throws Exception {
return Arrays.asList(line.replace("\"", "").split(","));
}
}
I don't think that splitting this work onto multiple threads is going to provide much improvement, and may in fact make the problem worse by consuming even more memory. The main problem is using too much heap memory, and the performance problem is likely to be due to excessive garbage collection when the remaining available heap is very small (but it's best to measure and profile to determine the exact cause of performance problems).
The memory consumption would be less from the replace and split operations, and more from the fact that the entire contents of the file need to be read into memory in this approach. Each line may not consume much memory, but multiplied by millions of lines, it all adds up.
If you have enough memory available on the machine to assign a heap size large enough to hold the entire contents, that will be the simplest solution, as it won't require changing the code.
Otherwise, the best way to deal with large amounts of data in a bounded amount of memory is to use a streaming approach. This means that each line of the file is processed and then passed directly to the output, without collecting all of the lines in memory in between. This will require changing the method signature to use a return type other than List. Assuming you are using Java 8 or later, the Stream API can be very helpful. You could rewrite the method like this:
public static Stream<List<String>> readCsv(InputStream inputStream) {
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8));
return reader.lines().map(line -> Arrays.asList(line.replace("\"", "").split(",")));
}
Note that this throws unchecked exceptions in case of an I/O error.
This will read and transform each line of input as needed by the caller of the method, and will allow previous lines to be garbage collected if they are no longer referenced. This then requires that the caller of this method also consume the data line by line, which can be tricky when generating JSON. The JakartaEE JsonGenerator API offers one possible approach. If you need help with this part of it, please open a new question including details of how you're currently generating JSON.
Instead of trying out a different approach, try to run with a profiler first and see where time is actually being spent. And use this information to change the approach.
Async-profiler is a very solid profiler (and free!) and will give you a very good impression of where time is being spent. And it will also show the time spend on garbage collection. So you can easily see the ratio of CPU utilization caused by garbage collection. It also has the ability to do allocation profiling to figure out which objects are being created (and where).
For a tutorial see the following link.
Try using Spring batch and see if it helps your scenario.
Ref : https://howtodoinjava.com/spring-batch/flatfileitemreader-read-csv-example/
I have the problem:
in a loop, each time I need to write a large string into one file(or temporary file), then process take the file as an argument for the next step.
Something along:
for(int i=0;i<n;i++){
File f = File.createTmpFile("xxx","xxx");
// write into f etc.
String result = func(f);
}
Since I think each time creating a File and writing string into it seem to be much costly, so is there any alternative methods?
If these Strings do not need to be immediately persisted to a File, you could store them in memory, some sort of Collection, e.g. an ArrayList. And when the list gets "large", say, every tenth time, write all ten at once to a file. This cuts file creation by 10X.
The danger is that if there is a crash you may lose up to 9 values.
I have a program which writes some 8 millions line of data in a flat file. As of now, the program is calling bufferedwriter.write for each of the record and I was planning to write in bulk with the following strategy
Keep a data structure (I used array) to hold a specific number of records.
write the details in a file using the array. here is the code snippet (array is the name of the Array which stores the record and threshold count is the kick off for writing process)
if (array.length==thresholdCount) {
writeBulk(array);
}
public void writeBulk(String[] inpArray) {
for (String line:inpArray) {
if (line!=null) {
try {
writer.write(line +"\n");
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
However I am not seeing much performance improvement. I want to know if there is a way to determine the optimal threshold count?
I was also planning to further tune the code so as to store each element in the array as a concatenation of some n number of records and then call the bulk method. For ex an array with length 5000 will actually contain 50000 records whereby each index in the array contains 10 records. however before doing so, I need the expert opinion.
Writes to files are already buffered in a similar fashion before they are pushed to disk (unless you flush -- which actually doesn't always do exactly that either). Thus pre-buffering the writes will not speed up the overall process. Note: that some IO Classes try to do immediate writes by inserting flush requests after each write. For those special cases pre-buffering can sometimes help, but usually you just use a Buffered version of the Class in the first place rather than manually buffer yourself.
If you were writing to somewhere other than the end of the file, then you could see an improvement as writing to the middle of a file wouldn't need to copy the contents of the already flushed entries sitting on your hard-disk.
I am struggling to figure out what's causing this OutofMemory Error. Making more memory available isn't the solution, because my system doesn't have enough memory. Instead I have to figure out a way of re-writing my code.
I've simplified my code to try to isolate the error. Please take a look at the following:
File[] files = new File(args[0]).listFiles();
int filecnt = 0;
LinkedList<String> urls = new LinkedList<String>();
for (File f : files) {
if (filecnt > 10) {
System.exit(1);
}
System.out.println("Doing File " + filecnt + " of " + files.length + " :" + f.getName());
filecnt++;
FileReader inputStream = null;
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
finally {
if (inputStream != null) {
inputStream.close();
}
}
inputStream.close();
String mystring = builder.toString();
String temp[] = mystring.split("\\|NEWandrewLINE\\|");
for (String s : temp) {
String temp2[] = s.split("\\|NEWandrewTAB\\|");
if (temp2.length == 22) {
urls.add(temp2[7].trim());
}
}
}
I know this code is probably pretty confusing :) I have loads of text files in the directory that is specified in args[0]. These text files were created by me. I used |NEWandrewLINE| to indicate a new row in the text file, and |NEWandrewTAB| to indicate a new column. In this code snippet, I am trying to access the URL of each stored row (which is in the 8th column of each row). So, I read in the whole text file. String split on |NEWandrewLINE| and then string split again on the substrings on |NEWandrewTAB|. I add the URL to the LinkedList (called "urls") with the line: urls.add(temp2[7].trim())
Now, the output of running this code is:
Doing File 0 of 973 :results1322453406319.txt
Doing File 1 of 973 :results1322464193519.txt
Doing File 2 of 973 :results1322337493419.txt
Doing File 3 of 973 :results1322347332053.txt
Doing File 4 of 973 :results1322330379488.txt
Doing File 5 of 973 :results1322369464720.txt
Doing File 6 of 973 :results1322379574296.txt
Doing File 7 of 973 :results1322346981999.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuilder.append(StringBuilder.java:203)
at Twitter.main(Twitter.java:86)
Where main line 86 relates to the line builder.append(d); in this example.
But the thing I don't understand is that if I comment out the line urls.add(temp2[7].trim()); I don't get any error. So the error seems to be caused by the linkedlist "urls" overfilling. But why then does the reported error relate to the StringBuilder?
Try to replace urls.add(temp2[7].trim()); with urls.add(new String(temp2[7].trim()));.
I suppose that your problem is that you are in fact storing the entire file content and not just the extracted URL field in your urls list, although that's not really obvious. It is actually an implementation specific issue with the String class, but usually String#split and String#trim return new String objects, which contain the same internal char array as the original string and only differs in their offset and length fields. Using the new String(String) constructor makes sure that you only keep the relevant part of the original data.
The linked list is using more memory each time you add a string. This means you can be left it not enough memory to build your StringBuilder.
The way to avoid this issue to write the results to a file instead of to a List as you don't appear to have enough memory to keep the List in memory.
Because this is
out of memory and not out of heap
you have LOTS of small temporary objects
I would suggest you give your JVM a -X maximum heap size limit that fits in your RAM.
To use less memory I would use a buffered reader to pull in the entire line and save on the temporary object creation.
The simple answer is: you should not load all the URLs from the text files into memory. You are surely doing this because you want to process them in a next step. So instead of adding them to a List in memory do the next step (maybe storing in a database or check if it is reachable) and forget that URL.
How many URLS do you have? Looks like you're just storing more of them than you can handle.
As far as I can see, the linked list is the only object that is not scoped inside the loop, so cannot be collected.
For an OOM error, it doesn't really matter where it is thrown.
To check this properly, use a profiler (look at JVisualVM for a free one, and you probably already have it). You'll see which objects are in the heap. You can also have the JVM dump its memory into a file when it crashes, then analyse that file with visualvm. You should see that one thing is grabbing all of your memory. I'm suspecting it's all the URLs.
There are several experts in here already, so, I'l be brief to the problems:
Inappropriate use of String Builder:
StringBuilder builder = new StringBuilder();
try {
inputStream = new FileReader(f);
int c;
char d;
while ((c = inputStream.read()) != -1) {
d = (char)c;
builder.append(d);
}
}
Java is beautiful when you process small amounts of data at a time, remember the garbage collector.
Instead, I would recommend that you read the file (Text file) 1 line at a time, process the line, and move on, never create a huge memory ball of StringBuilder just to get a String,
Immagine of your text file is 1 GB in size, you are done mate.
Add the real process while reading the file (as in item #1)
You dont need to close InputStream again, the code in finally block is good enough.
regards
if the linkedlist eats your memory every command which allocates memory afterwards may fail with an OOM error. So this looks like your problem.
You're reading the files into memory. At least one file is simply too big to fit into the default JVM heap. You can allow it use a lot more memory with an arg like -Xmx1g on the command line after java.
By the way this is really inefficient to read a file one character at a time!
Instead of trying to split the string (which basically creates an array of substrings based on the split) - thereby using more than double the memory each time you use the slpit, you should try to do regex based matching of the start and end patterns, extract individual sub-strings one by one and then extract the URL from that.
Also, if your file is large, I would suggest that you not even load all of that into memory at once ... stream its contents to a buffer (of manageable size) and use the pattern based search on that (and keep removing / adding more to the buffer as you progress through the file contents).
The implementation will slow down the program a bit but will use a considerably lesser amount of memory.
One major problem in your code is that you read whole file into a string builder, then convert it into string and then split it into smaller parts. So if file size is large you will get into trouble. As suggested by others process the file line by line as that should save a lot of memory.
Also you should check what is the size of your list after processing each file. If the size is very large you may want to use different approach or increase the memory for your process via -Xmx option.
I am trying to improve an external sort implementation in java.
I have a bunch of BufferedReader objects open for temporary files. I repeatedly remove the top line from each of these files. This pushes the limits of the Java's Heap.
I would like a more scalable method of doing this without loosing speed because of a bunch of constructor calls.
One solution is to only open files when they are needed, then read the first line and then delete it. But I am afraid that this will be significantly slower.
So using Java libraries what is the most efficient method of doing this.
--Edit--
For external sort, the usual method is to break a large file up into several chunk files. Sort each of the chunks. And then treat the sorted files like buffers, pop the top item from each file, the smallest of all those is the global minimum. Then continue until for all items.
http://en.wikipedia.org/wiki/External_sorting
My temporary files (buffers) are basically BufferedReader objects. The operations performed on these files are the same as stack/queue operations (peek and pop, no push needed).
I am trying to make these peek and pop operations more efficient. This is because using many BufferedReader objects takes up too much space.
I'm away from my compiler at the moment, but I think this will work. Edit: works fine.
I urge you to profile it and see. I bet the constructor calls are going to be nothing compared to the file I/O and your comparison operations.
public class FileStack {
private File file;
private long position = 0;
private String cache = null;
public FileStack(File file) {
this.file = file;
}
public String peek() throws IOException {
if (cache != null) {
return cache;
}
BufferedReader r = new BufferedReader(new FileReader(file));
try {
r.skip(position);
cache = r.readLine();
return cache;
} finally {
r.close();
}
}
public String pop() throws IOException {
String r = peek();
if (r != null) {
// if you have \r\n line endings, you may need +2 instead of +1
// if lines could end either way, you'll need something more complicated
position += r.length() + 1;
cache = null;
}
return r;
}
}
If heap space is the main concern, use the [2nd form of the BufferedReader constructor][1] and specify a small buffer size.
[1]: http://java.sun.com/j2se/1.5.0/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader, int)
I have a bunch of BufferedReader objects open for temporary files. I repeatedly remove the top line from each of these files. This pushes the limits of the Java's Heap.
This is a really surprising claim. Unless you have thousands files open at the same time, there is no way that that should stress the heap. The default buffer size for a BufferedReader is 8192 bytes, and there should be little extra space required. 8192 * 1000 is only ~8Mbytes, and that is tiny compared with a typical Java application's memory usage.
Consider the possibility that something else is causing the heap problems. For example, if your program retained references to each line that it read, THAT would lead to heap problems.
(Or maybe your notion of what is "too much space" is unrealistic.)
One solution is to only open files when they are needed, then read the first line and then delete it. But I am afraid that this will be significantly slower.
There is no doubt that it would be significantly slower! There is simply no efficient way to delete the first line from a file. Not in Java, or in any other language. Deleting characters from the beginning or middle of a file entails copying the file to a new one while skipping over the characters that need to be removed. There is no faster alternative.