Read large file (Java Heap Space) - java

I want to read CSV file, create objects from every rows and then save these objects to a database.
When i read all lines from my file, and store all objects inside ArrayList i get Java Heap Space Error.
I tried to save every record immediately after reading, but then saving records by Hibernate method save() take a lot of time.
I also tried to check size of my arrayList and save data when this size equals 100k (commented part of code).
Question: Is any way to read file partly or better way to store data in Java?
String[] colNames;
String[] values;
String line;
Map<Object1, Object1> newObject1Objects = new HashMap<Object1, Object1>();
Map<Object1, Integer> objIdMap = objDao.createObjIdMap();
StringBuilder raportBuilder = new StringBuilder();
Long lineCounter = 1L;
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(filename), "UTF-8"));
colNames = reader.readLine().split(";");
int columnLength = colNames.length;
while ((line = reader.readLine()) != null) {
lineCounter++;
line = line.replace("\"", "").replace("=", "");
values = line.split(";", columnLength);
// Object1
Object1 object1 = createObject1Object(values);
if (objIdMap.containsKey(object1)) {
object1.setObjId(objIdMap.get(object1));
} else if (newObject1Objects.containsKey(object1)) {
object1 = newObject1Objects.get(object1);
} else {
newObject1Objects.put(object1, object1);
}
// ==============================================
// Object2
Object2 object2 = createObject2Object(values, object1,
lineCounter, raportBuilder);
listOfObject2.add(object2);
/*
logger.error("listOfObject2.size():"+listOfObject2.size());
if(listOfObject2.size() % 100000 == 0){
object2Dao.performImportOperation(listOfObject2);
listOfObject2.clear();
}
*/
}
object2Dao.performImportOperation(listOfObject2);

Increase of max heap size won't help you if you want to process really large files. Your friend is batching.
Hibernate doesn’t implicitly employ JDBC batching and each INSERT and UPDATE statement is executed separately. Read "How do you enable batch inserts in hibernate?" to get information on how to enable it.
Pay attention to IDENTITY generators, as it disables batch fetching.

Related

How sort N files

Following this answer -->
How do I sort very large files
I need only the Merge function on N already sorted files on disk ,
I want to sort them into one Big file my limitation is the memory Not more than K lines in the memory (K < N) so i cannot fetch all them and then sort, preferred with java
so far I Tried as the code below , but I need a good way to iterate over all N of files line by line (not more than K LINES in memory) + store to disk the sorted final file
public void run() {
try {
System.out.println(file1 + " Started Merging " + file2 );
FileReader fileReader1 = new FileReader(file1);
FileReader fileReader2 = new FileReader(file2);
//......TODO with N ?? ......
FileWriter writer = new FileWriter(file3);
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
//Merge 2 files based on which string is greater.
while (line1 != null || line2 != null) {
if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
writer.write(line2 + "\r\n");
line2 = bufferedReader2.readLine();
} else {
writer.write(line1 + "\r\n");
line1 = bufferedReader1.readLine();
}
}
System.out.println(file1 + " Done Merging " + file2 );
new File(file1).delete();
new File(file2).delete();
writer.close();
} catch (Exception e) {
System.out.println(e);
}
}
regards,
You can use something like this
public static void mergeFiles(String target, String... input) throws IOException {
String lineBreak = System.getProperty("line.separator");
PriorityQueue<Map.Entry<String,BufferedReader>> lines
= new PriorityQueue<>(Map.Entry.comparingByKey());
try(FileWriter fw = new FileWriter(target)) {
String header = null;
for(String file: input) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line = br.readLine();
if(line == null) br.close();
else {
if(header == null) fw.append(header = line).write(lineBreak);
line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
for(;;) {
Map.Entry<String, BufferedReader> next = lines.poll();
if(next == null) break;
fw.append(next.getKey()).write(lineBreak);
final BufferedReader br = next.getValue();
String line = br.readLine();
if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
else br.close();
}
}
catch(Throwable t) {
for(Map.Entry<String,BufferedReader> br: lines) try {
br.getValue().close();
} catch(Throwable next) {
if(t != next) t.addSuppressed(next);
}
}
}
Note that this code, unlike the code in your question, handles the header line. Like the original code, it will delete the input lines. If that’s not intended, you can remove the DELETE_ON_CLOSE option and simplify the entire reader construction to
BufferedReader br = new BufferedReader(new FileReader(file));
It has exactly as much lines in memory, as you have files.
While in principle, it is possible to hold less line strings in memory, to re-read them when needed, it would be a performance disaster for a questionable little saving. E.g. you have already N strings in memory when calling this method, due to the fact that you have N file names.
However, when you want to reduce the number of lines held at the same time, at all costs, you can simply use the method shown in your question. Merge the first two files into a temporary file, merge that temporary file with the third to another temporary file, and so on, until merging the temporary file with the last input file to the final result. Then you have at most two line strings in memory (K == 2), saving less memory than the operating system will use for buffering, trying to mitigate the horrible performance of this approach.
Likewise, you can use the method shown above to merge K files into a temporary file, then merge the temporary file with the next K-1 file, and so on, until merging the temporary file with the remaining K-1 or less files to the final result, to have a memory consumption scaling with K < N. This approach allows to tune K to have a reasonable ratio to N, to trade memory for speed. I think, in most practical cases, K == N will work just fine.
#Holger gave a nice answer assuming that K>=N.
You can extend it to the K<N case by using mark(int) and reset() methods of the BufferedInputStream.
The parameter of mark is how many bytes a single line can have.
The idea is as follows:
Instead of putting all the N lines in the TreeMap, you can only have K of them. Whenever you put a new line into the set and it is already 'full' you evict the smallest one from it. Additionally, you reset the stream from which it came. So when you will read it again the same data can pop up.
You have to keep track of the maximum line not kept in the TreeSet, lets call it the lower bound. Once there are no elements in the TreeSet greater than the maintained lower bound, you scan all the files once again and repopulate the set.
I'm not sure if this approach is optimal, but should be ok.
Moreover, you have to be aware that BufferedInputStream has an internal buffer at least the size of a single line, so that will consume a lot of your memory, perhaps it would be better to maintain buffering on your own.

Garbage Collection for Strings

I have a text file which I need to read line by line and do some processing on each line.
ConcurrentMap<String, String> hm = new ConcurrentHashMap<>();
InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream("filename.txt");
InputStreamReader stream = new InputStreamReader(is, StandardCharsets.UTF_8);
BufferedReader reader = new BufferedReader(stream);
while(true)
{
line = reader.readLine();
if (line == null) {
break;
}
String text = line.substring(0, line.lastIndexOf(",")).trim();
String id = line.substring(line.lastIndexOf(",") + 1).trim();
hm.put(text,id);
}
I need to know, when will the strings created during the substring() and trim() operations be garbage collected?
Also, what about the String line?
The strings themselves will be garbage collected as soon as they go out of scope, which happens at the end of each iteration of the while loop. But from a memory usage point of view this is a moot point because you are storing this data into a map which will not go out of scope.
If you include information about how you are using this map, maybe a solution can be given which avoids having to store everything in memory.

Java GC overhead limit exceeded

I am trying to preprocess a large txt file (10G), and store it in binary file for future use. As the code runs it slows down and ends with
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead
limit exceeded
The input file has the following structure
200020000000008;0;2
200020000000004;0;2
200020000000002;0;2
200020000000007;1;2
This is the code I am using:
String strLine;
FileInputStream fstream = new FileInputStream(args[0]);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
//Read File Line By Line
HMbicnt map = new HMbicnt("-1");
ObjectOutputStream outputStream = null;
outputStream = new ObjectOutputStream(new FileOutputStream(args[1]));
int sepIndex = 15;
int sepIndex2 = 0;
String str_i = "";
String bb = "";
String bbBlock = "init";
int cnt = 0;
lineCnt = 0;
while ((strLine = br.readLine()) != null) {
//rozparsovat radek
str_i = strLine.substring(0, sepIndex);
sepIndex2 = strLine.substring(sepIndex+1).indexOf(';');
bb = strLine.substring(sepIndex+1, sepIndex+1+sepIndex2);
cnt = Integer.parseInt(strLine.substring(sepIndex+1+sepIndex2+1));
if(!bb.equals(bbBlock)){
outputStream.writeObject(map);
outputStream.flush();
map = new HMbicnt(bb);
map.addNew(str_i + ";" + bb, cnt);
bbBlock = bb;
}
else{
map.addNew(str_i + ";" + bb, cnt);
}
}
outputStream.writeObject(map);
//Close the input stream
br.close();
outputStream.writeObject(map = null);
outputStream.close();
Basically, it goes through the in file and stores data to the object HMbicnt (which is a hash map). Once it encounters new value in second column it should write object to the output file, free memory and continue.
Thanks for any help.
I think the problem is not that 10G is in memory, but that you are creating too many HashMaps. Maybe you could clear the HashMap instead of re-creating it after you don't need it anymore.
There seems to have been a similar problem in java.lang.OutOfMemoryError: GC overhead limit exceeded , it is also about HashMaps
Simply put, you're using too much memory. Since, as you said, your file is 10 GB, there is no way you're going to be able to fit it all into memory (unless, of course, you happen to have over 10 GB of RAM and have configured Java to use it).
From what I can tell from your code and description of it, you're reading the entire file into memory and adding it to one huge in-RAM map as you're doing so, then writing your result to output. This is not feasible. You'll need to redesign your code to work in-place (i.e. only keep a small portion of the file in memory at any given time).

File Handling - Reading from the beginning again [duplicate]

This question already has answers here:
Reset buffer with BufferedReader in Java?
(3 answers)
Closed 7 years ago.
I need to create an array of objects to save the records from files. I don't know about the size of an array. For that I have to find the number of lines in a file first. Using the number of lines, size of an array can be determined. Now I need to read the file again from the beginning to store records from the file in array object. This is where I am struggling. I don't know how to implement in order to read the file again from the beginning. Please advise.
/**
* Loads the game records from a text file.
* A GameRecord array is constructed to store all the game records.
* The size of the GameRecord array should be the same as the number of non-empty records in the text file.
* The GameRecord array contains no null/empty entries.
*
* #param reader The java.io.Reader object that points to the text file to be read.
* #return A GameRecord array containing all the game records read from the text file.
*/
public GameRecord[] loadGameRecord(java.io.Reader reader) {
// write your code after this line
String[] parts;
GameRecord[] gameRecord = null;
FileReader fileReader = (FileReader) reader;
java.io.BufferedReader bufferedReader = new java.io.BufferedReader(fileReader);
try {
int linenumber = 0;
String sCurrentLine;
while ((sCurrentLine = bufferedReader.readLine()) != null){
System.out.println(sCurrentLine);
linenumber++;
}
gameRecord = new GameRecord[linenumber]; //creating a space for total no Of lines
//How to read the file from the beginning again, and why I am getting bufferedReader.readLine()) as null
bufferedReader = new java.io.BufferedReader(fileReader);
while ((sCurrentLine = bufferedReader.readLine()) != null){
System.out.println(sCurrentLine);
parts = sCurrentLine.split("\t");
gameRecord[i] = new GameRecord(parts[0], Integer.parseInt(parts[1]),Integer.parseInt(parts[2]));
}
}catch (IOException exe){
System.err.println("IOException: " + exe.getMessage());
exe.printStackTrace();
}finally{
try {
if (bufferedReader!=null)
bufferedReader.close();
if (fileReader!=null)
fileReader.close();
}catch(IOException exe) {
System.err.println("IOException: " + exe.getMessage());
}
}
return gameRecord;
}
Note: I will get the reference to a file as an argument. Reader class has been used. Can I use this reference for FileInputStream?
Your current approach is very limiting because you are using an array which by definition has to be defined to have a fixed size. A better approach would be to use an ArrayList of GameRecord objects. With this approach, you can simply make one single pass through the file and add elements to the ArrayList as necessary.
Sample code:
List<GameRecord> grList = new ArrayList<GameRecord>();
bufferedReader = new java.io.BufferedReader(fileReader);
while ((sCurrentLine = bufferedReader.readLine()) != null){
parts = sCurrentLine.split("\t");
GameRecord gameRecord = new GameRecord(parts[0],
Integer.parseInt(parts[1]),
Integer.parseInt(parts[2]));
grList.add(gameRecord); // add the GameRecord object to the ArrayList
// and let the JVM worry about sizing problems
}
// finally, convert the ArrayList to an array
GameRecord[] grArray = new String[grList.size()];
grArray = grList.toArray(grArray);
If you must reset the BufferedReader then have a look at this SO article which discusses this.
if you used org.apache.commons.io.IOUtils, you could employ readLines method. as long as your file never contains "Too Many" records. this method returns a List which is easy to convert to array of Strings

what is the efficent way to process larges text files?

I have two files:
1- with 1400000 line or record --- 14 MB
2- with 16000000 -- 170 MB
I want to find if each record or line in file 1 is also in file 2 or not
I develop a java app that do the following: Read file line by line and pass each line to a method that loop in file 2
Here is my code:
public boolean hasIDin(String bioid) throws Exception {
BufferedReader br = new BufferedReader(new FileReader("C://AllIDs.txt"));
long bid = Long.parseLong(bioid);
String thisLine;
while((thisLine = br.readLine( )) != null)
{
if (Long.parseLong(thisLine) == bid)
return true;
}
return false;
}
public void getMBD() throws Exception{
BufferedReader br = new BufferedReader(new FileReader("C://DIDs.txt"));
OutputStream os = new FileOutputStream("C://MBD.txt");
PrintWriter pr = new PrintWriter(os);
String thisLine;
int count=1;
while ((thisLine = br.readLine( )) != null){
String bioid = thisLine;
System.out.println(count);
if(! hasIDin(bioid))
pr.println(bioid);
count++;
}
pr.close();
}
When I run it seems it will take more 1944.44444444444 hours to complete as every line processing takes 5 sec. That is about three months!
Is there any ideas to make it done in a much much more less time.
Thanks in advance.
Why don't you;
read all the lines in file2 into a set. Set is fine, but TLongHashSet would be more efficient.
for each line in the second file see if it is in the set.
Here is a tuned implementation which prints the following and uses < 64 MB.
Generating 1400000 ids to /tmp/DID.txt
Generating 16000000 ids to /tmp/AllIDs.txt
Reading ids in /tmp/DID.txt
Reading ids in /tmp/AllIDs.txt
Took 8794 ms to find 294330 valid ids
Code
public static void main(String... args) throws IOException {
generateFile("/tmp/DID.txt", 1400000);
generateFile("/tmp/AllIDs.txt", 16000000);
long start = System.currentTimeMillis();
TLongHashSet did = readLongs("/tmp/DID.txt");
TLongHashSet validIDS = readLongsUnion("/tmp/AllIDs.txt",did);
long time = System.currentTimeMillis() - start;
System.out.println("Took "+ time+" ms to find "+ validIDS.size()+" valid ids");
}
private static TLongHashSet readLongs(String filename) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;)
ids.add(Long.parseLong(line));
br.close();
return ids;
}
private static TLongHashSet readLongsUnion(String filename, TLongHashSet validSet) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;) {
long val = Long.parseLong(line);
if (validSet.contains(val))
ids.add(val);
}
br.close();
return ids;
}
private static void generateFile(String filename, int number) throws IOException {
System.out.println("Generating "+number+" ids to "+filename);
PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(filename), 128*1024));
Random rand = new Random();
for(int i=0;i<number;i++)
pw.println(rand.nextInt(1<<26));
pw.close();
}
170Mb + 14Mb is not so huge files.
My suggestion is to load the smallest one file into java.util.Map, parse the biggest one line-by-line (record-by-record) file and check if the current line present in this Map.
P.S. The question looks like something trivial in terms of RDBMS - maybe it's worth to use any?
You can't do an O(N^2) when each iteration is so long, that's completely unacceptable.
If you have enough RAM, you simply parse file 1, create a map of all numbers, then parse file 2 and check your map.
If you don't have enough RAM, parse file 1, create a map and store it to a file, then parse file 2 and read the map. The key is to make the map as easy to parse as possible - make it a binary format, maybe with a binary tree or something where you can quickly skip around and search. (EDIT: I have to add Michael Borgwardt's Grace Hash Join link, which shows an even better way: http://en.wikipedia.org/wiki/Hash_join#Grace_hash_join)
If there is a limit to the size of your files, option 1 is easier to implement - unless you're dealing with huuuuuuuge files (I'm talking lots of GB), you definitely want to do that.
Usually, memory-mapping is the most efficient way to read large files. You'll need to use java.nio.MappedByteBuffer and java.io.RandomAccessFile.
But your search algorithm is the real problem. Building some sort of index or hash table is what you need.

Categories