I am trying to preprocess a large txt file (10G), and store it in binary file for future use. As the code runs it slows down and ends with
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead
limit exceeded
The input file has the following structure
200020000000008;0;2
200020000000004;0;2
200020000000002;0;2
200020000000007;1;2
This is the code I am using:
String strLine;
FileInputStream fstream = new FileInputStream(args[0]);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
//Read File Line By Line
HMbicnt map = new HMbicnt("-1");
ObjectOutputStream outputStream = null;
outputStream = new ObjectOutputStream(new FileOutputStream(args[1]));
int sepIndex = 15;
int sepIndex2 = 0;
String str_i = "";
String bb = "";
String bbBlock = "init";
int cnt = 0;
lineCnt = 0;
while ((strLine = br.readLine()) != null) {
//rozparsovat radek
str_i = strLine.substring(0, sepIndex);
sepIndex2 = strLine.substring(sepIndex+1).indexOf(';');
bb = strLine.substring(sepIndex+1, sepIndex+1+sepIndex2);
cnt = Integer.parseInt(strLine.substring(sepIndex+1+sepIndex2+1));
if(!bb.equals(bbBlock)){
outputStream.writeObject(map);
outputStream.flush();
map = new HMbicnt(bb);
map.addNew(str_i + ";" + bb, cnt);
bbBlock = bb;
}
else{
map.addNew(str_i + ";" + bb, cnt);
}
}
outputStream.writeObject(map);
//Close the input stream
br.close();
outputStream.writeObject(map = null);
outputStream.close();
Basically, it goes through the in file and stores data to the object HMbicnt (which is a hash map). Once it encounters new value in second column it should write object to the output file, free memory and continue.
Thanks for any help.
I think the problem is not that 10G is in memory, but that you are creating too many HashMaps. Maybe you could clear the HashMap instead of re-creating it after you don't need it anymore.
There seems to have been a similar problem in java.lang.OutOfMemoryError: GC overhead limit exceeded , it is also about HashMaps
Simply put, you're using too much memory. Since, as you said, your file is 10 GB, there is no way you're going to be able to fit it all into memory (unless, of course, you happen to have over 10 GB of RAM and have configured Java to use it).
From what I can tell from your code and description of it, you're reading the entire file into memory and adding it to one huge in-RAM map as you're doing so, then writing your result to output. This is not feasible. You'll need to redesign your code to work in-place (i.e. only keep a small portion of the file in memory at any given time).
Related
I have a java program that uploads files from local to Minio browser. The file size is around 900 MB. When I'm executing the java program I get -
Java.lang.OutOfMemoryError - Java heap Size
I tried increasing heap size both in eclipse.ini as well as under Run-->Configurations-->Project to -Xms4096M -Xmx8192M.
After increasing the heap size when I executed the program I recieve -
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
How to upload large size files to Minio using Java ?
This is how my java program looks like -
StringBuilder stringBuilder = new StringBuilder();
File[] files = new File(path).listFiles();
showFiles(files);
System.out.println(pathList);
ListIterator<String> itr=pathList.listIterator();
while(itr.hasNext()){
String relativePath=itr.next();
if(relativePath!=null) {
String absolutePath=path+(relativePath).replaceFirst("minio_files", "");
System.out.println(absolutePath);
System.out.println(relativePath);
File f =new File(absolutePath);
BufferedReader reader = new BufferedReader(new FileReader(f));
String line = null;
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(ls);
}
if(stringBuilder.length()!=0) {
// delete the last new line separator
stringBuilder.deleteCharAt(stringBuilder.length() - 1);
}
reader.close();
// Create a InputStream for object upload.
ByteArrayInputStream bais = new ByteArrayInputStream(stringBuilder.toString().getBytes("UTF-8"));
Do you absolutely need to remove a trailing line separator from your text file?
If this not absolutely required you could let the minio client libraries handle the upload transparently:
String absolutePath=path+(relativePath).replaceFirst("minio_files", "");
File f =new File(absolutePath);
minio.putObject("bucketName", f.getName(), absolutePath);
According to the minio docs this allows uploads of up to 5 GB. This is easier to implement and faster than any other solution.
If you absolutely need to remove a trailing line separator, you should at least pre-size the StringBuilder (and use the correct code to remove the trailing line separator):
File f = new File(absolutePath);
stringBuilder.ensureCapacity((int) f.length()+2);
try (BufferedReader reader = new BufferedReader(new FileReader(f))) {
String line;
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
stringBuilder.append(ls);
}
if (stringBuilder.length() != 0) {
// delete the last new line separator
stringBuilder.setLength(stringBuilder.length() - ls.length());
}
}
Please beware that this code can never upload files larger than about 2GB:
arrays in java cannot be larger than Integer.MAX_VALUE-5
therefore StringBuilder cannot be used to create strings with more than Integer.MAX_VALUE-5 characters
transforming the string into a UTF-8 encoded byte array cannot produce a byte array longer than Integer.MAX_VALUE-5 bytes
since UTF-8 is a multibyte encoding, transforming a string with Integer.MAX_VALUE-5 characters into a byte array might not be possible
I have a problem with the Java heap space of BlueJ.
I have written a program which reads in a .txt to a String and goes through all the characters of the string and do some stuff(guess this is not really important). Some of the .txt are really large(around 200 million).
If I try to execute the program with these .txt i get this "Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space" error code. I increased the bluej.windows.vm.args and bluej.windows.vm.args in the bluej.def to 8gb. And it still does not work. But I actually guess that even a 200million character String would not exceed this limit.
Here is my code of how I read in the .txt
try
{
FileReader reader = new FileReader(input.getText());
BufferedReader bReader = new BufferedReader(reader);
String parcour = "";
String line = bReader.readLine();
while(line != null)
{
parcour += line;
line = bReader.readLine();
}
input.getText() gets the file paths.
I would be really grateful for an answer. Thanks :)
- Cyaena
In the below explanation only the plain memory for the data is in the scope. All additional memory need for the structures are left out. It's more an overview as an in deep detail view.
The memory is eaten at those lines
String parcour = "";
...
String line = bReader.readLine();
...
parcour += line;
The line parcour += line is compiled into the class file as
new StringBuilder().append(parcour).append(line).toString()
Assume parcour contains a string of size 10 MB and line would be of size 2 MB. Then the memory allocated during parcour += line; would be (roughly)
// creates a StringBuilder object of size 12 MB
new StringBuilder().append(parcour).append(line)
// the `.toString()` would generate a String object of size 12 MB
new StringBuilder().append(parcour).append(line).toString()
Your code needs before the newly created String is assigned to parcour around 34 MB.
parcour = 10 MB
the temporary StringBuilder object = 12 MB
the String fromStringBuilder = 12 MB
------------------------------------------
total 34 MB
A small demo snippet to show that the OutOfMemoryException is thrown much earlier then you currently expect.
OOMString.java
class OOMString {
public static void main(String[] args) throws Exception {
String parcour = "";
char[] chars = new char[1_000];
String line = new String(chars);
while(line != null)
{
System.out.println("length = " + parcour.length());
parcour += line;
}
}
}
OOMStringBuilder.java
class OOMStringBuilder {
public static void main(String[] args) throws Exception {
StringBuilder parcour = new StringBuilder();
char[] chars = new char[1_000];
String line = new String(chars);
while(line != null)
{
System.out.println("length = " + parcour.length());
parcour.append(line);
}
}
}
Both snippets do the same. They add a 1,000 charcater string to parcour till the OutOfMemoryException is thrown. To speed it up we limit the heap size to 10 MB.
output of java -Xmx10m OOMString
length = 1048000
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
output of java -Xmx10m OOMStringBuilder
length = 2052000
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
When you execute the code you will notice that OOMString needs much more time to fail (even at a shorter length) than OOMStringBuilder.
You also need to keep in mind that a single character is two bytes long. If your file contains 100 ASCII characters they consume 200 byte in memory.
Maybe this small demonstration could explain it a little bit for you.
I've had some problems with BlueJ and Heap Space errors as well. In my case, opening the terminal crashed the entire application. I suspect this had something to do with generating a lot of output, similar to your large String. In my case, I accidentally had created an endless loop somewhere which broke the terminal window.
I had to remove all property files. Now BlueJ works again and gives no more OutOfMemoryErrors.
I hope this might be helpful in other cases as well.
Our application need to read a file with a single line and that single line contains large amount data . What we are doing is that , read the line from file and store it in string and tokenize the string with - and store to a list . From that list some entries are to be checked.
the method is as follows
public bollean checkMessage(String filename){
boolean retBool = true;
LinkedList tokenList;
int size;
String line = "";
try {
File file = new File(filename);
FileInputStream fs = new FileInputStream(file);
InputStreamReader is = new InputStreamReader(fs);
BufferedReader br = new BufferedReader(is);
while ((line = br.readLine()) != null) {
line.trim();
tokenList = tokenizeString(line, "-");
if (tokenList == null) {
retBool = false;
resultMsg = "Error in File.java "
}
if (retBool) {
retBool = checkMessagePart(tokenList);
}
}
}
the error occurs at line , while ((line = br.readLine()) != null)
error is
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.BufferedReader.readLine(BufferedReader.java:363)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
Actually increasing heapsize didn't work. the size of the file trying to read is more than 1gb. Also tried to read as chunks of bytes , but when adding the read data to StringBuilder or list will again generate the MemoryError
If the problem is that you cannot read the file to a String, then don't do it. Read it token by token by using some other method. The easy one is using Scanner with the right delimiter ("-" in your case). If you find its performance lacking, you could resort to implementing your own version of BufferedReader in which the "lines" are split by that character instead of the normal values.
I have two files:
1- with 1400000 line or record --- 14 MB
2- with 16000000 -- 170 MB
I want to find if each record or line in file 1 is also in file 2 or not
I develop a java app that do the following: Read file line by line and pass each line to a method that loop in file 2
Here is my code:
public boolean hasIDin(String bioid) throws Exception {
BufferedReader br = new BufferedReader(new FileReader("C://AllIDs.txt"));
long bid = Long.parseLong(bioid);
String thisLine;
while((thisLine = br.readLine( )) != null)
{
if (Long.parseLong(thisLine) == bid)
return true;
}
return false;
}
public void getMBD() throws Exception{
BufferedReader br = new BufferedReader(new FileReader("C://DIDs.txt"));
OutputStream os = new FileOutputStream("C://MBD.txt");
PrintWriter pr = new PrintWriter(os);
String thisLine;
int count=1;
while ((thisLine = br.readLine( )) != null){
String bioid = thisLine;
System.out.println(count);
if(! hasIDin(bioid))
pr.println(bioid);
count++;
}
pr.close();
}
When I run it seems it will take more 1944.44444444444 hours to complete as every line processing takes 5 sec. That is about three months!
Is there any ideas to make it done in a much much more less time.
Thanks in advance.
Why don't you;
read all the lines in file2 into a set. Set is fine, but TLongHashSet would be more efficient.
for each line in the second file see if it is in the set.
Here is a tuned implementation which prints the following and uses < 64 MB.
Generating 1400000 ids to /tmp/DID.txt
Generating 16000000 ids to /tmp/AllIDs.txt
Reading ids in /tmp/DID.txt
Reading ids in /tmp/AllIDs.txt
Took 8794 ms to find 294330 valid ids
Code
public static void main(String... args) throws IOException {
generateFile("/tmp/DID.txt", 1400000);
generateFile("/tmp/AllIDs.txt", 16000000);
long start = System.currentTimeMillis();
TLongHashSet did = readLongs("/tmp/DID.txt");
TLongHashSet validIDS = readLongsUnion("/tmp/AllIDs.txt",did);
long time = System.currentTimeMillis() - start;
System.out.println("Took "+ time+" ms to find "+ validIDS.size()+" valid ids");
}
private static TLongHashSet readLongs(String filename) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;)
ids.add(Long.parseLong(line));
br.close();
return ids;
}
private static TLongHashSet readLongsUnion(String filename, TLongHashSet validSet) throws IOException {
System.out.println("Reading ids in "+filename);
BufferedReader br = new BufferedReader(new FileReader(filename), 128*1024);
TLongHashSet ids = new TLongHashSet();
for(String line; (line = br.readLine())!=null;) {
long val = Long.parseLong(line);
if (validSet.contains(val))
ids.add(val);
}
br.close();
return ids;
}
private static void generateFile(String filename, int number) throws IOException {
System.out.println("Generating "+number+" ids to "+filename);
PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(filename), 128*1024));
Random rand = new Random();
for(int i=0;i<number;i++)
pw.println(rand.nextInt(1<<26));
pw.close();
}
170Mb + 14Mb is not so huge files.
My suggestion is to load the smallest one file into java.util.Map, parse the biggest one line-by-line (record-by-record) file and check if the current line present in this Map.
P.S. The question looks like something trivial in terms of RDBMS - maybe it's worth to use any?
You can't do an O(N^2) when each iteration is so long, that's completely unacceptable.
If you have enough RAM, you simply parse file 1, create a map of all numbers, then parse file 2 and check your map.
If you don't have enough RAM, parse file 1, create a map and store it to a file, then parse file 2 and read the map. The key is to make the map as easy to parse as possible - make it a binary format, maybe with a binary tree or something where you can quickly skip around and search. (EDIT: I have to add Michael Borgwardt's Grace Hash Join link, which shows an even better way: http://en.wikipedia.org/wiki/Hash_join#Grace_hash_join)
If there is a limit to the size of your files, option 1 is easier to implement - unless you're dealing with huuuuuuuge files (I'm talking lots of GB), you definitely want to do that.
Usually, memory-mapping is the most efficient way to read large files. You'll need to use java.nio.MappedByteBuffer and java.io.RandomAccessFile.
But your search algorithm is the real problem. Building some sort of index or hash table is what you need.
Is there a way to prepend a line to the File in Java, without creating a temporary file, and writing the needed content to it?
No, there is no way to do that SAFELY in Java. (Or AFAIK, any other programming language.)
No filesystem implementation in any mainstream operating system supports this kind of thing, and you won't find this feature supported in any mainstream programming languages.
Real world file systems are implemented on devices that store data as fixed sized "blocks". It is not possible to implement a file system model where you can insert bytes into the middle of a file without significantly slowing down file I/O, wasting disk space or both.
The solutions that involve an in-place rewrite of the file are inherently unsafe. If your application is killed or the power dies in the middle of the prepend / rewrite process, you are likely to lose data. I would NOT recommend using that approach in practice.
Use a temporary file and renaming. It is safer.
There is a way, it involves rewriting the whole file though (but no temporary file). As others mentioned, no file system supports prepending content to a file. Here is some sample code that uses a RandomAccessFile to write and read content while keeping some content buffered in memory:
public static void main(final String args[]) throws Exception {
File f = File.createTempFile(Main.class.getName(), "tmp");
f.deleteOnExit();
System.out.println(f.getPath());
// put some dummy content into our file
BufferedWriter w = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f)));
for (int i = 0; i < 1000; i++) {
w.write(UUID.randomUUID().toString());
w.write('\n');
}
w.flush();
w.close();
// append "some uuids" to our file
int bufLength = 4096;
byte[] appendBuf = "some uuids\n".getBytes();
byte[] writeBuf = appendBuf;
byte[] readBuf = new byte[bufLength];
int writeBytes = writeBuf.length;
RandomAccessFile rw = new RandomAccessFile(f, "rw");
int read = 0;
int write = 0;
while (true) {
// seek to read position and read content into read buffer
rw.seek(read);
int bytesRead = rw.read(readBuf, 0, readBuf.length);
// seek to write position and write content from write buffer
rw.seek(write);
rw.write(writeBuf, 0, writeBytes);
// no bytes read - end of file reached
if (bytesRead < 0) {
// end of
break;
}
// update seek positions for write and read
read += bytesRead;
write += writeBytes;
writeBytes = bytesRead;
// reuse buffer, create new one to replace (short) append buf
byte[] nextWrite = writeBuf == appendBuf ? new byte[bufLength] : writeBuf;
writeBuf = readBuf;
readBuf = nextWrite;
};
rw.close();
// now show the content of our file
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
}
You could store the file content in a String and prepend the desired line by using a StringBuilder-Object. You just have to put the desired line first and then append the file-content-String.
No extra temporary file needed.
No. There are no "intra-file shift" operations, only read and write of discrete sizes.
It would be possible to do so by reading a chunk of the file of equal length to what you want to prepend, writing the new content in place of it, reading the later chunk and replacing it with what you read before, and so on, rippling down the to the end of the file.
However, don't do that, because if anything stops (out-of-memory, power outage, rogue thread calling System.exit) in the middle of that process, data will be lost. Use the temporary file instead.
private static void addPreAppnedText(File fileName) {
FileOutputStream fileOutputStream =null;
BufferedReader br = null;
FileReader fr = null;
String newFileName = fileName.getAbsolutePath() + "#";
try {
fileOutputStream = new FileOutputStream(newFileName);
fileOutputStream.write("preappendTextDataHere".getBytes());
fr = new FileReader(fileName);
br = new BufferedReader(fr);
String sCurrentLine;
while ((sCurrentLine = br.readLine()) != null) {
fileOutputStream.write(("\n"+sCurrentLine).getBytes());
}
fileOutputStream.flush();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
fileOutputStream.close();
if (br != null)
br.close();
if (fr != null)
fr.close();
new File(newFileName).renameTo(new File(newFileName.replace("#", "")));
} catch (IOException ex) {
ex.printStackTrace();
}
}
}