GC overhead limit exceeded while loading big file - java

Hi I am writing a java program to load a 2G file into memory, the data is a graph in the format of:
node_number: edge_point_1 edge_point_2 ... edge_point_k
and I want to import it into memory as an adjacency list but I get the garbage collector exceed error.
I noticed that the file is load into memory but the problem is while making linked-list. Here is my code:
while ((line = reader.readLine()) != null) {
Integer n1 = line.indexOf(":"), n2;
Integer k = Integer.parseInt(line.substring(0, n1));
n1 = n1 + 2;
lists[k] = new LinkedList<Integer>();
do {
n2 = line.indexOf(" ", n1);
if (n2 == -1)
lists[k].add(Integer.parseInt(line.substring(n1, line.length())));
else
lists[k].add(Integer.parseInt(line.substring(n1, n2)));
n1 = n2 + 1;
} while (n2 != -1);
}
Does anybody have any idea what's wrong with my code? I am compiling with Netbeans latest build.

You simple consume too much memory. Reduce it and increase you memory limit.
Reduce memory
You're using LinkedList<Integer> which requires maybe 50 bytes per int instead of 10. As the easy step, switch to ArrayList<Integer> to save half of it. As the harder step, use int[] and resize them yourself as needed.
Increase you memory limit
Start your JVM with
java -Xmx8G
when you have 8 GB of free memory.

Related

Java program slows down abruptly when indexing corpus for k-grams

I have a problem which is puzzling me. I'm indexing a corpus (17 000 files) of text files, and while doing this, I'm also storing all the k-grams (k-long parts of words) for each word in a HashMap to be used later:
public void insert( String token ) {
//For example, car should result in "^c", "ca", "ar" and "r$" for a 2-gram index
// Check if token has already been seen. if it has, all the
// k-grams for it have already been added.
if (term2id.get(token) != null) {
return;
}
id2term.put(++lastTermID, token);
term2id.put(token, lastTermID);
// is word long enough? for example, "a" can be bigrammed and trigrammed but not four-grammed.
// K must be <= token.length + 2. "ab". K must be <= 4
List<KGramPostingsEntry> postings = null;
if(K > token.length() + 2) {
return;
}else if(K == token.length() + 2) {
// insert the one K-gram "^<String token>$" into index
String kgram = "^"+token+"$";
postings = index.get(kgram);
SortedSet<String> kgrams = new TreeSet<String>();
kgrams.add(kgram);
term2KGrams.put(token, kgrams);
if (postings == null) {
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put("^"+token+"$", newList);
}
// No need to do anything if the posting already exists, so no else clause. There is only one possible term in this case
// Return since we are done
return;
}else {
// We get here if there is more than one k-gram in our term
// insert all k-grams in token into index
int start = 0;
int end = start+K;
//add ^ and $ to token.
String wrappedToken = "^"+token+"$";
int noOfKGrams = wrappedToken.length() - end + 1;
// get K-Grams
String kGram;
int startCurr, endCurr;
SortedSet<String> kgrams = new TreeSet<String>();
for (int i=0; i<noOfKGrams; i++) {
startCurr = start + i;
endCurr = end + i;
kGram = wrappedToken.substring(startCurr, endCurr);
kgrams.add(kGram);
postings = index.get(kGram);
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
// if this k-gram has been seen before
if (postings != null) {
// Add this token to the existing postingsList.
// We can be sure that the list doesn't contain the token
// already, else we would previously have terminated the
// execution of this function.
int lastTermInPostings = postings.get(postings.size()-1).tokenID;
if (lastTermID == lastTermInPostings) {
continue;
}
postings.add(newEntry);
index.put(kGram, postings);
}
// if this k-gram has not been seen before
else {
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put(kGram, newList);
}
}
Clock c = Clock.systemDefaultZone();
long timestart = c.millis();
System.out.println(token);
term2KGrams.put(token, kgrams);
long timestop = c.millis();
System.out.printf("time taken to put: %d\n", timestop-timestart);
System.out.print("put ");
System.out.println(kgrams);
System.out.println();
}
}
The insertion into the HashMap happens on the rows term2KGrams.put(token, kgrams); (There are 2 of them in the code snippet). When indexing, everything works fine until things suddenly, at 15 000 indexed files, go bad. Everything slows down immensely, and the program doesn't finish in a reasonable time, if at all.
To try to understand this problem, I've added some prints at the end of the function. This is the output they generate:
http://soccer.org
time taken to put: 0
put [.or, //s, /so, ://, ^ht, cce, cer, er., htt, occ, org, p:/, r.o, rg$, soc, tp:, ttp]
aysos
time taken to put: 0
put [^ay, ays, os$, sos, yso]
http://www.davisayso.org/contacts.htm
time taken to put: 0
put [.da, .ht, .or, //w, /co, /ww, ://, ^ht, act, avi, ays, con, cts, dav, g/c, htm, htt, isa, nta, o.o, ont, org, p:/, rg/, s.h, say, so., tac, tm$, tp:, ts., ttp, vis, w.d, ww., www, yso]
playsoccer
time taken to put: 0
put [^pl, ays, cce, cer, er$, lay, occ, pla, soc, yso]
This looks fine to me, the putting doesn't seem to be taking long time and the k-grams (in this case trigrams) are correct.
But one can see strange behaviour in the pace at which my computer is printing this information. In beginning, everything is printing at a super high speed. But at 15 000, that speed stops, and instead, my computer starts printing a few lines at a time, which of course means that indexing the other 2000 files of the corpus will take an eternity.
Another interesting thing I observed was when doing a keyboard interrupt (ctrl+c) after it had been printing erratically and slowly as described for a while. It gave me this message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringLatin1.newString(StringLatin1.java:549)sahandzarrinkoub#Sahands-MBP:~/Documents/Programming/Information Retrieval/lab3 2$ sh compile_all.sh
Note: ir/PersistentHashedIndex.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Does this mean I'm out of memory? Is that the issue? If so, that's surprising, because I've been storing quite a lot of things in memory before, such as a HashMap containing the document ID's of every single word in the corpus, a HashMap containing every single word where every single k-gram appears, etc.
Please let me know what you think and what I can do to fix this problem.
To understand this, you must first understand that java does not allocate memory dynamically (or, at least, not indefinetly). The JVM is by default configured to start with a minimum heap size and a maximum heap size. When the maximum heap size would be exceeded through some allocation, you get a OutOfMemoryError
You can change the minimum and maximum heap size for your execution with the vm parameters -Xms and -Xmx respectively. An example for an execution with at least 2, but at most 4 GB would be
java -Xms2g -Xmx4g ...
You can find more options on the man page for java.
Before changing the heap memory, however, take a close look at your system resources, especially whether your system starts swapping. If your system swaps, a larger heap size may let the program run longer, but with equally bad performance. The only thing possible then would be to optimize your program in order to use less memory or to upgrade the RAM of your machine.

javaml java.lang.OutOfMemoryError: Java heap space

I'm using javaml to train a classifier. Now instances in my data contain vectors in the format like this
1 0:5 1:9 24:2 ......
so when i read these from a file I'm using string.split. And then putting the values in the sparseinstance which then gets addd to the classifier.
However I'm getting a heap space out of memory error. I've read about string.split() causing memory leaks as such I've used new String to avoid memory leak. However I'm still facing the heap space problem
The code is as follows
////////////////////////////////////////
BufferedReader br = new BufferedReader(new FileReader("Repository\\IMDB Data\\Train.feat"));
Dataset data=new DefaultDataset();
String TrainLine;
int j=0;
while((TrainLine = br.readLine()) != null && j < 20000){
//TrainLine.replaceAll(":", " ");
String[] arr = TrainLine.split("\\D+");
double[] nums = new double[arr.length];
for (int i = 0; i < nums.length; i++) {
nums[i] = Double.parseDouble(new String(arr[i]));
}
//vector has one less element than arr 85527
String label;
if(nums[0] == 1){
label = "positive";
}else{
label = "negative";
}
System.out.println(label);
Instance instance = new SparseInstance(85527,label);
int i;
for(i=1;i<arr.length;i=i+2){
instance.put((int)nums[i],nums[i+1]);
//Strings have been converted to new strings to overcome memory leak
}
data.add(instance);
j++;
}
knn = new KNearestNeighbors(5);
knn.buildClassifier(data);
svm = new LibSVM();
svm.buildClassifier(data);
////////////////////////////////////////
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at java.util.TreeMap.put(Unknown Source)
at java.util.TreeSet.add(Unknown Source)
at java.util.AbstractCollection.addAll(Unknown Source)
at java.util.TreeSet.addAll(Unknown Source)
at net.sf.javaml.core.SparseInstance.keySet(SparseInstance.java:144)
at net.sf.javaml.core.SparseInstance.keySet(SparseInstance.java:27)
at libsvm.LibSVM.transformDataset(LibSVM.java:80)
at libsvm.LibSVM.buildClassifier(LibSVM.java:127)
at backend.ShubhamKNN.<init>(ShubhamKNN.java:55)
I also get this error, it happens when dataset is too big.
You can run your code with only 1000 records, i guess it runs ok. Cost much memory is a problem of Libsvm, it always occurs the error:
java.lang.OutOfMemoryError: Java heap space
if your computer has enough memory (mine is 8G), you can adjust the memory param of class in Eclipse:
choose the class which calls libsvm lib in Package Explorer view
at menu, Run -> Run configuration.. -> tab (x=)arguments - the input of VM arguments, type into -Xmx1024M. it means class could cost max memory is 1024M, I set the param 3072M, my class runs ok.
rerun the class.
above is my solution, more detail see:
http://blog.csdn.net/felomeng/article/details/4688414

How to parse a huge file line by line, serialize & deserialize a huge object efficiently?

I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like
class Holder {
private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
private int[] meta = new int[1000000000];
}
A sample line from the file looks like
(1_23_4_55) 99
Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,
--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99
Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.
I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.
I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.
P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.
EDIT I
The code that I am using to do this task is provided below(after anonymizing very few information);
public class BigFileParser {
private int parsePositiveInt(final String s) {
int num = 0;
int sign = -1;
final int len = s.length();
final char ch = s.charAt(0);
if (ch == '-')
sign = 1;
else
num = '0' - ch;
int i = 1;
while (i < len)
num = num * 10 + '0' - s.charAt(i++);
return sign * num;
}
private void loadBigFile() {
long startTime = System.nanoTime();
Holder holder = new Holder();
String line;
try {
Reader fReader = new FileReader("/path/to/BIG/file");
// 40 MB buffer size
BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
String tempTerm;
int i, meta, ascii, len;
boolean consumeNextInteger;
// GNU Trove primitive int array list
TIntArrayList arr;
char c;
while ((line = bufferedReader.readLine()) != null) {
consumeNextInteger = true;
tempTerm = "";
arr = new TIntArrayList(5);
for (i = 0, len = line.length(); i < len; i++) {
c = line.charAt(i);
ascii = c - 0;
// 95 is the ascii value of _ char
if (consumeNextInteger && ascii == 95) {
arr.add(parsePositiveInt(tempTerm));
tempTerm = "";
} else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
tempTerm += c;
} else if (ascii == 9) { // '\t'
arr.add(parsePositiveInt(tempTerm));
consumeNextInteger = false;
tempTerm = "";
}
}
meta = parsePositiveInt(tempTerm);
holder.update(arr, meta);
}
bufferedReader.close();
long endTime = System.nanoTime();
System.out.println("#time -> " + (endTime - startTime) * 1.0
/ 1000000000 + " seconds");
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
public class Holder {
private static final int SIZE = 500000000;
private TIntArrayList[] arrs;
private TIntArrayList metas;
private int idx;
public Holder() {
arrs = new TIntArrayList[SIZE];
metas = new TIntArrayList(SIZE);
idx = 0;
}
public void update(TIntArrayList arr, int meta) {
arrs[idx] = arr;
metas.add(meta);
idx++;
}
}
It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.
Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.
Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)
It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.
I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.
Writing a single huge int[] into memory should be pretty fast, surely no half an hour.
Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).
Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.
There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.
Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.
The implication is that parsing characters into the data you need is the bottleneck.
I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.
If that still isn't fast enough, you'd have to fall back to a native C extension.
If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.
But there's another way - why is the file in ASCII?
If it were in binary, you could just slurp it up.

Calculating memory usage of a B-Tree in Java

I've implemented a simple B-Tree whichs maps longs to ints. Now I wanted to estimate the memory usage of it using the following method (applies to 32bit JVM only):
class BTreeEntry {
int entrySize;
long keys[];
int values[];
BTreeEntry children[];
boolean isLeaf;
...
/** #return used bytes */
long capacity() {
long cap = keys.length * (8 + 4) + 3 * 12 + 4 + 1;
if (!isLeaf) {
cap += children.length * 4;
for (int i = 0; i < children.length; i++) {
if (children[i] != null)
cap += children[i].capacity();
}
}
return cap;
}
}
/** #return memory usage in MB */
public int memoryUsage() {
return Math.round(rootEntry.capacity() / (1 << 20));
}
But I tried it e.g. for 7mio entries and the memoryUsage method reports much higher values than the -Xmx setting would allow! E.g. it says 1040 (MB) and I set -Xmx300! Is the JVM somehow able to optimize the memory layout, eg. for empty arrays or what could be my mistake?
Update1: Ok, introducing the isLeaf boolean reduces memory usage a lot, but still it is unclear why I observed higher values than Xmx. (You can still try out this via using isLeaf == false for all contructors)
Update2: Hmmh, something is very wrong. When increasing the entries per leaf one would assume that the memory usage decreases (when doing compact for both), because less overhead of references is involved for larger arrays (and btree has smaller height). But the method memoryUsage reports an increased value if I use 500 instead 100 entries per leaf.
Ohh sh... a bit fresh air solved this issue ;)
When an entry is full it will be splitted. In my original split method checkSplitEntry (where I wanted to avoid waste of memory) I made a big memory waste mistake:
// left child: just copy pointer and decrease size to index
BTreeEntry newLeftChild = this;
newLeftChild.entrySize = splitIndex;
The problem here is, that the old children pointers are still accessible. And so, in my memoryUsage method I'm counting some children twice (especially when I did not compact!). So, without this trick all should be fine and my B-Tree will be even more memory efficient as the garbage collector can do its work!

Java : linear algorithm but non-linear performance drop, where does it come from?

I am currently having heavy performance issues with an application I'm developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.
And for every sentence, it does EXACTLY the same. The algorithms applied to gather the statistics do not evolve with previously read data and therefore stay the same.
The issue is that the processing time does not evolve linearly at all: 1 min for 10k sentences, 1 hour for 100k and days for 1M...
I tried everything I could, from re-implementing basic data structures to object pooling to recycles instances. The behavior doesn't change. I get non-linear increase in time that seem impossible to justify by a little more hashmap collisions, nor by IO waiting, nor by anything! Java starts to be sluggish when data increases and I feel totally helpless.
If you want an example, just try the following: count the number of occurences of each word in a big file. Some code is shown below. By doing this, it takes me 3 seconds over 100k sentences and 326 seconds over 1.6M ...so a multiplicator of 110 times instead of 16 times. As data grows more, it just get worse...
Here is a code sample:
Note that I compare strings by reference (for efficiency reasons), this can be done thanks to the 'String.intern()' method which returns a unique reference per string. And the map is never re-hashed during the whole process for the numbers given above.
public class DataGathering
{
SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);
private void makeCounts(String path) throws IOException
{
BufferedReader file_src = new BufferedReader(new FileReader(path));
String line_src;
int n = 0;
while (file_src.ready())
{
n++;
if (n % 10000 == 0)
System.out.print(".");
if (n % 100000 == 0)
System.out.println("");
line_src = file_src.readLine();
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
for (int i = 0; i < src_tokens.length; i++)
{
String src = src_tokens[i].intern();
counts.bump(src);
}
}
file_src.close();
}
public static void main(String[] args) throws IOException
{
String path = "some_big_file.txt";
long timestamp = System.currentTimeMillis();
DataGathering dg = new DataGathering();
dg.makeCounts(path);
long time = (System.currentTimeMillis() - timestamp) / 1000;
System.out.println("\nElapsed time: " + time + "s.");
}
}
public class SimpleRefCounter<K>
{
static final double GROW_FACTOR = 2;
static final double LOAD_FACTOR = 0.5;
private int capacity;
private Object[] keys;
private int[] counts;
public SimpleRefCounter()
{
this(1000);
}
public SimpleRefCounter(int capacity)
{
this.capacity = capacity;
keys = new Object[capacity];
counts = new int[capacity];
}
public synchronized int increase(K key, int n)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
{
key_count++;
keys[id] = key;
if (key_count > LOAD_FACTOR * capacity)
{
resize((int) (GROW_FACTOR * capacity));
}
}
counts[id] += n;
total += n;
return counts[id];
}
public synchronized void resize(int capacity)
{
System.out.println("Resizing counters: " + this);
this.capacity = capacity;
Object[] new_keys = new Object[capacity];
int[] new_counts = new int[capacity];
for (int i = 0; i < keys.length; i++)
{
Object key = keys[i];
int count = counts[i];
int id = System.identityHashCode(key) % capacity;
while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
new_keys[id] = key;
new_counts[id] = count;
}
this.keys = new_keys;
this.counts = new_counts;
}
public int bump(K key)
{
return increase(key, 1);
}
public int get(K key)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
return 0;
else
return counts[id];
}
}
Any explanations? Ideas? Suggestions?
...and, as said in the beginning, it is not for this toy example in particular but for the more general case. This same exploding behavior occurs for no reason in the more complex and larger program.
Rather than feeling helpless use a profiler! That would tell you where exactly in your code all this time is spent.
Bursting the processor cache and thrashing the Translation Lookaside Buffer (TLB) may be the problem.
For String.intern you might want to do your own single-threaded implementation.
However, I'm placing my bets on the relatively bad hash values from System.identityHashCode. It clearly isn't using the top bit, as you don't appear to get ArrayIndexOutOfBoundsExceptions. I suggest replacing that with String.hashCode.
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
Just an idea -- you are creating a new Pattern object for every line here (look at the String.split() implementation). I wonder if this is also contributing to a ton of objects that need to be garbage collected?
I would create the Pattern once, probably as a static field:
final private static Pattern TOKEN_PATTERN = Pattern.compile("[ ,.;:?!'\"]");
And then change the split line do this:
String[] src_tokens = TOKEN_PATTERN.split(line_src);
Or if you don't want to create it as a static field, as least only create it once as a local variable at the beginning of the method, before the while.
In get, when you search for a nonexistent key, search time is proportional to the size of the set of keys.
My advice: if you want a HashMap, just use a HashMap. They got it right for you.
You are filling up the Perm Gen with the string intern. Have you tried viewing the -Xloggc output?
I would guess it's just memory filling up, growing outside the processor cache, memory fragmentation and the garbage collection pauses kicking in. Have you checked memory use at all? Tried to change the heap size the JVM uses?
Try to do it in python, and run the python module from Java.
Enter all the keys in the database, and then execute the following query:
select key, count(*)
from keys
group by key
Have you tried to only iterate through the keys without doing any calculations? is it faster? if yes then go with option (2).
Can't you do this? You can get your answer in no time.
It's me, the original poster, something went wrong during registration, so I post separately. I'll try the various suggestions given.
PS for Tom Hawtin: thanks for the hints, perhaps the 'String.intern()' takes more and more time as vocabulary grows, i'll check that tomorrow, as everything else.

Categories