I have a problem which is puzzling me. I'm indexing a corpus (17 000 files) of text files, and while doing this, I'm also storing all the k-grams (k-long parts of words) for each word in a HashMap to be used later:
public void insert( String token ) {
//For example, car should result in "^c", "ca", "ar" and "r$" for a 2-gram index
// Check if token has already been seen. if it has, all the
// k-grams for it have already been added.
if (term2id.get(token) != null) {
return;
}
id2term.put(++lastTermID, token);
term2id.put(token, lastTermID);
// is word long enough? for example, "a" can be bigrammed and trigrammed but not four-grammed.
// K must be <= token.length + 2. "ab". K must be <= 4
List<KGramPostingsEntry> postings = null;
if(K > token.length() + 2) {
return;
}else if(K == token.length() + 2) {
// insert the one K-gram "^<String token>$" into index
String kgram = "^"+token+"$";
postings = index.get(kgram);
SortedSet<String> kgrams = new TreeSet<String>();
kgrams.add(kgram);
term2KGrams.put(token, kgrams);
if (postings == null) {
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put("^"+token+"$", newList);
}
// No need to do anything if the posting already exists, so no else clause. There is only one possible term in this case
// Return since we are done
return;
}else {
// We get here if there is more than one k-gram in our term
// insert all k-grams in token into index
int start = 0;
int end = start+K;
//add ^ and $ to token.
String wrappedToken = "^"+token+"$";
int noOfKGrams = wrappedToken.length() - end + 1;
// get K-Grams
String kGram;
int startCurr, endCurr;
SortedSet<String> kgrams = new TreeSet<String>();
for (int i=0; i<noOfKGrams; i++) {
startCurr = start + i;
endCurr = end + i;
kGram = wrappedToken.substring(startCurr, endCurr);
kgrams.add(kGram);
postings = index.get(kGram);
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
// if this k-gram has been seen before
if (postings != null) {
// Add this token to the existing postingsList.
// We can be sure that the list doesn't contain the token
// already, else we would previously have terminated the
// execution of this function.
int lastTermInPostings = postings.get(postings.size()-1).tokenID;
if (lastTermID == lastTermInPostings) {
continue;
}
postings.add(newEntry);
index.put(kGram, postings);
}
// if this k-gram has not been seen before
else {
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put(kGram, newList);
}
}
Clock c = Clock.systemDefaultZone();
long timestart = c.millis();
System.out.println(token);
term2KGrams.put(token, kgrams);
long timestop = c.millis();
System.out.printf("time taken to put: %d\n", timestop-timestart);
System.out.print("put ");
System.out.println(kgrams);
System.out.println();
}
}
The insertion into the HashMap happens on the rows term2KGrams.put(token, kgrams); (There are 2 of them in the code snippet). When indexing, everything works fine until things suddenly, at 15 000 indexed files, go bad. Everything slows down immensely, and the program doesn't finish in a reasonable time, if at all.
To try to understand this problem, I've added some prints at the end of the function. This is the output they generate:
http://soccer.org
time taken to put: 0
put [.or, //s, /so, ://, ^ht, cce, cer, er., htt, occ, org, p:/, r.o, rg$, soc, tp:, ttp]
aysos
time taken to put: 0
put [^ay, ays, os$, sos, yso]
http://www.davisayso.org/contacts.htm
time taken to put: 0
put [.da, .ht, .or, //w, /co, /ww, ://, ^ht, act, avi, ays, con, cts, dav, g/c, htm, htt, isa, nta, o.o, ont, org, p:/, rg/, s.h, say, so., tac, tm$, tp:, ts., ttp, vis, w.d, ww., www, yso]
playsoccer
time taken to put: 0
put [^pl, ays, cce, cer, er$, lay, occ, pla, soc, yso]
This looks fine to me, the putting doesn't seem to be taking long time and the k-grams (in this case trigrams) are correct.
But one can see strange behaviour in the pace at which my computer is printing this information. In beginning, everything is printing at a super high speed. But at 15 000, that speed stops, and instead, my computer starts printing a few lines at a time, which of course means that indexing the other 2000 files of the corpus will take an eternity.
Another interesting thing I observed was when doing a keyboard interrupt (ctrl+c) after it had been printing erratically and slowly as described for a while. It gave me this message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringLatin1.newString(StringLatin1.java:549)sahandzarrinkoub#Sahands-MBP:~/Documents/Programming/Information Retrieval/lab3 2$ sh compile_all.sh
Note: ir/PersistentHashedIndex.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Does this mean I'm out of memory? Is that the issue? If so, that's surprising, because I've been storing quite a lot of things in memory before, such as a HashMap containing the document ID's of every single word in the corpus, a HashMap containing every single word where every single k-gram appears, etc.
Please let me know what you think and what I can do to fix this problem.
To understand this, you must first understand that java does not allocate memory dynamically (or, at least, not indefinetly). The JVM is by default configured to start with a minimum heap size and a maximum heap size. When the maximum heap size would be exceeded through some allocation, you get a OutOfMemoryError
You can change the minimum and maximum heap size for your execution with the vm parameters -Xms and -Xmx respectively. An example for an execution with at least 2, but at most 4 GB would be
java -Xms2g -Xmx4g ...
You can find more options on the man page for java.
Before changing the heap memory, however, take a close look at your system resources, especially whether your system starts swapping. If your system swaps, a larger heap size may let the program run longer, but with equally bad performance. The only thing possible then would be to optimize your program in order to use less memory or to upgrade the RAM of your machine.
Hi I'm reading data from .xls sheet it contains 8500 rows of data and I'm trying to store it in double[][] but I'm getting an error
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Code
public static double[][] getData_DoubleArray(String path, int sheetNo, int rowIndex1, int rowIndex2) {
double[][] doubleArray=null;
try {
HSSFSheet sheet = PS_ExcelReader.getWorkSheet(path,sheetNo);
System.out.println("sheet" + sheet);
List<Object> data1 = PS_ExcelReader.getFullColumnByIndex(sheet, rowIndex1);
List<Object> data2 = PS_ExcelReader.getFullColumnByIndex(sheet, rowIndex2);
doubleArray = new double[data1.size()][data2.size()];
for(int i = 0; i < data1.size(); i++) {
for(int j = 0; j < data2.size(); j++) {
doubleArray[i][0] = (Double)data1.get(i);
doubleArray[i][1] = (Double)data2.get(j);
}
}
System.out.println("array " + Arrays.deepToString(doubleArray));
}
catch(IOException ioe) {
log.error("data mis match");
}
return doubleArray;
}
Currently this line:
doubleArray = new double[data1.size()][data2.size()];
is creating 8500 x 8500 doubles, which is over 500MBs.
You are basically allocating space enough for 8500rows and 8500columns.
But seeing that you are only using 2 of these columns in your algorithm:
doubleArray[i][0] = (Double)data1.get(i);
doubleArray[i][1] = (Double)data2.get(j);
I doubt that you really want to create that many columns.
Given your remaining algorithm, this allocation should suffice your needs:
doubleArray = new double[data1.size()][2];
in your java file you have to use double[] array. I think you are not declaring the proper size for the array
You are short of heap space. The default heap space is not enough for your program. You need to increase your heap space using the -Xms or -Xmx flag.
-Xms flag to specify minimum heap space.
-Xmx flag to specify maximum heap space.
For example:
java -Xms512m -Xmx1024m YourMainClassName
will set minimum heap space to 512 MB and maximum heap space to 1024 MB.
More reference here: http://www.mkyong.com/java/find-out-your-java-heap-memory-size/
I am writing list object's into a CSV file by using StringBuffer object, when the list contains less data our logic is working perfectly but when there is a large amount of data the in list then there's a problem and I get the error: java.lang.OutOfMemoryError: Java heap space problem
Code snippet as follows :
StringBuffer report = new StringBuffer();
String[] column = null;
StringReader stream = null;
for (MassDetailReportDto dto: newList.values()) {
int i = 0;
column = new String[REPORT_INDEX];
column[i++] = dto.getCommodityCode() == null ? " " : dto.getCommodityCode();
column[i++] = dto.getOaId() == null ? " " : dto.getOaId();
//like this we are calling some other getter methods
//After all getter methods we are appending columns to stringBuffer object
report.append(StringUtils.join(column, PIPE));
report.append(NEW_LINE);
//now stringbuffer object we are writing to file
stream = new StringReader(report.toString());
int count;
char buffer[] = new char[4096];
while ((count = stream.read(buffer)) > -1) {
//writing into file
writer.write(buffer, 0, count);
}
writer.flush();
//clearing the buffer
report.delete(0, report.length());
}
Error is :
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
at java.lang.StringBuilder.append(StringBuilder.java:120)
Could you please look into above code snippet and help me, it would be great help.
Where does column get initialized? I don't see it. But it seems that's a likely culprit. You are building a string array without clearing it out. column[i++] . Where do you clear out that array? It should be scoped to the loop body, not outside of it. So inside loop, declare your String[] column and use it within that scope.
This seems logical to have out of memory error when the list size is big enough. Increasing the JVM heap size (using -Xmx and -Xms jvm args) would resolve the issue temporarily. However, ideally you should used paged access to the source of the items in the list. If the list is populated from database or webservice, it can easily be accessed in paged way.
Hi I am writing a java program to load a 2G file into memory, the data is a graph in the format of:
node_number: edge_point_1 edge_point_2 ... edge_point_k
and I want to import it into memory as an adjacency list but I get the garbage collector exceed error.
I noticed that the file is load into memory but the problem is while making linked-list. Here is my code:
while ((line = reader.readLine()) != null) {
Integer n1 = line.indexOf(":"), n2;
Integer k = Integer.parseInt(line.substring(0, n1));
n1 = n1 + 2;
lists[k] = new LinkedList<Integer>();
do {
n2 = line.indexOf(" ", n1);
if (n2 == -1)
lists[k].add(Integer.parseInt(line.substring(n1, line.length())));
else
lists[k].add(Integer.parseInt(line.substring(n1, n2)));
n1 = n2 + 1;
} while (n2 != -1);
}
Does anybody have any idea what's wrong with my code? I am compiling with Netbeans latest build.
You simple consume too much memory. Reduce it and increase you memory limit.
Reduce memory
You're using LinkedList<Integer> which requires maybe 50 bytes per int instead of 10. As the easy step, switch to ArrayList<Integer> to save half of it. As the harder step, use int[] and resize them yourself as needed.
Increase you memory limit
Start your JVM with
java -Xmx8G
when you have 8 GB of free memory.
I am using java to read data from file, copy the data to smaller arrays and put these arrays in Hashtables. I noticed that Hashmap consumes more memory (about double) than what is in the original file! Any idea why?
Here is my code:
public static void main(final String[] args) throws IOException {
final PrintWriter writer = new PrintWriter(new FileWriter("test.txt",
true));
for(int i = 0; i < 1000000; i++)
writer.println("This is just a dummy text!");
writer.close();
final BufferedReader reader = new BufferedReader(new FileReader(
"test.txt"));
final HashMap<Integer, String> testMap = new HashMap<Integer, String>();
String line = reader.readLine();
int k = 0;
while(line != null) {
testMap.put(k, line);
k++;
line = reader.readLine();
}
}
This is not a problem of HashMap, its a problem of Java Objects in general. Each object has a certain memory overhead, including the arrays and the entries in your HashMap.
But more importantly: Character data consumes double the space in memory. The reason for this is that Java uses 16 bits for each character, whereas the file is probably encoded in ASCII or UTF-8, which only uses 7 or 8 bits per character.
Update: There is not much you can do about this. The code you posted is fine in principle. It just doesn't work with huge files. You might be able to do a little better if you tune your HashMap carefully, or you might use a byte array instead of a String to store your characters (assuming everything is ASCII or one-byte UTF-8).
But in the end, to solve your out-of-memory problems, the right way to go is to rethink your program so that you don't have to read the whole file into memory at once.
Whatever it is you're doing with the content of that file, think about whether you can do it while reading the file from disk (this is called streaming) or maybe extract the relevant parts and only store those. You could also try to random access the file.
I suggest you read up on those things a bit, try something and come back and ask a new question, specific to your application. Because this thread is getting too long.
A map is an "extendable" structure - when it reaches its capacity it gets resized. So it is possible that say 40% of the space used by your map is actually empty. If you know how many entries will be in your map, you can use the ad hoc constructors to size your map in an optimal way:
Map<xx,yy> map = new HashMap<> (length, 1);
Even if you do that, the map will still use more space than the actual size of the contained items.
In more details: HashMap's size gets doubled when it reaches (capacity * loadFactor). Default load factor for a HashMap is 0.75.
Example:
Imagine your map has a capacity (size) of 10,000 entries
You then put 7,501 entries in the map. Capacity * loadFactor = 10,000 * 0.75 = 7,500
So your hashmap has reached its resize threshold and gets resized to (capacity * 2) = 20,000, although you are only holding 7,501 entries. That wastes a lot of space.
EDIT
This simple code gives you an idea of what happens in practice - the output is:
threshold of empty map = 8192
size of empty map = 35792
threshold of filled map = 8192
size of filled map = 1181712
threshold with one more entry = 16384
size with one more entry = 66640
which shows that if the last item you add happens to force the map to resize, it can artificially increase the size of your map. Admittedly, that does not account for the whole effect that you are observing.
public static void main(String[] args) throws java.lang.Exception {
Field f = HashMap.class.getDeclaredField("threshold");
f.setAccessible(true);
long mem = Runtime.getRuntime().freeMemory();
Map<String, String> map = new HashMap<>(2 << 12, 1); // 8,192
System.out.println("threshold of empty map = " + f.get(map));
System.out.println("size of empty map = " + (mem - Runtime.getRuntime().freeMemory()));
mem = Runtime.getRuntime().freeMemory();
for (int i = 0; i < 8192; i++) {
map.put(String.valueOf(i), String.valueOf(i));
}
System.out.println("threshold of filled map = " + f.get(map));
System.out.println("size of filled map = " + (mem - Runtime.getRuntime().freeMemory()));
mem = Runtime.getRuntime().freeMemory();
map.put("a", "a");
System.out.println("threshold with one more entry = " + f.get(map));
System.out.println("size with one more entry = " + (mem - Runtime.getRuntime().freeMemory()));
}
There are lots of things internal to the implementation of HashMap (and arrays) that need to be stored. Array lengths would be one such example. Not sure if this would account for double, but it could certainly account for some.