Hash Table Memory Usage in Java - java

I am using java to read data from file, copy the data to smaller arrays and put these arrays in Hashtables. I noticed that Hashmap consumes more memory (about double) than what is in the original file! Any idea why?
Here is my code:
public static void main(final String[] args) throws IOException {
final PrintWriter writer = new PrintWriter(new FileWriter("test.txt",
true));
for(int i = 0; i < 1000000; i++)
writer.println("This is just a dummy text!");
writer.close();
final BufferedReader reader = new BufferedReader(new FileReader(
"test.txt"));
final HashMap<Integer, String> testMap = new HashMap<Integer, String>();
String line = reader.readLine();
int k = 0;
while(line != null) {
testMap.put(k, line);
k++;
line = reader.readLine();
}
}

This is not a problem of HashMap, its a problem of Java Objects in general. Each object has a certain memory overhead, including the arrays and the entries in your HashMap.
But more importantly: Character data consumes double the space in memory. The reason for this is that Java uses 16 bits for each character, whereas the file is probably encoded in ASCII or UTF-8, which only uses 7 or 8 bits per character.
Update: There is not much you can do about this. The code you posted is fine in principle. It just doesn't work with huge files. You might be able to do a little better if you tune your HashMap carefully, or you might use a byte array instead of a String to store your characters (assuming everything is ASCII or one-byte UTF-8).
But in the end, to solve your out-of-memory problems, the right way to go is to rethink your program so that you don't have to read the whole file into memory at once.
Whatever it is you're doing with the content of that file, think about whether you can do it while reading the file from disk (this is called streaming) or maybe extract the relevant parts and only store those. You could also try to random access the file.
I suggest you read up on those things a bit, try something and come back and ask a new question, specific to your application. Because this thread is getting too long.

A map is an "extendable" structure - when it reaches its capacity it gets resized. So it is possible that say 40% of the space used by your map is actually empty. If you know how many entries will be in your map, you can use the ad hoc constructors to size your map in an optimal way:
Map<xx,yy> map = new HashMap<> (length, 1);
Even if you do that, the map will still use more space than the actual size of the contained items.
In more details: HashMap's size gets doubled when it reaches (capacity * loadFactor). Default load factor for a HashMap is 0.75.
Example:
Imagine your map has a capacity (size) of 10,000 entries
You then put 7,501 entries in the map. Capacity * loadFactor = 10,000 * 0.75 = 7,500
So your hashmap has reached its resize threshold and gets resized to (capacity * 2) = 20,000, although you are only holding 7,501 entries. That wastes a lot of space.
EDIT
This simple code gives you an idea of what happens in practice - the output is:
threshold of empty map = 8192
size of empty map = 35792
threshold of filled map = 8192
size of filled map = 1181712
threshold with one more entry = 16384
size with one more entry = 66640
which shows that if the last item you add happens to force the map to resize, it can artificially increase the size of your map. Admittedly, that does not account for the whole effect that you are observing.
public static void main(String[] args) throws java.lang.Exception {
Field f = HashMap.class.getDeclaredField("threshold");
f.setAccessible(true);
long mem = Runtime.getRuntime().freeMemory();
Map<String, String> map = new HashMap<>(2 << 12, 1); // 8,192
System.out.println("threshold of empty map = " + f.get(map));
System.out.println("size of empty map = " + (mem - Runtime.getRuntime().freeMemory()));
mem = Runtime.getRuntime().freeMemory();
for (int i = 0; i < 8192; i++) {
map.put(String.valueOf(i), String.valueOf(i));
}
System.out.println("threshold of filled map = " + f.get(map));
System.out.println("size of filled map = " + (mem - Runtime.getRuntime().freeMemory()));
mem = Runtime.getRuntime().freeMemory();
map.put("a", "a");
System.out.println("threshold with one more entry = " + f.get(map));
System.out.println("size with one more entry = " + (mem - Runtime.getRuntime().freeMemory()));
}

There are lots of things internal to the implementation of HashMap (and arrays) that need to be stored. Array lengths would be one such example. Not sure if this would account for double, but it could certainly account for some.

Related

Java: Most efficient way to loop through CSV and sum values of one column for each unique value in another Column

I have a CSV file with 500,000 rows of data and 22 columns. This data represents all commercial flights in the USA for one year. I am being tasked with finding the tail number of the plane that flew the most miles in the data set. Column 5 contains the airplain's tail number for each flight. Column 22 contains the total distance traveled.
Please see my extractQ3 method below. First, created a HashMap for the whole CSV using the createHashMap() method. Then, I ran a for loop to identify every unique tail number in the dataset and stored them in an array called tailNumbers. Then for each unique tail number, I looped through the entire Hashmap to calculate the total miles of distance for that tail number.
The code runs fine on smaller datasets, but once the sized increased to 500,000 rows the code becomes horribly inefficient and takes an eternity to run. Can anyone provide me with a faster way to do this?
public class FlightData {
HashMap<String,String[]> dataMap;
public static void main(String[] args) {
FlightData map1 = new FlightData();
map1.dataMap = map1.createHashMap();
String answer = map1.extractQ3(map1);
}
public String extractQ3(FlightData map1) {
ArrayList<String> tailNumbers = new ArrayList<String>();
ArrayList<Integer> tailMiles = new ArrayList<Integer>();
//Filling the Array with all tail numbers
for (String[] value : map1.dataMap.values()) {
if(Arrays.asList(tailNumbers).contains(value[4])) {
} else {
tailNumbers.add(value[4]);
}
}
for (int i = 0; i < tailNumbers.size(); i++) {
String tempName = tailNumbers.get(i);
int miles = 0;
for (String[] value : map1.dataMap.values()) {
if(value[4].contentEquals(tempName) && value[19].contentEquals("0")) {
miles = miles + Integer.parseInt(value[21]);
}
}
tailMiles.add(miles);
}
Integer maxVal = Collections.max(tailMiles);
Integer maxIdx = tailMiles.indexOf(maxVal);
String maxPlane = tailNumbers.get(maxIdx);
return maxPlane;
}
public HashMap<String,String[]> createHashMap() {
File flightFile = new File("flights_small.csv");
HashMap<String,String[]> flightsMap = new HashMap<String,String[]>();
try {
Scanner s = new Scanner(flightFile);
while (s.hasNextLine()) {
String info = s.nextLine();
String [] piecesOfInfo = info.split(",");
String flightKey = piecesOfInfo[4] + "_" + piecesOfInfo[2] + "_" + piecesOfInfo[11]; //Setting the Key
String[] values = Arrays.copyOfRange(piecesOfInfo, 0, piecesOfInfo.length);
flightsMap.put(flightKey, values);
}
s.close();
}
catch (FileNotFoundException e)
{
System.out.println("Cannot open: " + flightFile);
}
return flightsMap;
}
}
The answer depends on what you mean by "most efficient", "horribly inefficient" and "takes an eternity". These are subjective terms. The answer may also depend on specific technical factors (speed vs. memory consumption; the number of unique flight keys compared to the number of overall records; etc.).
I would recommend applying some basic streamlining to your code, to start with. See if that gets you a better (acceptable) result. If you need more, then you can consider more advanced improvements.
Whatever you do, take some timings to understand the broad impacts of any changes you make.
Focus on going from "horrible" to "acceptable" - and then worry about more advanced tuning after that (if you still need it).
Consider using a BufferedReader instead of a Scanner. See here. Although the scanner may be just fine for your needs (i.e. if it's not a bottleneck).
Consider using logic within your scanner loop to capture tail numbers and accumulated mileage in one pass of the data. The following is deliberately basic, for clarity and simplicity:
// The string is a tail number.
// The integer holds the accumulated miles flown for that tail number:
Map<String, Integer> planeMileages = new HashMap();
if (planeMileages.containsKey(tailNumber)) {
// add miles to existing total:
int accumulatedMileage = planeMileages.get(tailNumber) + flightMileage;
planeMileages.put(tailNumber, accumulatedMileage);
} else {
// capture new tail number:
planeMileages.put(tailNumber, flightMileage);
}
After that, once you have completed the scanner loop, you can iterate over your planeMileages to find the largest mileage:
String maxMilesTailNumber;
int maxMiles = 0;
for (Map.Entry<String, Integer> entry : planeMileages.entrySet()) {
int planeMiles = entry.getValue();
if (planeMiles > maxMiles) {
maxMilesTailNumber = entry.getKey();
maxMiles = planeMiles;
}
}
WARNING - This approach is just for illustration. It will only capture one tail number. There could be multiple planes with the same maximum mileage. You would have to adjust your logic to capture multiple "winners".
The above approach removes the need for several of your existing data structures, and related processing.
If you still face problems, put in some timers to see which specific areas of your code are slowest - and then you will have more specific tuning opportunities you can focus on.
I suggest you use the java 8 Stream API, so that you can take advantage of Parallel streams.

Java program slows down abruptly when indexing corpus for k-grams

I have a problem which is puzzling me. I'm indexing a corpus (17 000 files) of text files, and while doing this, I'm also storing all the k-grams (k-long parts of words) for each word in a HashMap to be used later:
public void insert( String token ) {
//For example, car should result in "^c", "ca", "ar" and "r$" for a 2-gram index
// Check if token has already been seen. if it has, all the
// k-grams for it have already been added.
if (term2id.get(token) != null) {
return;
}
id2term.put(++lastTermID, token);
term2id.put(token, lastTermID);
// is word long enough? for example, "a" can be bigrammed and trigrammed but not four-grammed.
// K must be <= token.length + 2. "ab". K must be <= 4
List<KGramPostingsEntry> postings = null;
if(K > token.length() + 2) {
return;
}else if(K == token.length() + 2) {
// insert the one K-gram "^<String token>$" into index
String kgram = "^"+token+"$";
postings = index.get(kgram);
SortedSet<String> kgrams = new TreeSet<String>();
kgrams.add(kgram);
term2KGrams.put(token, kgrams);
if (postings == null) {
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put("^"+token+"$", newList);
}
// No need to do anything if the posting already exists, so no else clause. There is only one possible term in this case
// Return since we are done
return;
}else {
// We get here if there is more than one k-gram in our term
// insert all k-grams in token into index
int start = 0;
int end = start+K;
//add ^ and $ to token.
String wrappedToken = "^"+token+"$";
int noOfKGrams = wrappedToken.length() - end + 1;
// get K-Grams
String kGram;
int startCurr, endCurr;
SortedSet<String> kgrams = new TreeSet<String>();
for (int i=0; i<noOfKGrams; i++) {
startCurr = start + i;
endCurr = end + i;
kGram = wrappedToken.substring(startCurr, endCurr);
kgrams.add(kGram);
postings = index.get(kGram);
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
// if this k-gram has been seen before
if (postings != null) {
// Add this token to the existing postingsList.
// We can be sure that the list doesn't contain the token
// already, else we would previously have terminated the
// execution of this function.
int lastTermInPostings = postings.get(postings.size()-1).tokenID;
if (lastTermID == lastTermInPostings) {
continue;
}
postings.add(newEntry);
index.put(kGram, postings);
}
// if this k-gram has not been seen before
else {
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put(kGram, newList);
}
}
Clock c = Clock.systemDefaultZone();
long timestart = c.millis();
System.out.println(token);
term2KGrams.put(token, kgrams);
long timestop = c.millis();
System.out.printf("time taken to put: %d\n", timestop-timestart);
System.out.print("put ");
System.out.println(kgrams);
System.out.println();
}
}
The insertion into the HashMap happens on the rows term2KGrams.put(token, kgrams); (There are 2 of them in the code snippet). When indexing, everything works fine until things suddenly, at 15 000 indexed files, go bad. Everything slows down immensely, and the program doesn't finish in a reasonable time, if at all.
To try to understand this problem, I've added some prints at the end of the function. This is the output they generate:
http://soccer.org
time taken to put: 0
put [.or, //s, /so, ://, ^ht, cce, cer, er., htt, occ, org, p:/, r.o, rg$, soc, tp:, ttp]
aysos
time taken to put: 0
put [^ay, ays, os$, sos, yso]
http://www.davisayso.org/contacts.htm
time taken to put: 0
put [.da, .ht, .or, //w, /co, /ww, ://, ^ht, act, avi, ays, con, cts, dav, g/c, htm, htt, isa, nta, o.o, ont, org, p:/, rg/, s.h, say, so., tac, tm$, tp:, ts., ttp, vis, w.d, ww., www, yso]
playsoccer
time taken to put: 0
put [^pl, ays, cce, cer, er$, lay, occ, pla, soc, yso]
This looks fine to me, the putting doesn't seem to be taking long time and the k-grams (in this case trigrams) are correct.
But one can see strange behaviour in the pace at which my computer is printing this information. In beginning, everything is printing at a super high speed. But at 15 000, that speed stops, and instead, my computer starts printing a few lines at a time, which of course means that indexing the other 2000 files of the corpus will take an eternity.
Another interesting thing I observed was when doing a keyboard interrupt (ctrl+c) after it had been printing erratically and slowly as described for a while. It gave me this message:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringLatin1.newString(StringLatin1.java:549)sahandzarrinkoub#Sahands-MBP:~/Documents/Programming/Information Retrieval/lab3 2$ sh compile_all.sh
Note: ir/PersistentHashedIndex.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Does this mean I'm out of memory? Is that the issue? If so, that's surprising, because I've been storing quite a lot of things in memory before, such as a HashMap containing the document ID's of every single word in the corpus, a HashMap containing every single word where every single k-gram appears, etc.
Please let me know what you think and what I can do to fix this problem.
To understand this, you must first understand that java does not allocate memory dynamically (or, at least, not indefinetly). The JVM is by default configured to start with a minimum heap size and a maximum heap size. When the maximum heap size would be exceeded through some allocation, you get a OutOfMemoryError
You can change the minimum and maximum heap size for your execution with the vm parameters -Xms and -Xmx respectively. An example for an execution with at least 2, but at most 4 GB would be
java -Xms2g -Xmx4g ...
You can find more options on the man page for java.
Before changing the heap memory, however, take a close look at your system resources, especially whether your system starts swapping. If your system swaps, a larger heap size may let the program run longer, but with equally bad performance. The only thing possible then would be to optimize your program in order to use less memory or to upgrade the RAM of your machine.

Creating dense matrix using org.javatuples.Pair and HashMap is too slow

I have a dense symmetric matrix of size about 30000 X 30000 that contains distances between strings. Since the distance is symmetric, the upper triangle of the matrix is stored in a tab-separated 3-column file of the form
stringA<tab>stringB<tab>distance
I am using HashMap and org.javatuples.Pair to create a map to quickly look up distances for given pairs of string as follows:
import org.javatuples.Pair;
HashMap<Pair<String,String>,Double> pairScores = new HashMap<Pair<String,String>,Double>();
BufferedReader bufferedReader = new BufferedReader(new FileReader("data.txt"));
String line = null;
while((line = bufferedReader.readLine()) != null) {
String [] parts = line.split("\t");
String d1 = parts[0];
String d2 = parts[1];
Double score = Double.parseDouble(parts[2]);
Pair<String,String> p12 = new Pair<String,String>(d1,d2);
Pair<String,String> p21 = new Pair<String,String>(d2,d1);
pairScores.put(p12, score);
pairScores.put(p21, score);
}
data.txt is very big (~400M lines) and the process eventually slows down to a crawl with most time being spent in java.util.HashMap.put.
I don't think there should be (m)any hash code collisions on pairs but I might be wrong. How can I verify this? Is it enough to simply look at how unique p12.hashCode() and p12.hashCode() are?
If there are no collisions, what else could be causing to slow down?
Is there a batter way to construct this matrix for quick lookup?
I am now using Guava's Table<Integer, Integer, Double> after also realizing that my strings are unique enough that I could use their hashes, instead of the strings themselves, as keys, to reduce memory requirements. The creation of the table runs in reasonable time, however, there are issues with serializing and deserializing the resulting objects: I ran into out of memory errors even with the move from String to Integer. It seems to be working after I decided to not store both a-b and b-a pairs, but I might be balancing on the edge of what my machine can handle

java.lang.OutOfMemoryError: Java heap space?

I am writing list object's into a CSV file by using StringBuffer object, when the list contains less data our logic is working perfectly but when there is a large amount of data the in list then there's a problem and I get the error: java.lang.OutOfMemoryError: Java heap space problem
Code snippet as follows :
StringBuffer report = new StringBuffer();
String[] column = null;
StringReader stream = null;
for (MassDetailReportDto dto: newList.values()) {
int i = 0;
column = new String[REPORT_INDEX];
column[i++] = dto.getCommodityCode() == null ? " " : dto.getCommodityCode();
column[i++] = dto.getOaId() == null ? " " : dto.getOaId();
//like this we are calling some other getter methods
//After all getter methods we are appending columns to stringBuffer object
report.append(StringUtils.join(column, PIPE));
report.append(NEW_LINE);
//now stringbuffer object we are writing to file
stream = new StringReader(report.toString());
int count;
char buffer[] = new char[4096];
while ((count = stream.read(buffer)) > -1) {
//writing into file
writer.write(buffer, 0, count);
}
writer.flush();
//clearing the buffer
report.delete(0, report.length());
}
Error is :
java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
at java.lang.StringBuilder.append(StringBuilder.java:120)
Could you please look into above code snippet and help me, it would be great help.
Where does column get initialized? I don't see it. But it seems that's a likely culprit. You are building a string array without clearing it out. column[i++] . Where do you clear out that array? It should be scoped to the loop body, not outside of it. So inside loop, declare your String[] column and use it within that scope.
This seems logical to have out of memory error when the list size is big enough. Increasing the JVM heap size (using -Xmx and -Xms jvm args) would resolve the issue temporarily. However, ideally you should used paged access to the source of the items in the list. If the list is populated from database or webservice, it can easily be accessed in paged way.

How to parse a huge file line by line, serialize & deserialize a huge object efficiently?

I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like
class Holder {
private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
private int[] meta = new int[1000000000];
}
A sample line from the file looks like
(1_23_4_55) 99
Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,
--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99
Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.
I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.
I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.
P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.
EDIT I
The code that I am using to do this task is provided below(after anonymizing very few information);
public class BigFileParser {
private int parsePositiveInt(final String s) {
int num = 0;
int sign = -1;
final int len = s.length();
final char ch = s.charAt(0);
if (ch == '-')
sign = 1;
else
num = '0' - ch;
int i = 1;
while (i < len)
num = num * 10 + '0' - s.charAt(i++);
return sign * num;
}
private void loadBigFile() {
long startTime = System.nanoTime();
Holder holder = new Holder();
String line;
try {
Reader fReader = new FileReader("/path/to/BIG/file");
// 40 MB buffer size
BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
String tempTerm;
int i, meta, ascii, len;
boolean consumeNextInteger;
// GNU Trove primitive int array list
TIntArrayList arr;
char c;
while ((line = bufferedReader.readLine()) != null) {
consumeNextInteger = true;
tempTerm = "";
arr = new TIntArrayList(5);
for (i = 0, len = line.length(); i < len; i++) {
c = line.charAt(i);
ascii = c - 0;
// 95 is the ascii value of _ char
if (consumeNextInteger && ascii == 95) {
arr.add(parsePositiveInt(tempTerm));
tempTerm = "";
} else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
tempTerm += c;
} else if (ascii == 9) { // '\t'
arr.add(parsePositiveInt(tempTerm));
consumeNextInteger = false;
tempTerm = "";
}
}
meta = parsePositiveInt(tempTerm);
holder.update(arr, meta);
}
bufferedReader.close();
long endTime = System.nanoTime();
System.out.println("#time -> " + (endTime - startTime) * 1.0
/ 1000000000 + " seconds");
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
public class Holder {
private static final int SIZE = 500000000;
private TIntArrayList[] arrs;
private TIntArrayList metas;
private int idx;
public Holder() {
arrs = new TIntArrayList[SIZE];
metas = new TIntArrayList(SIZE);
idx = 0;
}
public void update(TIntArrayList arr, int meta) {
arrs[idx] = arr;
metas.add(meta);
idx++;
}
}
It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.
Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.
Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)
It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.
I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.
Writing a single huge int[] into memory should be pretty fast, surely no half an hour.
Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).
Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.
There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.
Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.
The implication is that parsing characters into the data you need is the bottleneck.
I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.
If that still isn't fast enough, you'd have to fall back to a native C extension.
If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.
But there's another way - why is the file in ASCII?
If it were in binary, you could just slurp it up.

Categories