Why is my hashset so memory-consuming?

Why is my hashset so memory-consuming? - java

I found out the memory my program is increasing is because of the code below, currently I am reading a file that is about 7GB big, and I believe the one that would be stored in the hashset is lesson than 10M, but the memory my program keeps increasing to 300MB and then crashes because of OutofMemoryError. If it is the Hashset problem, which data structure shall I choose?
if(tagsStr!=null) {
if(tagsStr.contains("a")||tagsStr.contains("b")||tagsStr.contains("c")) {
maTable.add(postId);
}
} else {
if(maTable.contains(parentId)) {
//do sth else, no memories added here
}
}

You haven't really told us what you're doing, but:
If your file is currently in something like ASCII, each character you read will be one byte in the file or two bytes in memory.
Each string will have an object overhead - this can be significant if you're storing lots of small strings
If you're reading lines with BufferedReader (or taking substrings from large strings), each one may have a large backing buffer - you may want to use maTable.add(new String(postId)) to avoid this
Each entry in the hash set needs a separate object to keep the key/hashcode/value/next-entry values. Again, with a lot of entries this can add up
In short, it's quite possible that you're doing nothing wrong, but a combination of memory-increasing factors are working against you. Most of these are unavoidable, but the third one may be relevant.

You've either got a memory leak or your understanding of the amount of string data that you are storing is incorrect. We can't tell which without seeing more of your code.
The scientific solution is to run your application using a memory profiler, and analyze the output to see which of your data structures is using an unexpectedly large amount of memory.
If I was to guess, it would be that your application (at some level) is doing something like this:
String line;
while ((line = br.readLine()) != null) {
// search for tag in line
String tagStr = line.substring(pos1, pos2);
// code as per your example
}
This uses a lot more memory than you'd expect. The substring(...) call creates a tagStr object that refers to the backing array of the original line string. Your tag strings that you expect to be short actually refer to a char[] object that holds all characters in the original line.
The fix is to do this:
String tagStr = new String(line.substring(pos1, pos2));
This creates a String object that does not share the backing array of the argument String.
UPDATE - this or something similar is an increasingly likely explanation ... given your latest data.
To expand on another of Jon Skeet's point, the overheads of a small String are surprisingly high. For instance, on a typical 32 bit JVM, the memory usage of a one character String is:
String object header for String object: 2 words
String object fields: 3 words
Padding: 1 word (I think)
Backing array object header: 3 words
Backing array data: 1 word
Total: 10 words - 40 bytes - to hold one char of data ... or one byte of data if your input is in an 8-bit character set.
(This is not sufficient to explain your problem, but you should be aware of it anyway.)

Couldn't be it possible that the data read into memory (from the 7G file) is somehow not freed? Something ike Jon puts... ie. since strings are immutable every string read requires a new String object creation which might lead to out of memory if GC is not quick enough...
If the above is the case than you might insert some 'breakpoints' into your code/iteration, ie. at some defined points, issue gc and wait till it terminates.

Run your program with -XX:+HeapDumpOnOutOfMemoryError. You'll then be able to use a memory analyser like MAT to see what is using up all of the memory - it may be something completely unexpected.

Related

from InputStream to List<String>, why java is allocating space twice in the JVM?

I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?

tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).

Best way to optimize string data in an application that allocates quite a bit of it

I have an application that uses a ton of String objects. One of my objects (lets call it Person) contains 9 of them. The data that is written to each String object is never written more than once, but will be read several times after. There will be several hundred thousand or so Person objects at a given time and many of these Person objects will share first name, last name, etc...
I am trying to think of immediate ways to reduce the amount memory that is consumed by the Person object but I am no expert when it comes to how Java manages its memory underneath.
Before I go down this rabbit hole, I would like to know what drawbacks there would be if I went down these paths and if it even make sense in the first place:
Using StringBuilder or StringBuffer solely because of the trimToSize() method which would allow me to reduce the number of allocated bytes used in the string.
Store the strings as byte[] array's and provide a getter that would convert the byte[] to String and a setter that would accept String and convert to byte[] - data is being read quite a bit, so would this be too expensive?
Create a hash table for (lets just say) "names" that would prevent duplicate allocations (using a pointer) for the same name over and over (there could be thousands of names with 10+ characters).
Before I pointlessly head down any of these roads, does it make sense to do? Maybe Java is already reducing String allocations and checking for duplicates?
I don't mind a good read either. I have found some documentation but nothing that explores to this depth.

Obviously StringBuilder and StringBuffer couldn't help in this case. String is immutable object, so these 2 classes were introduced for building Strings not for storing. Anyway you may (in most cases - must) use StringBuilder if you concatinate/insert chars in the middle/delete some chars from/of Strings
In my opinion, second option could led to increasing memory consuption because new String will be created when byte[] will be converted to String every time you need it.
Handwritten StringDeduplicator is very reasonable solution, especially if you are stuck with java 5,6,7.
Java 8/9 has String Deduplication option. By default, this option is disabled. To use this one in Java 8, you must enable the G1 garbage collector, while in Java 9 G1 is the default.
-XX:+UseStringDeduplication
Regarding String Deduplication, see:
JEP 192: String Deduplication in G1
Java 8 Update 20 Release Notes
Other Stack Overflow posts

Reading large files for a simulation (Java crashes with out of heap space)

For a school assignment, I need to create a Simulation for memory accesses. First I need to read 1 or more trace files. Each contains memory addresses for each access. Example:
0 F001CBAD
2 EEECA89F
0 EBC17910
...
Where the first integer indicates a read/write etc. then the hex memory address follows. With this data, I am supposed to run a simulation. So the idea I had was parse these data into an ArrayList<Trace> (for now I am using Java) with trace being a simple class containing the memory address and the access type (just a String and an integer). After which I plan to loop through these array lists to process them.
The problem is even at parsing, it running out of heap space. Each trace file is ~200MB. I have up to 8. Meaning minimum of ~1.6 GB of data I am trying to "cache"? What baffles me is I am only parsing 1 file and java is using 2GB according to my task manager ...
What is a better way of doing this?
A code snippet can be found at Code Review

The answer I gave on codereview is the same one you should use here .....
But, because duplication appears to be OK, I'll duplicate the answer here.
The issue is almost certainly in the structure of your Trace class, and it's memory efficiency. You should ensure that the instrType and hexAddress are stored as memory efficient structures. The instrType appears to be an int, which is good, but just make sure that it is declared as an int in the Trace class.
The more likely problem is the size of the hexAddress String. You may not realise it but Strings are notorious for 'leaking' memory. In this case, you have a line and you think you are just getting the hexString from it... but in reality, the hexString contains the entire line.... yeah, really. For example, look at the following code:
public class SToken {
public static void main(String[] args) {
StringTokenizer tokenizer = new StringTokenizer("99 bottles of beer");
int instrType = Integer.parseInt(tokenizer.nextToken());
String hexAddr = tokenizer.nextToken();
System.out.println(instrType + hexAddr);
}
}
Now, set a break-point in (I use eclipse) your IDE, and then run it, and you will see that hexAddr contains a char[] array for the entire line, and it has an offset of 3 and a count of 7.
Because of the way that String substring and other constructs work, they can consume huge amounts of memory for short strings... (in theory that memory is shared with other strings though). As a consequence, you are essentially storing the entire file in memory!!!!
At a minimum, you should change your code to:
hexAddr = new String(tokenizer.nextToken().toCharArray());
But even better would be:
long hexAddr = parseHexAddress(tokenizer.nextToken());

Like rolfl I answered your question in the code review. The biggest issue, to me, is the reading everything into memory first and then processing. You need to read a fixed amount, process that, and repeat until finished.

Try use class java.nio.ByteBuffer instead of java.util.ArrayList<Trace>. It should also reduce the memory usage.
class TraceList {
private ByteBuffer buffer;
public TraceList(){
//allocate byte buffer
}
public void put(byte operationType, int addres) {
//put data to byte buffer
}
public Trace get(int index) {
//get data from byte buffer by index
byte type = ...//read type
int addres = ...//read addres
return new Trace(type, addres)
}
}

Java Strings : how the memory works with immutable Strings

I have a simple question.
byte[] responseData = ...;
String str = new String(responseData);
String withKey = "{\"Abcd\":" + str + "}";
in the above code, are these three lines taking 3X memory. for example if the responseData is 1mb, then line 2 will take an extra 1mb in memory and then line 3 will take extra 1mb + xx. is this true? if no, then how it is going to work. if yes, then what is the optimal way to fix this. will StringBuffer help here?

Yes, that sounds about right. Probably even more because your 1MB byte array needs to be turned into UTF-16, so depending on the encoding, it may be even bigger (2MB if the input was ASCII).
Note that the garbage collector can reclaim memory as soon as the variables that use it go out of scope. You could set them to null as early as possible to help it make this as timely as possible (for example responseData = null; after you constructed your String).
if yes, then what is the optimal way to fix this
"Fix" implies a problem. If you have enough memory there is no problem.
the problem is that I am getting OutOfMemoryException as the byte[] data coming from server is quite big,
If you don't, you have to think about a better alternative to keeping a 1MB string in memory. Maybe you can stream the data off a file? Or work on the byte array directly? What kind of data is this?

The problem is that I am getting OutOfMemoryException as the byte[] data coming from server is quite big, thats why I need to figure it out first that am I doing something wrong ....
Yes. Well basically your fundamental problem is that you are trying to hold the entire string in memory at one time. This is always going to fail for a sufficiently large string ... even if you code it in the most optimal memory efficient fashion possible. (And that would be complicated in itself.)
The ultimate solution (i.e. the one that "scales") is to do one of the following:
stream the data to the file system, or
process it in such a way that you don't need ever need the entire "string" to be represented.
You asked if StringBuffer will help. It might help a bit ... provided that you use it correctly. The trick is to make sure that you preallocate the StringBuffer (actually a StringBuilder is better!!) to be big enough to hold all of the characters required. Then copy data into it using a charset decoder (directly or using a Reader pipeline).
But even with optimal coding, you are likely to need a peak of 3 times the size of your input byte[].
Note that your OOME problem is probably nothing to do with GC or storage leaks. It is actually about the fundamental space requirements of the data types you are using ... and the fact that Java does not offer a "string of bytes" data type.

There is no such OutOfMemoryException in my apidocs. If it's OutOfMemoryError, especially on the server-side, you definitely got a problem.
When you receive big requests from clients, those String related statements are not the first problem. Reducing 3X to 1X is not the solution.
I'm sorry I can't help without any further codes.
Use back-end storage
You should not store the whole request body on byte[]. You can store them directly on any back-end storage such as a local file, a remote database, or cloud storage.
I would
copy stream from request to back-end with small chunked buffer
Use streams
If can use Streams not Objects.
I would
response.getWriter().write("{\"Abcd\":");
copy <your back-end stored data as stream>);
response.getWriter().write("}");

Yes, if you use a Stringbuffer for the code you have, you would save 1mb of heap space in the last step. However, considering the size of data you have, I recommend an external memory algorithm where you bring only part of your data to memory, process it and put it back to storage.

As others have mentioned, you should really try not to have such a big Object in your mobile app, and that streaming should be your best solution.
That said, there are some techniques to reduce the amount memory your app is using now:
Remove byte[] responseData entirely if possible, so the memory it used can be released ASAP (assuming it is not used anywhere else)
Create the largest String first, and then substring() it, Android uses Apache Harmony for its standard Java library implementation. If you check its String class implementation, you'll see that substring() is implemented simply by creating a new String object with the proper start and end offset to the original data and no duplicate copy is created. So doing the following would cuts the overall memory consumption by at least 1/3:
String withKey = StringBuilder().append("{\"Abcd\").append(str).append("}").toString();
String str = withKey.substring("{\"Abcd\".length(), withKey.length()-"}".length());
Never ever use something like "{\"Abcd\":" + str + "}" for large Strings, under the hood "string_a"+"string_b" is implemented as new StringBuilder().append("string_a").append("string_b").toString(); so implicitly you are creating two (or at least one if the compiler is mart) StringBuilders. For large Strings, it's better that you take over this process yourself as you have deep domain knowledge about your program that the compiler doesn't, and knows how to best manipulate the strings.

Why does reading a file into memory takes 4x the memory in Java?

I have the following code which reads in the follow file, append a \r\n to the end of each line and puts the result in a string buffer:
public InputStream getInputStream() throws Exception {
StringBuffer holder = new StringBuffer();
try{
FileInputStream reader = new FileInputStream(inputPath);
BufferedReader br = new BufferedReader(new InputStreamReader(reader));
String strLine;
//Read File Line By Line
boolean start = true;
while ((strLine = br.readLine()) != null) {
if( !start )
holder.append("\r\n");
holder.append(strLine);
start = false;
}
//Close the input stream
reader.close();
}catch (Throwable e){//this is where the heap error is caught up to 2Gb
System.err.println("Error: " + e.getMessage());
}
return new StringBufferInputStream(holder.toString());
}
I tried reading in a 400Mb file, and I changed the max heap space to 2Gb and yet it still gives the out of memory heap exception. Any ideas?

It may be to do with how the StringBuffer resizes when it reaches capacity - This involves creating a new char[] double the size of the previous one and then copying the contents across into the new array. Together with the points already made about characters in Java being stored as 2 bytes this will definitely add to your memory usage.
To resolve this you could create a StringBuffer with sufficient capacity to begin with, given that you know the file size (and hence approximate number of characters to read in). However, be warned that the array allocation will also occur if you then attempt to convert this large StringBuffer into a String.
Another point: You should typically favour StringBuilder over StringBuffer as the operations on it are faster.
You could consider implementing your own "CharBuffer", using for example a LinkedList of char[] to avoid expensive array allocation / copy operations. You could make this class implement CharSequence and perhaps avoid converting to a String altogether. Another suggestion for more compact representation: If you're reading in English text containing large numbers of repeated words you could read and store each word, using the String.intern() function to significantly reduce storage.

To begin with Java strings are UTF-16 (i.e. 2 bytes per character), so assuming your input file is ASCII or a similar one-byte-per-character format then holder will be ~2x the size of the input data, plus the extra \r\n per line and any additional overhead. There's ~800MB straight away, assuming a very low storage overhead in StringBuffer.
I could also believe that the contents of your file is buffered twice - once at the I/O level and once in the BufferedReader.
However, to know for sure, it's probably best to look at what's actually on the heap - use a tool like HPROF to see exactly where your memory has gone.
I terms of solving this, I suggest you process a line at a time, writing out each line after your have added the line termination. That way your memory usage should be proportional to the length of a line, instead of the entire file.

It's an interesting question, but rather than stress over why Java is using so much memory, why not try a design that doesn't require your program to load the entire file into memory?

You have a number of problems here:
Unicode: characters take twice as much space in memory as on disk (assuming a 1 byte encoding)
StringBuffer resizing: could double (permanently) and triple (temporarily) the occupied memory, though this is the worst case
StringBuffer.toString() temporarily doubles the occupied memory since it makes a copy
All of these combined mean that you could require temporarily up to 8 times your file's size in RAM, i.e. 3.2G for a 400M file. Even if your machine physically has that much RAM, it has to be running a 64bit OS and JVM to actually get that much heap for the JVM.
All in all, it's simply a horrible idea to keep such a huge String in memory - and it's totally unneccessary as well - since your method returns an InputStream, all you really need is a FilterInputStream that adds the line breaks on the fly.

It's the StringBuffer. The empty constructor creates a StringBuffer with a initial length of 16 Bytes. Now if you append something and the capacity is not sufficiant, it does an Arraycopy of the internal String Array to a new buffer.
So in fact, with each line appended the StringBuffer has to create a copy of the complete internal Array which nearly doubles the required memory when appending the last line. Together with the UTF-16 representation this results in the observed memory demand.
Edit
Michael is right, when saying, that the internal buffer is not incremented in small portions - it roughly doubles in size each to you need more memory. But still, in the worst case, say the buffer needs to expand capacity just with the very last append, it creates a new array twice the size of the actual one - so in this case, for a moment you need roughly three times the amount of memory.
Anyway, I've learned the lesson: StringBuffer (and Builder) may cause unexpected OutOfMemory errors and I'll always initialize it with a size, at least when I have to store large Strings. Thanks for the question :)

At the last insert into the StringBuffer, you need three times the memory allocated, because the StringBuffer always expands by (size + 1) * 2 (which is already double because of unicode). So a 400GB file could require an allocation of 800GB * 3 == 2.4GB at the end of the inserts. It may be something less, that depends on exactly when the threshold is reached.
The suggestion to concatenate Strings rather than using a Buffer or Builder is in order here. There will be a lot of garbage collection and object creation (so it will be slow), but a much lower memory footprint.
[At Michael's prompting, I investigated this further, and concat wouldn't help here, as it copies the char buffer, so while it wouldn't require triple, it would require double the memory at the end.]
You could continue to use the Buffer (or better yet Builder in this case) if you know the maximum size of the file and initialize the size of the Buffer on creation and you are sure this method will only get called from one thread at a time.
But really such an approach of loading such a large file into memory at once should only be done as a last resort.

I would suggest you use the OS file cache instead of copying the data into Java memory via characters and back to bytes again. If you re-read the file as required (perhaps transforming it as you go) it will be faster and very likely to be simpler
You need over 2 GB because 1 byte letters use char (2-bytes) in memory and when your StringBuffer resizes you need double that (to copy the old array to the larger new array) The new array is typically 50% larger so you need up to 6x the original file size. If the performance wasn't bad enough, you are using StringBuffer instead of StringBuilder which synchronizes every call when it is clearly not needed. (This only slows you down, but uses the same amount of memory)

Others have explained why you're running out of memory. As to how to solve this problem, I'd suggest writing a custom FilterInputStream subclass. This class would read one line at a time, append the "\r\n" characters and buffer the result. Once the line has been read by the consumer of your FilterInputStream, you'd read another line. This way you'd only ever have one line in memory at a time.

I also recommend checking out Commons IO FileUtils class for this. Specifically: org.apache.commons.io.FileUtils#readFileToString. You can also specify the encoding if you know you only are using ASCII.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.