Why does appending "" to a String save memory? - java

I used a variable with a lot of data in it, say String data.
I wanted to use a small part of this string in the following way:
this.smallpart = data.substring(12,18);
After some hours of debugging (with a memory visualizer) I found out that the objects field smallpart remembered all the data from data, although it only contained the substring.
When I changed the code into:
this.smallpart = data.substring(12,18)+"";
..the problem was solved! Now my application uses very little memory now!
How is that possible? Can anyone explain this? I think this.smallpart kept referencing towards data, but why?
UPDATE:
How can I clear the big String then? Will data = new String(data.substring(0,100)) do the thing?

Doing the following:
data.substring(x, y) + ""
creates a new (smaller) String object, and throws away the reference to the String created by substring(), thus enabling garbage collection of this.
The important thing to realise is that substring() gives a window onto an existing String - or rather, the character array underlying the original String. Hence it will consume the same memory as the original String. This can be advantageous in some circumstances, but problematic if you want to get a substring and dispose of the original String (as you've found out).
Take a look at the substring() method in the JDK String source for more info.
EDIT: To answer your supplementary question, constructing a new String from the substring will reduce your memory consumption, provided you bin any references to the original String.
NOTE (Jan 2013). The above behaviour has changed in Java 7u6. The flyweight pattern is no longer used and substring() will work as you would expect.

If you look at the source of substring(int, int), you'll see that it returns:
new String(offset + beginIndex, endIndex - beginIndex, value);
where value is the original char[]. So you get a new String but with the same underlying char[].
When you do, data.substring() + "", you get a new String with a new underlying char[].
Actually, your use case is the only situation where you should use the String(String) constructor:
String tiny = new String(huge.substring(12,18));

When you use substring, it doesn't actually create a new string. It still refers to your original string, with an offset and size constraint.
So, to allow your original string to be collected, you need to create a new string (using new String, or what you've got).

I think this.smallpart kept
referencing towards data, but why?
Because Java strings consist of a char array, a start offset and a length (and a cached hashCode). Some String operations like substring() create a new String object that shares the original's char array and simply has different offset and/or length fields. This works because the char array of a String is never modified once it has been created.
This can save memory when many substrings refer to the same basic string without replicating overlapping parts. As you have noticed, in some situations, it can keep data that's not needed anymore from being garbage collected.
The "correct" way to fix this is the new String(String) constructor, i.e.
this.smallpart = new String(data.substring(12,18));
BTW, the overall best solution would be to avoid having very large Strings in the first place, and processing any input in smaller chunks, aa few KB at a time.

In Java strings are imutable objects and once a string is created, it remains on memory until it's cleaned by the garbage colector (and this cleaning is not something you can take for granted).
When you call the substring method, Java does not create a trully new string, but just stores a range of characters inside the original string.
So, when you created a new string with this code:
this.smallpart = data.substring(12, 18) + "";
you actually created a new string when you concatenated the result with the empty string.
That's why.

As documented by jwz in 1997:
If you have a huge string, pull out a substring() of it, hold on to the substring and allow the longer string to become garbage (in other words, the substring has a longer lifetime) the underlying bytes of the huge string never go away.

Just to sum up, if you create lots of substrings from a small number of big strings, then use
String subtring = string.substring(5,23)
Since you only use the space to store the big strings, but if you are extracting a just handful of small strings, from losts of big strings, then
String substring = new String(string.substring(5,23));
Will keep your memory use down, since the big strings can be reclaimed when no longer needed.
That you call new String is a helpful reminder that you really are getting a new string, rather than a reference to the original one.

Firstly, calling java.lang.String.substring creates new window on the original String with usage of the offset and length instead of copying the significant part of underlying array.
If we take a closer look at the substring method we will notice a string constructor call String(int, int, char[]) and passing it whole char[] that represents the string. That means the substring will occupy as much amount of memory as the original string.
Ok, but why + "" results in demand for less memory than without it??
Doing a + on strings is implemented via StringBuilder.append method call. Look at the implementation of this method in AbstractStringBuilder class will tell us that it finally do arraycopy with the part we just really need (the substring).
Any other workaround??
this.smallpart = new String(data.substring(12,18));
this.smallpart = data.substring(12,18).intern();

Appending "" to a string will sometimes save memory.
Let's say I have a huge string containing a whole book, one million characters.
Then I create 20 strings containing the chapters of the book as substrings.
Then I create 1000 strings containing all paragraphs.
Then I create 10,000 strings containing all sentences.
Then I create 100,000 strings containing all the words.
I still only use 1,000,000 characters. If you add "" to each chapter, paragraph, sentence and word, you use 5,000,000 characters.
Of course it's entirely different if you only extract one single word from the whole book, and the whole book could be garbage collected but isn't because that one word holds a reference to it.
And it's again different if you have a one million character string and remove tabs and spaces at both ends, making say 10 calls to create a substring. The way Java works or worked avoids copying a million characters each time. There is compromise, and it's good if you know what the compromises are.

Related

Java concatenate strings vs static strings

I try to get a better understanding of Strings. I am basically making a program that requires a lot of strings. However, a lot of the strings are very, very similar and merely require a different word at the end of the string.
E.g.
String one = "I went to the store and bought milk"
String two = "I went to the store and bought eggs"
String three = "I went to the store and bought cheese"
So my question is, what approach would be best suited to take when dealing with strings? Would concatenating 2 strings together have any benefits over just having static strings in, say for example, performance or memory management?
E.g.
String one = "I went to the store and bought "
String two = "milk"
String three = "cheese"
String four = one + two
String five = one + three
I am just trying to figure out the most optimal way of dealing with all these strings. (If it helps to put a number of strings I am using, I currently have 50 but the number could surplus a huge amount)
As spooky has said the main concern with the code is readability. Unless you are working on a program for a phone you do not need to manage your resources. That being said, it really doesn't matter whether you create a lot of Strings that stand alone or concatenate a base String with the small piece that varies. You won't really notice better performance either way.
You may set the opening sentence in a string like this
String openingSentence = "I went to the store and bought";
and alternate defining each word alone, by defining one array of strings like the following ::
String[] thingsToBeBought = { "milk", "water", "cheese" .... };
then you can do foreach loop and concatenate each element in the array with the opening sentence.
In Java, if you concatenate two Strings (e.g. using '+') a new String is created, so the old memory needs to be garbage collected. If you want to concatenate strings, the correct way to do this is to use a StringBuilder or StringBuffer.
Given your comment about these strings really being URLs, you probably want to have a StringBuilder/StringBuffer that is the URL base, and then append the suffixes as needed.
Performance wise final static strings are always better as they are generated during compile time. Something like this
final static String s = "static string";
Non static strings and strings concatenated as shown in the other example are generated at runtime. So even though performance will hardly matter for such a small thing, The second example is not as good as the first one performance wise as in your code :
// not as good performance wise since they are generated at runtime
String four = one + two
String five = one + three
Since you are going to use this string as URL, I would recommend to use StringJoiner (in case your are using JAVA 8). It will be as efficient as StringBuilder (will not create a new string every time you perform concatenation) and will automatically add "/" between strings.
StringJoiner myJoiner = new StringJoiner("/")
There will be no discernable difference in performance, so the manner in which you go about this is more a matter of preference. I would likely declare the first part of the sentence as a String and store the individual purchase items in an array.
Example:
String action = "I went to the store and bought ";
String [] items = {"milk", "eggs", "cheese"};
for (int x = 0; x< items.length; x++){
System.out.println(action + items[x]);
}
Whether you declare every possible String or separate Strings to be concatenated isn't going to have any measurable impact on memory or performance in the example you give. In the extreme case of declaring truly large numbers of String literals, Java's native hash table of interned Strings will use more memory if you declare every possible String, because the table's cached values will be longer.
If you are concatenating more than 2 Strings using the + operator, you will be creating extra String objects to be GC'd. For example if you have Strings a = "1" and b = "2", and do String s = "s" + a + b;, Java will first create the String "s1" and then concatenate it to form a second String "s12". Avoid the intermediate String by using something like StringBuilder. (This wouldn't apply to compile-time declarations, but it would to runtime concatenations.)
If you happen to be formatting a String rather than simply concatenating, use a MessageFormat or String.format(). It's prettier and avoids the intermediate Strings created when using the + operator. So something like, String urlBase = "http://host/res?a=%s&b=%s"; String url = String.format(urlBase, a, b); where a and b are the query parameter String values.

Creating unnecessary string objects

This is a bit of Java String 101. I came across this recently in some existing code. My initial reaction is that this is redundant
car.setDetails(new String(someStringBufferObj.toString));
In my opinion even this would be redundant...
car.setDetails(new String(someOtherStringObj));
because String is immutable so there is never a risk that the car details would be changed accidentally (by changing someOtherStringObj) in a later line of code
Are am I wrong?
The first snippet above looks unnecessary. However the second may be necessary. Consider the following.
The constructor String(String) is useful since it'll take a copy of the underlying character array of the original string.
Why is this useful ? You have to understand that a string object has a character array underlying it, and getting a substring() of an existing string actually uses that original character array. This is a flyweight pattern. Consider the following
String s = longstring.substring(2,4);
The string s points to the character array underlying longstring (somewhat unintuitively). If you want to bin longstring (using garbage collection) the underlying character array won't be binned since s still references it, and you end up consuming a potentially huge amount of memory for a 2 character string.
The String(String) constructor resolves this by creating a new character array from that referenced by the string being used to construct from. When the original string is removed via garbage collection its character array won't be referenced by the substring() result and hence that will be removed too.
Note that this behaviour has changed very recently in Java (release 7u4, I think) and strings don't support the above mode of operation anymore.
You're absolutely right Rob, there's no need to new up a String in this instance. Just providing a call to someStringBufferObj.toString() should be sufficient!

optimizing the java code for better memory

I need to write a function which will do the following functionalities
Note that this:
fqField.substring(quoteEnd+1, fqField.length());
uses the character array of the referenced string, rather than create a new string. That is, if I have a 100,000 character array and I take a 2 character substring of that, the substring will reference the original 100,000 chars. This is true even if you dispose of the reference to the original string.
If you do this:
new String(fqField.substring(quoteEnd+1, fqField.length()));
then this will create a new String, with a new underlying character array. You can then dispose of the original and you won't be consuming memory for the original.
The ArrayList "prefixes" which you're creating has the default size for a list. You could add a sensible size to it.
What about using char instead of String, is it an option for you to pass that as params?
How about making "prefixes" an array of String (or char) from the start, instead of making it an ArrayList first and converting it later.

Why is my hashset so memory-consuming?

I found out the memory my program is increasing is because of the code below, currently I am reading a file that is about 7GB big, and I believe the one that would be stored in the hashset is lesson than 10M, but the memory my program keeps increasing to 300MB and then crashes because of OutofMemoryError. If it is the Hashset problem, which data structure shall I choose?
if(tagsStr!=null) {
if(tagsStr.contains("a")||tagsStr.contains("b")||tagsStr.contains("c")) {
maTable.add(postId);
}
} else {
if(maTable.contains(parentId)) {
//do sth else, no memories added here
}
}
You haven't really told us what you're doing, but:
If your file is currently in something like ASCII, each character you read will be one byte in the file or two bytes in memory.
Each string will have an object overhead - this can be significant if you're storing lots of small strings
If you're reading lines with BufferedReader (or taking substrings from large strings), each one may have a large backing buffer - you may want to use maTable.add(new String(postId)) to avoid this
Each entry in the hash set needs a separate object to keep the key/hashcode/value/next-entry values. Again, with a lot of entries this can add up
In short, it's quite possible that you're doing nothing wrong, but a combination of memory-increasing factors are working against you. Most of these are unavoidable, but the third one may be relevant.
You've either got a memory leak or your understanding of the amount of string data that you are storing is incorrect. We can't tell which without seeing more of your code.
The scientific solution is to run your application using a memory profiler, and analyze the output to see which of your data structures is using an unexpectedly large amount of memory.
If I was to guess, it would be that your application (at some level) is doing something like this:
String line;
while ((line = br.readLine()) != null) {
// search for tag in line
String tagStr = line.substring(pos1, pos2);
// code as per your example
}
This uses a lot more memory than you'd expect. The substring(...) call creates a tagStr object that refers to the backing array of the original line string. Your tag strings that you expect to be short actually refer to a char[] object that holds all characters in the original line.
The fix is to do this:
String tagStr = new String(line.substring(pos1, pos2));
This creates a String object that does not share the backing array of the argument String.
UPDATE - this or something similar is an increasingly likely explanation ... given your latest data.
To expand on another of Jon Skeet's point, the overheads of a small String are surprisingly high. For instance, on a typical 32 bit JVM, the memory usage of a one character String is:
String object header for String object: 2 words
String object fields: 3 words
Padding: 1 word (I think)
Backing array object header: 3 words
Backing array data: 1 word
Total: 10 words - 40 bytes - to hold one char of data ... or one byte of data if your input is in an 8-bit character set.
(This is not sufficient to explain your problem, but you should be aware of it anyway.)
Couldn't be it possible that the data read into memory (from the 7G file) is somehow not freed? Something ike Jon puts... ie. since strings are immutable every string read requires a new String object creation which might lead to out of memory if GC is not quick enough...
If the above is the case than you might insert some 'breakpoints' into your code/iteration, ie. at some defined points, issue gc and wait till it terminates.
Run your program with -XX:+HeapDumpOnOutOfMemoryError. You'll then be able to use a memory analyser like MAT to see what is using up all of the memory - it may be something completely unexpected.

How to avoid out of memory in StringBuilder or String in Java

I am getting a lot of data from a webservice containing xml entity references. While replacing those with the respective characters I am getting an out of memory error. Can anybody give an example of how to avoid that? I have been stuck for two days on this problem.
This is my code:
public String decodeXMLData(String s)
{
s = s.replaceAll(">",">");
System.out.println("string value is"+s);
s = s.replaceAll("<", "<");
System.out.println("string value1 is"+s);
s = s.replaceAll("&", "&");
s = s.replaceAll(""", "\"");
s = s.replaceAll("&apos;", "'");
s = s.replaceAll(" ", " ");
return s;
}
You should use a SAX parser, not parse it on your own.
Just look in to these resources, they have code samples too:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
http://www.java-samples.com/showtutorial.php?tutorialid=152
http://www.totheriver.com/learn/xml/xmltutorial.html
Take a look at Apache Commons Lang | StringEscapeUtils.unescapeHtml.
Calling five times replaceAll, you are creating five new String objects. In total, you are working with six Strings. This is not an efficent way to XML-decode a string.
I reccommend you using a more robust implementation of XML-encoding/decoding methods, like those contained in Commons Lang libraries. In particular, StringEscapeUtils may help you to get your job done.
The method as shown would not be a source of out of memory errors (unless the string you are handling is as big as the remaining free heap).
What uou could be running into is the fact that String.substring() calls do not allocate a new string, but create a string object which re-uses the one that substring is called on. If your code exists of reading large buffers and creating strings from those buffers, you might need to use new String(str.substring(index)) to force reallocation of the string values into new small char arrays.
You can try increasing JVM memory, but that will only delay the inevitable if the problem is serious (i.e. if you're trying to claim gigabytes for example).
If you've got a single String that causes you to run out of memory trying to do this, it must be humongous :) Suggestion to use a SAX parser to handle it and print it in bits and pieces is a good one.
Or split it up into smaller bits yourself and send each of those to a routine that does what you want and discard the result afterwards.

Categories