Memory usage of a large substring?

Memory usage of a large substring? - java

Reading the source code for String#substring (Java 1.7) it looks like it reuses the character array, but with different offset and length. This means that if I have an giant String that I substring, the initial string will never be reclaimed by GC (right?).
What's the easiest way to sure that the giant String is reclaimed? I am running JavaSE-1.7.
(For the curious, I'll be writing a radix implementation in Java to reduce memory usage. The answer to this question is essential to avoid the radix tree using more memory than necessary)

For pre JDK 7u6 version
You should use String(String) constructor in that case:
163 public String(String original) {164 int size = original.count;165 char[] originalValue = original.value;166 char[] v;167 if (originalValue.length > size) {168 // The array representing the String is bigger than the new169 // String itself. Perhaps this constructor is being called170 // in order to trim the baggage, so make a copy of the array.171 int off = original.offset;172 v = Arrays.copyOfRange(originalValue, off, off+size);173 } else {174 // The array representing the String is the same175 // size as the String, so no point in making a copy.176 v = originalValue;177 }178 this.offset = 0;179 this.count = size;180 this.value = v;181 }
String s = "some really looooong text";
String s2 = new String(s.substring(0,3));
When you pass result of s.substring() to String constructor, it will not use char[] of the original String. So the original String can be GC. This is actually one of the use case when one should use String constructor. While in most of the cases we should use String literal assignment.
For JDK 7u6+ version
In Java 7, implementation of String.substring() has been changed, which now internally uses String(char value[], int offset, int count) constructor (which we had to use manually in older version to avoid memory leak). This constructor checks it needs original String's value[] array or a shorter array would be sufficient. So for JDK 7+ using String.substring() will not pose memory leak issue. Please have a look at the source code String.substring()

Original String will always be Garbage Collected if required. No one will object. Here is the partial code for substring() method (JDK 1.7.0_51):
return ((beginIndex == 0) && (endIndex == value.length)) ? this
: new String(value, beginIndex, subLen);
So, this method is returning a brand new String object or if beginIndex is 0 then the originam String will be returned. I guess you are concerned about first case. In that case, it has nothing to do with the older one once it is created.

Related

Char array or String Builder

I'd like to know for character concatenation in Java - which one of the below method would be better for readability, maintenance and performance - either 'char array' or 'string builder'.
The method has to take the first letter from both the strings, append and return it.
Eg:
Input 1: ABC Input 2: DEF -> method should return AD.
using string builder:
private String getString(String str1, String str2) {
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append(str1.charAt(0));
stringBuilder.append(str2.charAt(0));
return stringBuilder.toString();
}
using char array:
private String getString(String str1, String str2) {
char[] charArray = new char[2];
charArray[0] = str1.charAt(0);
charArray[1] = str2.charAt(0);
return String.valueOf(charArray);
}

StringBuilder is just a wrapper around a char[], adding functionality like resizing the array as necessary; and moving elements when you insert/delete etc.
It might be marginally faster to use the char[] directly for some things, but you'd lose (or have to reimplement) a lot of the useful functionality.

charArray is good in term of Performance and readability too but it hard to maintain the code like this. It can cause the error like Null pointer. You just need to add the null check with char[] code.
On the other side StringBuffer internally use the char. So, char is better here and also by doing this we are not creating an Object. Memory point of view. It's good not to create that one.

If you review the source code for StringBuilder, you will find that internally it uses a char[] to represent the buffered string. So both versions of your code are doing very similar things. However, I would vote for using StringBuilder, because it offers an API which can much more than the plain char[] which sits inside its implementation.

How to reverse string which place 2/3 of heap?

Recently I've had an interview and I was asked a strange(at least for me) question:
I should write a method which would inverse a string.
public static String myReverse(String str){
...
}
The problem is that str is a very very huge object (2/3 of memory).
I suppose only one solution:
create a new String where I will store the result, then reverse 1/2 of the source String. After using reflection, clear the second half(already reversed) of the source string underlying array and then continue to reverse.
Am I right?
Any other solutions?

If you are using reflection anyway, you could access the underlying character array of the string and reverse it in place, by traversing from both ends and swapping the chars at each end.
public static String myReverse(String str){
char[] content;
//Fill content with reflection
for (int a = 0, b = content.length - 1; a < b; a++, b--) {
char temp = content[b];
content[b] = content[a];
content[a] = temp;
}
return str;
}
I unfortunately can't think of a way that doesn't use reflection.

A String is internally a 16-bit char array. If we know the character set to be ASCII, meaning each char maps to a single byte, we can encode the string to a 8-bit byte array at only 50% of the memory cost. This fully utilizes the available memory during the transition. Then we let go of the input string to reclaim 2/3 of the memory, reverse the byte array and reconstruct the string.
public static String myReverse(String str) {
byte[] bytes = str.getBytes("ASCII");
// memory at full capacity
str = null;
// memory at 1/3 capacity
for (int i = 0; i < bytes.length / 2; i++) {
byte tmp = bytes[i];
bytes[i] = bytes[bytes.length - i - 1];
bytes[bytes.length - i - 1] = tmp;
}
return new String(bytes, "ASCII");
}
This, of course, assumes you have a little extra memory available for temporary objects created by the encoding process, array headers, etc.

It's unlikely that you can do that without using tricks like reflection or assuming that the String is stored in an efficient way (for example knowing it to be only ASCII characters). The problems in your way is that in java Strings are immutable. The other is the likely implementation of garbage collection.
The problem with the likely implementation of garbage collection is that the memory is reclaimed after the object can no longer be accessed. This means that there would be a brief period where both the input and the output of a transformation would need to occupy memory.
For example one could try to reverse the string by successively build the result and cut down the original string:
rev = rev + orig.substring(0,1);
orig = orig.substring(1);
But this relies oth that the previos incarnation of rev or orig respectively is collected as the new incarnation of rev or orig is being created so that they never occupy up to 2/3 of the memory at the same time.
To be more general one would study such a process. During the process there would be a set of objects that evolve throughout the process, both the set it self and (some of) the objects. At the start the original string would be in the set and at the end the reversed string would be there. It's clear that due to information content the total size of the objects in the set can never be lower than the original. The crucial point here is that the original string have to be deleted at some point. Before that time at most 50% of the information may exist in the other objects. So we need a construct that would at the same time delete a String object as it retains more than half of the information therein.
Such a construct would need you basically to call a method to an object returning another object an in the process remove the object as the result is being constructed. It's unlikely that the implementation would work in that way.
Your approach seem to rely on that String are indeed mutable somehow, and then there would be no problem in just reversing the string in place without having to use a lot of memory. You don't need to copy out anything there, you can do the whole thing in place: swap the [j] and then [len-1-j] (for all j<(len-1)/2)

How can I efficiently use StringBuilder?

In the past, I've always used printf to format printing to the console but the assignment I currently have (creating an invoice report) wants us to use StringBuilder, but I have no idea how to do so without simply using " " for every gap needed. For example... I'm supposed to print this out
Invoice Customer Salesperson Subtotal Fees Taxes Discount Total
INV001 Company Eccleston, Chris $ 2357.60 $ 40.00 $ 190.19 $ -282.91 $ 2304.88
But I don't know how to get everything to line up using the StringBuilder. Any advice?

StringBuilder aims to reduce the overhead associated with creating strings.
As you may or may not know, strings are immutable. What this means that something like
String a = "foo";
String b = "bar";
String c = a + b;
String d = c + c;
creates a new string for each line. If all we are concerned about is the final string d, the line with string c is wasting space because it creates a new String object when we don't need it.
String builder simply delays actually building the String object until you call .toString(). At that point, it converts an internal char[] to an actual string.
Let's take another example.
String foo() {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 100; i++)
sb.append(i);
return sb.toString();
}
Here, we only create one string. StringBuilder will keep track of the chars you have added to your string in its internal char[] value. Note that value.length will generally be larger than the total chars you have added to your StringBuilder, but value might run out of room for what you're appending if the string you are building gets too big. When that happens, it'll resize, which just means replacing value with a larger char[], and copying over the old values to the new array, along with the chars of whatever you appended.
Finally, when you call sb.toString(), the StringBuilder will call a String constructor that takes an argument of a char[].
That means only one String object was created, and we only needed enough memory for our char[] and to resize it.
Compare with the following:
String foo() {
String toReturn = "";
for (int i = 0; i < 100; i++)
toReturn += "" + i;
toReturn;
}
Here, we have 101 string objects created (maybe more, I'm unsure). We only needed one though! This means that at every call, we're disposing the original string toReturn represented, and creating another string.
With a large string, especially, this is very expensive, because at every call you need to first acquire as much memory as the new string needs, and dispose of as much memory as the old string had. It's not a big deal when things are kept short, but when you're working with entire files this can easily become a problem.
In a nutshell: if you're working appending / removing information before finalizing an output: use a StringBuilder. If your strings are very short, I think it is OK to just concatenate normally for convenience, but this is up to you to define what "short" is.

Best way to modify an existing string? StringBuilder or convert to char array and back to string?

I'm learning Java and am wondering what's the best way to modify strings here (both for performance and to learn the preferred method in Java). Assume you're looping through a string and checking each character/performing some action on that index in the string.
Do I use the StringBuilder class, or convert the string into a char array, make my modifications, and then convert the char array back to a string?
Example for StringBuilder:
StringBuilder newString = new StringBuilder(oldString);
for (int i = 0; i < oldString.length() ; i++) {
newString.setCharAt(i, 'X');
}
Example for char array conversion:
char[] newStringArray = oldString.toCharArray();
for (int i = 0; i < oldString.length() ; i++) {
myNameChars[i] = 'X';
}
myString = String.valueOf(newStringArray);
What are the pros/cons to each different way?
I take it that StringBuilder is going to be more efficient since the converting to a char array makes copies of the array each time you update an index.

I say do whatever is most readable/maintainable until you you know that String "modification" is slowing you down. To me, this is the most readable:
Sting s = "foo";
s += "bar";
s += "baz";
If that's too slow, I'd use a StringBuilder. You may want to compare this to StringBuffer. If performance matters and synchronization does not, StringBuilder should be faster. If sychronization is needed, then you should use StringBuffer.
Also it's important to know that these strings are not being modified. In java, Strings are immutable.
This is all context specific. If you optimize this code and it doesn't make a noticeable difference (and this is usually the case), then you just thought longer than you had to and you probably made your code more difficult to understand. Optimize when you need to, not because you can. And before you do that, make sure the code you're optimizing is the cause of your performance issue.

What are the pros/cons to each different way. I take it that StringBuilder is going to be more efficient since the convering to a char array makes copies of the array each time you update an index.
As written, the code in your second example will create just two arrays: one when you call toCharArray(), and another when you call String.valueOf() (String stores data in a char[] array). The element manipulations you are performing should not trigger any object allocations. There are no copies being made of the array when you read or write an element.
If you are going to be doing any sort of String manipulation, the recommended practice is to use a StringBuilder. If you are writing very performance-sensitive code, and your transformation does not alter the length of the string, then it might be worthwhile to manipulate the array directly. But since you are learning Java as a new language, I am going to guess that you are not working in high frequency trading or any other environment where latency is critical. Therefore, you are probably better off using a StringBuilder.
If you are performing any transformations that might yield a string of a different length than the original, you should almost certainly use a StringBuilder; it will resize its internal buffer as necessary.
On a related note, if you are doing simple string concatenation (e.g, s = "a" + someObject + "c"), the compiler will actually transform those operations into a chain of StringBuilder.append() calls, so you are free to use whichever you find more aesthetically pleasing. I personally prefer the + operator. However, if you are building up a string across multiple statements, you should create a single StringBuilder.
For example:
public String toString() {
return "{field1 =" + this.field1 +
", field2 =" + this.field2 +
...
", field50 =" + this.field50 + "}";
}
Here, we have a single, long expression involving many concatenations. You don't need to worry about hand-optimizing this, because the compiler will use a single StringBuilder and just call append() on it repeatedly.
String s = ...;
if (someCondition) {
s += someValue;
}
s += additionalValue;
return s;
Here, you'll end up with two StringBuilders being created under the covers, but unless this is an extremely hot code path in a latency-critical application, it's really not worth fretting about. Given similar code, but with many more separate concatenations, it might be worth optimizing. Same goes if you know the strings might be very large. But don't just guess--measure! Demonstrate that there's a performance problem before you try to fix it. (Note: this is just a general rule for "micro optimizations"; there's rarely a downside to explicitly using a StringBuilder. But don't assume it will make a measurable difference: if you're concerned about it, you should actually measure.)
String s = "";
for (final Object item : items) {
s += item + "\n";
}
Here, we're performing a separate concatenation operation on each loop iteration, which means a new StringBuilder will be allocated on each pass. In this case, it's probably worth using a single StringBuilder since you may not know how large the collection will be. I would consider this an exception to the "prove there's a performance problem before optimizing rule": if the operation has the potential to explode in complexity based on input, err on the side of caution.

Which option will perform the best is not an easy question.
I did a benchmark using Caliper:
RUNTIME (NS)
array 88
builder 126
builderTillEnd 76
concat 3435
Benchmarked methods:
public static String array(String input)
{
char[] result = input.toCharArray(); // COPYING
for (int i = 0; i < input.length(); i++)
{
result[i] = 'X';
}
return String.valueOf(result); // COPYING
}
public static String builder(String input)
{
StringBuilder result = new StringBuilder(input); // COPYING
for (int i = 0; i < input.length(); i++)
{
result.setCharAt(i, 'X');
}
return result.toString(); // COPYING
}
public static StringBuilder builderTillEnd(String input)
{
StringBuilder result = new StringBuilder(input); // COPYING
for (int i = 0; i < input.length(); i++)
{
result.setCharAt(i, 'X');
}
return result;
}
public static String concat(String input)
{
String result = "";
for (int i = 0; i < input.length(); i++)
{
result += 'X'; // terrible COPYING, COPYING, COPYING... same as:
// result = new StringBuilder(result).append('X').toString();
}
return result;
}
Remarks
If we want to modify a String, we have to do at least 1 copy of that input String, because Strings in Java are immutable.
java.lang.StringBuilder extends java.lang.AbstractStringBuilder. StringBuilder.setCharAt() is inherited from AbstractStringBuilder and looks like this:
public void setCharAt(int index, char ch) {
if ((index < 0) || (index >= count))
throw new StringIndexOutOfBoundsException(index);
value[index] = ch;
}
AbstractStringBuilder internally uses the simplest char array: char value[]. So, result[i] = 'X' is very similar to result.setCharAt(i, 'X'), however the second will call a polymorphic method (which probably gets inlined by JVM) and check bounds in if, so it will be a bit slower.
Conclusions
If you can operate on StringBuilder until the end (you don't need String back) - do it. It's the preferred way and also the fastest. Simply the best.
If you want String in the end and this is the bottleneck of your program, then you might consider using char array. In benchmark char array was ~25% faster than StringBuilder. Be sure to properly measure execution time of your program before and after optimization, because there is no guarantee about this 25%.
Never concatenate Strings in the loop with + or +=, unless you really know what you do. Usally it's better to use explicit StringBuilder and append().

I'd prefer to use StringBuilder class where original string is modified.
For String manipulation, I like StringUtil class. You'll need to get Apache commons dependency to use it

Please explain to me this snippet of code for String constructor in Java?

Here is one of the constructor for String object in Java:
public String(String original) {
int size = original.count;
char[] originalValue = original.value;
char[] v;
if (originalValue.length > size) {
// The array representing the String is bigger than the new
// String itself. Perhaps this constructor is being called
// in order to trim the baggage, so make a copy of the array.
int off = original.offset;
v = Arrays.copyOfRange(originalValue, off, off+size);
} else {
// The array representing the String is the same
// size as the String, so no point in making a copy.
v = originalValue;
}
this.offset = 0;
this.count = size;
this.value = v;
}
The line of code if (originalValue.length > size) is what I care about, I don't think this condition can be true for all the code inside IF being executed. The String is in fact an array of characters. original.count should be equal to its value's length (its value is an array of characters), so the condition wouldn't happen.
I may be wrong, so I need your explanation. Thanks for your help.
VipHaLong.

The String is infact an array of characters
No it's not. It's an object which internally has a reference to an array of characters.
original.count should be equal to its value's length (its value is an array of characters)
Not necessarily. It depends on the exact version of Java you're looking at, but until recently several strings could refer to the same char[], each using a different portion of the array.
For example, if you have:
String longString = "this is a long string";
String shortString = longString.substring(0, 2);
... the object referred to shortString would use the same char[] that the original string referred to, but with an start offset of 0 and a count of 2. So if you then called:
String copyOfShortString = new String(shortString);
that would indeed go into the if block you were concerned about in your question.
As of Java 7 update 5, the Oracle JRE has changed to make substring always take a copy. (The pros and cons behind this can get quite complicated, but it's worth being aware of both systems.)
It looks like the version of code you're looking at is an older version where string objects could share an underlying array but view different portions.

The String implementation that you are looking at does not copy character data when you create a substring. Instead, multiple String objects can refer to the same character array but have different offset and count (and therefore length).
Therefore, the if condition can, in fact, be true.
Note that this sharing of character arrays has been removed in recent versions of the Oracle JDK.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.