About the String#substring() method - java

If we take a look at the String#substring method implementation :
new String(offset + beginIndex, endIndex - beginIndex, value);
We see that a new String is created with the same original content (parameter char [] value).
So the workaround is to use new String(toto.substring(...)) to drop the reference to the original char[] value and make it eligible for GC (if no more references exist).
I would like to know if there is a special reason that explain this implementation. Why the method doesn't create herself the new shorter String and why she keeps the full original value instead?
The other related question is : should we always use new String(...) when dealing with substring?

I would like to know if there is a special reason that explain this implementation. Why the method doesn't create herself the new shorter String and why she keeps the full original value instead?
Because in most use-cases it is faster for substring() to work this way. At least, that's what Sun / Oracle's empirical measurements would have shown. By doing this, the implementation avoids allocating a backing array and copying characters to the array.
This is only a non-optimization if you have to then copy the String to avoid a memory leakage problem. In the vast majority of cases, the substrings become garbage in a relatively short period of time, and there is no long-term leakage of memory.
Hypothetically, the Java designers could have provided two versions of substring, one which behaved as currently, and the other that created a String with its own backing array. But that would encourage the developer to waste brain-cycles thinking about which version to use. And then there's the problem of utility methods that build on substrings ... like the Pattern / Matcher classes for instance. So I think it is a good thing that they didn't.

Because String is immutable class
Also See
http://javarevisited.blogspot.it/2010/10/why-string-is-immutable-in-java.html (Courtesy: Luca Geretti )

The reason for this implementation is efficiency. By pointing to the same char[] as the original string, no data needs to be copied.
This does have a downside though, as you've already hinted at yourself. If the original string is long and you just want to get a small part of it, and you don't need the original string anymore after that, then the complete original array is still referenced and can't be garbage collected. You already know how to avoid that - do new String(original.substring(...)).
should we always use new String(...) when dealing with substring?
No, not always. Only when you know it might cause problem. In many cases, referring to the original char[] instead of copying the data is more efficient.

Related

What's the difference between StringBuilder and ArrayList<String>?

In an interview, I want to build up a new String with some substrings. I argued that ArrayList<String> is almost the same as StringBuilder, but the interviewer said I should always use StringBuilder if I need to deal with String. I think the time complexity of adding/removing functions between them are the same.
They aren't the same thing at all. StringBuilder builds a single string, while ArrayList<String> is just that--an array of separate strings. Of course, you can concatenate all of the array's strings with String.join("", list), where the first argument is the separator that you want to use, but why would you go that route instead of just using the class that was designed to do exactly what you're trying to do in the first place?
It all comes down to memory consumption. String is an object, while ArrayList<String> holds separate objects, StringBuilder holds only one.
StringBuilder has a member function to return the whole built string, whereas in ArrayList, you have to concatenate the strings yourself.
Unless you continue to need the separate elements you are adding to the list, you should use a StringBuilder.
After all, you can't directly get a concatenated string from the contents of the list: you have to put it in, say, a StringBuilder.
But in the specific case of building up a string of substrings, StringBuilder provides methods to allow you to append portions of Strings without using substring: the append(CharSequence, int, int) method is an optimization to avoid creating that extra string.
It should be mentioned that, at least when I have written python, it has been considered better to build up a list, and then use ''.join(theList) at the end, which is basically the analog of ArrayList<String>.
I don't know enough about python to know why this is considered particularly better.
You can "build" strings using both. However StringBuilder is a class specializing in building strings with its append insert delete charAt etc... methods. An ArrayList is a general purpose collection which lacks most of this functionality. Consider implementing the following (contrived example) with an ArrayList:
StringBuilder sb = new StringBuilder().append("time: ")
.append(System.currentTimeMillis())
.deleteCharAt(4)
.reverse();
System.err.println(sb); // 3153067310451 emit
Ergonomics and readability aside, there are performance considerations but those are largely irrelevant on trivially sized examples.
If you need a single String at the end, performance and memory consumption are some differences for sure. Whenever you build a String from parts, in the good case you end up using StringBuilder, or in a slightly worse case StringBuffer, and in the worst case you end up concatenating two strings, then throw them away, and repeat - lots of allocations and garbage collection in this case.
JLS12 still mentions StringBuffer by name for optimization (but hopefully StringBuilder is used internally, as similar technique):
An implementation may choose to perform conversion and concatenation in one step to avoid creating and then discarding an intermediate String object. To increase the performance of repeated string concatenation, a Java compiler may use the StringBuffer class or a similar technique to reduce the number of intermediate String objects that are created by evaluation of an expression.
In the particular case of having a List<String> and later using String.join() on it, StringJoiner contains the particular StringBuilder object which is going to be used.
So there will be a builder anyway, and then it may be more efficient to use it from the beginning.

Are Strings *really* immutable in Java?

Everyone knows that Java's String object is immutable which essentially means that if you take String object a and concatenate it with another String object, say b, a totally new String object is created instead of an inplace concatenation.
However, recently I have read this question on SO which tells that the concatenation operator + is replaced with StringBuilder instances by the compiler.
At this point, I get a bit confused since, as far as I know and think, the StringBuilder objects are mutable due to their internal structure. In this case, wouldn't this essentially make String objects in Java mutable ?
Not really.
The actual Strings are still immutable, but in compile time, the JVM can detect some situations where the creation of additional String objects can be replaced by a StringBuilder.
So if you declare a String a and concatenate with another String, your a object doesn't change, (since it's immutable), but the JVM optimizes this by replacing the concatenation with the instantiation of a StringBuilder, appending both Strings to the Builder, and finally assigning the resulting String.
Let's say you have:
String a = "banana";
String d = a + "123" + "xpto";
Before the JVM optimized this, you would essentially be creating a relatively large number of Strings for something so simple, namely:
String a
String "123"
String "xpto"
String a + "123"
String a+"123"+"xpto"
With the optimization of transforming concatenation into a StringBuilder, the JVM no longer needs to create the intermediate results of the concatenation, so only the individual Strings and the resulting one are needed.
This is done basically for performance reasons, but keep in mind that in certain situations, you'll pay a huge penalty for this if you aren't careful. For instance:
String a = "";
for(String str: listOfStrings){
a += str;
}
If you were doing something like this, in each iteration the JVM will be instantiating a new StringBuilder, and this will be extremely costly if listOfStrings has a lot of elements. In this case, you should use a StringBuilder explicitly and do appends inside the loop instead of concatenating.
Strings are immutable objects. Once you concatenate it with another String, it becomes a new object. So remember - you don't change the existing String, you just create a new one.
Strings really are immutable. The compiler may generate code involving StringBuilder, but that is just an optimization which does not change how the code behaves (apart from performance). If there was some case where you could observe mutation (e.g. by keeping a reference to one of the intermediate results), the compiler would have to optimize in a way that still gives you an immutable string for that intermediate reference.
So if StringBuilder is used under the covers, even if you can't directly see the difference, doesn't that still mean that mutability is involved? Well, yes, but if you get down to it all the RAM in your PC is mutable. The memory of immutable objects can be moved around by the garbage collector, which definitely involves mutation as well. In the end, the important thing to you as a programmer is that this mutation is hidden from you, and you get a big promise that your program will behave the way you expect, i.e. you'll never see mutation in an immutable object (except in cases of severe problems, e.g. faulty RAM).

No reverse method in String class in Java?

Why there is no reverse method in String class in Java? Instead, the reverse() method is provided in StringBuilder? Is there a reason for this? But String has split(), regionMatches(), etc., which are more complex than the reverse() method.
When they added these methods, why not add reverse()?
Since you have it in StringBuilder, there's no need for it in String, right? :-)
Seriously, when designing an API there's lots of things you could include. The interfaces are however intentionally kept small for simplicity and clarity. Google on "API design" and you'll find tons of pages agreeing on this.
Here's how you do it if you actually need it:
str = new StringBuilder(str).reverse().toString();
Theoretically, String could offer it and just return the correct result as a new String. It's just a design choice, when you get down to it, on the part of the Java base libraries.
If you want an historical reason, String are immutable in Java, that is you cannot change a given String if not creating another String.
While this is not bad "per se", initial versions of Java missed classes like StringBuilder. Instead, String itself contained (and still contains) a lot of methods to "alter" the String but since String is immutable, each of these methods actually creates and return a NEW String object.
This caused simple expressions like :
String s = "a" + anotherString.substr(10,5).trim().toLowerCase();
To actually create in ram something like 5 strings, 4 of which are absolutely useless, with obvious performance problems (despite after there has been some optimizations regarding underlying char[] arrays).
To solve this, Sun introduced StringBuilder and other classes that ARE NOT immutable. These classes freely modify a single char[] array, so that calling methods does not need to produce many intermediate String instances.
They added "reverse" quite lately, so they added it to StringBuilder instead of String, cause that's now the preferred way to manipulate strings.
As a side-note, in Scala you use the same java.lang.String class and you do get a reverse method (along with all kinds of other handy stuff). The way it does it is with implicit conversions, so that your String gets automatically converted into a class that does have a reverse method. It's really quite clever, and removes the need to bloat the base class with hundred of methods.
String is immutable, meaning it can't be changed.
When you reverse a String, what's happening is that each letter is switched on it's own, means it will always create the new object each times.
Let us see with example:
This means that for instance Hello becomes as below
elloH lloeH loleH olleH
and you end up with 4 new String objects on the heap.
So think if you have thousands latter of string or more then how much object will be created.... it will be really a very expensive. So too much memory will be occupied.
So because of this String class not having reverse() method.
Well I think it could be because it is an immutable class so if we had a reverse method it would actually create a new object.
reverse() acts on this, modifying the current object, and String objects are immutable - they can't be modified.
It's peculiarly efficient to do reverse() in situ - the size is known to be the same, so no allocation is necessary, there are half as many loop iterations as there would be in a copy, and, for large strings, memory locality is optimal. From looking at the code, one can see that a lot of care was taken to make it fast. I suspect the author(s) had a particular use case in mind that demanded high performance.

Why isn't String() constructor private?

Is using a String() constructor as against string literal beneficial in any scenario?
Using string literals enable reuse of existing objects, so why do we need the public constructor? Is there any real world use?
For eg., both the literals point to the same object.
String name1 = "name";//new String("name") creates a new object.
String name2 = "name";
One example where the constructor has a useful purpose: Strings created by String.substring() share the underlying char[] of the String they're created by. So if you have a String of length 10.000.000 (taking up 20MB of memory) and take its first 5 characters as substring then discard the original String, that substring will still keep the 20MB object from being eligible for garbage collection. Using the String constructor on it will avoid that, as it makes a copy of only the part of the underlying char array that's actually used by the String instance.
Of course, if you create and use a lot of substrings of the same String, especially if they overlap, then you'd very much want them to share the underlying char[], and using the constructor would be counterproductive.
Since string is immutable, operations like substring keep the original string that might be long. Using the constructor to create new string from the substring will create a new string and will allow dropping the old string (to GC). This way might free up needed memory.
Example:
String longS = "very long";
String shortS = new String(longS.substring(4));
Because sometimes you might want to create a copy and not just have a new reference to the same string.
All good answers here, but I think it's worth pointing out that the Java treats literals quite different than Strings constructed the traditional way.
This is a good Q/A about it.

String indexed collection in Java

Using Java, assuming v1.6.
I have a collection where the unique index is a string and the non-unique value is an int.
I need to perform thousands of lookups against this collection as fast as possible.
I currently am using a HashMap<String, Integer> but I am worried that the boxing/unboxing of the Integer to int is making this slower.
I had thought of using an ArrayList<String> coupled with an int[].
i.e. Instead of:
int value = (int) HashMap<String, Integer>.get("key");
I could do
int value = int[ArrayList<String>.indexOf("key")];
Any thoughts? Is there a faster way to do this?
p.s. I will only build the collection once and maybe modify it once but each time I will know the size so I can use String[] instead of ArrayList but not sure there's a faster way to replicate indexOf...
Unboxing is fast - no allocations are required. Boxing is a potentially slower, as it needs to allocate a new object (unless it uses a pooled one).
Are you sure you've really got a problem though? Don't complicate your code until you've actually proved that this is a significant hit. I very much doubt that it is.
There are collection libraries available for primitive types, but I would stick to the normal HashMap from the JRE until you've profiled and checked that this is causing a problem. If it really only is thousands of lookups, I very much doubt it'll be a problem at all. Likewise if you're lookup-based rather than addition-based (i.e. you fetch more often than you add) then the boxing cost won't be particularly significant, just unboxing, which is cheap.
I would suggest using intValue() rather than the cast to convert the value to an int though - it makes it clearer (IMO) what's going on.
EDIT: To answer the question in the comment, HashMap.get(key) will be faster than ArrayList.indexOf(key) when the collection is large enough. If you've actually only got five items, the list may well be faster. I assume that's not really the case though.
If you really, really don't want the boxing/unboxing, try Trove (TObjectHashMap). There's also COLT to consider, but I couldn't find the right type in there.
Any performance gain that you get from not having to box/unbox is significanlty erased by the for loop that you need to go with the indexOf method.
Use the HashMap. Also you don't need the (int) cast, the compiler will take care of it for you.
The array thing would be ok with a small number of items in the array, but then so is the HashMap...
The only way you could make it fast to look up in an array (and this is not a real suggestion as it has too many issues) is if you use the hashCode of the String to work with as the index into the array - don't even think about doing that though! (I only mention it because you might find something via google that talks about it... if they don't explain why it is bad don't read any more about it!)
I would guess that the HashMap would give a much faster lookup, but I think this needs some benchmarking to answer correctly.
EDIT: Furthermore, There is no boxing involved, merely unboxing of the already-stored objects, which should be pretty fast, since no object allocation is done in that step. So, I don't think this would give you any more speed, but you should run benchmarks nonetheless.
I think scanning your ArrayList to find the match for your "key" is going to be much slower than your boxing/unboxing concerns.
Since you say it is indeed a bottleneck, I'll suggest Primitive Collections for Java; in particular, the ObjectKeyIntMap looks like exactly what you want.
If the cost of building the map once and once only doesn't matter, you might want to look at perfect hashing, for example Bob Jenkins' code.
One slight problem here: You can have duplicate elements in a List. If you really want to do it the second way, consider using a Set instead.
Having said that, have you done a performance test on the two to see if one is faster than the other?
Edit: Of course, the most popular Set type (HashSet) is itself backed by a HashMap, so switching to a set may not be such a wise change after all.
List.indexOf will do a linear scan of the list - O(n) typically. A binary search will do the job in O(log n). A hash table will do it in O(1).
Having large numbers of Integer objects in memory could be a problem. But then the same is true for Strings (both the String and char[]). You could do you own custom DB-style implementation, but I suggest benchmarking first.
The map access does not do unboxing for the lookup, only the later access to the result makes it slow.
I suggest to introduce a small wrapper with a getter for the int, such as SimpleInt. It holds the int without conversion. The constructor is not expensive and overall is is cheaper than an Integer.
public SimpleInt
{
private final int data;
public SimpleInt(int i)
{
data = i;
}
// getter here
....
}

Categories