Pattern.compile.split vs StringBuilder iteration and substring

Pattern.compile.split vs StringBuilder iteration and substring - java

I have to split a very large string in the fastest way possible and from what research i did i narrow it down to 2 possibilities:
1.Pattern.compile("[delimiter]").split("[large_string]");
2. Iterate through StringBuilder and call substring
StringBuilder sb = new StringBuilder("[large_string]");
ArrayList<String> pieces = new ArrayList<String>();
int pos = 0;
int currentPos;
while((currentPos = sb.indexOf("[delimiter]", pos)) != -1){
pieces.add(sb.substring(pos, currentPos));
pos = currentPos+"[delimiter]".length();
}
Any help is appreciated , i will benchmark them but i'm more interested in the theoretic part : why is one faster then the other .
Furthermore if you have other suggestions please post them.
UPDATE: So as I said I've done the benchmark , generated 5 mil strings each having 32 chars , these were put in a single string delimited by ~~ :
StringBuilder approach , surprisingly , was the slowest with an avg of 2.50-2.55 sec
Pattern.compile.split come on 2nd place with an avg of 2.47-2.49 sec
Splitter by Guava was the undisputed winner with an avg of 1.12-1.18 sec half the time of others (special thanks to fge who suggested it)
Thank you all for the help!

If your string is large, something to consider is whether any copies are made. If you don't use StringBuilder but use the plain String#substring(from,to), then no copies will be made of the contents of the string. There will be 1 instance of the whole String, and it will stick around as long as at least 1 substring persists.
Hmm... Source perusal of the Pattern class shows that split does the same thing, while the source of the StringBuilder shows that copies are made for each substring.

If this is a fixed pattern, and you do not need a regex, you might want to consider Guava's Splitter. It is very well written and performs admirably:
private static final Splitter SPLITTER = Splitter.on("myDelimiterHere");
Also, unlike .split(), you don't get nasty surprises with empty strings at the end... (you must pass a negative integer as an argument in order for it to do a "real" split)
You will also see that this class' .split() method returns an Iterable<CharSequence>; when the string is REALLY large, it only makes the necessary copies you ask it to make!

If you have to use it multiple times, a static object of your Pattern would be the choice. Look into the StringBuilder. The method indexOf is doing the same, iterating through all characters. Internally the String.split() method is also using Pattern to compile and split the string. Use the given methods and you should have the best performance...

Related

Why would you use a StringBuilder method over a String in Java? [duplicate]

This question already has answers here:
Why StringBuilder when there is String?
(9 answers)
Closed 1 year ago.
What are the benefits of using a StringBuilder method over a String? Why not just amend the content within a String?
I understand that a StringBuilder is mutable, but if you have to write more lines of code to append onto it, why not change the original String?
I'd appreciate it if someone would provide an example of where a StringBuilder would be more beneficial.

You cannot change the original string because it is immutable therefore having String s = ""; every operation like
s += "something";
will create and reassign new object (probably it will also add a little bit of work for GC in near future). On he other hand modifying StringBuilder is (usually) not creating new object (indeed it is happening just once at the very end when calling toString() method on builder instance)
Because of this it is common to use StringBuilder when you are modifying string many many times (for example in some long loops).
Still it is common error to overuse StringBuilder - it may be example of premature optimization
Read also:
Is it better to reuse a StringBuilder in a loop?

You're on the right track by understanding the immutability of the String class.
Based on [1] and [2], here are some cases where each type of implementation is recommended:
1. Simple String Concatenation
String answer = firstPart + "." + secondPart;
This is syntactic sugar for
String answer = new StringBuilder(firstPart).append("."). append(secondPart).toString();
This is actually quite performant and is the recommended approach for simple string concatenation [1].
2. Stepwise Construction
String answer = firstPart;
answer += ".";
answer += secondPart;
Under the hood, this translates to
String answer = new StringBuilder(firstPart).toString();
answer = new StringBuilder(answer).append(".").toString();
answer = new StringBuilder(answer).append(secondPart).toString();
This creates a temporary StringBuilder and intermediate String objects which are inefficient [1]. Especially if the intermediate results are not used.
Use StringBuilder in this case.
3. For Loop Construction and Scaling For Larger Collections
String result = "";
for(int i = 0; i < numItems(); i++)
result += lineItem(i);
return result;
The above code is O(n^2), where n is number of strings. This is due to the immutability of the String class and due to the the fact that when concatenating two strings, the contents of both are copied [2].
So it may be fine for a few fixed length items, but it will not scale.
In such cases, use StringBuilder.
StringBuilder sb = new StringBuilder(numItems() * LINE_SIZE);
for(int i = 0; i < numItems(); i++)
sb.append(lineItem(i));
return b.toString();
This code is O(n) time, where n is number of items or strings.
So as the number of strings gets larger, you will see the difference in performance [2].
This code pre-allocates an array in the initialization of StringBuilder, but even if a default size array is used, it will be significantly faster than the previous code for a large number of items [2].
Summary
Use string concatenation if you are concatenating only a few strings or if performance is not of importance (i.e. a demonstration/toy-application). Otherwise, use StringBuilder or consider processing the string as a character array [2].
References:
[1] Java Performance: The Definitive Guide by Scott Oaks: Link
[2] Effective Java 3rd Edition by Joshua Bloch: Link

String.substring vs String[].split

I have a comma delaminated string that when calling String.split(",") it returns an array size of about 60. In a specific use case I only need to get the value of the second value that would be returned from the array. So for example "Q,BAC,233,sdf,sdf," all I want is the value of the string after the first ',' and before the second ','. The question I have for performance am I better off parsing it myself using substring or using the split method and then get the second value in the array? Any input would be appreciated. This method will get called hundreds of times a second so it's important I understand the best approach regarding performance and memory allocation.
-Duncan

Since String.Split returns a string[], using a 60-way Split would result in about sixty needless allocations per line. Split goes through your entire string, and creates sixty new object plus the array object itself. Of these sixty one objects you keep exactly one, and let garbage collector deal with the remaining sixty.
If you are calling this in a tight loop, a substring would definitely be more efficient: it goes through the portion of your string up to the second comma ,, and then creates one new object that you keep.
String s = "quick,brown,fox,jumps,over,the,lazy,dog";
int from = s.indexOf(',');
int to = s.indexOf(',', from+1);
String brown = s.substring(from+1, to);
The above prints brown
When you run this multiple times, the substring wins on time hands down: 1,000,000 iterations of split take 3.36s, while 1,000,000 iterations of substring take only 0.05s. And that's with only eight components in the string! The difference for sixty components would be even more drastic.

ofcourse why iterate through whole string, just use substring() and indexOf()

You are certainly better off doing it by hand for two reasons:
.split() takes a string as an argument, but this string is interpreted as a Pattern, and for your use case Pattern is costly;
as you say, you only need the second element: the algorithm to grab that second element is simple enough to do by hand.

I would use something like:
final int first = searchString.indexOf(",");
final int second = searchString.indexOf(",", first+1);
String result= searchString.substring(first+1, second);

My first inclination would be to find the index of the first and second commas and take the substring.
The only real way to tell for sure, though, is to test each in your particular scenario. Break out the appropriate stopwatch and measure the two.

String concatenation in Java - when to use +, StringBuilder and concat [duplicate]

This question already has answers here:
StringBuilder vs String concatenation in toString() in Java
(20 answers)
Closed 8 years ago.
When should we use + for concatenation of strings, when is StringBuilder preferred and When is it suitable to use concat.
I've heard StringBuilder is preferable for concatenation within loops. Why is it so?
Thanks.

Modern Java compiler convert your + operations by StringBuilder's append. I mean to say if you do str = str1 + str2 + str3 then the compiler will generate the following code:
StringBuilder sb = new StringBuilder();
str = sb.append(str1).append(str2).append(str3).toString();
You can decompile code using DJ or Cavaj to confirm this :)
So now its more a matter of choice than performance benefit to use + or StringBuilder :)
However given the situation that compiler does not do it for your (if you are using any private Java SDK to do it then it may happen), then surely StringBuilder is the way to go as you end up avoiding lots of unnecessary String objects.

I tend to use StringBuilder on code paths where performance is a concern. Repeated string concatenation within a loop is often a good candidate.
The reason to prefer StringBuilder is that both + and concat create a new object every time you call them (provided the right hand side argument is not empty). This can quickly add up to a lot of objects, almost all of which are completely unnecessary.
As others have pointed out, when you use + multiple times within the same statement, the compiler can often optimize this for you. However, in my experience this argument doesn't apply when the concatenations happen in separate statements. It certainly doesn't help with loops.
Having said all this, I think top priority should be writing clear code. There are some great profiling tools available for Java (I use YourKit), which make it very easy to pinpoint performance bottlenecks and optimize just the bits where it matters.
P.S. I have never needed to use concat.

From Java/J2EE Job Interview Companion:
String
String is immutable: you can’t modify a String object but can replace it by creating a new instance. Creating a new instance is rather expensive.
//Inefficient version using immutable String
String output = "Some text";
int count = 100;
for (int i = 0; i < count; i++) {
output += i;
}
return output;
The above code would build 99 new String objects, of which 98 would be thrown away immediately. Creating new objects is not efficient.
StringBuffer/StringBuilder
StringBuffer is mutable: use StringBuffer or StringBuilder when you want to modify the contents. StringBuilder was added in Java 5 and it is identical in all respects to StringBuffer except that it is not synchronised, which makes it slightly faster at the cost of not being thread-safe.
//More efficient version using mutable StringBuffer
StringBuffer output = new StringBuffer(110);
output.append("Some text");
for (int i = 0; i < count; i++) {
output.append(i);
}
return output.toString();
The above code creates only two new objects, the StringBuffer and the final String that is returned. StringBuffer expands as needed, which is costly however, so it would be better to initialise the StringBuffer with the correct size from the start as shown.

If all concatenated elements are constants (example : "these" + "are" + "constants"), then I'd prefer the +, because the compiler will inline the concatenation for you. Otherwise, using StringBuilder is the most effective way.
If you use + with non-constants, the Compiler will internally use StringBuilder as well, but debugging becomes hell, because the code used is no longer identical to your source code.

My recommendation would be as follows:
+: Use when concatenating 2 or 3 Strings simply to keep your code brief and readable.
StringBuilder: Use when building up complex String output or where performance is a concern.
String.format: You didn't mention this in your question but it is my preferred method for creating Strings as it keeps the code the most readable / concise in my opinion and is particularly useful for log statements.
concat: I don't think I've ever had cause to use this.

Use StringBuilder if you do a lot of manipulation. Usually a loop is a pretty good indication of this.
The reason for this is that using normal concatenation produces lots of intermediate String object that can't easily be "extended" (i.e. each concatenation operation produces a copy, requiring memory and CPU time to make). A StringBuilder on the other hand only needs to copy the data in some cases (inserting something in the middle, or having to resize because the result becomes to big), so it saves on those copy operations.
Using concat() has no real benefit over using + (it might be ever so slightly faster for a single +, but once you do a.concat(b).concat(c) it will actually be slower than a + b + c).

Use + for single statements and StringBuilder for multiple statements/ loops.

The performace gain from compiler applies to concatenating constants.
The rest uses are actually slower then using StringBuilder directly.
There is not problem with using "+" e.g. for creating a message for Exception because it does not happen often and the application si already somehow screwed at the moment. Avoid using "+" it in loops.
For creating meaningful messages or other parametrized strings (Xpath expressions e.g.) use String.format - it is much better readable.

I suggest to use concat for two string concatination and StringBuilder otherwise, see my explanation for concatenation operator (+) vs concat()

Why are most string manipulations in Java based on regexp?

In Java there are a bunch of methods that all have to do with manipulating Strings.
The simplest example is the String.split("something") method.
Now the actual definition of many of those methods is that they all take a regular expression as their input parameter(s). Which makes then all very powerful building blocks.
Now there are two effects you'll see in many of those methods:
They recompile the expression each time the method is invoked. As such they impose a performance impact.
I've found that in most "real-life" situations these methods are called with "fixed" texts. The most common usage of the split method is even worse: It's usually called with a single char (usually a ' ', a ';' or a '&') to split by.
So it's not only that the default methods are powerful, they also seem overpowered for what they are actually used for. Internally we've developed a "fastSplit" method that splits on fixed strings. I wrote a test at home to see how much faster I could do it if it was known to be a single char. Both are significantly faster than the "standard" split method.
So I was wondering: why was the Java API chosen the way it is now?
What was the good reason to go for this instead of having a something like split(char) and split(String) and a splitRegex(String) ??
Update: I slapped together a few calls to see how much time the various ways of splitting a string would take.
Short summary: It makes a big difference!
I did 10000000 iterations for each test case, always using the input
"aap,noot,mies,wim,zus,jet,teun"
and always using ',' or "," as the split argument.
This is what I got on my Linux system (it's an Atom D510 box, so it's a bit slow):
fastSplit STRING
Test 1 : 11405 milliseconds: Split in several pieces
Test 2 : 3018 milliseconds: Split in 2 pieces
Test 3 : 4396 milliseconds: Split in 3 pieces
homegrown fast splitter based on char
Test 4 : 9076 milliseconds: Split in several pieces
Test 5 : 2024 milliseconds: Split in 2 pieces
Test 6 : 2924 milliseconds: Split in 3 pieces
homegrown splitter based on char that always splits in 2 pieces
Test 7 : 1230 milliseconds: Split in 2 pieces
String.split(regex)
Test 8 : 32913 milliseconds: Split in several pieces
Test 9 : 30072 milliseconds: Split in 2 pieces
Test 10 : 31278 milliseconds: Split in 3 pieces
String.split(regex) using precompiled Pattern
Test 11 : 26138 milliseconds: Split in several pieces
Test 12 : 23612 milliseconds: Split in 2 pieces
Test 13 : 24654 milliseconds: Split in 3 pieces
StringTokenizer
Test 14 : 27616 milliseconds: Split in several pieces
Test 15 : 28121 milliseconds: Split in 2 pieces
Test 16 : 27739 milliseconds: Split in 3 pieces
As you can see it makes a big difference if you have a lot of "fixed char" splits to do.
To give you guys some insight; I'm currently in the Apache logfiles and Hadoop arena with the data of a big website. So to me this stuff really matters :)
Something I haven't factored in here is the garbage collector. As far as I can tell compiling a regular expression into a Pattern/Matcher/.. will allocate a lot of objects, that need to be collected some time. So perhaps in the long run the differences between these versions is even bigger .... or smaller.
My conclusions so far:
Only optimize this if you have a LOT of strings to split.
If you use the regex methods always precompile if you repeatedly use the same pattern.
Forget the (obsolete) StringTokenizer
If you want to split on a single char then use a custom method, especially if you only need to split it into a specific number of pieces (like ... 2).
P.S. I'm giving you all my homegrown split by char methods to play with (under the license that everything on this site falls under :) ). I never fully tested them .. yet. Have fun.
private static String[]
stringSplitChar(final String input,
final char separator) {
int pieces = 0;
// First we count how many pieces we will need to store ( = separators + 1 )
int position = 0;
do {
pieces++;
position = input.indexOf(separator, position + 1);
} while (position != -1);
// Then we allocate memory
final String[] result = new String[pieces];
// And start cutting and copying the pieces.
int previousposition = 0;
int currentposition = input.indexOf(separator);
int piece = 0;
final int lastpiece = pieces - 1;
while (piece < lastpiece) {
result[piece++] = input.substring(previousposition, currentposition);
previousposition = currentposition + 1;
currentposition = input.indexOf(separator, previousposition);
}
result[piece] = input.substring(previousposition);
return result;
}
private static String[]
stringSplitChar(final String input,
final char separator,
final int maxpieces) {
if (maxpieces <= 0) {
return stringSplitChar(input, separator);
}
int pieces = maxpieces;
// Then we allocate memory
final String[] result = new String[pieces];
// And start cutting and copying the pieces.
int previousposition = 0;
int currentposition = input.indexOf(separator);
int piece = 0;
final int lastpiece = pieces - 1;
while (currentposition != -1 && piece < lastpiece) {
result[piece++] = input.substring(previousposition, currentposition);
previousposition = currentposition + 1;
currentposition = input.indexOf(separator, previousposition);
}
result[piece] = input.substring(previousposition);
// All remaining array elements are uninitialized and assumed to be null
return result;
}
private static String[]
stringChop(final String input,
final char separator) {
String[] result;
// Find the separator.
final int separatorIndex = input.indexOf(separator);
if (separatorIndex == -1) {
result = new String[1];
result[0] = input;
}
else {
result = new String[2];
result[0] = input.substring(0, separatorIndex);
result[1] = input.substring(separatorIndex + 1);
}
return result;
}

Note that the regex need not be recompiled each time. From the Javadoc:
An invocation of this method of the form str.split(regex, n) yields the same result as the expression
Pattern.compile(regex).split(str, n)
That is, if you are worried about performance, you may precompile the pattern and then reuse it:
Pattern p = Pattern.compile(regex);
...
String[] tokens1 = p.split(str1);
String[] tokens2 = p.split(str2);
...
instead of
String[] tokens1 = str1.split(regex);
String[] tokens2 = str2.split(regex);
...
I believe that the main reason for this API design is convenience. Since regular expressions include all "fixed" strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.
My feeling (which I can't back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.
It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.

I wouldn't say most string manipulations are regex-based in Java. Really we are only talking about split and replaceAll/replaceFirst. But I agree, it's a big mistake.
Apart from the ugliness of having a low-level language feature (strings) becoming dependent on a higher-level feature (regex), it's also a nasty trap for new users who might naturally assume that a method with the signature String.replaceAll(String, String) would be a string-replace function. Code written under that assumption will look like it's working, until a regex-special character creeps in, at which point you've got confusing, hard-to-debug (and maybe even security-significant) bugs.
It's amusing that a language that can be so pedantically strict about typing made the sloppy mistake of treating a string and a regex as the same thing. It's less amusing that there's still no builtin method to do a plain string replace or split. You have to use a regex replace with a Pattern.quoted string. And you only even get that from Java 5 onwards. Hopeless.
#Tim Pietzcker:
Are there other languages that do the same?
JavaScript's Strings are partly modelled on Java's and are also messy in the case of replace(). By passing in a string, you get a plain string replace, but it only replaces the first match, which is rarely what's wanted. To get a replace-all you have to pass in a RegExp object with the /g flag, which again has problems if you want to create it dynamically from a string (there is no built-in RegExp.quote method in JS). Luckily, split() is purely string-based, so you can use the idiom:
s.split(findstr).join(replacestr)
Plus of course Perl does absolutely everything with regexen, because it's just perverse like that.
(This is a comment more than an answer, but is too big for one. Why did Java do this? Dunno, they made a lot of mistakes in the early days. Some of them have since been fixed. I suspect if they'd thought to put regex functionality in the box marked Pattern back in 1.0, the design of String would be cleaner to match.)

I imagine a good reason is that they can simply pass the buck on to the regex method, which does all the real heavy lifting for all of the string methods. Im guessing they thought if they already had a working solution it was less efficient, from a development and maintenance standpoint, to reinvent the wheel for each string manipulation method.

Interesting discussion!
Java was not originally intended as a batch programming language. As such the API out of the box are more tuned towards doing one "replace" , one "parse" etc. except on Application initialization when the app may be expected to be parsing a bunch of configuration files.
Hence optimization of these APIs was sacrificed in the altar of simplicity IMO. But the question brings up an important point. Python's desire to keep the regex distinct from the non regex in its API, stems from the fact that Python can be used as an excellent scripting language as well. In UNIX too, the original versions of fgrep did not support regex.
I was engaged in a project where we had to do some amount of ETL work in java. At that time, I remember coming up with the kind of optimizations that you have alluded to, in your question.

I suspect that the reason why things like String#split(String) use regexp under the hood is because it involves less extraneous code in the Java Class Library. The state machine resulting from a split on something like , or space is so simple that it is unlikely to be significantly slower to execute than a statically implemented equivalent using a StringCharacterIterator.
Beyond that the statically implemented solution would complicate runtime optimization with the JIT because it would be a different block of code that also requires hot code analysis. Using the existing Pattern algorithms regularly across the library means that they are more likely candidates for JIT compilation.

Very good question..
I suppose when the designers sat down to look at this (and not for very long, it seems), they came at it from a point of view that it should be designed to suit as many different possibilities as possible. Regular Expressions offered that flexibility.
They didn't think in terms of efficiencies. There is the Java Community Process available to raise this.
Have you looked at using the java.util.regex.Pattern class, where you compile the expression once and then use on different strings.
Pattern exp = Pattern.compile(":");
String[] array = exp.split(sourceString1);
String[] array2 = exp.split(sourceString2);

In looking at the Java String class, the uses of regex seem reasonable, and there are alternatives if regex is not desired:
http://java.sun.com/javase/6/docs/api/java/lang/String.html
boolean matches(String regex) - A regex seems appropriate, otherwise you could just use equals
String replaceAll/replaceFirst(String regex, String replacement) - There are equivalents that take CharSequence instead, preventing regex.
String[] split(String regex, int limit) - A powerful but expensive split, you can use StringTokenizer to split by tokens.
These are the only functions I saw that took regex.
Edit: After seeing that StringTokenizer is legacy, I would defer to Péter Török's answer to precompile the regex for split instead of using the tokenizer.

The answer to your question is that the Java core API did it wrong. For day to day work you can consider using Guava libraries' CharMatcher which fills the gap beautifully.

...why was the Java API chosen the way it is now?
Short answer: it wasn't. Nobody ever decided to favor regex methods over non-regex methods in the String API, it just worked out that way.
I always understood that Java's designers deliberately kept the string-manipulation methods to a minimum, in order to avoid API bloat. But when regex support came along in JDK 1.4, of course they had to add some convenience methods to String's API.
So now users are faced with a choice between the immensely powerful and flexible regex methods, and the bone-basic methods that Java always offered.

Why does appending "" to a String save memory?

I used a variable with a lot of data in it, say String data.
I wanted to use a small part of this string in the following way:
this.smallpart = data.substring(12,18);
After some hours of debugging (with a memory visualizer) I found out that the objects field smallpart remembered all the data from data, although it only contained the substring.
When I changed the code into:
this.smallpart = data.substring(12,18)+"";
..the problem was solved! Now my application uses very little memory now!
How is that possible? Can anyone explain this? I think this.smallpart kept referencing towards data, but why?
UPDATE:
How can I clear the big String then? Will data = new String(data.substring(0,100)) do the thing?

Doing the following:
data.substring(x, y) + ""
creates a new (smaller) String object, and throws away the reference to the String created by substring(), thus enabling garbage collection of this.
The important thing to realise is that substring() gives a window onto an existing String - or rather, the character array underlying the original String. Hence it will consume the same memory as the original String. This can be advantageous in some circumstances, but problematic if you want to get a substring and dispose of the original String (as you've found out).
Take a look at the substring() method in the JDK String source for more info.
EDIT: To answer your supplementary question, constructing a new String from the substring will reduce your memory consumption, provided you bin any references to the original String.
NOTE (Jan 2013). The above behaviour has changed in Java 7u6. The flyweight pattern is no longer used and substring() will work as you would expect.

If you look at the source of substring(int, int), you'll see that it returns:
new String(offset + beginIndex, endIndex - beginIndex, value);
where value is the original char[]. So you get a new String but with the same underlying char[].
When you do, data.substring() + "", you get a new String with a new underlying char[].
Actually, your use case is the only situation where you should use the String(String) constructor:
String tiny = new String(huge.substring(12,18));

When you use substring, it doesn't actually create a new string. It still refers to your original string, with an offset and size constraint.
So, to allow your original string to be collected, you need to create a new string (using new String, or what you've got).

I think this.smallpart kept
referencing towards data, but why?
Because Java strings consist of a char array, a start offset and a length (and a cached hashCode). Some String operations like substring() create a new String object that shares the original's char array and simply has different offset and/or length fields. This works because the char array of a String is never modified once it has been created.
This can save memory when many substrings refer to the same basic string without replicating overlapping parts. As you have noticed, in some situations, it can keep data that's not needed anymore from being garbage collected.
The "correct" way to fix this is the new String(String) constructor, i.e.
this.smallpart = new String(data.substring(12,18));
BTW, the overall best solution would be to avoid having very large Strings in the first place, and processing any input in smaller chunks, aa few KB at a time.

In Java strings are imutable objects and once a string is created, it remains on memory until it's cleaned by the garbage colector (and this cleaning is not something you can take for granted).
When you call the substring method, Java does not create a trully new string, but just stores a range of characters inside the original string.
So, when you created a new string with this code:
this.smallpart = data.substring(12, 18) + "";
you actually created a new string when you concatenated the result with the empty string.
That's why.

As documented by jwz in 1997:
If you have a huge string, pull out a substring() of it, hold on to the substring and allow the longer string to become garbage (in other words, the substring has a longer lifetime) the underlying bytes of the huge string never go away.

Just to sum up, if you create lots of substrings from a small number of big strings, then use
String subtring = string.substring(5,23)
Since you only use the space to store the big strings, but if you are extracting a just handful of small strings, from losts of big strings, then
String substring = new String(string.substring(5,23));
Will keep your memory use down, since the big strings can be reclaimed when no longer needed.
That you call new String is a helpful reminder that you really are getting a new string, rather than a reference to the original one.

Firstly, calling java.lang.String.substring creates new window on the original String with usage of the offset and length instead of copying the significant part of underlying array.
If we take a closer look at the substring method we will notice a string constructor call String(int, int, char[]) and passing it whole char[] that represents the string. That means the substring will occupy as much amount of memory as the original string.
Ok, but why + "" results in demand for less memory than without it??
Doing a + on strings is implemented via StringBuilder.append method call. Look at the implementation of this method in AbstractStringBuilder class will tell us that it finally do arraycopy with the part we just really need (the substring).
Any other workaround??
this.smallpart = new String(data.substring(12,18));
this.smallpart = data.substring(12,18).intern();

Appending "" to a string will sometimes save memory.
Let's say I have a huge string containing a whole book, one million characters.
Then I create 20 strings containing the chapters of the book as substrings.
Then I create 1000 strings containing all paragraphs.
Then I create 10,000 strings containing all sentences.
Then I create 100,000 strings containing all the words.
I still only use 1,000,000 characters. If you add "" to each chapter, paragraph, sentence and word, you use 5,000,000 characters.
Of course it's entirely different if you only extract one single word from the whole book, and the whole book could be garbage collected but isn't because that one word holds a reference to it.
And it's again different if you have a one million character string and remove tabs and spaces at both ends, making say 10 calls to create a substring. The way Java works or worked avoids copying a million characters each time. There is compromise, and it's good if you know what the compromises are.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.