String.substring vs String[].split - java

I have a comma delaminated string that when calling String.split(",") it returns an array size of about 60. In a specific use case I only need to get the value of the second value that would be returned from the array. So for example "Q,BAC,233,sdf,sdf," all I want is the value of the string after the first ',' and before the second ','. The question I have for performance am I better off parsing it myself using substring or using the split method and then get the second value in the array? Any input would be appreciated. This method will get called hundreds of times a second so it's important I understand the best approach regarding performance and memory allocation.
-Duncan

Since String.Split returns a string[], using a 60-way Split would result in about sixty needless allocations per line. Split goes through your entire string, and creates sixty new object plus the array object itself. Of these sixty one objects you keep exactly one, and let garbage collector deal with the remaining sixty.
If you are calling this in a tight loop, a substring would definitely be more efficient: it goes through the portion of your string up to the second comma ,, and then creates one new object that you keep.
String s = "quick,brown,fox,jumps,over,the,lazy,dog";
int from = s.indexOf(',');
int to = s.indexOf(',', from+1);
String brown = s.substring(from+1, to);
The above prints brown
When you run this multiple times, the substring wins on time hands down: 1,000,000 iterations of split take 3.36s, while 1,000,000 iterations of substring take only 0.05s. And that's with only eight components in the string! The difference for sixty components would be even more drastic.

ofcourse why iterate through whole string, just use substring() and indexOf()

You are certainly better off doing it by hand for two reasons:
.split() takes a string as an argument, but this string is interpreted as a Pattern, and for your use case Pattern is costly;
as you say, you only need the second element: the algorithm to grab that second element is simple enough to do by hand.

I would use something like:
final int first = searchString.indexOf(",");
final int second = searchString.indexOf(",", first+1);
String result= searchString.substring(first+1, second);

My first inclination would be to find the index of the first and second commas and take the substring.
The only real way to tell for sure, though, is to test each in your particular scenario. Break out the appropriate stopwatch and measure the two.

Related

What is more efficient? Storing a split string in an array, or calling the split method everytime you need it

What is going to be faster, storing a split string into an array and using this array within my program, or could I call the .split() method on the string whenever I needed an array to iterate through?
String main = "1,2,3,4,5,6";
String[] array = main.split(",");
vs
main.split(",");
whenever I need to use the input values?
I realise it will be way more readable if I were to store the string in an array. I'd just like to know if the .split() takes more computing time than using an array. Since the split method returns an array containing the split strings.
A simple example(?) loop to go with the question:
for (int i = main.length - 1; i >= 0; i--){}
vs
for (int i = main.split(",") - 1; i >= 0; i--){}
It's a trade off, like most such things in programming. If you split just once and use the array directly from then on, you'll save processing time at the expense of memory. If you split every time, you'll save memory at the expense of processing time.
One is more time efficient, the other is more space efficient.
As you can see, the split() method returns an array so behind the scenes the main.split(",") will iterate every time you call it through main String to extract the values. So it's faster to use it only once and use the result.
I would prefer to split once and keep the tokens around regardless of the size of the array. If the resulting array is large, it will be more expensive to split each time. If it is small, the resultant storage is probably not going to be a factor.
If your worried about storage for a large array, then the last time you split should also be a concern. To mitigate that, simply assign null to the array when your done and let the garbage collector do its thing.
If I were going to iterate thru an array of tokens, I would probably do it like this.
for (String token : main.split(",")) {
// do some stuff.
}
which creates the array once.

Efficiency of String matching, Equals vs. Matches methods

Say I have to compare some string objects in Java, and I have to do it like a million times for a high volume program. The strings will either be completely identical or they should not count as a match. Which method is more efficient to use, equals (object equality) or matches (regex)? An example:
String a = "JACK", b = "JACK", c = "MEG";
a.equals(b);//True
a.equals(c);//False
a.matches(b);//True
a.matches(c);//False
Both methods give me the results I want, but I'd like to know which one would be more efficient given the high volume processing.
You can check this by yourself by taking a big pool of strings and compare them in a loop. Before and after the loop, you take the current system time and then get the difference of the start and end time. See here: Runtime . But you should be careful because results may differ because of your hardware. Also it is important to be aware of the optimizations the JVM may do in background. That is the reason why you should compare many strings and maybe make an average value
List<String> bigList = new List<String>(); // put many many values in this list
String pattern = "pattern";
long start = System.nanoTime();
for(int i=0;i<bigList.length;i++) {
bigList.get(i).equals(pattern); //in another program, check for matches(pattern)
}
long end = System.nanoTime();
System.out.println((end-start)/bigList.size()) // this is the average time
matches will probably be slower since it uses a java.util.regex.Pattern and java.util.regex.Matcher in the background. Both equals and compareTo use a simple loop, and should therefore be faster.
Answer found here: http://www.coderanch.com/t/487350/Performance/java/compare-strings

Java concatenate strings vs static strings

I try to get a better understanding of Strings. I am basically making a program that requires a lot of strings. However, a lot of the strings are very, very similar and merely require a different word at the end of the string.
E.g.
String one = "I went to the store and bought milk"
String two = "I went to the store and bought eggs"
String three = "I went to the store and bought cheese"
So my question is, what approach would be best suited to take when dealing with strings? Would concatenating 2 strings together have any benefits over just having static strings in, say for example, performance or memory management?
E.g.
String one = "I went to the store and bought "
String two = "milk"
String three = "cheese"
String four = one + two
String five = one + three
I am just trying to figure out the most optimal way of dealing with all these strings. (If it helps to put a number of strings I am using, I currently have 50 but the number could surplus a huge amount)
As spooky has said the main concern with the code is readability. Unless you are working on a program for a phone you do not need to manage your resources. That being said, it really doesn't matter whether you create a lot of Strings that stand alone or concatenate a base String with the small piece that varies. You won't really notice better performance either way.
You may set the opening sentence in a string like this
String openingSentence = "I went to the store and bought";
and alternate defining each word alone, by defining one array of strings like the following ::
String[] thingsToBeBought = { "milk", "water", "cheese" .... };
then you can do foreach loop and concatenate each element in the array with the opening sentence.
In Java, if you concatenate two Strings (e.g. using '+') a new String is created, so the old memory needs to be garbage collected. If you want to concatenate strings, the correct way to do this is to use a StringBuilder or StringBuffer.
Given your comment about these strings really being URLs, you probably want to have a StringBuilder/StringBuffer that is the URL base, and then append the suffixes as needed.
Performance wise final static strings are always better as they are generated during compile time. Something like this
final static String s = "static string";
Non static strings and strings concatenated as shown in the other example are generated at runtime. So even though performance will hardly matter for such a small thing, The second example is not as good as the first one performance wise as in your code :
// not as good performance wise since they are generated at runtime
String four = one + two
String five = one + three
Since you are going to use this string as URL, I would recommend to use StringJoiner (in case your are using JAVA 8). It will be as efficient as StringBuilder (will not create a new string every time you perform concatenation) and will automatically add "/" between strings.
StringJoiner myJoiner = new StringJoiner("/")
There will be no discernable difference in performance, so the manner in which you go about this is more a matter of preference. I would likely declare the first part of the sentence as a String and store the individual purchase items in an array.
Example:
String action = "I went to the store and bought ";
String [] items = {"milk", "eggs", "cheese"};
for (int x = 0; x< items.length; x++){
System.out.println(action + items[x]);
}
Whether you declare every possible String or separate Strings to be concatenated isn't going to have any measurable impact on memory or performance in the example you give. In the extreme case of declaring truly large numbers of String literals, Java's native hash table of interned Strings will use more memory if you declare every possible String, because the table's cached values will be longer.
If you are concatenating more than 2 Strings using the + operator, you will be creating extra String objects to be GC'd. For example if you have Strings a = "1" and b = "2", and do String s = "s" + a + b;, Java will first create the String "s1" and then concatenate it to form a second String "s12". Avoid the intermediate String by using something like StringBuilder. (This wouldn't apply to compile-time declarations, but it would to runtime concatenations.)
If you happen to be formatting a String rather than simply concatenating, use a MessageFormat or String.format(). It's prettier and avoids the intermediate Strings created when using the + operator. So something like, String urlBase = "http://host/res?a=%s&b=%s"; String url = String.format(urlBase, a, b); where a and b are the query parameter String values.

Pattern.compile.split vs StringBuilder iteration and substring

I have to split a very large string in the fastest way possible and from what research i did i narrow it down to 2 possibilities:
1.Pattern.compile("[delimiter]").split("[large_string]");
2. Iterate through StringBuilder and call substring
StringBuilder sb = new StringBuilder("[large_string]");
ArrayList<String> pieces = new ArrayList<String>();
int pos = 0;
int currentPos;
while((currentPos = sb.indexOf("[delimiter]", pos)) != -1){
pieces.add(sb.substring(pos, currentPos));
pos = currentPos+"[delimiter]".length();
}
Any help is appreciated , i will benchmark them but i'm more interested in the theoretic part : why is one faster then the other .
Furthermore if you have other suggestions please post them.
UPDATE: So as I said I've done the benchmark , generated 5 mil strings each having 32 chars , these were put in a single string delimited by ~~ :
StringBuilder approach , surprisingly , was the slowest with an avg of 2.50-2.55 sec
Pattern.compile.split come on 2nd place with an avg of 2.47-2.49 sec
Splitter by Guava was the undisputed winner with an avg of 1.12-1.18 sec half the time of others (special thanks to fge who suggested it)
Thank you all for the help!
If your string is large, something to consider is whether any copies are made. If you don't use StringBuilder but use the plain String#substring(from,to), then no copies will be made of the contents of the string. There will be 1 instance of the whole String, and it will stick around as long as at least 1 substring persists.
Hmm... Source perusal of the Pattern class shows that split does the same thing, while the source of the StringBuilder shows that copies are made for each substring.
If this is a fixed pattern, and you do not need a regex, you might want to consider Guava's Splitter. It is very well written and performs admirably:
private static final Splitter SPLITTER = Splitter.on("myDelimiterHere");
Also, unlike .split(), you don't get nasty surprises with empty strings at the end... (you must pass a negative integer as an argument in order for it to do a "real" split)
You will also see that this class' .split() method returns an Iterable<CharSequence>; when the string is REALLY large, it only makes the necessary copies you ask it to make!
If you have to use it multiple times, a static object of your Pattern would be the choice. Look into the StringBuilder. The method indexOf is doing the same, iterating through all characters. Internally the String.split() method is also using Pattern to compile and split the string. Use the given methods and you should have the best performance...

Why does appending "" to a String save memory?

I used a variable with a lot of data in it, say String data.
I wanted to use a small part of this string in the following way:
this.smallpart = data.substring(12,18);
After some hours of debugging (with a memory visualizer) I found out that the objects field smallpart remembered all the data from data, although it only contained the substring.
When I changed the code into:
this.smallpart = data.substring(12,18)+"";
..the problem was solved! Now my application uses very little memory now!
How is that possible? Can anyone explain this? I think this.smallpart kept referencing towards data, but why?
UPDATE:
How can I clear the big String then? Will data = new String(data.substring(0,100)) do the thing?
Doing the following:
data.substring(x, y) + ""
creates a new (smaller) String object, and throws away the reference to the String created by substring(), thus enabling garbage collection of this.
The important thing to realise is that substring() gives a window onto an existing String - or rather, the character array underlying the original String. Hence it will consume the same memory as the original String. This can be advantageous in some circumstances, but problematic if you want to get a substring and dispose of the original String (as you've found out).
Take a look at the substring() method in the JDK String source for more info.
EDIT: To answer your supplementary question, constructing a new String from the substring will reduce your memory consumption, provided you bin any references to the original String.
NOTE (Jan 2013). The above behaviour has changed in Java 7u6. The flyweight pattern is no longer used and substring() will work as you would expect.
If you look at the source of substring(int, int), you'll see that it returns:
new String(offset + beginIndex, endIndex - beginIndex, value);
where value is the original char[]. So you get a new String but with the same underlying char[].
When you do, data.substring() + "", you get a new String with a new underlying char[].
Actually, your use case is the only situation where you should use the String(String) constructor:
String tiny = new String(huge.substring(12,18));
When you use substring, it doesn't actually create a new string. It still refers to your original string, with an offset and size constraint.
So, to allow your original string to be collected, you need to create a new string (using new String, or what you've got).
I think this.smallpart kept
referencing towards data, but why?
Because Java strings consist of a char array, a start offset and a length (and a cached hashCode). Some String operations like substring() create a new String object that shares the original's char array and simply has different offset and/or length fields. This works because the char array of a String is never modified once it has been created.
This can save memory when many substrings refer to the same basic string without replicating overlapping parts. As you have noticed, in some situations, it can keep data that's not needed anymore from being garbage collected.
The "correct" way to fix this is the new String(String) constructor, i.e.
this.smallpart = new String(data.substring(12,18));
BTW, the overall best solution would be to avoid having very large Strings in the first place, and processing any input in smaller chunks, aa few KB at a time.
In Java strings are imutable objects and once a string is created, it remains on memory until it's cleaned by the garbage colector (and this cleaning is not something you can take for granted).
When you call the substring method, Java does not create a trully new string, but just stores a range of characters inside the original string.
So, when you created a new string with this code:
this.smallpart = data.substring(12, 18) + "";
you actually created a new string when you concatenated the result with the empty string.
That's why.
As documented by jwz in 1997:
If you have a huge string, pull out a substring() of it, hold on to the substring and allow the longer string to become garbage (in other words, the substring has a longer lifetime) the underlying bytes of the huge string never go away.
Just to sum up, if you create lots of substrings from a small number of big strings, then use
String subtring = string.substring(5,23)
Since you only use the space to store the big strings, but if you are extracting a just handful of small strings, from losts of big strings, then
String substring = new String(string.substring(5,23));
Will keep your memory use down, since the big strings can be reclaimed when no longer needed.
That you call new String is a helpful reminder that you really are getting a new string, rather than a reference to the original one.
Firstly, calling java.lang.String.substring creates new window on the original String with usage of the offset and length instead of copying the significant part of underlying array.
If we take a closer look at the substring method we will notice a string constructor call String(int, int, char[]) and passing it whole char[] that represents the string. That means the substring will occupy as much amount of memory as the original string.
Ok, but why + "" results in demand for less memory than without it??
Doing a + on strings is implemented via StringBuilder.append method call. Look at the implementation of this method in AbstractStringBuilder class will tell us that it finally do arraycopy with the part we just really need (the substring).
Any other workaround??
this.smallpart = new String(data.substring(12,18));
this.smallpart = data.substring(12,18).intern();
Appending "" to a string will sometimes save memory.
Let's say I have a huge string containing a whole book, one million characters.
Then I create 20 strings containing the chapters of the book as substrings.
Then I create 1000 strings containing all paragraphs.
Then I create 10,000 strings containing all sentences.
Then I create 100,000 strings containing all the words.
I still only use 1,000,000 characters. If you add "" to each chapter, paragraph, sentence and word, you use 5,000,000 characters.
Of course it's entirely different if you only extract one single word from the whole book, and the whole book could be garbage collected but isn't because that one word holds a reference to it.
And it's again different if you have a one million character string and remove tabs and spaces at both ends, making say 10 calls to create a substring. The way Java works or worked avoids copying a million characters each time. There is compromise, and it's good if you know what the compromises are.

Categories