Strange behavior of Java String split() method - java

I have a method which takes a string parameter and split the string by # and after splitting it prints the length of the array along with array elements. Below is my code
public void StringSplitTesting(String inputString) {
String tokenArray[] = inputString.split("#");
System.out.println("tokenArray length is " + tokenArray.length
+ " and array elements are " + Arrays.toString(tokenArray));
}
Case I : Now when my input is abc# the output is tokenArray length is 1 and array elements are [abc]
Case II : But when my input is #abc the output is tokenArray length is 2 and array elements are [, abc]
But I was expecting the same output for both the cases. What is the reason behind this implementation? Why split() method is behaving like this? Could someone give me proper explanation on this?

One aspect of the behavior of the one-argument split method can be surprising -- trailing nulls are discarded from the returned array.
Trailing empty strings are therefore not included in the resulting array.
To get a length of 2 for each case, you can pass in a negative second argument to the two-argument split method, which means that the length is unrestricted and no trailing empty strings are discarded.

Just take a look in the documentation:
Trailing empty strings are therefore not included in the resulting
array.
So in case 1, the output would be {"abc", ""} but Java cuts the trailing empty String.
If you don't want the trailing empty String to be discarded, you have to use split("#", -1).

The observed behavior is due to the inherently asymmetric nature of the substring() method in Java:
This is the core of the implementation of split():
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, value.length));
off = value.length;
break;
}
}
The key to understanding the behavior of the above code is to understand the behavior of the substring() method:
From the Javadocs:
String java.lang.String.substring(int beginIndex, int endIndex)
Returns a new string that is a substring of this string. The substring
begins at the specified beginIndex and extends to the character at index
endIndex - 1. Thus the length of the substring is endIndex-beginIndex.
Examples:
"hamburger".substring(4, 8) returns "urge" (not "urger")
"smiles".substring(1, 5) returns "mile" (not "miles")
Hope this helps.

Related

How to find if an array index Exists?

Basically I need to check if a String contains 2 indexes.
Based on my googling I found that I could Either use part[0].length() > 0 || part[0] != null But none happen to help me here.
My code:
String[] parts = datareceived.split("&");
if(!(parts[0].length()>0) && parts[0] == null){
out.print("String is null");
return;
}
if(!(parts[1].length()>0) && parts[1] == null){
out.print("String is null");
return;
}
But here in parts[1] i'm getting an exception which says:
java.lang.ArrayIndexOutOfBoundsException: 1
at pack.reg.pack.serv.doPost(serv.java:10)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:648)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:729)
Thanks in advance!
Basically I need to check if a String contains 2 indexes
If you are using split() it returns you an array which you can use .length to check the size of the returned tokens:
if(parts.length >= 2)
But that is not gonna tell if the second index is empty, right?
If you are afraid of getting empty string, you can trim() the String first:
String[] parts = datareceived.trim().split("&");
It means split returning an array of string and its size is 1. Thats why you are getting java.lang.ArrayIndexOutOfBoundsException
and if you want to check whether array size is 2 you can do following
part.length >= 2
// write your logic here
When you're getting an ArrayIndexOutOfBoundsException when calling parts[1], but not when calling parts[0], we can conclude that parts.length == 1. Hence, since indices start at 0, index 0 contains the whole string, and index 1 in the split array (and up) doesn't exist.
--> Problem found: Your datareceived doesn't contain a &, hence it won't be split anywhere.
Solution: Hoping I understood your problem correctly (checking whether a string datareceived contains a value at certain indices), I wrote you a piece of code that should work.
String datareceived = "01234";
int index1 = 2;
int index2 = 5;
if (datareceived.length() > index1) {
//the string has a value at index1. With the given example, this if-query would be true.
}
if (datareceived.length() > index2) {
//the string has a value at index2. With the given example, this if-query would be false.
}
Side-note when using .split:
I found that, when using String.split("expression"), the "expression" part can contain regex-codes (regular expression). Therefore, if you're using any symbol that is a valid regex expression (such as . or $), it will not work (at least, not necessarily). For example, in regex, . means "any character", which will essentially give you an empty array.
(Note: I'm no regex expert, but "&" doesn't appear to be a valid regular expression)
Example:
String s = "a,b";
String[] strings = s.split(",");
for (String str : strings) {
System.out.println(str + "_");
}
//will print out "a_b_"
Example (not working as desired):
String s = "a.b";
String[] strings = s.split(".");
//"strings" is an empty array, since "a", ".", and "b" are "any character".
Solution:
instead of .split("."), use .split("\\.")

indexOf() of StringBuilder doesn't return anything

StringBuilder builder = new StringBuilder();
builder.setLength(10);
builder.append("d");
System.out.println(builder.length() + "\t" + builder.toString() + "\t" + builder.indexOf("d"));
Output:
11
Problem:
Why indexOf() doesn't return anything.
My Understanding:
As per my understanding, it should return 10, since StringBuilder is counting the "d" as part of its length in length().
But in case if "d" is not part of string hold by StringBuilder then, it should return -1 and length should be 10.
If you look at the docs of StringBuilder#setLength(int newLength)
If the newLength argument is greater than or equal to the current length, sufficient null characters ('\u0000') are appended so that length becomes the newLength argument.
That is why when you append "d" after setting the length, it is placed after the 10 null characters.
Q. Why indexOf() doesn't return anything.
It does return a value and that is 10, since the indexing is 0-based. This is the output of your code.
11 d 10 // the 10 represents the index of d
^-length ^-10 null chars followed by d
The reason you're not getting the output may be because of your console not supporting null characters. That is why, when it encounters the null character \u0000, it would just stop printing the values in the console. Try using eclipse which supports printing of Unicode characters.
Sample snapshot:

Why does "||".split("\\|").length return 0 and not 3?

When there are adjacent separators in the split expression I expect null or an empty string--not have it eliminated.
The Java code is below:
public class splitter {
public static void main(String args[]) {
int size = "||".split("\\|").length;
assert size == 3 : "size should be 3 and not " + size;
}
}
I expected to get either { "", "", "" } or { null, null, null }. Either would be fine.
Perhaps there's a regular expression that will not be fooled by empty words?
According to the javadoc:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
The javadoc for split(String, int) elaborates:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
(emphasis mine)
So to return an array of empty strings, call "||".split("\\|", -1)
I need to take a closer look at Paul's answer (his looks simpler), but I was able to find something about look-ahead expressions that solve the assertions (I apologize that the code is in Apex--it just wraps Java).
static testMethod void testPatternStringSplit() {
Pattern aPattern = Pattern.Compile('(?=\\|)');
system.assertEquals(3, aPattern.split('||').size());
system.assertEquals(3, aPattern.split(' | | ').size());
system.assertEquals(3, aPattern.split('a|b|c').size());
system.assertEquals(3, aPattern.split('a|b|').size());
system.assertEquals(3, aPattern.split('|b|c').size());
system.assertEquals(3, aPattern.split('|b|').size());
}
I need to write some code to test Paul's ...

Split by first found String in Java

is ist possible to tell String.split("(") function that it has to split only by the first found string "("?
Example:
String test = "A*B(A+B)+A*(A+B)";
test.split("(") should result to ["A*B" ,"A+B)+A*(A+B)"]
test.split(")") should result to ["A*B(A+B" ,"+A*(A+B)"]
Yes, absolutely:
test.split("\\(", 2);
As the documentation for String.split(String,int) explains:
The limit parameter controls the number of times the
pattern is applied and therefore affects the length of the resulting
array. If the limit n is greater than zero then the pattern
will be applied at most n - 1 times, the array's
length will be no greater than n, and the array's last entry
will contain all input beyond the last matched delimiter.
test.split("\\(",2);
See javadoc for more info
EDIT: Escaped bracket, as per #Pedro's comment below.
Try with this solution, it's generic, faster and simpler than using a regular expression:
public static String[] splitOnFirst(String str, char c) {
int idx = str.indexOf(c);
String head = str.substring(0, idx);
String tail = str.substring(idx + 1);
return new String[] { head, tail} ;
}
Test it like this:
String test = "A*B(A+B)+A*(A+B)";
System.out.println(Arrays.toString(splitOnFirst(test, '(')));
System.out.println(Arrays.toString(splitOnFirst(test, ')')));

Java String.indexOf and empty Strings

I'm curious why the String.indexOf is returning a 0 (instead of -1) when asking for the index of an empty string within a string.
The Javadocs only say this method returns the index in this string of the specified string, -1 if the string isn't found.
To me this behavior seems highly unexpected, I would have expected a -1. Any ideas why this unexpected behavior is going on? I would at the least think this is worth a note in the method's Javadocs...
System.out.println("FOO".indexOf("")); // outputs 0 wtf!!!
System.out.println("FOO".indexOf("bar")); // outputs -1 as expected
System.out.println("FOO".indexOf("F")); // outputs 0 as expected
System.out.println("".indexOf("")); // outputs 0 as expected, I think
The empty string is everywhere, and nowhere. It is within all strings at all times, permeating the essence of their being, yet as you seek it you shall never catch a glimpse.
How many empty strings can you fit at the beginning of a string? Mu
The student said to the teacher,
Teacher, I believe that I have found the nature of the empty string. The empty string is like a particle of dust, and it floats freely through a string as dust floats freely through the room, glistening in a beam of sunlight.
The teacher responded to the student,
Hmm. A fine notion. Now tell me, where is the dust, and where is the sunlight?
The teacher struck the student with a strap and instructed him to continue his meditation.
Well, if it helps, you can think of "FOO" as "" + "FOO".
int number_of_empty_strings_in_string_named_text = text.length() + 1
All characters are separated by an empty String. Additionally empty String is present at the beginning and at the end.
By using the expression "", you are actually referring to a null string. A null string is an ethereal tag placed on something that exists only to show that there is a lack of anything at this location.
So, by saying "".indexOf( "" ), you are really asking the interpreter:
Where does a string value of null exist in my null string?
It returns a zero, since the null is at the beginning of the non-existent null string.
To add anything to the string would now make it a non-null string... null can be thought of as the absence of everything, even nothing.
Using an algebraic approach, "" is the neutral element of string concatenation: x + "" == x and "" + x == x (although + is non commutative here).
Then it must also be:
x.indexOf ( y ) == i and i != -1
<==> x.substring ( 0, i ) + y + x.substring ( i + y.length () ) == x
when y = "", this holds if i == 0 and x.substring ( 0, 0 ) == "".
I didn't design Java, but I guess mathematicians participated in it...
if we look inside of String implementation for a method "foo".indexOf(""), we arrive at this method:
public int indexOf(String str) {
byte coder = coder();
if (coder == str.coder()) {
return isLatin1() ? StringLatin1.indexOf(value, str.value)
: StringUTF16.indexOf(value, str.value);
}
if (coder == LATIN1) { // str.coder == UTF16
return -1;
}
return StringUTF16.indexOfLatin1(value, str.value);
}
If we look inside of any of the called indexOf(value, str.value) methods we find a condition that says:
if the second parameter (string we are searching for) length is 0 return 0:
public static int indexOf(byte[] value, byte[] str) {
if (str.length == 0) {
return 0;
}
...
This is just defensive coding for an edge case, and it is necessary because in the next method that is called to do actual searching by comparing bytes of the string (string is a byte array) it would otherwise have resulted in an ArrayIndexOutOfBounds exception:
public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
byte first = str[0];
...
This question is actually two questions:
Why should a string contain the empty string?
Why should the empty string be found specifically at index zero?
Answering #1:
A string contains the empty string in order to be in accordance with Set Theory, according to which:
The empty set is a subset of every set including itself.
This also means that even the empty string contains the empty string, and the following statement proves it:
assert "".indexOf( "" ) == 0;
I am not sure why mathematicians have decided that it should be so, but I am pretty sure they have their reasons, and it appears that these reasons can be explained in layman's terms, as various youtube videos seem to do, (for example, https://www.youtube.com/watch?v=1nBKadtFViM) although I have not actually viewed any of those videos, because #AintNoBodyGotNoTimeFoDat.
Answering #2:
The empty string can be found specifically at index zero of any string, because why not? In other words, if not at index zero, then at which index? Index zero is as good as any other index, and index zero is guaranteed to be a valid index for all strings except for the trifling exception of the empty string.

Categories