Java String split regex not working as expected

Java String split regex not working as expected - java

The following Java code will print "0". I would expect that this would print "4". According to the Java API String.split "Splits this string around matches of the given regular expression". And from the linked regular expression documentation:
Predefined character classes
. Any character (may or may not match line terminators)
Therefore I would expect "Test" to be split on each character. I am clearly misunderstanding something.
System.out.println("Test".split(".").length); //0

You're right: it is split on each character. However, the "split" character is not returned by this function, hence the resulting 0.
The important part in the javadoc: "Trailing empty strings are therefore not included in the resulting array. "
I think you want "Test".split(".", -1).length(), but this will return 5 and not 4 (there are 5 ''spaces'': one before T, one between T and e, others for e-s, s-t, and the last one after the final t.)

You have to use two backslashes and it should work fine:
Here is example:
String[] parts = string.split("\\.",-1);

Everything is ok. "Test" is split on each character, and so between them there is no character. If you want iterate your string over each character you can use charAt and length methods.

Related

Why does Kotlin's split("") function result in a leading and trailing empty string?

I'm a Java developer who tried Kotlin and found a counterintuitive case between these two languages.
Consider the given code in Kotlin:
"???".split("") # gives ["", "?", "?", "?", ""]
and the same in Java:
"???".split("") # gives ["?", "?", "?"]
Why does Kotlin produce a leading and trailing empty space string in the resulting array? Or does Java always removes these empty strings, and I just wasn't aware of that?
I know that there is the toCharArray() method on each Kotlin String, but still, it's not very intuitive (maybe the Java developers should give up old habits from Java which they were hoping to reuse in a new language?).

This is because the Java split(String regex) method explicitly removes them:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
split(String regex, int limit) mentions:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
"" is a zero-width match. Not sure why you consider toCharArray() to not be intuitive here, splitting by an empty string to iterate over all characters is a roundabout way of doing things. split() is intended to pattern match and get groups of Strings.
PS: I checked JDK 8, 11 and 17, behavior seems to be consistent for a while now.

You need to filter out the first and last element:
"???".split("").drop(1).dropLast(1)
Check out this example:
"".split("") // [, ]
Splits this char sequence to a list of strings around occurrences of the specified delimiters.
See https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/split.html

Where does the split() method in Java begin matching regex to a string?

I was messing around with the split() method in Java when I came across a problem which I couldn't seem to understand. I was curious as to where exactly the split method starts to search for regex matches: at the first character, before, or after?
Given String "test":
If the split method starts before the first character then there should be an empty string before the string "test", and splitting at an empty string should return an array of length 6, but it is of length 5.
System.out.println("test".split("",-1).length);
So clearly the split method does not start before the given string.
If the split method starts at the first character given string then shouldn't splitting with a regex of "Z*" return an array of length 6 with a leading empty string as the first character is indeed not Z (hence 0 or more times)? However it returns an array of length 5.
System.out.println("test".split("Z*",-1).length);
So by induction the split method starts after the first character...
but clearly it does not since the following code works as expected:
System.out.println("test".split("t",-1).length);
Output: 3
So where exactly does the split method start searching for regex matches? Or what exactly is the gap in my reasoning?

You can read the jdk source code online. Here is split from OpenJdk 8.
String.split has a happy-path optimization for single character strings, but most work is delegated to Pattern.split. Pattern split has a special case for a zero width match at the beginning of the string.

Java String split regexp returns empty strings with multiple delimiters

I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.

You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters

I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times

Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)

If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.

Java split by alphabeta char creates an empty value in array

I want to split my string on every occurrence of an alpha-beta character.
for example:
"s1l1e13" to an array of: ["s1","l1","e13"]
when trying to use this simple split by regex i get some weird results:
testStr = "s1l1e13"
Arrays.toString(testStr.split("(?=[a-z])"))
gives me the array of:
["","s1","l1","e13"]
how can i create the split without the empty array element?
I tried a couple more things:
testStr = "s1"
Arrays.toString(testStr.split("(?=[a-z])"))
does return the currect array: ["s1"]
but when trying to use substring
testStr = "s1l1e13"
Arrays.toString(testStr.substring(1).split("(?=[a-z])")
i get in return ["1","l1","e13"]
what am i missing?

Your Lookahead marks each position before any character of a to z; marking the following positions:
s1 l1 e13
^ ^ ^
So by spliting using just the Lookahead, it returns ["", "s1", "l1", "e13"]
You can use a Negative Lookbehind here. This looks behind to see if there is not the beginning of the string.
String s = "s1l1e13";
String[] parts = s.split("(?<!\\A)(?=[a-z])");
System.out.println(Arrays.toString(parts)); //=> [s1, l1, e13]

Your problem is that (?=[a-z]) means "place before [a-z]" and in your text
s1l1e13
you have 3 such places. I will mark them with |
|s1|l1|e13
so split (unfortunately correctly) produces "" "s1" "l1" "e13" and doesn't automatically remove for you first empty elements.
To solve this problem you have at least two options:
make sure that there is something before your place you need to split on (it is not at start of your string). You can use for instance (?<=\\d)(?=[a-z]) if you want to split after digit but before character
(PREFFERED SOLUTION) start using Java 8 which automatically removes empty strings at start of result array if regex used on split is zero-length (look-arounds are zero length).

The first match finds "" to be okay because its looking ahead for any alpha character, which is called zero-width lookahead, so it doesn't need to actually match anything. So "s" at the beginning is alphanumeric, and it matches that at a probable spot.
If you want the regex to match something always, use ".+(?=[a-z])"

The problem is that the initial "s" counts as an alphabetic character. So, the regex is trying to split at s.
The issue is that there is nothing before the s, so the regex machine instead decides to show that there is nothing by adding the null element. It'll do the same thing at the end if you ended with "s" (or any other letter).
If this is the only string you're splitting, or if every array you had starts with a letter but does not end with one, just truncate the array to omit the first element. Otherwise, you'll probably need to loop through each array as you make it so that you can drop empty elements.

So it seems your matches has the pattern x###, where x is a letter, and # is a number.
I'd make the following Regex:
([a-z][0-9]+)

leading whitespace while using string.split()

here's the code that i am using to split the string
str = "1234".split("") ;
System.out.println(str.length) ; //this gives 5
there is an extra whitespace added just before 1 i.e str[0]=" "
How to split this string without having the leading whitespace.

Try this:
char[] str = "1234".toCharArray();
System.out.println(str.length) ;

You are not actually getting a leading whitespace element, you are getting a leading empty string, as can be seen from printing the whole array:
System.out.println(java.util.Arrays.toString("1234".split("")));
Output:
[, 1, 2, 3, 4]
The rationale for that behavior is the fact that "1234".indexOf("") is 0. I.e., the delimiter string matches at the beginning of the searched string, and thus it creates a split there, giving an empty initial element in the returned array. The delimiter also matches at the end of the string, but you don't get an extra empty element there, because (as the String.split documentation says) "trailing empty strings will be discarded".
However, the behavior was changed for Java 8. Now the documentation also says:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
(Compare String.split documentation for Java 7 and Java 8.)
Thus, on Java 8 and above, the output from the above example is:
[1, 2, 3, 4]
For more information about the change, see: Why in Java 8 split sometimes removes empty strings at start of result array?
Anyway, it is overkill to use String.split in this particular case. The other answers are correct that to get an array of all characters you should use str.toCharArray(), which is faster and more straightforward. If you really need an array of one-character strings rather than an array of characters, then see: Split string into array of character strings.

Use toCharArray() instead....................

if you want to split on "" then should use toCharArray method of string.
String str = "1234";
System.out.println(str.toCharArray().length);

If you are using Java 7 or below, use this:
"1234".split("(?!^)")

I would just do:
str = "1 2 3 4".split(" ");
Unless of course there's a specific reason why it has to be "1234"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.