leading whitespace while using string.split()

leading whitespace while using string.split() - java

here's the code that i am using to split the string
str = "1234".split("") ;
System.out.println(str.length) ; //this gives 5
there is an extra whitespace added just before 1 i.e str[0]=" "
How to split this string without having the leading whitespace.

Try this:
char[] str = "1234".toCharArray();
System.out.println(str.length) ;

You are not actually getting a leading whitespace element, you are getting a leading empty string, as can be seen from printing the whole array:
System.out.println(java.util.Arrays.toString("1234".split("")));
Output:
[, 1, 2, 3, 4]
The rationale for that behavior is the fact that "1234".indexOf("") is 0. I.e., the delimiter string matches at the beginning of the searched string, and thus it creates a split there, giving an empty initial element in the returned array. The delimiter also matches at the end of the string, but you don't get an extra empty element there, because (as the String.split documentation says) "trailing empty strings will be discarded".
However, the behavior was changed for Java 8. Now the documentation also says:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
(Compare String.split documentation for Java 7 and Java 8.)
Thus, on Java 8 and above, the output from the above example is:
[1, 2, 3, 4]
For more information about the change, see: Why in Java 8 split sometimes removes empty strings at start of result array?
Anyway, it is overkill to use String.split in this particular case. The other answers are correct that to get an array of all characters you should use str.toCharArray(), which is faster and more straightforward. If you really need an array of one-character strings rather than an array of characters, then see: Split string into array of character strings.

Use toCharArray() instead....................

if you want to split on "" then should use toCharArray method of string.
String str = "1234";
System.out.println(str.toCharArray().length);

If you are using Java 7 or below, use this:
"1234".split("(?!^)")

I would just do:
str = "1 2 3 4".split(" ");
Unless of course there's a specific reason why it has to be "1234"

Related

Why does Kotlin's split("") function result in a leading and trailing empty string?

I'm a Java developer who tried Kotlin and found a counterintuitive case between these two languages.
Consider the given code in Kotlin:
"???".split("") # gives ["", "?", "?", "?", ""]
and the same in Java:
"???".split("") # gives ["?", "?", "?"]
Why does Kotlin produce a leading and trailing empty space string in the resulting array? Or does Java always removes these empty strings, and I just wasn't aware of that?
I know that there is the toCharArray() method on each Kotlin String, but still, it's not very intuitive (maybe the Java developers should give up old habits from Java which they were hoping to reuse in a new language?).

This is because the Java split(String regex) method explicitly removes them:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
split(String regex, int limit) mentions:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
"" is a zero-width match. Not sure why you consider toCharArray() to not be intuitive here, splitting by an empty string to iterate over all characters is a roundabout way of doing things. split() is intended to pattern match and get groups of Strings.
PS: I checked JDK 8, 11 and 17, behavior seems to be consistent for a while now.

You need to filter out the first and last element:
"???".split("").drop(1).dropLast(1)
Check out this example:
"".split("") // [, ]
Splits this char sequence to a list of strings around occurrences of the specified delimiters.
See https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/split.html

Where does the split() method in Java begin matching regex to a string?

I was messing around with the split() method in Java when I came across a problem which I couldn't seem to understand. I was curious as to where exactly the split method starts to search for regex matches: at the first character, before, or after?
Given String "test":
If the split method starts before the first character then there should be an empty string before the string "test", and splitting at an empty string should return an array of length 6, but it is of length 5.
System.out.println("test".split("",-1).length);
So clearly the split method does not start before the given string.
If the split method starts at the first character given string then shouldn't splitting with a regex of "Z*" return an array of length 6 with a leading empty string as the first character is indeed not Z (hence 0 or more times)? However it returns an array of length 5.
System.out.println("test".split("Z*",-1).length);
So by induction the split method starts after the first character...
but clearly it does not since the following code works as expected:
System.out.println("test".split("t",-1).length);
Output: 3
So where exactly does the split method start searching for regex matches? Or what exactly is the gap in my reasoning?

You can read the jdk source code online. Here is split from OpenJdk 8.
String.split has a happy-path optimization for single character strings, but most work is delegated to Pattern.split. Pattern split has a special case for a zero width match at the beginning of the string.

take characters after spaces in java

I have a string and it has two words with 3 spaces between them. For example: "Hello Java". I need to extract "Java" from this string. I searched for regex and tokens, but can't find a solution.

All you have to do is split the string by the space and get the second index since Java is after the first space. You said it has 3 spaces so you can just do:
string.split(" ")[1];
In your case, if string is equal to Hello Java with three spaces, this will work.

There are many ways you can do this, but which you choose depends on how you may expect your input to vary. If you can assume there will always be exactly 3 spaces in the string, all sequential, then just use the indexOf method to locate the first space, add 3 to that index, and take a substring with the resulting value. If you're unsure how many sequential spaces there will be, use lastIndexOf and add 1. You can also use the split method mentioned in another solution.
For instance:
String s = "Hello Java";
s = s.substring(s.lastIndexOf(" ")+1);
System.out.println(s);

Java split by alphabeta char creates an empty value in array

I want to split my string on every occurrence of an alpha-beta character.
for example:
"s1l1e13" to an array of: ["s1","l1","e13"]
when trying to use this simple split by regex i get some weird results:
testStr = "s1l1e13"
Arrays.toString(testStr.split("(?=[a-z])"))
gives me the array of:
["","s1","l1","e13"]
how can i create the split without the empty array element?
I tried a couple more things:
testStr = "s1"
Arrays.toString(testStr.split("(?=[a-z])"))
does return the currect array: ["s1"]
but when trying to use substring
testStr = "s1l1e13"
Arrays.toString(testStr.substring(1).split("(?=[a-z])")
i get in return ["1","l1","e13"]
what am i missing?

Your Lookahead marks each position before any character of a to z; marking the following positions:
s1 l1 e13
^ ^ ^
So by spliting using just the Lookahead, it returns ["", "s1", "l1", "e13"]
You can use a Negative Lookbehind here. This looks behind to see if there is not the beginning of the string.
String s = "s1l1e13";
String[] parts = s.split("(?<!\\A)(?=[a-z])");
System.out.println(Arrays.toString(parts)); //=> [s1, l1, e13]

Your problem is that (?=[a-z]) means "place before [a-z]" and in your text
s1l1e13
you have 3 such places. I will mark them with |
|s1|l1|e13
so split (unfortunately correctly) produces "" "s1" "l1" "e13" and doesn't automatically remove for you first empty elements.
To solve this problem you have at least two options:
make sure that there is something before your place you need to split on (it is not at start of your string). You can use for instance (?<=\\d)(?=[a-z]) if you want to split after digit but before character
(PREFFERED SOLUTION) start using Java 8 which automatically removes empty strings at start of result array if regex used on split is zero-length (look-arounds are zero length).

The first match finds "" to be okay because its looking ahead for any alpha character, which is called zero-width lookahead, so it doesn't need to actually match anything. So "s" at the beginning is alphanumeric, and it matches that at a probable spot.
If you want the regex to match something always, use ".+(?=[a-z])"

The problem is that the initial "s" counts as an alphabetic character. So, the regex is trying to split at s.
The issue is that there is nothing before the s, so the regex machine instead decides to show that there is nothing by adding the null element. It'll do the same thing at the end if you ended with "s" (or any other letter).
If this is the only string you're splitting, or if every array you had starts with a letter but does not end with one, just truncate the array to omit the first element. Otherwise, you'll probably need to loop through each array as you make it so that you can drop empty elements.

So it seems your matches has the pattern x###, where x is a letter, and # is a number.
I'd make the following Regex:
([a-z][0-9]+)

Java String split regex not working as expected

The following Java code will print "0". I would expect that this would print "4". According to the Java API String.split "Splits this string around matches of the given regular expression". And from the linked regular expression documentation:
Predefined character classes
. Any character (may or may not match line terminators)
Therefore I would expect "Test" to be split on each character. I am clearly misunderstanding something.
System.out.println("Test".split(".").length); //0

You're right: it is split on each character. However, the "split" character is not returned by this function, hence the resulting 0.
The important part in the javadoc: "Trailing empty strings are therefore not included in the resulting array. "
I think you want "Test".split(".", -1).length(), but this will return 5 and not 4 (there are 5 ''spaces'': one before T, one between T and e, others for e-s, s-t, and the last one after the final t.)

You have to use two backslashes and it should work fine:
Here is example:
String[] parts = string.split("\\.",-1);

Everything is ok. "Test" is split on each character, and so between them there is no character. If you want iterate your string over each character you can use charAt and length methods.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.