How to split a string with any whitespace chars as delimiters - java

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?

Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].

In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.

To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)

"\\s+" should do the trick

Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking

String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");

Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.

All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!

To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]

Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)

you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");

String str = "Hello World";
String res[] = str.split("\\s+");

Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Related

How to extract and replace a String with specific format?

I have input String like;
(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)
What I want to do is find all words starting with "rm" and replace them with remove function.
(remove(01ADS21212), 'adfffddd', remove(Adssssss), '1231232131', remove(2321312322))
I am trying to use replaceAll function but I don't know how to extract parts after "rm" literal.
statement.replaceAll("\\(rm*.,", "remove($1)");
Is there any way to get these parts?
You have not captured any substring with a capturing group, thus $1 is null.
You may use
.replaceAll("\\brm(\\w*)", "remove($1)")
See the regex demo
Details
\b - a word boundary (to start matching only at the start of a word)
rm - a literal part
(\w*) - Group 1: 0+ word chars (letters, digits or underscores)
The $1 in the replacement pattern stands for Group 1 value.
If you mean to match any chars other than a comma and whitespace after rm, use "\\brm([^\\s,]*)", see this regex demo.
Use "Replace" with empty string .
Eg;
string str = "(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)";
Console.WriteLine(str.Replace("rm", ""));
Output : (01ADS21212, 'adfffddd', Adssssss, '1231232131', 2321312322)

Java Regex does not match newline

My code is as follows:
public class Test {
static String REGEX = ".*([ |\t|\r\n|\r|\n]).*";
static String st = "abcd\r\nefgh";
public static void main(String args[]){
System.out.println(st.matches(REGEX));
}
}
The code outputs false. In any other cases it matches as expected, but I can't figure out what the problem here is.
You need to remove the character class.
static String REGEX = ".*( |\t|\r\n|\r|\n).*";
You can't put \r\n inside a character class. If you do that, it would be treated as \r, \n as two separate items which in-turn matches either \r or \n. You already know that .* won't match any line breaks so, .* matches the first part and the next char class would match a single character ie, \r. Now the following character is \n which won't be matched by .*, so your regex got failed.
UPDATE:
Based on your comments, you need something like this:
.*(?:[ \r\n\t].*)+
EXPLANATION:
In plain words, it is a regex that matches a line, then 1 or more lines. Or, just a multiline text.
.* - 0 or more characters other than a newline
(?:[ \r\n\t].*)+ - a non-capturing group that matches 1 or more times a sequence of
[ \r\n\t] - either a space, or a \r or \n or \t
.* - 0 or more characters other than a newline
See demo
Original answer
You can fix your pattern 2 ways:
String REGEX = ".*(?:\r\n|[ \t\r\n]).*";
This way we match either \r\n sequence, or any character in the character class.
Or (since the character class only matches 1 character, we can add + after it to capture 1 or more:
String REGEX = ".*[ \t\r\n]+.*";
See IDEONE demo
Note that it is not a good idea to use single characters in alternations, it decreases performance.
Also note that capturing groups should not be overused. If you do not plan to use the contents of the group, use non-capturing groups ((?:...)), or remove them.

Regular expression for splitting a String while preserving whitespace

I am doing an Android project which needs to split a String into tokens while preserving whitespaces and also not to split at non-word characters like #, & etc ...
Using \b splits at any non-word character .So i need a way to split the string in the following way.
Input: (. indicates whitespace)
A.A#..A##
Desired output:
A
.
A#
..
A##
So these 5 lines are the 5 values I would like in an array or similar. That means the 4th element of the result-array contains 2 spaces.
I think this is what you want:
(?<=\S)(?=\s)|(?<=\s)(?=\S)
Debuggex Demo
Basically I'm saying "if the previous character is a non-space and the next is a space or if the previous is a space and the next is a non-space, then split".
Use StringTokenizer:
StringTokenizer st = new StringTokenizer("A.A#..A##", ".");//first argument is string you want to split, another is whitespace
while(st.hasMoreTokens())
System.out.println(st.nextToken());
output will be:
A
A#
A##
Try:
String s = "A.A#..A##";
if(s.contains("..")) | s.contains("...")) {
s.replace("..", ".");
s.replace("...", ".");
String out[] = s.split(".");
It should give you an array with Strings the way you want :)
Don't forget to replace the "." with actual spaces :)

what \\s matches in Java

In all the tutorials I have read they always say that \s matches a whitespace. So why this instruction
System.out.println("line1 \n line2".replaceAll("\\s\\s*", " "));
have this output :
line1 line2
Thanks for your response.
The string literal "\\s\\s*" is equivalent to the regular expression syntax \s\s* which matches "a whitespace character followed by zero or more whitespace characters".
A whitespace character is defined as [ \t\n\x0B\f\r], which includes spaces and newlines.
\\s matches a whitespace character, where the whitespace characters are - [ \t\n\x0B\f\r]. It's not just a space. I suspect this is what you inferred from whitespace. See Pattern class documentation.
Also, you can replace your regex \\s\\s* with just \\s+.
"\\s\\s*" is the escaped version of \s\s* which is the same of \s+
It maches one or more of any white-space char. White-space chars are [ \t\n\x0B\f\r]. So it will replace multiple white-spaces by only one in each match.
First, this regex is a bit silly: \\s\\s* will match one or more whitespace characters, since the \\s character class matches all whitespace.
But, it could be expressed easier as \\s+, which accomplishes the exact same thing.

What does regular expression \\s*,\\s* do?

I am wondering what this line of code does to a url that is contained in a String called surl?
String[] stokens = surl.split("\\s*,\\s*");
Lets pretend this is the surl = "http://myipaddress:8080/Map/MapServer.html"
What will stokens be?
That regex "\\s*,\\s*" means:
\s* any number of whitespace characters
a comma
\s* any number of whitespace characters
which will split on commas and consume any spaces either side
\s stands for "whitespace character".
It includes [ \t\n\x0B\f\r]. That is: \s matches a space( ) or a tab(\t) or a line(\n)
break or a vertical tab(\x0B sometimes referred as \v) or a form feed(\f) or a carriage return(\r) .
\\s*,\\s*
It says zero or more occurrence of whitespace characters, followed by a comma and then followed by zero or more occurrence of whitespace characters.
These are called short hand expressions.
You can find similar regex in this site: http://www.regular-expressions.info/shorthand.html

Categories