What does regular expression \\s*,\\s* do? - java

I am wondering what this line of code does to a url that is contained in a String called surl?
String[] stokens = surl.split("\\s*,\\s*");
Lets pretend this is the surl = "http://myipaddress:8080/Map/MapServer.html"
What will stokens be?

That regex "\\s*,\\s*" means:
\s* any number of whitespace characters
a comma
\s* any number of whitespace characters
which will split on commas and consume any spaces either side

\s stands for "whitespace character".
It includes [ \t\n\x0B\f\r]. That is: \s matches a space( ) or a tab(\t) or a line(\n)
break or a vertical tab(\x0B sometimes referred as \v) or a form feed(\f) or a carriage return(\r) .
\\s*,\\s*
It says zero or more occurrence of whitespace characters, followed by a comma and then followed by zero or more occurrence of whitespace characters.
These are called short hand expressions.
You can find similar regex in this site: http://www.regular-expressions.info/shorthand.html

Related

Regex expression to match on hyphens in words within sentence based on occurrences of hyphen

I am trying to match on hyphens in a word but only if the hyphen occurs in said word say more than once
So in the phrase "Step-By-Step" the hyphens would be matched whereas in the phrase "Coca-Cola", the hyphens would not be matched.
In a full sentence combining phrases "Step-By-Step and Coca-Cola" only the hyphens within "Step-By-Step" would be expected to match.
I have the following expression currently, but this is matching all hyphens separated by non-digit characters regardless of occurences
((?=\D)-(?<=\D))
I can't seem to get the quantifiers to work with this expression, any ideas?
Java Regex Solution:
(?<=-[^\s-]{0,999})-|-(?=[^\s-]*-)
Java RegEx Demo
PCRE Regex Solution:
Here is a way to match all hyphens in a line with more than one hyphen in PCRE:
(?:(?:^|\s)(?=(?:[^\s-]*-){2})|(?!^)\G)[^\s-]*\K-
RegEx Demo
Explanation:
[^\s-]* matches a character that is not a whitespace and not a hyphen
(?=(?:[^\s-]*-){2}) is lookahead to make sure there are at least 2 hyphens in a non-whitespace substring
\G asserts position at the end of the previous match or the start of the string for the first match
\K resets match info
This matches at least two words each followed by hyphen, followed by another word (I'm assuming you don't want to allow hyphen at the very beginning or end, only between words).
(\w+-){2,}\w+

RegEx for combining multiple sequences

As many people ,i am struggling with what it seems a "trivial" regex issue.
in a given text, whenever I encounter a word within {} brackets i need to extract it.At first i used
"\\{-?(\\w{3,})\\}"
and it worked ok:
as long as the word didnt have any white space or special character like ' .
For example {Project} returns Project.But {Project Test} or {Project D'arce} don't return anything.
i know that for white characters i need to use \s.But it is absolutely not clear for me how to add to the above , i tried :
"%\\{-?(\\w(\\s{3,})\\)\\}"))
but not working.Also what if i want to add words containing a special characters like ' ??? Its really frustrating
How about matching any character inside {..} which is not }?
To do so you can use negated character class [^..] like [^}]. So your regex can look like
"\\{[^}]{3,}\\}"
But if you want to limit your regex only to some specific alphabet you can also use character class to combine many characters and even predefined shorthand character classes like \w \s \d and so on.
So if you want to accept any word character \w or whitespace \s or ' your regex can look like
"\\{[\\w\\s']{3,}\\}"
You could use a character class [\w\s']and add to it what you could allow to match:
\{-?([\w\s']{3,})}
In Java
String regex = "\\{-?([\\w\\s']{3,})}";
Regex demo
If you want to prevent matching only 3 whitespace chars, you could use a repeating group:
\{-?\h*([\w']{3,}(?:\h+[\w']+)*)\h*}
About the pattern
\{ Match { char
-? Optional hyphen
\h* Match 0+ times a horizontal whitespace char
([\w\s']{3,}) Capture in a group matching 3 or more times either a word char, whitespace char or '
(?:\h[\w']+)* Repeat 0+ times matching 1+ horizontal whitespace chars followed by what is listed in the character class
\h* Match 0+ times a horizontal whitespace char
} Match }
In Java
String regex = "\\{-?\\h*([\\w']{3,}(?:\\h+[\\w']+)*)\\h*}";
Regex demo

what \\s matches in Java

In all the tutorials I have read they always say that \s matches a whitespace. So why this instruction
System.out.println("line1 \n line2".replaceAll("\\s\\s*", " "));
have this output :
line1 line2
Thanks for your response.
The string literal "\\s\\s*" is equivalent to the regular expression syntax \s\s* which matches "a whitespace character followed by zero or more whitespace characters".
A whitespace character is defined as [ \t\n\x0B\f\r], which includes spaces and newlines.
\\s matches a whitespace character, where the whitespace characters are - [ \t\n\x0B\f\r]. It's not just a space. I suspect this is what you inferred from whitespace. See Pattern class documentation.
Also, you can replace your regex \\s\\s* with just \\s+.
"\\s\\s*" is the escaped version of \s\s* which is the same of \s+
It maches one or more of any white-space char. White-space chars are [ \t\n\x0B\f\r]. So it will replace multiple white-spaces by only one in each match.
First, this regex is a bit silly: \\s\\s* will match one or more whitespace characters, since the \\s character class matches all whitespace.
But, it could be expressed easier as \\s+, which accomplishes the exact same thing.

word boundary that rejects leading/end non-alphanumeric character

Right now I'm learning regular expression on Java and I have a question about the word boundaries. So when I looking for word boundaries on Java Regular Expression, I got this \b that accepts word bordered by non-word character so this regex
\b123\b
will accepts this string 123 456 but will rejects 456123456. Now I found that a condition like the word !$###%123^^%$# or "123" still got accepted by the regex above. Is there any word boundaries/pattern that rejects word that bordered by non-alphanumeric (except space) like the example above?
You want to use \s instead of \b. That will look for a whitespace character rather than a word boundary.
If you want your first example of 123 456 to be a match, however, then you will also need to use anchors to accept 123 at the immediate start or end of the string. This can be accomplished via (\s|^)123(\s|$). The carat ^ matches the start of the string and $ matches the end of the string.
(?<!\S)123(?!\S)
(?<!\S) matches a position that is not preceded by a non-whitespace character. (negative lookbehind)
(?!\S) matches a position that is not followed by a non-whitespace character. (negative lookahead)
I know this seems gratuitously complicated, but that's because \b conceals a lot of complexity. It's equivalent to this:
(?<=\w)(?!\w)|(?=\w)(?<!\w)
...meaning a position that's preceded by a word character and not followed by one, or a position that's followed by a word character and not preceded by one.

How to split a string with any whitespace chars as delimiters

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Categories