Whats the difference between [\s\S]*? and .*? in Java regular expressions? - java

I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):
<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>
Then I optimised it and replaced [\s\S]*? with .*? It suddenly stopped recognising the xml.
As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to . I didn't use greedy filters, so what could be the difference?

The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.
According to the oracle website, . matches
Any character (may or may not match line terminators)
while a line terminator is any of the following:
A newline (line feed) character ('\n'),
A carriage-return character followed immediately by a newline character ("\r\n"),
A standalone carriage-return character ('\r'),
A next-line character ('\u0085'),
A line-separator character ('\u2028'), or
A paragraph-separator character ('\u2029).
The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:
If UNIX_LINES mode is activated, then the only line terminators
recognized are newline characters.
The regular expression . matches any character except a line
terminator unless the DOTALL flag is specified.

Here is a sheet explaining all the regex commands.
Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

it is like in javascript although i don't get use to java, but java is a type of program and it is very useful in our real life.

Related

Difficulties finding a Java regex equivalent to a JavaScript regex

So, what I am trying to do is:
I have a string:
Special Skills:
someText
could range
through multiple lines
Special Abilities:
another
someText
Background:
multiline
text
I've already managed to come up with the following regex. It works perfectly in JavaScript according to regexr.com, but not in Java, according to Intellij's built-in Check-Regex and freeformatter.com.
Special Abilities:\n(.*\n)+?(Special Skills:|Background:)
The expression should, first off, extract
Special Skills:
someText
could range
through multiple lines
Mind that the both the sections "Special Abilities" and "Background" are optional.
Since I am kindoff stuck here, any help would be greatly appreciated!
You may add the end-of-string(line) anchor $ as an alternative to the alternation group at the end of the pattern, make sure the . matches carriage returns with (?d) Pattern.UNIX_LINES embedded flag and wrap (.*\n)+? with a capturing group to capture all text it matches into 1 group (and the (.*\n)+? can be changed into a non-capturing group):
(?d)Special Abilities:\r?\n((?:.*\n)*?)(Special Skills:|Background:|$)
See this regex demo.
Details
(?d) - . now matches any char but a newline
Special Abilities: - a literal text
\r?\n - a CRLF or LF line ending
((?:.*\n)*?) - Group 1: zero or more, but as few as possible, repetitionsof 0+ chars other than LF symbol and then an LF symbol
(Special Skills:|Background:|$) - either of the three alternatives: Special Skills:, Background: or end of string ($).
An alternative expression:
(?ms)Special Abilities:\r?\n(.*?)(^Special Skills:|^Background:|\Z)
See this regex demo
Here, (?ms) defines the multiline and dotall modes (^ will match start of a line here and . will match all symbols). Instead of $, we need to use \Z - end of string anchor.

Regular expression to match '\n' character

I am having a string "<?xml version=2.0><rss>Feed</rss>" I wrote a regex to match this string as
"<?xml.*<rss.*</rss>"
But if the input string contains \n like `"\nFeed" doesn't work for the above regex.
How to modify my regex to include \n character between strings.
The matching behavior of a dot can be controlled with a flag. It looks like in Java the default matching behavior for the dot is any character except the line terminators \r and \n.
I'm not a Java programmer, but usually using (?s) at beginning of a search string changes the matching behavior for a dot to any character including line terminators. So perhaps "(?s)<?xml.*<rss.*</rss>" works.
But better would be here to use "<?xml.*?<rss[\s\S]*?</rss>" as search string.
\s matches any whitespace character which includes line terminators and \S matches any non whitespace character. Both in square brackets results in matching any character.
For completness: [\w\W] matches also always any character.
You can combine it with (\\n)*. It is necessary to add an extra \ because it is a special character.
Another option is to execute replaceAll("\\n","") before executing the regex.

Using Unicode regular expressions in Java to match any Unicode character

I am trying to use the Java regex matcher to search and replace. However, after it failed to match a certain string, I noticed that the expression ".*" seems to fail to match certain Unicode characters (in my case it was a \u2028 LINE SEPARATOR character).
This is what I have at the moment (match an XML element with any text in between):
String segSourceSearch = "<source(.?)>(.*?)</source>";
String segSourceReplace = "<source$1>$2</source><target$1>$2</target>";
myString = myString.replaceAll(segSourceSearch, segSourceReplace);
Basically, what this is supposed to do is duplicate the element.
But how can I modify the regex (.*?) to match any Unicode character between <source> and </source>? Is there a built-in pattern in Java? If not, is there anything in ICU4J that I could use? (I haven't been able to find a regex matcher in ICU4J).
Pattern.DOTALL:
Enables dotall mode.
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
Dotall mode can also be enabled via the embedded flag expression (?s).
So the pattern you are looking for is (?s).*?, for capturing you still have to enclose it in braces, ((?s).*?), but you can also place the (?s) at the beginning of the entire expression to enable the DOTALL mode for the entire regex.

Regex for "* word"

Any Regex masters out there? I need a regular expression in Java that matches:
"RANDOMSTUFF SPECIFICWORD"
Including the quotation marks.
Thus I need
to match the first quote,
RANDOMSTUFF (any number of words with spaces between preceding SPECIFICWORD)
SPECIFICWORD (a specific word which I won't specify here.)
and the ending quote.
I don't want to match things such as:
RANDOMSTUFF SPECIFICWORD
"RANDOMSTUFF NOTTHESPECIFICWORD"
"RANDOMSTUFF SPECIFICWORD MORERANDOMSTUFF"
\".*\sSPECIFICWORD\"
If you don't want to allow quotes in between, use \"[^"]*\sSPECIFICWORD\"
. matches any character
* says 0 or more of the preceding character (in this case, 0 or more of any characters)
\s matches any whitespace character
SPECIFICWORD will be treated as a string literal, assuming there are no special characters (escape them if there are)
\" matches the quote
[^"] means any character except a quote (the ^ is what makes it 'except')
Also, this link could be useful. Regex's are powerful expressions and are applicable across virtually any language, so it would be a good thing to become comfortable with using them.
EDIT:
As several other posters have pointed out, adding ^ to the beginning and $ to the end will only match if the entire line matches.
^ matches the beginning of the line
$ matches the end of the line
^.*\s+SPECIFICWORD"$
'^' matches 'from the start of the line'
.* matches anything
\s+ matches 'any amount of whitespace, but at least some'
SPECIFICWORD" is a string literal
$ means 'this is the end of the line'
Note that ^ and $ are not always 'line'-based; most languages allow you to specify a 'multiline' mode that would cause them to match 'start of the string/end of the string' instead of one line at a time.
Will this string be matched as a line by line basis or will it be found within the text? If so, you can add anchors to ensure that it matches the string.
^(\".*\sSPECIFICWPRD\")$
Saying, at the start of the line, look for a double quote followed by zero or more random characters followed by a single whitespace, followed by the specific word, followed by a double quote at the end of the string.
Optionally, there are excellent tools for designing regex patterns and seeing what they match in real time.
Here are a couple of examples:
http://gskinner.com/RegExr/
http://regex101.com/r/zC3fM1
Try:
\"[\w\s]*SPECIFICWORD\"
Works like this:
\" matches opening quote
[\w\s]* matches zero or more of the characters from the following sets:
[a-zA-Z_0-9] (\w part)
[ \t\n\x0B\f\r] (\s part)
SPECIFICWORD matches the SPECIFICWORD
\" matches closing quote

problem understanding a string pattern

I'm learning GWT by following this tutorial but there's something I don't quite fully understand in step 4. The following line's checking that a string matches a pattern:
if (!str.matches("^[0-9A-Z\\.]{1,10}$")) {...}
After checking the documentation for the Pattern class I understand that the characters ^ and $ represent the beginning and the end of the line, and that [...]{1,10} means that the part in brackets [...] has to be present at least once but not more than 10 times. What I don't understand is the final characters of the part in brackets. 0-9A-Z means a range of characters from 0 to 9 or from A to Z. But what does \\. mean?
It matches a dot character. Since dot has a special meaning in regexp, it must be escaped with a backslash. And because backslash has a special meaning in Java strings, it must be escaped with another backslash.
dot .
As it is a special character in regexp syntax.
Also it has two escapes as \ is a special character in java strings.
The dot "." in regex means "any character". An escaped dot "." (or "\.") means the dot character itself (without any special regex behaviour like the unescaped dot).
So, for example, "123.ABC" could be a line that matches the given regex (line breaks etc. not included).
It matches a dot character. A double slash '\\' simply means a single '\' as you have to escape '\'s in java strings. So '\\.' is translated to '\.' which means match just a '.' character. If you just used '.' by itself, without escaping, it would match any character. So you have to escape it, to match a '.' character.

Categories