Using Unicode regular expressions in Java to match any Unicode character - java

I am trying to use the Java regex matcher to search and replace. However, after it failed to match a certain string, I noticed that the expression ".*" seems to fail to match certain Unicode characters (in my case it was a \u2028 LINE SEPARATOR character).
This is what I have at the moment (match an XML element with any text in between):
String segSourceSearch = "<source(.?)>(.*?)</source>";
String segSourceReplace = "<source$1>$2</source><target$1>$2</target>";
myString = myString.replaceAll(segSourceSearch, segSourceReplace);
Basically, what this is supposed to do is duplicate the element.
But how can I modify the regex (.*?) to match any Unicode character between <source> and </source>? Is there a built-in pattern in Java? If not, is there anything in ICU4J that I could use? (I haven't been able to find a regex matcher in ICU4J).

Pattern.DOTALL:
Enables dotall mode.
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
Dotall mode can also be enabled via the embedded flag expression (?s).
So the pattern you are looking for is (?s).*?, for capturing you still have to enclose it in braces, ((?s).*?), but you can also place the (?s) at the beginning of the entire expression to enable the DOTALL mode for the entire regex.

Related

Difficulties finding a Java regex equivalent to a JavaScript regex

So, what I am trying to do is:
I have a string:
Special Skills:
someText
could range
through multiple lines
Special Abilities:
another
someText
Background:
multiline
text
I've already managed to come up with the following regex. It works perfectly in JavaScript according to regexr.com, but not in Java, according to Intellij's built-in Check-Regex and freeformatter.com.
Special Abilities:\n(.*\n)+?(Special Skills:|Background:)
The expression should, first off, extract
Special Skills:
someText
could range
through multiple lines
Mind that the both the sections "Special Abilities" and "Background" are optional.
Since I am kindoff stuck here, any help would be greatly appreciated!
You may add the end-of-string(line) anchor $ as an alternative to the alternation group at the end of the pattern, make sure the . matches carriage returns with (?d) Pattern.UNIX_LINES embedded flag and wrap (.*\n)+? with a capturing group to capture all text it matches into 1 group (and the (.*\n)+? can be changed into a non-capturing group):
(?d)Special Abilities:\r?\n((?:.*\n)*?)(Special Skills:|Background:|$)
See this regex demo.
Details
(?d) - . now matches any char but a newline
Special Abilities: - a literal text
\r?\n - a CRLF or LF line ending
((?:.*\n)*?) - Group 1: zero or more, but as few as possible, repetitionsof 0+ chars other than LF symbol and then an LF symbol
(Special Skills:|Background:|$) - either of the three alternatives: Special Skills:, Background: or end of string ($).
An alternative expression:
(?ms)Special Abilities:\r?\n(.*?)(^Special Skills:|^Background:|\Z)
See this regex demo
Here, (?ms) defines the multiline and dotall modes (^ will match start of a line here and . will match all symbols). Instead of $, we need to use \Z - end of string anchor.

Whats the difference between [\s\S]*? and .*? in Java regular expressions?

I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):
<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>
Then I optimised it and replaced [\s\S]*? with .*? It suddenly stopped recognising the xml.
As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to . I didn't use greedy filters, so what could be the difference?
The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.
According to the oracle website, . matches
Any character (may or may not match line terminators)
while a line terminator is any of the following:
A newline (line feed) character ('\n'),
A carriage-return character followed immediately by a newline character ("\r\n"),
A standalone carriage-return character ('\r'),
A next-line character ('\u0085'),
A line-separator character ('\u2028'), or
A paragraph-separator character ('\u2029).
The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:
If UNIX_LINES mode is activated, then the only line terminators
recognized are newline characters.
The regular expression . matches any character except a line
terminator unless the DOTALL flag is specified.
Here is a sheet explaining all the regex commands.
Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).
it is like in javascript although i don't get use to java, but java is a type of program and it is very useful in our real life.

Ignore line breaks and spaces after the >(html end tag) in the java regular expression [duplicate]

I thought it may be [.\n]+ but that doesn't seem to work?
The dot cannot be used inside character classes.
See the option Pattern.DOTALL.
Pattern.DOTALL Enables dotall mode. In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators. Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)
If you need it on just a portion of the regular expression, you use e.g. [\s\S].
Edit: While my original answer is technically correct, as ThorSummoner pointed out, it can be done more efficiently like so
[\s\S]
as compared to (.|\n) or (.|\n|\r)
Try this
((.|\n)*)
It matches all characters multiple times

Regular expression to match '\n' character

I am having a string "<?xml version=2.0><rss>Feed</rss>" I wrote a regex to match this string as
"<?xml.*<rss.*</rss>"
But if the input string contains \n like `"\nFeed" doesn't work for the above regex.
How to modify my regex to include \n character between strings.
The matching behavior of a dot can be controlled with a flag. It looks like in Java the default matching behavior for the dot is any character except the line terminators \r and \n.
I'm not a Java programmer, but usually using (?s) at beginning of a search string changes the matching behavior for a dot to any character including line terminators. So perhaps "(?s)<?xml.*<rss.*</rss>" works.
But better would be here to use "<?xml.*?<rss[\s\S]*?</rss>" as search string.
\s matches any whitespace character which includes line terminators and \S matches any non whitespace character. Both in square brackets results in matching any character.
For completness: [\w\W] matches also always any character.
You can combine it with (\\n)*. It is necessary to add an extra \ because it is a special character.
Another option is to execute replaceAll("\\n","") before executing the regex.

regular expressions using java.util.regex API- java

How can I create a regular expression to search strings with a given pattern? For example I want to search all strings that match pattern '*index.tx?'. Now this should find strings with values index.txt,mainindex.txt and somethingindex.txp.
Pattern pattern = Pattern.compile("*.html");
Matcher m = pattern.matcher("input.html");
This code is obviously not working.
You need to learn regular expression syntax. It is not the same as using wildcards. Try this:
Pattern pattern = Pattern.compile("^.*index\\.tx.$");
There is a lot of information about regular expressions here. You may find the program RegexBuddy useful while you are learning regular expressions.
The code you posted does not work because:
dot . is a special regex character. It means one instance of any character.
* means any number of occurrences of the preceding character.
therefore, .* means any number of occurrences of any character.
so you would need something like
Pattern pattern = Pattern.compile(".*\\.html.*");
the reason for the \\ is because we want to insert dot, although it is a special regex sign.
this means: match a string in which at first there are any number of wild characters, followed by a dot, followed by html, followed by anything.
* matches zero or more occurrences of the preceding token, so if you want to match zero or more of any character, use .* instead (. matches any char).
Modified regex should look something like this:
Pattern pattern = Pattern.compile("^.*\\.html$");
^ matches the start of the string
.* matches zero or more of any char
\\. matches the dot char (if not escaped it would match any char)
$ matches the end of the string

Categories