Regular expression to match '\n' character

Regular expression to match '\n' character - java

I am having a string "<?xml version=2.0><rss>Feed</rss>" I wrote a regex to match this string as
"<?xml.*<rss.*</rss>"
But if the input string contains \n like `"\nFeed" doesn't work for the above regex.
How to modify my regex to include \n character between strings.

The matching behavior of a dot can be controlled with a flag. It looks like in Java the default matching behavior for the dot is any character except the line terminators \r and \n.
I'm not a Java programmer, but usually using (?s) at beginning of a search string changes the matching behavior for a dot to any character including line terminators. So perhaps "(?s)<?xml.*<rss.*</rss>" works.
But better would be here to use "<?xml.*?<rss[\s\S]*?</rss>" as search string.
\s matches any whitespace character which includes line terminators and \S matches any non whitespace character. Both in square brackets results in matching any character.
For completness: [\w\W] matches also always any character.

You can combine it with (\\n)*. It is necessary to add an extra \ because it is a special character.
Another option is to execute replaceAll("\\n","") before executing the regex.

Related

How to put [] in my regex [duplicate]

I have comma separated list of regular expressions:
.{8},[0-9],[^0-9A-Za-z ],[A-Z],[a-z]
I have done a split on the comma. Now I'm trying to match this regex against a generated password. The problem is that Pattern.compile does not like square brackets that is not escaped.
Can some please give me a simple function that takes a string like so: [0-9] and returns the escaped string \[0-9\].

For some reason, the above answer didn't work for me. For those like me who come after, here is what I found.
I was expecting a single backslash to escape the bracket, however, you must use two if you have the pattern stored in a string. The first backslash escapes the second one into the string, so that what regex sees is \]. Since regex just sees one backslash, it uses it to escape the square bracket.
\\]
In regex, that will match a single closing square bracket.
If you're trying to match a newline, for example though, you'd only use a single backslash. You're using the string escape pattern to insert a newline character into the string. Regex doesn't see \n - it sees the newline character, and matches that. You need two backslashes because it's not a string escape sequence, it's a regex escape sequence.

You can use Pattern.quote(String).
From the docs:
public static String quote(String s)
Returns a literal pattern String for the specified String.
This method produces a String that can be used to create a Pattern that would match the string s as if it were a literal pattern.
Metacharacters or escape sequences in the input sequence will be given no special meaning.

You can use the \Q and \E special characters...anything between \Q and \E is automatically escaped.
\Q[0-9]\E

Pattern.compile() likes square brackets just fine. If you take the string
".{8},[0-9],[^0-9A-Za-z ],[A-Z],[a-z]"
and split it on commas, you end up with five perfectly valid regexes: the first one matches eight non-line-separator characters, the second matches an ASCII digit, and so on. Unless you really want to match strings like ".{8}" and "[0-9]", I don't see why you would need to escape anything.

Whats the difference between [\s\S]? and .? in Java regular expressions?

I have developed a regular expression to identify a block of xml inside a text file. The expression looks like this (I have removed all java escape slashes to make it read easy):
<\?xml\s+version="[\d\.]+"\s*\?>\s*<\s*rdf:RDF[^>]*>[\s\S]*?<\s*\/\s*rdf:RDF\s*>
Then I optimised it and replaced [\s\S]*? with .*? It suddenly stopped recognising the xml.
As far as I know, \s means all white-space symbols and \S means all non white-spaced symbols or [^\s] so [\s\S] logically should be equivalent to . I didn't use greedy filters, so what could be the difference?

The regular expressions . and \s\S are not equivalent, since . doesn't catch line terminators (like new line) by default.
According to the oracle website, . matches
Any character (may or may not match line terminators)
while a line terminator is any of the following:
A newline (line feed) character ('\n'),
A carriage-return character followed immediately by a newline character ("\r\n"),
A standalone carriage-return character ('\r'),
A next-line character ('\u0085'),
A line-separator character ('\u2028'), or
A paragraph-separator character ('\u2029).
The two expressions are not equivalent, as long as the necessary flags are not set. Again quoting the oracle website:
If UNIX_LINES mode is activated, then the only line terminators
recognized are newline characters.
The regular expression . matches any character except a line
terminator unless the DOTALL flag is specified.

Here is a sheet explaining all the regex commands.
Basically, \s\S will pickup all characters, including newlines. Whereas . does not pickup line terminators per default (certain flags need to be set to pick them up).

it is like in javascript although i don't get use to java, but java is a type of program and it is very useful in our real life.

Escape each literal in regex string instead of quote the entire string

The answers here suggesting to use Pattern.quote in order to escape the special regex characters.
The problem with Pattern.quote is it escapes the string as a whole, not each of the special character on its own.
This is my case:
I receive a string from the user, and need to search for it in a document.
Since the user can't pass new line characters (It's a bug in a 3rd party API I have no access to), I decieded to treat any whitespace sequence as "\s+" and use a regex to search the document. This way the user can send a simple whitespace instead of a newline character.
For instance, if the document is:
The \s metacharacter is used to find a whitespace character.
A whitespace character can be:
A space character
A tab character
A carriage return character
A new line character
A vertical tab character
A form feed character
Then the received string
String receivedStr = "The \s metacharacter is used to find a whitespace character. A whitespace character can be:";
should be found in the document.
To acheive this I want to quote the string, and then replace any whitespace sequence with the string "\s+".
Using the following code:
receivedStr = Pattern.quote(receivedStr).replaceAll("\\s+", "\\\\s+");
yield the regex:
\QThe\s+\s\s+metacharacter\s+is\s+used\s+to\s+find\s+a\s+whitespace\s+character.\s+A\s+whitespace\s+character\s+can\s+be:\E
that will ofcourse ignore my added "\s+"'s instead of the expected:
The\s+\\s\s+metacharacter\s+is\s+used\s+to\s+find\s+a\s+whitespace\s+character.\s+A\s+whitespace\s+character\s+can\s+be:
that only escapes the "\s" literal and not the entire string.
Is there an alternative to Pattern.quote that escapes single literals instead of the whole string?

I would suggest something like this:
String re = Stream.of(input.split("\\s+"))
.map(Pattern::quote)
.collect(Collectors.joining("\\s+"));
This makes sure everything gets quoted (including stuff that otherwise would be interpreted as look-arounds and could cause exponential blowup in match finding), and any user entered whitespace ends up as unquoted \s+.
Example input:
Lorem \\b ipsum \\s dolor (sit) amet.
Output:
\QLorem\E\s+\Q\b\E\s+\Qipsum\E\s+\Q\s\E\s+\Qdolor\E\s+\Q(sit)\E\s+\Qamet.\E

Regex for "* word"

Any Regex masters out there? I need a regular expression in Java that matches:
"RANDOMSTUFF SPECIFICWORD"
Including the quotation marks.
Thus I need
to match the first quote,
RANDOMSTUFF (any number of words with spaces between preceding SPECIFICWORD)
SPECIFICWORD (a specific word which I won't specify here.)
and the ending quote.
I don't want to match things such as:
RANDOMSTUFF SPECIFICWORD
"RANDOMSTUFF NOTTHESPECIFICWORD"
"RANDOMSTUFF SPECIFICWORD MORERANDOMSTUFF"

\".*\sSPECIFICWORD\"
If you don't want to allow quotes in between, use \"[^"]*\sSPECIFICWORD\"
. matches any character
* says 0 or more of the preceding character (in this case, 0 or more of any characters)
\s matches any whitespace character
SPECIFICWORD will be treated as a string literal, assuming there are no special characters (escape them if there are)
\" matches the quote
[^"] means any character except a quote (the ^ is what makes it 'except')
Also, this link could be useful. Regex's are powerful expressions and are applicable across virtually any language, so it would be a good thing to become comfortable with using them.
EDIT:
As several other posters have pointed out, adding ^ to the beginning and $ to the end will only match if the entire line matches.
^ matches the beginning of the line
$ matches the end of the line

^.*\s+SPECIFICWORD"$
'^' matches 'from the start of the line'
.* matches anything
\s+ matches 'any amount of whitespace, but at least some'
SPECIFICWORD" is a string literal
$ means 'this is the end of the line'
Note that ^ and $ are not always 'line'-based; most languages allow you to specify a 'multiline' mode that would cause them to match 'start of the string/end of the string' instead of one line at a time.

Will this string be matched as a line by line basis or will it be found within the text? If so, you can add anchors to ensure that it matches the string.
^(\".*\sSPECIFICWPRD\")$
Saying, at the start of the line, look for a double quote followed by zero or more random characters followed by a single whitespace, followed by the specific word, followed by a double quote at the end of the string.
Optionally, there are excellent tools for designing regex patterns and seeing what they match in real time.
Here are a couple of examples:
http://gskinner.com/RegExr/
http://regex101.com/r/zC3fM1

Try:
\"[\w\s]*SPECIFICWORD\"
Works like this:
\" matches opening quote
[\w\s]* matches zero or more of the characters from the following sets:
[a-zA-Z_0-9] (\w part)
[ \t\n\x0B\f\r] (\s part)
SPECIFICWORD matches the SPECIFICWORD
\" matches closing quote

problem understanding a string pattern

I'm learning GWT by following this tutorial but there's something I don't quite fully understand in step 4. The following line's checking that a string matches a pattern:
if (!str.matches("^[0-9A-Z\\.]{1,10}$")) {...}
After checking the documentation for the Pattern class I understand that the characters ^ and $ represent the beginning and the end of the line, and that [...]{1,10} means that the part in brackets [...] has to be present at least once but not more than 10 times. What I don't understand is the final characters of the part in brackets. 0-9A-Z means a range of characters from 0 to 9 or from A to Z. But what does \\. mean?

It matches a dot character. Since dot has a special meaning in regexp, it must be escaped with a backslash. And because backslash has a special meaning in Java strings, it must be escaped with another backslash.

dot .
As it is a special character in regexp syntax.
Also it has two escapes as \ is a special character in java strings.

The dot "." in regex means "any character". An escaped dot "." (or "\.") means the dot character itself (without any special regex behaviour like the unescaped dot).
So, for example, "123.ABC" could be a line that matches the given regex (line breaks etc. not included).

It matches a dot character. A double slash '\\' simply means a single '\' as you have to escape '\'s in java strings. So '\\.' is translated to '\.' which means match just a '.' character. If you just used '.' by itself, without escaping, it would match any character. So you have to escape it, to match a '.' character.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.