Regular expression does not match newline when escaped with backslash [duplicate] - java

I am trying out the following code and it's printing false.
I was expected that this would print true.
In addition , the Pattern.Compile() statemenet , gives a warning 'redundant escape character'.
Can someone please help me as to why this is not returning true and why do I see a warning.
public static void main(String[] args) {
String s = "\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());
}

The s="\\n" means you assign a backslash and n to the variable s, and it contains a sequence of two chars, \ and n.
The Pattern.compile("\\\n") means you define a regex pattern \<LF> (a backslash and a newline, line feed, char) that matches a newline (LF) char, because escaped non-word non-special chars match themselves. \, matches a ,, \; matches a ;. Thus, this pattern won't match the string in variable s.
The redundant escape warning is thrown because \<LF> matches the same newline char that can be matched with mere <LF>.
More examples:
Regex
Regex string literal
Matching text
Matching string literal
<LF>
"\n"
<LF>
"\n"
\n
"\\n"
<LF>
"\n"
\\n
"\\\\n"
\n
"\\n"

Because "\\n" evaulates to backslash \\ and the letter n while "\\\n" evaluates to a backslash \\ and then a newline \n.

Backslashes within string literals in Java source code are interpreted as required by The Java™ Language Specification as either Unicode escapes (section 3.3) or other character escapes (section 3.10.6) It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.
Refer : https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Your source s has two characters, '\' and 'n', if you meant it would be \ followed by a line break then it should be "\\\n"
Pattern has two characters '\' and '\n' (line break) and \ the escape characher is not needed, hence warning. If you meant \ followed by line break it should be "\\\\\n" (twice \ to escape it for regex and then \n).
String s = "\\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());

Related

Matcher.replaceAll() removes backslash even when I escape it. Java

I have functionality in my app that should replace some text in json (I have simplified it in the example). Their replacement may contain escaping sequences like \n \b \t etc. which can break the json string when I try to build json with Jackson. So I decided to use Apache's solution - StringEscapeUtils.escapeJava() to escape all escaping sequences. But
Matcher.replaceAll() removes backslashes which added by escapeJava()
There is the code:
public static void main(String[] args) {
String json = "{\"test2\": \"Hello toReplace \\\"test\\\" world\"}";
String replacedJson = Pattern.compile("toReplace")
.matcher(json)
.replaceAll(StringEscapeUtils.escapeJava("replacement \n \b \t"));
System.out.println(replacedJson);
}
Expected Output:
{"test2": "Hello replacement \n \b \t \"test\" world"}
Actual Output:
{"test2": "Hello replacement n b t \"test\" world"}
Why does Matcher.replaceAll() removes backslahes while System.out.println(StringEscapeUtils.escapeJava("replacement \n \b \t")); returns correct output - replacement \n \b \t
StringEscapeUtils.escapeJava("\n") allows you to transform the single newline character \n into two characters: \ and n.
\ is a special character in pattern replacements though, from https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String):
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
To have them taken as literal characters, you need to escape it via Matcher.quoteReplacement, from https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#quoteReplacement(java.lang.String):
Returns a literal replacement String for the specified String. This method produces a String that will work as a literal replacement s in the appendReplacement method of the Matcher class. The String produced will match the sequence of characters in s treated as a literal sequence. Slashes (\) and dollar signs ($) will be given no special meaning.
So in your case:
.replaceAll(Matcher.quoteReplacement(StringEscapeUtils.escapeJava("replacement \n \b \t")))
If you want a literal backslash in replaceAll, you need to escape it. You can find this in the documentation here
StringEscapeUtils.escapeJava will escape a string suitable for use in Java source code - but it won't allow you to use unescaped strings in your source code.
"replacement \n \b \t"
^ new line
^ backspace
^ tab
If you want literal backslashes in a regular Java string, you need:
"replacement \\n \\b \\t"
Because this is a java string of the replace part of a regular expression for replaceAll, you need:
"replacement \\\\n \\\\b \\\\t"
Try:
String replacedJson = Pattern.compile("toReplace")
.matcher(json)
.replaceAll("replacement \\\\n \\\\b \\\\t")
You have to escape \ as well using Matcher.quoteReplacement().
public static String replaceAll(String json, String regex, String replace) {
return Pattern.compile(regex)
.matcher(json)
.replaceAll(Matcher.quoteReplacement(StringEscapeUtils.escapeJava(replace)));
}

Why escaping double quote with single and triple backslashes in a Java regular expression yields identical results

I want to escape " (double quotes) in a regex.
I found out that there is no difference whether I use \\\ or \, both yield the same correct result.
Why is it so? How can the first one give correct result?
To define a " char in a string literal in Java, you need to escape it for the string parsing engine, like "\"".
The " char is not a special regex metacharacter, so you needn't escape this character for the regex engine. However, you may do it:
A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.
To define a regex escape a literal backslash is used, and it is defined with double backslash in a Java string literal, "\\":
It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler.
So, both "\"" (a literal " string) and "\\\"" (a literal \" string) form a regex pattern that matches a single " char.
Try to use this:
String regex = "(\"\\w+\")";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher("Some \"test\" string. And \"another\" quoted word.");
while (matcher.find()) {
System.out.println(matcher.group());
}
Prints:
"test"
"another"

Why does this regex match online but not in my environment? [duplicate]

I am trying out the following code and it's printing false.
I was expected that this would print true.
In addition , the Pattern.Compile() statemenet , gives a warning 'redundant escape character'.
Can someone please help me as to why this is not returning true and why do I see a warning.
public static void main(String[] args) {
String s = "\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());
}
The s="\\n" means you assign a backslash and n to the variable s, and it contains a sequence of two chars, \ and n.
The Pattern.compile("\\\n") means you define a regex pattern \<LF> (a backslash and a newline, line feed, char) that matches a newline (LF) char, because escaped non-word non-special chars match themselves. \, matches a ,, \; matches a ;. Thus, this pattern won't match the string in variable s.
The redundant escape warning is thrown because \<LF> matches the same newline char that can be matched with mere <LF>.
More examples:
Regex
Regex string literal
Matching text
Matching string literal
<LF>
"\n"
<LF>
"\n"
\n
"\\n"
<LF>
"\n"
\\n
"\\\\n"
\n
"\\n"
Because "\\n" evaulates to backslash \\ and the letter n while "\\\n" evaluates to a backslash \\ and then a newline \n.
Backslashes within string literals in Java source code are interpreted as required by The Java™ Language Specification as either Unicode escapes (section 3.3) or other character escapes (section 3.10.6) It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.
Refer : https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Your source s has two characters, '\' and 'n', if you meant it would be \ followed by a line break then it should be "\\\n"
Pattern has two characters '\' and '\n' (line break) and \ the escape characher is not needed, hence warning. If you meant \ followed by line break it should be "\\\\\n" (twice \ to escape it for regex and then \n).
String s = "\\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());

Regular expression to match escaped sequences in java

I am looking for regex to check for all escape sequences in java
\b backspace
\t horizontal tab
\n linefeed
\f form feed
\r carriage return
\" double quote
\' single quote
\\ backslash
How do I write regex and perform validation to allow words / textarea / strings / sentences containing valid escape sequences
This regex will match all your escape sequence that you have written:
\\[btnfr"'\\]
In Java you need to duplicate the backslash, the code will result as:
Pattern p = Pattern.compile("\\\\[btnfr\\\"\\'\\\\]");
if(p.matcher("\\b backspace").find()){
System.out.println("Contains escape sequence");
}
The following regex should meet your need:
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
as in
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
String[] strings = new String[]{"\\b","\\t","\\n","\\f","\\r","\\\'","\\\"", "\\\\"};
for (String s:strings) {
System.out.println(s + " - " + pattern.matcher(s).matches());
}
To match a single \, you would have to add 4 \ inside a regex string.
Considering a string, "\\" stands for a single \.
When you have "\\" as a regex string, it means a \ which is a special character in regex and it is supposed to be followed by certain other character to form an escape sequence.
In this way, we need "\\\\", to match a single \ which is equivalent to the string "\\".
EDIT: There is no need to escape the single quote in the regex string. So "\\\\[\\\\btnfr\'\"]" can be replaced with "\\\\[\\\\btnfr'\"]".
You'll need to use DOTALL to match line terminators. You might also find \s handy as it represents all whitespace. Eg
Pattern p = Pattern.compile("([\\s\"'\\]+)", Pattern.DOTALL);
Matcher m = p.matcher("foo '\r\n\t bar");
assertTrue(m.find());
assertEquals(" '\r\n\t ", m.group(1));

Escape sequence in regex parsed by Pattern.LITERAL

Given the following snippet:
Pattern pt = Pattern.compile("\ndog", Pattern.LITERAL);
Matcher mc = pt.matcher("\ndogDoG");
while(mc.find())
{
System.out.printf("I have found %s starting at the " +
"index %s and ending at the index %s%n",mc.group(),mc.start(),mc.end());
}
The output will be:
I have found
dog starting at the index 0 and ending at the index 4.
It means that even though I have specified Pattern.LITERAL which this link says that:
Pattern.LITERAL Enables literal parsing of the pattern. When this flag
is specified then the input string that specifies the pattern is
treated as a sequence of literal characters. Metacharacters or escape
sequences in the input sequence will be given no special meaning.
However the output given from the above snippet does interpret the \n escape sequence, it does not treat it like a literal.
Why does it happen in that way since they specify in this tutorial that it should not?
I now \n is a line terminator, however it's still an escape sequence character.
however it's still an escape sequence character.
No it's not. It's a newline character. You can do:
char c = '\n';
Your output is therefore expected.
Note that if you compile a pattern with:
Pattern.compile("\n")
then \n is the literal character \n.
BUT if you compile with:
Pattern.compile("\\n")
then it is an escape sequence. And they happen to match the same thing.
Pattern.LITERAL cares about regex literals, not string literals.
Therefore, it treats \\n as backslash plus n (instead of the regex token for newline), but it treats \n as the line feed character that it stands for (and thus ignores it).

Categories