Regular expression to match escaped sequences in java - java

I am looking for regex to check for all escape sequences in java
\b backspace
\t horizontal tab
\n linefeed
\f form feed
\r carriage return
\" double quote
\' single quote
\\ backslash
How do I write regex and perform validation to allow words / textarea / strings / sentences containing valid escape sequences

This regex will match all your escape sequence that you have written:
\\[btnfr"'\\]
In Java you need to duplicate the backslash, the code will result as:
Pattern p = Pattern.compile("\\\\[btnfr\\\"\\'\\\\]");
if(p.matcher("\\b backspace").find()){
System.out.println("Contains escape sequence");
}

The following regex should meet your need:
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
as in
Pattern pattern = Pattern.compile("\\\\[\\\\btnfr\'\"]");
String[] strings = new String[]{"\\b","\\t","\\n","\\f","\\r","\\\'","\\\"", "\\\\"};
for (String s:strings) {
System.out.println(s + " - " + pattern.matcher(s).matches());
}
To match a single \, you would have to add 4 \ inside a regex string.
Considering a string, "\\" stands for a single \.
When you have "\\" as a regex string, it means a \ which is a special character in regex and it is supposed to be followed by certain other character to form an escape sequence.
In this way, we need "\\\\", to match a single \ which is equivalent to the string "\\".
EDIT: There is no need to escape the single quote in the regex string. So "\\\\[\\\\btnfr\'\"]" can be replaced with "\\\\[\\\\btnfr'\"]".

You'll need to use DOTALL to match line terminators. You might also find \s handy as it represents all whitespace. Eg
Pattern p = Pattern.compile("([\\s\"'\\]+)", Pattern.DOTALL);
Matcher m = p.matcher("foo '\r\n\t bar");
assertTrue(m.find());
assertEquals(" '\r\n\t ", m.group(1));

Related

Matcher.replaceAll() removes backslash even when I escape it. Java

I have functionality in my app that should replace some text in json (I have simplified it in the example). Their replacement may contain escaping sequences like \n \b \t etc. which can break the json string when I try to build json with Jackson. So I decided to use Apache's solution - StringEscapeUtils.escapeJava() to escape all escaping sequences. But
Matcher.replaceAll() removes backslashes which added by escapeJava()
There is the code:
public static void main(String[] args) {
String json = "{\"test2\": \"Hello toReplace \\\"test\\\" world\"}";
String replacedJson = Pattern.compile("toReplace")
.matcher(json)
.replaceAll(StringEscapeUtils.escapeJava("replacement \n \b \t"));
System.out.println(replacedJson);
}
Expected Output:
{"test2": "Hello replacement \n \b \t \"test\" world"}
Actual Output:
{"test2": "Hello replacement n b t \"test\" world"}
Why does Matcher.replaceAll() removes backslahes while System.out.println(StringEscapeUtils.escapeJava("replacement \n \b \t")); returns correct output - replacement \n \b \t
StringEscapeUtils.escapeJava("\n") allows you to transform the single newline character \n into two characters: \ and n.
\ is a special character in pattern replacements though, from https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String):
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
To have them taken as literal characters, you need to escape it via Matcher.quoteReplacement, from https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#quoteReplacement(java.lang.String):
Returns a literal replacement String for the specified String. This method produces a String that will work as a literal replacement s in the appendReplacement method of the Matcher class. The String produced will match the sequence of characters in s treated as a literal sequence. Slashes (\) and dollar signs ($) will be given no special meaning.
So in your case:
.replaceAll(Matcher.quoteReplacement(StringEscapeUtils.escapeJava("replacement \n \b \t")))
If you want a literal backslash in replaceAll, you need to escape it. You can find this in the documentation here
StringEscapeUtils.escapeJava will escape a string suitable for use in Java source code - but it won't allow you to use unescaped strings in your source code.
"replacement \n \b \t"
^ new line
^ backspace
^ tab
If you want literal backslashes in a regular Java string, you need:
"replacement \\n \\b \\t"
Because this is a java string of the replace part of a regular expression for replaceAll, you need:
"replacement \\\\n \\\\b \\\\t"
Try:
String replacedJson = Pattern.compile("toReplace")
.matcher(json)
.replaceAll("replacement \\\\n \\\\b \\\\t")
You have to escape \ as well using Matcher.quoteReplacement().
public static String replaceAll(String json, String regex, String replace) {
return Pattern.compile(regex)
.matcher(json)
.replaceAll(Matcher.quoteReplacement(StringEscapeUtils.escapeJava(replace)));
}

Why escaping double quote with single and triple backslashes in a Java regular expression yields identical results

I want to escape " (double quotes) in a regex.
I found out that there is no difference whether I use \\\ or \, both yield the same correct result.
Why is it so? How can the first one give correct result?
To define a " char in a string literal in Java, you need to escape it for the string parsing engine, like "\"".
The " char is not a special regex metacharacter, so you needn't escape this character for the regex engine. However, you may do it:
A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct.
To define a regex escape a literal backslash is used, and it is defined with double backslash in a Java string literal, "\\":
It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler.
So, both "\"" (a literal " string) and "\\\"" (a literal \" string) form a regex pattern that matches a single " char.
Try to use this:
String regex = "(\"\\w+\")";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher("Some \"test\" string. And \"another\" quoted word.");
while (matcher.find()) {
System.out.println(matcher.group());
}
Prints:
"test"
"another"

Why does this regex match online but not in my environment? [duplicate]

I am trying out the following code and it's printing false.
I was expected that this would print true.
In addition , the Pattern.Compile() statemenet , gives a warning 'redundant escape character'.
Can someone please help me as to why this is not returning true and why do I see a warning.
public static void main(String[] args) {
String s = "\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());
}
The s="\\n" means you assign a backslash and n to the variable s, and it contains a sequence of two chars, \ and n.
The Pattern.compile("\\\n") means you define a regex pattern \<LF> (a backslash and a newline, line feed, char) that matches a newline (LF) char, because escaped non-word non-special chars match themselves. \, matches a ,, \; matches a ;. Thus, this pattern won't match the string in variable s.
The redundant escape warning is thrown because \<LF> matches the same newline char that can be matched with mere <LF>.
More examples:
Regex
Regex string literal
Matching text
Matching string literal
<LF>
"\n"
<LF>
"\n"
\n
"\\n"
<LF>
"\n"
\\n
"\\\\n"
\n
"\\n"
Because "\\n" evaulates to backslash \\ and the letter n while "\\\n" evaluates to a backslash \\ and then a newline \n.
Backslashes within string literals in Java source code are interpreted as required by The Java™ Language Specification as either Unicode escapes (section 3.3) or other character escapes (section 3.10.6) It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.
Refer : https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Your source s has two characters, '\' and 'n', if you meant it would be \ followed by a line break then it should be "\\\n"
Pattern has two characters '\' and '\n' (line break) and \ the escape characher is not needed, hence warning. If you meant \ followed by line break it should be "\\\\\n" (twice \ to escape it for regex and then \n).
String s = "\\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());

Regular expression does not match newline when escaped with backslash [duplicate]

I am trying out the following code and it's printing false.
I was expected that this would print true.
In addition , the Pattern.Compile() statemenet , gives a warning 'redundant escape character'.
Can someone please help me as to why this is not returning true and why do I see a warning.
public static void main(String[] args) {
String s = "\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());
}
The s="\\n" means you assign a backslash and n to the variable s, and it contains a sequence of two chars, \ and n.
The Pattern.compile("\\\n") means you define a regex pattern \<LF> (a backslash and a newline, line feed, char) that matches a newline (LF) char, because escaped non-word non-special chars match themselves. \, matches a ,, \; matches a ;. Thus, this pattern won't match the string in variable s.
The redundant escape warning is thrown because \<LF> matches the same newline char that can be matched with mere <LF>.
More examples:
Regex
Regex string literal
Matching text
Matching string literal
<LF>
"\n"
<LF>
"\n"
\n
"\\n"
<LF>
"\n"
\\n
"\\\\n"
\n
"\\n"
Because "\\n" evaulates to backslash \\ and the letter n while "\\\n" evaluates to a backslash \\ and then a newline \n.
Backslashes within string literals in Java source code are interpreted as required by The Java™ Language Specification as either Unicode escapes (section 3.3) or other character escapes (section 3.10.6) It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.
Refer : https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Your source s has two characters, '\' and 'n', if you meant it would be \ followed by a line break then it should be "\\\n"
Pattern has two characters '\' and '\n' (line break) and \ the escape characher is not needed, hence warning. If you meant \ followed by line break it should be "\\\\\n" (twice \ to escape it for regex and then \n).
String s = "\\\n";
System.out.println(s);
Pattern p = Pattern.compile("\\\\\n");
Matcher mm = p.matcher(s);
System.out.println(mm.matches());

REGEXP - how to read " character?

I'm using hadoop pig with regexp (REGEX_EXTRACT_ALL) - this is Java parsing.
I have a string:
"DYN_USER_ID=32753477; $Path=\"/\"; DYN_USER_CONFIRM=e6d2a0a7b7715cb10d1dca504e3c5e80; $Path=\"/\"" "Nokia6070/2.0 (03.20) Profile/MIDP-2.0 Configuration/CLDC-1.1"
I'm expeting two groups:
First: DYN_USER_ID=32753477; $Path=\"/\"; DYN_USER_CONFIRM=e6d2a0a7b7715cb10d1dca504e3c5e80; $Path=\"/\"
Second: Nokia6070/2.0 (03.20) Profile/MIDP-2.0 Configuration/CLDC-1.1
As you can see, inside the first string there is " character but with escape character \.
The simplies solution is:
"(.*)" "(.*)"
But is it the best one?
"(.*)(?<!\\\\)" "(.*)"
This uses negatve lookbehind: (?<!☀) where ☀ is some string, here the character backspace is represented by an regex-escaped and String-escaped backslash.
Ideally, you should be using the negated character class [^"] so that it matches from the first delimiter " to the last delimiter ", but the problem is that it ignores escaped " characters. If you can have escaped " and escaped \ in your strings, it will be better if you use something like this:
"((?:\\.|[^"\\])+)" "((?:\\.|[^"\\])+)"
The group (?:\\.|[^"\\])+ will match either an escaped character or many [^"\\] characters.
regex101 demo

Categories