Regex - escape or character block? - java

What is the best approach if for instance a question mark is expected in a String.
...[?]...
or
...\?...
Example:
The text bla?bla will match both with the pattern bla[?]bla and bla\?bla (bot not bla?bla obviously) but is there any reason to use one over the other?

There is no technical reason to prefer one over the other: They are equivalent expressions. The character class is only used to avoid entering a backslash, so IMHO the escaped version is "cleaner"
However the reason may be to avoid double-escaping the slash on input. In languages like java, the literal version of the escaped version would look like this:
// in java you need to escape a backslash with another backslash :(
String regex = "...\\?...";
It could be that wherever the regexes are coming from has a similar issue and it's easier to read [?] than \\?

Related

Is there any need to escape the slash('/') character for regular expressions in Java

I have the following code snippet:
Pattern patternOfSlashContainingBackSlash=Pattern.compile("\\/");
Pattern patternOfSlashNotContainingBackSlash=Pattern.compile("/");
String slash = "/";
Matcher matcherOfSlashContainingBackSlash = patternOfSlashContainingBackSlash.matcher(slash);
Matcher matcherOfSlashNotContainingBackSlash = patternOfSlashNotContainingBackSlash.matcher(slash);
//both patterns match the slash
System.out.println(matcherOfSlashContainingBackSlash.matches());
System.out.println(matcherOfSlashNotContainingBackSlash.matches());
My questions:
What is the difference (from Java perspective) between the two patterns, or is there any difference?
Is the '/' character just a plain character for regex(not a special character like ']' is) ,from Java perspective?
The java version on which I run this is 1.8
This question is different from the others, since it makes it clear that the patterns "\\/" and "/" are the same for Java programming language.
Thank you very much!
/ is not special in Java regex, it is in JavaScript where we have syntax /regex/flags. Java allows escaping characters even if they don't require it, probably to make such regex more portable. So Pattern.compile("\\/") and Pattern.compile("/") will behave the same.
BTW ] by itself is not special character. If it is not part of [...] construct you don't need to escape it, but you are allowed to (at least in Java, I am not sure about other regex flavors).
/ is not a special character in Java regexes. "\\/" is equivalent to "/".
The forward slash / character is not a command character in Java Pattern representations (nor in normal Strings), and does not require escaping.
The back-slash character \ requires escaping in Java Strings, as it is used to encode special sequences such as newline ("\n").
E.g. String singleBackslash = "\\";
As the back-slash is also used to signify special constructs in Java Pattern representations, it may require double escaping in Pattern definitions.
E.g. Pattern singleBackSlash = Pattern.compile("\\\\");
See API for examples of Pattern constructs involving the back-slash.
There is no need to escape the slash character. The reference for Java's regex syntax is the Javadoc for class Pattern. There, it lists all the characters that have special meaning. The slash is not present in that list.
I think that the reason why people escape the / character is that they come from different regex flavours (like Javascript) where you can define a pattern enclosing it between two slashes and if you want to match one slash you need to escape it.
var pattern = /\d\/\d/; // Valid pattern of 2 digits divided by a slash, which must be escaped.
From a Java perspective, though,it is equivalent and thus there's no need to escape it.

Add Dash to Java Regex

I am trying to modify an existing Regex expression being pulled in from a properties file from a Java program that someone else built.
The current Regex expression used to match an email address is -
RR.emailRegex=^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
That matches email addresses such as abc.xyz#example.com, but now some email addresses have dashes in them such as abc-def.xyz#example.com and those are failing the Regex pattern match.
What would my new Regex expression be to add the dash to that regular expression match or is there a better way to represent that?
Basing on the regex you are using, you can add the dash into your character class:
RR.emailRegex=^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
add
RR.emailRegex=^[a-zA-Z0-9_\\.-]+#[a-zA-Z0-9_-]+\\.[a-zA-Z0-9_-]+$
Btw, you can shorten your regex like this:
RR.emailRegex=^[\\w.-]+#[\\w-]+\\.[\\w-]+$
Anyway, I would use Apache EmailValidator instead like this:
if (EmailValidator.getInstance().isValid(email)) ....
Meaning of - inside a character class is different than used elsewhere. Inside character class - denotes range. e.g. 0-9. If you want to include -, write it in beginning or ending of character class like [-0-9] or [0-9-].
You also don't need to escape . inside character class because it is treated as . literally inside character class.
Your regex can be simplified further. \w denotes [A-Za-z0-9_]. So you can use
^[-\w.]+#[\w]+\.[\w]+$
In Java, this can be written as
^[-\\w.]+#[\\w]+\\.[\\w]+$
^[a-zA-Z0-9_\\.\\-]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
Should solve your problem. In regex you need to escape anything that has meaning in the Regex engine (eg. -, ?, *, etc.).
The correct Regex fix is below.
OLD Regex Expression
^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
NEW Regex Expression
^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
Actually I read this post it covers all special cases, so the best one that's work correctly with java is
String pattern ="(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?|\\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-zA-Z0-9-]*[a-zA-Z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])";

Java Scanner backslash delimiter

I try to use a series of delimiter for an input. It's for a homework. They said that we should use backslash () too. If I use it like this (it's at the end):
scanner.useDelimiter("\\;|\\:|\\?|\\~|/|\\.|,|\\<|\\>|\\`|\\[|\\]|\\{|\\}|\\(|\\)|\\!|\\#|\\#|\\$|\\%|\\^|\\&|\\-|\\_|\\+|\\'|\\=|\\*|\"|\\||\n|\t|\r|\\");
It won't work. It says unsupported escape sequence. If I add another backslash it says Illegal line end in string literal. If I add another it will escape to double backslash and that's not what I need.
I couldn't find any solution for this and that's why I'm asking. I already finished the homework and I used Scanner and right now changing it it's not a solution (a lot to re-implement).
Thank you.
You should use four backslashes at the end, like:
scanner.useDelimiter("\\;|\\:| ... |\r|\\\\");
This is the way it should work. You said if you tried it would match double backslashes. Have you tried it? If you did, and it still matches double backslashes, I suspect your input is escaped too somewhere. (maybe it is a string literal somewhere in your code?)
The reason behind this is that your string is de-escaped twice. Once at compile time as every other string literal in the Java language, and once compiling the regex. That means, after the first step it is escaped once, so the regex compiler gets two backslashes \\. The regex compiler will de-escape that too (just like \r), and will match a single \ character.
If you would like to match two backslashes this way, then you have to use eight backslash (\\\\\\\\ or \\\\{2}) in your literal. Yeah, pretty ugly.
You are using the delimiter in wrong way i think.
There is a related topic.
Check this first
How do I use a delimiter in Java Scanner?

Java: Replace all ' in a string with \'

I need to escape all quotes (') in a string, so it becomes \'
I've tried using replaceAll, but it doesn't do anything. For some reason I can't get the regex to work.
I'm trying with
String s = "You'll be totally awesome, I'm really terrible";
String shouldBecome = "You\'ll be totally awesome, I\'m really terrible";
s = s.replaceAll("'","\\'"); // Doesn't do anything
s = s.replaceAll("\'","\\'"); // Doesn't do anything
s = s.replaceAll("\\'","\\'"); // Doesn't do anything
I'm really stuck here, hope somebody can help me here.
Thanks,
Iwan
You have to first escape the backslash because it's a literal (yielding \\), and then escape it again because of the regular expression (yielding \\\\). So, Try:
s.replaceAll("'", "\\\\'");
output:
You\'ll be totally awesome, I\'m really terrible
Use replace()
s = s.replace("'", "\\'");
output:
You\'ll be totally awesome, I\'m really terrible
Let's take a tour of String#repalceAll(String regex, String replacement)
You will see that:
An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression
Pattern.compile(regex).matcher(str).replaceAll(repl)
So lets take a look at Matcher.html#replaceAll(java.lang.String) documentation
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
You can see that in replacement we have special character $ which can be used as reference to captured group like
System.out.println("aHellob,aWorldb".replaceAll("a(\\w+?)b", "$1"));
// result Hello,World
But sometimes we don't want $ to be such special because we want to use it as simple dollar character, so we need a way to escape it.
And here comes \, because since it is used to escape metacharacters in regex, Strings and probably in other places it is good convention to use it here to escape $.
So now \ is also metacharacter in replacing part, so if you want to make it simple \ literal in replacement you need to escape it somehow. And guess what? You escape it the same way as you escape it in regex or String. You just need to place another \ before one you escaping.
So if you want to create \ in replacement part you need to add another \ before it. But remember that to write \ literal in String you need to write it as "\\" so to create two \\ in replacement you need to write it as "\\\\".
So try
s = s.replaceAll("'", "\\\\'");
Or even better
to reduce explicit escaping in replacement part (and also in regex part - forgot to mentioned that earlier) just use replace instead replaceAll which adds regex escaping for us
s = s.replace("'", "\\'");
This doesn't say how to "fix" the problem - that's already been done in other answers; it exists to draw out the details and applicable documentation references.
When using String.replaceAll or any of the applicable Matcher replacers, pay attention to the replacement string and how it is handled:
Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.
As pointed out by isnot2bad in a comment, Matcher.quoteReplacement may be useful here:
Returns a literal replacement String for the specified String. .. The String produced will match the sequence of characters in s treated as a literal sequence. Slashes (\) and dollar signs ($) will be given no special meaning.
You could also try using something like StringEscapeUtils to make your life even easier: http://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html
s = StringEscapeUtils.escapeJava(s);
You can use apache's commons-text library (instead of commons-lang):
Example code:
org.apache.commons.text.StringEscapeUtils.escapeJava(escapedString);
Dependency:
compile 'org.apache.commons:commons-text:1.8'
OR
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.8</version>
</dependency>

Why String.replaceAll() in java requires 4 slashes "\\\\" in regex to actually replace "\"?

I recently noticed that, String.replaceAll(regex,replacement) behaves very weirdly when it comes to the escape-character "\"(slash)
For example consider there is a string with filepath - String text = "E:\\dummypath"
and we want to replace the "\\" with "/".
text.replace("\\","/") gives the output "E:/dummypath" whereas text.replaceAll("\\","/") raises the exception java.util.regex.PatternSyntaxException.
If we want to implement the same functionality with replaceAll() we need to write it as,
text.replaceAll("\\\\","/")
One notable difference is replaceAll() has its arguments as reg-ex whereas replace() has arguments character-sequence!
But text.replaceAll("\n","/") works exactly the same as its char-sequence equivalent text.replace("\n","/")
Digging Deeper:
Even more weird behaviors can be observed when we try some other inputs.
Lets assign text="Hello\nWorld\n"
Now,
text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") all these three gives the same output Hello/World/
Java had really messed up with the reg-ex in its best possible way I feel! No other language seems to have these playful behaviors in reg-ex. Any specific reason, why Java messed up like this?
You need to esacpe twice, once for Java, once for the regex.
Java code is
"\\\\"
makes a regex string of
"\\" - two chars
but the regex needs an escape too so it turns into
\ - one symbol
#Peter Lawrey's answer describes the mechanics. The "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.
But why is it like that?
It is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern class ... in Java 1.4.
So how do other languages manage to avoid this?
They do it by providing direct or indirect syntactic support for regexes in the programming language itself. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax in which backslashes are not escapes. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)
Why do text.replaceAll("\n","/"), text.replaceAll("\\n","/"), and text.replaceAll("\\\n","/") all give the same output?
The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.
The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.
The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However in the regex language, a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.
1) Let's say you want to replace a single \ using Java's replaceAll method:
\
˪--- 1) the final backslash
2) Java's replaceAll method takes a regex as first argument. In a regex literal, \ has a special meaning, e.g. in \d which is a shortcut for [0-9] (any digit). The way to escape a metachar in a regex literal is to precede it with a \, which leads to:
\ \
| ˪--- 1) the final backslash
|
˪----- 2) the backslash needed to escape 1) in a regex literal
3) In Java, there is no regex literal: you write a regex in a string literal (unlike JavaScript for example, where you can write /\d+/). But in a string literal, \ also has a special meaning, e.g. in \n (a new line) or \t (a tab). The way to escape a metachar in a string literal is to precede it with a \, which leads to:
\\\\
|||˪--- 1) the final backslash
||˪---- 3) the backslash needed to escape 1) in a string literal
|˪----- 2) the backslash needed to escape 1) in a regex literal
˪------ 3) the backslash needed to escape 2) in a string literal
This is because Java tries to give \ a special meaning in the replacement string, so that \$ will be a literal $ sign, but in the process they seem to have removed the actual special meaning of \
While text.replaceAll("\\\\","/"), at least can be considered to be okay in some sense (though it itself is not absolutely right), all the three executions, text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") giving same output seem even more funny. It is just contradicting as to why they have restricted the functioning of text.replaceAll("\\","/") for the same reason.
Java didn't mess up with regular expressions. It is because, Java likes to mess up with coders by trying to do something unique and different, when it is not at all required.
One way around this problem is to replace backslash with another character, use that stand-in character for intermediate replacements, then convert it back into backslash at the end. For example, to convert "\r\n" to "\n":
String out = in.replace('\\','#').replaceAll("#r#n","#n").replace('#','\\');
Of course, that won't work very well if you choose a replacement character that can occur in the input string.
I think java really messed with regular expression in String.replaceAll();
Other than java I have never seen a language parse regular expression this way. You will be confused if you have used regex in some other languages.
In case of using the "\\" in replacement string, you can use java.util.regex.Matcher.quoteReplacement(String)
String.replaceAll("/", Matcher.quoteReplacement("\\"));
By using this Matcher class you can get the expected result.

Categories