I was wondering about regex in Java and stumbled upon the use of backslashes. For instance, if I wanted to look for occurences of the words "this regex" in a text, I would do something like this:
Pattern.compile("this regex");
Nonetheless, I could also do something like this:
Pattern.compile("this\\sregex");
My question is: what is the difference between the two of them? And why do I have to type the backslash twice, I mean, why isn't \s an escape sequence in Java? Thanks in advance!
\s means any whitespace character, including tab, line feed and carriage return.
Java string literals already use \ to escape special characters. To put the character \ in a string literal, you need to write "\\". However regex patterns also use \ as their escape character, and the way to put that into a string literal is to use two, because it goes through two separate escaping processes. If you read your regex pattern from a plain text file for example, you won't need double escaping.
The reason you need two backslashes is that when you enter a regex string in Java code you are actually dealing with two parsers:
The first is the Java compiler, which is converting your string literal to a Java String.
The second is the regex parser, which is interpreting your regex, after it has been converted to a Java string and then passed to the regex parse when you call Pattern.compile.
So when you input "this\\sregex", it will be converted to the Java string "this\sregex" by the Java compiler. Then when you call Pattern.compile with the string, the backslash will be interpreted by the regex compiler as a special character.
The difference is that \s denotes a whitespace character, which can be more than just a blank space. It can be a tab, newline, line feed, to name a few.
Related
I'm trying to convert the String \something\ into the String \\something\\ using replaceAll, but I keep getting all kinds of errors. I thought this was the solution:
theString.replaceAll("\\", "\\\\");
But this gives the below exception:
java.util.regex.PatternSyntaxException: Unexpected internal error near index 1
The String#replaceAll() interprets the argument as a regular expression. The \ is an escape character in both String and regex. You need to double-escape it for regex:
string.replaceAll("\\\\", "\\\\\\\\");
But you don't necessarily need regex for this, simply because you want an exact character-by-character replacement and you don't need patterns here. So String#replace() should suffice:
string.replace("\\", "\\\\");
Update: as per the comments, you appear to want to use the string in JavaScript context. You'd perhaps better use StringEscapeUtils#escapeEcmaScript() instead to cover more characters.
TLDR: use theString = theString.replace("\\", "\\\\"); instead.
Problem
replaceAll(target, replacement) uses regular expression (regex) syntax for target and partially for replacement.
Problem is that \ is special character in regex (it can be used like \d to represents digit) and in String literal (it can be used like "\n" to represent line separator or \" to escape double quote symbol which normally would represent end of string literal).
In both these cases to create \ symbol we can escape it (make it literal instead of special character) by placing additional \ before it (like we escape " in string literals via \").
So to target regex representing \ symbol will need to hold \\, and string literal representing such text will need to look like "\\\\".
So we escaped \ twice:
once in regex \\
once in String literal "\\\\" (each \ is represented as "\\").
In case of replacement \ is also special there. It allows us to escape other special character $ which via $x notation, allows us to use portion of data matched by regex and held by capturing group indexed as x, like "012".replaceAll("(\\d)", "$1$1") will match each digit, place it in capturing group 1 and $1$1 will replace it with its two copies (it will duplicate it) resulting in "001122".
So again, to let replacement represent \ literal we need to escape it with additional \ which means that:
replacement must hold two backslash characters \\
and String literal which represents \\ looks like "\\\\"
BUT since we want replacement to hold two backslashes we will need "\\\\\\\\" (each \ represented by one "\\\\").
So version with replaceAll can look like
replaceAll("\\\\", "\\\\\\\\");
Easier way with replaceAll
To make out life easier Java provides tools to automatically escape text into target and replacement parts. So now we can focus only on strings, and forget about regex syntax:
replaceAll(Pattern.quote(target), Matcher.quoteReplacement(replacement))
which in our case can look like
replaceAll(Pattern.quote("\\"), Matcher.quoteReplacement("\\\\"))
Even better: use replace
If we don't really need regex syntax support lets not involve replaceAll at all. Instead lets use replace. Both methods will replace all targets, but replace doesn't involve regex syntax. So you could simply write
theString = theString.replace("\\", "\\\\");
To avoid this sort of trouble, you can use replace (which takes a plain string) instead of replaceAll (which takes a regular expression). You will still need to escape backslashes, but not in the wild ways required with regular expressions.
You'll need to escape the (escaped) backslash in the first argument as it is a regular expression. Replacement (2nd argument - see Matcher#replaceAll(String)) also has it's special meaning of backslashes, so you'll have to replace those to:
theString.replaceAll("\\\\", "\\\\\\\\");
Yes... by the time the regex compiler sees the pattern you've given it, it sees only a single backslash (since Java's lexer has turned the double backwhack into a single one). You need to replace "\\\\" with "\\\\", believe it or not! Java really needs a good raw string syntax.
take these strings for example:
"hello world\n" (correct - regex should match this)
"I'm happy \ here" (this is incorrect as the escape character is not
used correctly - regex should not match this one)
I've tried searching on google but didn't find anything helpful.
I want this one to be used in a parser which only parses string literals from a java code file.
Here is the the regex I used:
"\\\"(\\[tbnrf\'\"\\])*[a-zA-Z0-9\\`\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)\\_\\-\\+\\=\\|\\{\\[\\}\\]\\;\\:\\'\\/\\?\\>\\.\\<\\,]\\\""
what am I doing wrong?
I guess you gave us the regex in Java String literal form, like
String regex = \"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\";
Unpacking that from Java's String escaping syntax gives the raw regex:
\"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\"
That consists of:
\" matching a double-quote character (Java String literal begins here). Escaping the double quotes with backslash isn't necessary: " on its own is ok as well.
(\[tbnrf'"\])*: a group, repeated 0...n times. I guess you want that to match against the various Java backslash escapes, but that should read (\\[tbnrf'"\\])* with a double backslash in front and inside the character class. And maybe you want to cover the Java octal escapes as well (see the language specification), giving (\\[tbnrf01234567'"\\])*
[a-zA-Z0-9\``\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]: a character class matching one character from a selected list of alphabetic and punctuation characters. I'd replace that with [^"\\], meaning anything but double quote or backslash.
\" matching a double-quote character (string literal ends here). Once again, no need to escape the double quote.
Besides the individual elements, the overall structure of the regex probably isn't what you want: You allow only strings beginning with any number of backslash escapes, followed by exactly one non-escape character, and this enclosed in a pair of double quotes.
The overall structure should instead be "(backslash_escape|simple_character)*"
So, the complete regex would be:
"(\\[tbnrf01234567'"\\]|[^"\\])*"
or, expressed in a Java literal:
String regex = "\"(\\\\[tbnrf01234567'\"\\\\]|[^\"\\\\])*\"";
And, although this is shorter than your original attempt, I'd still not call it readable and opt for a different implementation, not using regular expressions.
P.S. Although I did some testing with my regex, I'm not at all sure that it covers all relevant cases correctly.
P.P.S. There are the \uxxxx escapes, not yet covered by the regex.
I recently noticed that, String.replaceAll(regex,replacement) behaves very weirdly when it comes to the escape-character "\"(slash)
For example consider there is a string with filepath - String text = "E:\\dummypath"
and we want to replace the "\\" with "/".
text.replace("\\","/") gives the output "E:/dummypath" whereas text.replaceAll("\\","/") raises the exception java.util.regex.PatternSyntaxException.
If we want to implement the same functionality with replaceAll() we need to write it as,
text.replaceAll("\\\\","/")
One notable difference is replaceAll() has its arguments as reg-ex whereas replace() has arguments character-sequence!
But text.replaceAll("\n","/") works exactly the same as its char-sequence equivalent text.replace("\n","/")
Digging Deeper:
Even more weird behaviors can be observed when we try some other inputs.
Lets assign text="Hello\nWorld\n"
Now,
text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") all these three gives the same output Hello/World/
Java had really messed up with the reg-ex in its best possible way I feel! No other language seems to have these playful behaviors in reg-ex. Any specific reason, why Java messed up like this?
You need to esacpe twice, once for Java, once for the regex.
Java code is
"\\\\"
makes a regex string of
"\\" - two chars
but the regex needs an escape too so it turns into
\ - one symbol
#Peter Lawrey's answer describes the mechanics. The "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.
But why is it like that?
It is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern class ... in Java 1.4.
So how do other languages manage to avoid this?
They do it by providing direct or indirect syntactic support for regexes in the programming language itself. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax in which backslashes are not escapes. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)
Why do text.replaceAll("\n","/"), text.replaceAll("\\n","/"), and text.replaceAll("\\\n","/") all give the same output?
The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.
The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.
The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However in the regex language, a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.
1) Let's say you want to replace a single \ using Java's replaceAll method:
\
˪--- 1) the final backslash
2) Java's replaceAll method takes a regex as first argument. In a regex literal, \ has a special meaning, e.g. in \d which is a shortcut for [0-9] (any digit). The way to escape a metachar in a regex literal is to precede it with a \, which leads to:
\ \
| ˪--- 1) the final backslash
|
˪----- 2) the backslash needed to escape 1) in a regex literal
3) In Java, there is no regex literal: you write a regex in a string literal (unlike JavaScript for example, where you can write /\d+/). But in a string literal, \ also has a special meaning, e.g. in \n (a new line) or \t (a tab). The way to escape a metachar in a string literal is to precede it with a \, which leads to:
\\\\
|||˪--- 1) the final backslash
||˪---- 3) the backslash needed to escape 1) in a string literal
|˪----- 2) the backslash needed to escape 1) in a regex literal
˪------ 3) the backslash needed to escape 2) in a string literal
This is because Java tries to give \ a special meaning in the replacement string, so that \$ will be a literal $ sign, but in the process they seem to have removed the actual special meaning of \
While text.replaceAll("\\\\","/"), at least can be considered to be okay in some sense (though it itself is not absolutely right), all the three executions, text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") giving same output seem even more funny. It is just contradicting as to why they have restricted the functioning of text.replaceAll("\\","/") for the same reason.
Java didn't mess up with regular expressions. It is because, Java likes to mess up with coders by trying to do something unique and different, when it is not at all required.
One way around this problem is to replace backslash with another character, use that stand-in character for intermediate replacements, then convert it back into backslash at the end. For example, to convert "\r\n" to "\n":
String out = in.replace('\\','#').replaceAll("#r#n","#n").replace('#','\\');
Of course, that won't work very well if you choose a replacement character that can occur in the input string.
I think java really messed with regular expression in String.replaceAll();
Other than java I have never seen a language parse regular expression this way. You will be confused if you have used regex in some other languages.
In case of using the "\\" in replacement string, you can use java.util.regex.Matcher.quoteReplacement(String)
String.replaceAll("/", Matcher.quoteReplacement("\\"));
By using this Matcher class you can get the expected result.
I have Java string:
String b = "/feedback/com.school.edu.domain.feedback.Review$0/feedbackId");
I also have generated pattern against which I want to match this string:
String pattern = "/feedback/com.school.edu.domain.feedback.Review$0(.)*";
When I say b.matches(pattern) it returns false. Now I know dollar sign is part of Java RegEx, but I don't know how should my pattern look like. I am assuming that $ in pattern needs to be replaced by some escape characters, but don't know how many. This $ sign is important to me as it helps me distinguish elements in list (numbers after dollar), and I can't go without it.
Use
String escapedString = java.util.regex.Pattern.quote(myString)
to automatically escape all special regex characters in a given string.
You need to escape $ in the regex with a back-slash (\), but as a back-slash is an escape character in strings you need to escape the back-slash itself.
You will need to escape any special regex char the same way, for example with ".".
String pattern = "/feedback/com\\.navteq\\.lcms\\.common\\.domain\\.poi\\.feedback\\.Review\\$0(.)*";
In Java regex both . and $ are special. You need to escape it with 2 backslashes, i.e..
"/feedback/com\\.navtag\\.etc\\.Review\\$0(.*)"
(1 backslash is for the Java string, and 1 is for the regex engine.)
Escape the dollar with \
String pattern =
"/feedback/com.navteq.lcms.common.domain.poi.feedback.Review\\$0(.)*";
I advise you to escape . as well, . represent any character.
String pattern =
"/feedback/com\\.navteq\\.lcms\\.common\\.domain\\.poi\\.feedback\\.Review\\$0(.)*";
The ans by #Colin Hebert and edited by #theon is correct. The explanation is as follows. #azec-pdx
It is a regex as a string literal (within double quotes).
period (.) and dollar-sign ($) are special regex characters (metacharacters).
To make the regex engine interpret them as normal regex characters period(.) and dollar-sign ($), you need to prefix a single backslash to each. The single backslash ( itself a special regex character) quotes the character following it and thus escaping it.
Since the given regex is a string literal, another backslash is required to be prefixed to each to avoid confusion with the usual visible-ASCII escapes(character, string and Unicode escapes in string literals) and thus avoid compiler error.
Even if you use within a string literal any special regex construct that has been defined as an escape sequence, it needs to be prefixed with another backslash to avoid compiler error.For example, the special regex construct (an escape sequence) \b (word boundary) of regex would clash with \b(backspace) of the usual visible-ASCII escape(character escape). Thus another backslash is prefixed to avoid the clash and then \\b would be read by regex as word boundary.
To be always safe, all single backslash escapes (quotes) within string literals are prefixed with another backslash. For example, the string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used.
The last period (.)* is supposed to be interpreted as special regex character and thus it needs no quoting by a backslash, let alone prefixing a second one.
I'm trying to convert the String \something\ into the String \\something\\ using replaceAll, but I keep getting all kinds of errors. I thought this was the solution:
theString.replaceAll("\\", "\\\\");
But this gives the below exception:
java.util.regex.PatternSyntaxException: Unexpected internal error near index 1
The String#replaceAll() interprets the argument as a regular expression. The \ is an escape character in both String and regex. You need to double-escape it for regex:
string.replaceAll("\\\\", "\\\\\\\\");
But you don't necessarily need regex for this, simply because you want an exact character-by-character replacement and you don't need patterns here. So String#replace() should suffice:
string.replace("\\", "\\\\");
Update: as per the comments, you appear to want to use the string in JavaScript context. You'd perhaps better use StringEscapeUtils#escapeEcmaScript() instead to cover more characters.
TLDR: use theString = theString.replace("\\", "\\\\"); instead.
Problem
replaceAll(target, replacement) uses regular expression (regex) syntax for target and partially for replacement.
Problem is that \ is special character in regex (it can be used like \d to represents digit) and in String literal (it can be used like "\n" to represent line separator or \" to escape double quote symbol which normally would represent end of string literal).
In both these cases to create \ symbol we can escape it (make it literal instead of special character) by placing additional \ before it (like we escape " in string literals via \").
So to target regex representing \ symbol will need to hold \\, and string literal representing such text will need to look like "\\\\".
So we escaped \ twice:
once in regex \\
once in String literal "\\\\" (each \ is represented as "\\").
In case of replacement \ is also special there. It allows us to escape other special character $ which via $x notation, allows us to use portion of data matched by regex and held by capturing group indexed as x, like "012".replaceAll("(\\d)", "$1$1") will match each digit, place it in capturing group 1 and $1$1 will replace it with its two copies (it will duplicate it) resulting in "001122".
So again, to let replacement represent \ literal we need to escape it with additional \ which means that:
replacement must hold two backslash characters \\
and String literal which represents \\ looks like "\\\\"
BUT since we want replacement to hold two backslashes we will need "\\\\\\\\" (each \ represented by one "\\\\").
So version with replaceAll can look like
replaceAll("\\\\", "\\\\\\\\");
Easier way with replaceAll
To make out life easier Java provides tools to automatically escape text into target and replacement parts. So now we can focus only on strings, and forget about regex syntax:
replaceAll(Pattern.quote(target), Matcher.quoteReplacement(replacement))
which in our case can look like
replaceAll(Pattern.quote("\\"), Matcher.quoteReplacement("\\\\"))
Even better: use replace
If we don't really need regex syntax support lets not involve replaceAll at all. Instead lets use replace. Both methods will replace all targets, but replace doesn't involve regex syntax. So you could simply write
theString = theString.replace("\\", "\\\\");
To avoid this sort of trouble, you can use replace (which takes a plain string) instead of replaceAll (which takes a regular expression). You will still need to escape backslashes, but not in the wild ways required with regular expressions.
You'll need to escape the (escaped) backslash in the first argument as it is a regular expression. Replacement (2nd argument - see Matcher#replaceAll(String)) also has it's special meaning of backslashes, so you'll have to replace those to:
theString.replaceAll("\\\\", "\\\\\\\\");
Yes... by the time the regex compiler sees the pattern you've given it, it sees only a single backslash (since Java's lexer has turned the double backwhack into a single one). You need to replace "\\\\" with "\\\\", believe it or not! Java really needs a good raw string syntax.