How to escape from escape sequence in a string variable - java

I tried to initialize string variable to path of one of the file. It reports that the escape sequence is not valid. Any Solution?
String s="F:\abc\xyz.txt";

Converting #Hank D and #Seige's comments to an answer:
In Java and C# (it's hard to tell which language you're using here, but it's likely one of those two), the backslash character \ is used to start escape sequences you can use to include special characters in your string that you can't normally type on the keyboard or that would otherwise cause problems. For example, you can put a newline in a string by writing \n:
String multiline = "This String\nSpans Multiple\nLines!";
You can include Unicode characters with the \U sequence:
String heart = "I \U2764 Escape Sequences!";
And you can include nested quotes with the \" sequence:
String quotation = "Quoth the raven, \"Nevermore.\"";
In your case, you're trying to use the \ character as a path separator, but Java/C# is interpreting what you're doing as trying to build invalid escape sequences. That is, the string
F:\abc\xyz.txt
is getting interpreted as
F:(\a)bc(\x)yz.txt
To fix this, you can use the fact that the escape sequence \\ stands for a backslash and write the string like this:
String s = "F:\\abc\\xyz.txt";
Fun fact: The reason that the backslash was chosen as the path separator in Java/C# is that it was chosen that way in C because that character was so rarely used... and then DOS/Windows came along and broke everything. :-)
Alternatively, in C#, you can write
String s = #"F:\abc\xyz.txt";
The # prefix disables escape sequences in the string, which makes things a lot easier to read.

Related

Using regex to only match those Strings which use escape character correctly (according to Java syntax)?

take these strings for example:
"hello world\n" (correct - regex should match this)
"I'm happy \ here" (this is incorrect as the escape character is not
used correctly - regex should not match this one)
I've tried searching on google but didn't find anything helpful.
I want this one to be used in a parser which only parses string literals from a java code file.
Here is the the regex I used:
"\\\"(\\[tbnrf\'\"\\])*[a-zA-Z0-9\\`\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)\\_\\-\\+\\=\\|\\{\\[\\}\\]\\;\\:\\'\\/\\?\\>\\.\\<\\,]\\\""
what am I doing wrong?
I guess you gave us the regex in Java String literal form, like
String regex = \"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\";
Unpacking that from Java's String escaping syntax gives the raw regex:
\"(\[tbnrf'"\])*[a-zA-Z0-9\`\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]\"
That consists of:
\" matching a double-quote character (Java String literal begins here). Escaping the double quotes with backslash isn't necessary: " on its own is ok as well.
(\[tbnrf'"\])*: a group, repeated 0...n times. I guess you want that to match against the various Java backslash escapes, but that should read (\\[tbnrf'"\\])* with a double backslash in front and inside the character class. And maybe you want to cover the Java octal escapes as well (see the language specification), giving (\\[tbnrf01234567'"\\])*
[a-zA-Z0-9\``\~\!\#\#\$\%\^\&\*\(\)\_\-\+\=\|\{\[\}\]\;\:\'\/\?\>\.\<\,]: a character class matching one character from a selected list of alphabetic and punctuation characters. I'd replace that with [^"\\], meaning anything but double quote or backslash.
\" matching a double-quote character (string literal ends here). Once again, no need to escape the double quote.
Besides the individual elements, the overall structure of the regex probably isn't what you want: You allow only strings beginning with any number of backslash escapes, followed by exactly one non-escape character, and this enclosed in a pair of double quotes.
The overall structure should instead be "(backslash_escape|simple_character)*"
So, the complete regex would be:
"(\\[tbnrf01234567'"\\]|[^"\\])*"
or, expressed in a Java literal:
String regex = "\"(\\\\[tbnrf01234567'\"\\\\]|[^\"\\\\])*\"";
And, although this is shorter than your original attempt, I'd still not call it readable and opt for a different implementation, not using regular expressions.
P.S. Although I did some testing with my regex, I'm not at all sure that it covers all relevant cases correctly.
P.P.S. There are the \uxxxx escapes, not yet covered by the regex.

Regular expression that matches "{$" AND NOT matches "\{$"

I am working on a project with lexical analysis and basically I have to generate tokens that are text and that are not text.
Tokens that are text are considered all characters until the "{$" sequence.
Tokens that are not text are considered all characters inside the "{$" and "$}" sequences.
Note that the "{$" character sequence can be escaped by writing "\{$" so this also becomes a part of text.
My job is to read a String of text, and for that I am using Regular expressions.
I am using the Java Scanner and Pattern classes and this is my work so far:
String text = "This is \\{$ just text$}\nThis is {$not_text$}."
Scanner sc = new Scanner(text);
Pattern textPattern = Pattern.compile("{\\$"); // insert working regex here
sc.useDelimiter(textPattern);
System.out.println(sc.next());
This is what should be printed out:
This is \{$ just text$}
This is
How do I make a regex for the following logical statement:
match "{$" AND NOT match "\{$"
You can use Negative Look-Behind (?<!\\) in front of \{\$ to ensure that escaped curly braces are not matched:
(?<!\\)\{\$
Demo
Possible solution:
String text = "This is \\{$ just text$}\nThis is {$not_text$}.";
Pattern textPattern = Pattern.compile(
"(?<text>(?:\\\\.|(?!\\{\\$).)+)" // text - `\x` or non-start-of `{$`
+ "|" // OR
+ "(?<nonText>\\{\\$.*?\\$\\})"); // non-text
Matcher m = textPattern.matcher(text);
while (m.find()) {
if (m.group(1)!=null){
System.out.println("text : "+m.group("text"));
}else{
System.out.println("non-text : "+m.group("nonText"));
}
}
System.out.println("\01234");
Explanation:
From what I see, you want \ to be special character used for escaping.
Problem now is to determine where \ is meant to escape character/sequence after it, and when it should be treated as simple printable character (literal).
(possible problem)
Lets say that you have text dir1\dir2\ and you want to add after it non-text foo. How would you write it?
You could try writing dir1\dir2\{$foo$} but this could mean that you just escaped {$ which would prevent foo from being seen as non-text.
In Java, String literals faced same problem since \ can be used to create other special characters using
pairs \n \r \t \"
Unicode codepoints \uFFFF
octal format \012.
Solution used in Java (and many other languages) was making \ always special character which to create \ literal required escaping it with another \ (there was no real need to add yet another special character for that). So to represent \ we need to write it as \\.
So if we have text dir1\dir2\ we would need to write it as dir1\\dir2\\. This would allow us to concatenate to it {$non-text$} without fear that this last \\ placed right before {$ will be causing misinterpretation of it and prevent seeing it as non-text sequence.
So now when we see dir1\\dir2\\{$foo$} we can interpret {$ properly.
From this point I am assuming you are also using this approach which ensures proper interpretation of \.
Now, lets try to create rule which will let us find/separate text and non-text characters.
Based on our example we know that dir1\\dir2\\{$foo$} is: text dir1\\dir2\\ and non-text {$foo$}.
So as you see splitting on {$ which is not preceded by \ can fail you sometimes (if number of preceding \ is not odd).
Probably simpler solution is to accept
for text:
\\. - regex representing characters which are preceded by \ (this will handle \\ literal and escaped \{ (which will also allow us to accept rest of $..$} part)
(?!\{\$). - regex representing character which isn't { which would start {$ area.
for non-text:
\{\$.*?\$\} - regex representing {$...$} - we know that it will be unescaped because all escaped characters will be accepted by \\..

java regex escape sequences

I was wondering about regex in Java and stumbled upon the use of backslashes. For instance, if I wanted to look for occurences of the words "this regex" in a text, I would do something like this:
Pattern.compile("this regex");
Nonetheless, I could also do something like this:
Pattern.compile("this\\sregex");
My question is: what is the difference between the two of them? And why do I have to type the backslash twice, I mean, why isn't \s an escape sequence in Java? Thanks in advance!
\s means any whitespace character, including tab, line feed and carriage return.
Java string literals already use \ to escape special characters. To put the character \ in a string literal, you need to write "\\". However regex patterns also use \ as their escape character, and the way to put that into a string literal is to use two, because it goes through two separate escaping processes. If you read your regex pattern from a plain text file for example, you won't need double escaping.
The reason you need two backslashes is that when you enter a regex string in Java code you are actually dealing with two parsers:
The first is the Java compiler, which is converting your string literal to a Java String.
The second is the regex parser, which is interpreting your regex, after it has been converted to a Java string and then passed to the regex parse when you call Pattern.compile.
So when you input "this\\sregex", it will be converted to the Java string "this\sregex" by the Java compiler. Then when you call Pattern.compile with the string, the backslash will be interpreted by the regex compiler as a special character.
The difference is that \s denotes a whitespace character, which can be more than just a blank space. It can be a tab, newline, line feed, to name a few.

Why String.replaceAll() in java requires 4 slashes "\\\\" in regex to actually replace "\"?

I recently noticed that, String.replaceAll(regex,replacement) behaves very weirdly when it comes to the escape-character "\"(slash)
For example consider there is a string with filepath - String text = "E:\\dummypath"
and we want to replace the "\\" with "/".
text.replace("\\","/") gives the output "E:/dummypath" whereas text.replaceAll("\\","/") raises the exception java.util.regex.PatternSyntaxException.
If we want to implement the same functionality with replaceAll() we need to write it as,
text.replaceAll("\\\\","/")
One notable difference is replaceAll() has its arguments as reg-ex whereas replace() has arguments character-sequence!
But text.replaceAll("\n","/") works exactly the same as its char-sequence equivalent text.replace("\n","/")
Digging Deeper:
Even more weird behaviors can be observed when we try some other inputs.
Lets assign text="Hello\nWorld\n"
Now,
text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") all these three gives the same output Hello/World/
Java had really messed up with the reg-ex in its best possible way I feel! No other language seems to have these playful behaviors in reg-ex. Any specific reason, why Java messed up like this?
You need to esacpe twice, once for Java, once for the regex.
Java code is
"\\\\"
makes a regex string of
"\\" - two chars
but the regex needs an escape too so it turns into
\ - one symbol
#Peter Lawrey's answer describes the mechanics. The "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.
But why is it like that?
It is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern class ... in Java 1.4.
So how do other languages manage to avoid this?
They do it by providing direct or indirect syntactic support for regexes in the programming language itself. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax in which backslashes are not escapes. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)
Why do text.replaceAll("\n","/"), text.replaceAll("\\n","/"), and text.replaceAll("\\\n","/") all give the same output?
The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.
The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.
The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However in the regex language, a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.
1) Let's say you want to replace a single \ using Java's replaceAll method:
\
˪--- 1) the final backslash
2) Java's replaceAll method takes a regex as first argument. In a regex literal, \ has a special meaning, e.g. in \d which is a shortcut for [0-9] (any digit). The way to escape a metachar in a regex literal is to precede it with a \, which leads to:
\ \
| ˪--- 1) the final backslash
|
˪----- 2) the backslash needed to escape 1) in a regex literal
3) In Java, there is no regex literal: you write a regex in a string literal (unlike JavaScript for example, where you can write /\d+/). But in a string literal, \ also has a special meaning, e.g. in \n (a new line) or \t (a tab). The way to escape a metachar in a string literal is to precede it with a \, which leads to:
\\\\
|||˪--- 1) the final backslash
||˪---- 3) the backslash needed to escape 1) in a string literal
|˪----- 2) the backslash needed to escape 1) in a regex literal
˪------ 3) the backslash needed to escape 2) in a string literal
This is because Java tries to give \ a special meaning in the replacement string, so that \$ will be a literal $ sign, but in the process they seem to have removed the actual special meaning of \
While text.replaceAll("\\\\","/"), at least can be considered to be okay in some sense (though it itself is not absolutely right), all the three executions, text.replaceAll("\n","/"), text.replaceAll("\\n","/"), text.replaceAll("\\\n","/") giving same output seem even more funny. It is just contradicting as to why they have restricted the functioning of text.replaceAll("\\","/") for the same reason.
Java didn't mess up with regular expressions. It is because, Java likes to mess up with coders by trying to do something unique and different, when it is not at all required.
One way around this problem is to replace backslash with another character, use that stand-in character for intermediate replacements, then convert it back into backslash at the end. For example, to convert "\r\n" to "\n":
String out = in.replace('\\','#').replaceAll("#r#n","#n").replace('#','\\');
Of course, that won't work very well if you choose a replacement character that can occur in the input string.
I think java really messed with regular expression in String.replaceAll();
Other than java I have never seen a language parse regular expression this way. You will be confused if you have used regex in some other languages.
In case of using the "\\" in replacement string, you can use java.util.regex.Matcher.quoteReplacement(String)
String.replaceAll("/", Matcher.quoteReplacement("\\"));
By using this Matcher class you can get the expected result.

How to replace a special character with single slash

I have a question about strings in Java. Let's say, I have a string like so:
String str = "The . startup trace ?state is info?";
As the string contains the special character like "?" I need the string to be replaced with "\?" as per my requirement. How do I replace special characters with "\"? I tried the following way.
str.replace("?","\?");
But it gives a compilation error. Then I tried the following:
str.replace("?","\\?");
When I do this it replaces the special characters with "\\". But when I print the string, it prints with single slash. I thought it is taking single slash only but when I debugged I found that the variable is taking "\\".
Can anyone suggest how to replace the special characters with single slash ("\")?
On escape sequences
A declaration like:
String s = "\\";
defines a string containing a single backslash. That is, s.length() == 1.
This is because \ is a Java escape character for String and char literals. Here are some other examples:
"\n" is a String of length 1 containing the newline character
"\t" is a String of length 1 containing the tab character
"\"" is a String of length 1 containing the double quote character
"\/" contains an invalid escape sequence, and therefore is not a valid String literal
it causes compilation error
Naturally you can combine escape sequences with normal unescaped characters in a String literal:
System.out.println("\"Hey\\\nHow\tare you?");
The above prints (tab spacing may vary):
"Hey\
How are you?
References
JLS 3.10.6 Escape Sequences for Character and String Literals
See also
Is the char literal '\"' the same as '"' ?(backslash-doublequote vs only-doublequote)
Back to the problem
Your problem definition is very vague, but the following snippet works as it should:
System.out.println("How are you? Really??? Awesome!".replace("?", "\\?"));
The above snippet replaces ? with \?, and thus prints:
How are you\? Really\?\?\? Awesome!
If instead you want to replace a char with another char, then there's also an overload for that:
System.out.println("How are you? Really??? Awesome!".replace('?', '\\'));
The above snippet replaces ? with \, and thus prints:
How are you\ Really\\\ Awesome!
String API links
replace(CharSequence target, CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
replace(char oldChar, char newChar)
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
On how regex complicates things
If you're using replaceAll or any other regex-based methods, then things becomes somewhat more complicated. It can be greatly simplified if you understand some basic rules.
Regex patterns in Java is given as String values
Metacharacters (such as ? and .) have special meanings, and may need to be escaped by preceding with a backslash to be matched literally
The backslash is also a special character in replacement String values
The above factors can lead to the need for numerous backslashes in patterns and replacement strings in a Java source code.
It doesn't look like you need regex for this problem, but here's a simple example to show what it can do:
System.out.println(
"Who you gonna call? GHOSTBUSTERS!!!"
.replaceAll("[?!]+", "<$0>")
);
The above prints:
Who you gonna call<?> GHOSTBUSTERS<!!!>
The pattern [?!]+ matches one-or-more (+) of any characters in the character class [...] definition (which contains a ? and ! in this case). The replacement string <$0> essentially puts the entire match $0 within angled brackets.
Related questions
Having trouble with Splitting text. - discusses common mistakes like split(".") and split("|")
Regular expressions references
regular-expressions.info
Character class and Repetition with Star and Plus
java.util.regex.Pattern and Matcher
In case you want to replace ? with \?, there are 2 possibilities: replace and replaceAll (for regular expressions):
str.replace("?", "\\?")
str.replaceAll("\\?","\\\\?");
The result is "The . startup trace \?state is info\?"
If you want to replace ? with \, just remove the ? character from the second argument.
But when I print the string, it prints
with single slash.
Good. That's exactly what you want, isn't it?
There are two simple rules:
A backslash inside a String literal has to be specified as two to satisfy the compiler, i.e. "\". Otherwise it is taken as a special-character escape.
A backslash in a regular expresion has to be specified as two to satisfy regex, otherwise it is taken as a regex escape. Because of (1) this means you have to write 2x2=4 of them:"\\\\" (and because of the forum software I actually had to write 8!).
String str="\\";
str=str.replace(str,"\\\\");
System.out.println("New String="+str);
Out put:- New String=\
In java "\\" treat as "\". So, the above code replace a "\" single slash into "\\".

Categories