Avoid ignoring next character after "/" - java

I am converting unicode characters stored a String into unicode text.
For example, here is a String -
String unicode = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064";
Now from this string, I want to get separate unicode character -
u0041 u006e u0064 u0072 u006f u0069 u0064
So for that, I use the following code -
String[] parts = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064".split("\");
But now since the " after \ is ignored in split("\"), I am getting a error.
How to not ignore a character after \?

The \ character is an escape character. You are getting a syntax error because \" is the escape sequence for placing a " character in a String literal. To place a \ inside a String literal, you need to use \\ (the first \ escapes the special meaning of the second \). So a syntactically correct statement would be:
String[] parts = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064".split("\\");
But that is not going to give you what you want, because the first argument does not contain any \ characters. (Also, the split() method expects a regular expression and \ is not a valid regular expression.) Instead, it contains seven characters with code points U+0041, etc. Perhaps you want:
String[] parts = "\\u0041\\u006e\\u0064\\u0072\\u006f\\u0069\\u0064".split("\\\\");
or perhaps you want
char[] parts = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064".toCharArray();
and you can then convert each element of parts to a Unicode code point string.

You need to escape the backslash. You also need to escape the backslash again because split() treats the string as a regular expression. Use .split("\\\\");

Related

Using regex, with reserved characters in Java [duplicate]

How to write a regular expression to match this \" (a backslash then a quote)? Assume I have a string like this:
click to search
I need to replace all the \" with a ", so the result would look like:
click to search
This one does not work: str.replaceAll("\\\"", "\"") because it only matches the quote. Not sure how to get around with the backslash. I could have removed the backslash first, but there are other backslashes in my string.
If you don't need any of regex mechanisms like predefined character classes \d, quantifiers etc. instead of replaceAll which expects regex use replace which expects literals
str = str.replace("\\\"","\"");
Both methods will replace all occurrences of targets, but replace will treat targets literally.
BUT if you really must use regex you are looking for
str = str.replaceAll("\\\\\"", "\"")
\ is special character in regex (used for instance to create \d - character class representing digits). To make regex treat \ as normal character you need to place another \ before it to turn off its special meaning (you need to escape it). So regex which we are trying to create is \\.
But to create string literal representing text \\ so you could pass it to regex engine you need to write it as four \ ("\\\\"), because \ is also special character in String literals (part of code written using "...") since it can be used for instance as \t to represent tabulator.
That is why you also need to escape \ there.
In short you need to escape \ twice:
in regex \\
and then in String literal "\\\\"
You don't need a regular expression.
str.replace("\\\"", "\"")
should work just fine.
The replace method takes two substrings and replaces all non-overlapping occurrences of the first with the second. Per the javadoc:
public String replace(CharSequence target,
CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence. The replacement proceeds from the beginning of the string to the end, for example, replacing "aa" with "b" in the string "aaa" will result in "ba" rather than "ab".
try this: str.replaceAll("\\\\\"", "\\\"")
because Java will replace \ twice:
(1) \\\\\" --> \\" (for string)
(2) \\" --> \" (for regex)

difference between '\' and '\\' in while using it as escape characters

I know that we use escape characters like \n for next line and \t for tab.
But today while working on few string I came across \\$.
I had to print "nike$" so to print it I had to modify the string as "nike\\$".
I want to know what is the exact difference between \ and \\.
Inside a string literal, \ is an escape: The next character that follows tells us what it will do, as in your \n example for newline.
This means you can't put \ in a string on its own, since it's half of an escape sequence. Instead, to have a \ actually in a string, you use \\.
I had to print "nike$" so to print it I had to modify the string as "nike\\$"
"nike\\$" will result in a string that outputs (for instance, via System.out.println) as nike\$, not nike$.
Your use of \\$ suggests to me that you were feeding a regular expression pattern into something, e.g.:
p = Pattern.compile("nike\\$");
In that situation, we have two levels of escaping going on: The string literal, and the regular expression. To have a literal $ in a regular expression, it has to be escaped by \ because otherwise it's an end-of-input assertion. To get that \$ actually to the regular expression parser when using a string literal, we have to escape the backslash in the literal so we actually have a backslash in the string for the regular expression engine to see, thus \\$.

Why should I use a different number of escape characters in different situations?

With regular expressions in Java, why I should write "\n" to define a new line character and "\\s" to define whitespace character?
Why does the quantity of backslashes differs?
Java does its own string parsing, converting it from your code to an internal string in memory and before it sends the string to the regex parser.
Java converts the 2 characters \n to a linefeed (ASCII code 0x0A) and the first 2 (!) characters in \\s to a single backslash: \s. Now this string is sent to the regex parser, and since regular expressions recognize their own special escaped characters, it treats the \s as "any whitespace".
At this point, the code \n is already stored as a single character "linefeed", and the regular expression does not process it again.
Since regular expressions also recognize the set \n as "a linefeed", you can also use \\n in your Java string -- Java converts the escaped \\ to a single \, and the regular expression module then finds \n, which (again) gets translated into a linefeed.
A Java string has a certain set of allowed escape sequences, of which "\n" is one, but "\s" is not. A string doesn't understand the regexp shorthand for whitespace. You're probably passing a Java string to the RegExp constructor, so in order to pass "\s" as a string, you have to escape the "\" by doubling it.
\ is special character in many languages (in Java it is special in String or char) or tools like regex.
In String or char it is used to create other special characters which you normally couldn't write. By using \x where x is representation of that special character you are able to create
\t tab
\b backspace
\n newline
\r carriage return
\f formfeed
or to escape other special characters
\' single quote (' is special in char because it represents where char starts and ends, so to actually write ' character you need to escape it and write it as
here we start creating character
| here we end creating character
↓ ↓
'\''
↑↑
here we created literal of '
\" double quote - similarly to \' in char, in String " represents where it starts and ends, so to put " literal into string (to actually be able to write it) you need to escape it
here we start creating String
| here we end creating String
↓ ↓
"\""
↑↑
here we created literal of "
\\ backslash - since \ is special character used to create others special character there has to be a way to un-special it so we could actually print \ as simple literal.
Problem: how to write string representing day\night? If you write it such string in a way "day\night" it will be interpreted asday[newline]ight`.
So in many languages to represent \ literal another \ is added before it to escape it. So String which represent day\night needs to be written as "day\\night" (now \ in \n is escaped so it no longer represents \n - newline - but concatenation of \ and n characters)
In case of regex to represent character class which will accept any whitespace you need to actually pass \s.
But string which will represent \s needs to be written as "\\s" because as mentioned earlier in String \ is special and needs escaping.
If you would write \s as "\s" you would get

How to replace one or more \ in string with just \?

Consider the string,
this\is\\a\new\\string
The output should be:
this\is\a\new\string
So basically one or more \ character should be replaced with just one \.
I tried the following:
str = str.replace("[\\]+","\")
but it was no use. The reason I used two \ in [\\]+ was because internally \ is stored as \\. I know this might be a basic regex question, but I am able to replace one or more normal alphabets but not \ character. Any help is really appreciated.
str.replace("[\\]+", "\") has few problems,
replace doesn't use regex (replaceAll does) so "[\\]" will represent [\] literal, not \ nor \\ (depending on what you think it would represent)
even if it did accept regex "[\\]" would not be correct regex because \\] would escape ] so you would end up with unclosed character class [..
it will not compile (your replacement String is not closed)
It will not compile because \ is start of escape sequence \X where X needs to be either
changed from being String special character to simple literal, like in your case \" will escape " to be literal (so you could print it for instance) instead of being start/end of String,
changed from being normal character to be special one like in case of line separators \n \r or tabulations \t.
Now we know that \ is special and is used to escape other character. So what do you think we need to do to make \ represent literal (when we want to print \). If you guessed that it needs to be escaped with another \ then you are right. To create \ literal we need to write it in String as "\\".
Since you know how to create String containing \ literal (escaped \) you can start thinking about how to create your replacements.
Regex which represents one or more \ can look like
\\+
But that is its native form, and we need to create it using String. I used \\ here because in regex \ is also special character (for instance \d represents digits , not \ literal followed by d) so it also needs to be escaped first to represent \ literal. Just like in String we can escape it with another \.
So String representing this regex will need to be written as
"\\\\+" (we escaped \ twice, once in regex \\+ and once in string)
You can use it as first argument of replaceAll (because replace as mentioned earlier doesn't accept regex).
Now last problem you will face is second argument of replaceAll method. If you write
replaceAll("\\\\+", "\\")
and it will find match for regex you will see exception
java.lang.IllegalArgumentException: character to be escaped is missing
It is because in replacement part (second argument in replaceAll method) we can also use special formula $x which represents current match from group with index x. So to be able to escape $ into literal we need some escape mechanism, and again \ was used here for that purpose. So \ is also special in replacement part of our method.
So again to create \ literal we need to escape it with another \, and string literal representing expression \\ is "\\\\".
But lets get back to earlier exception: message "character to be escaped is missing" refers to X part of \X formula (X is character we want to be escaped). Problem is that earlier your replacement "\\" represented only \ part, so this method expected either $ to create \$, or \\ to create \ literal. So valid replacements would be "\\$ or "\\\\".
To make things work you need to write your replacing method as
str = str.replaceAll("\\\\+", "\\\\")
You can use:
str = str.replace("\\\\", "\\");
Remember that String#replace doesn't take a regex.
try this
str = str.replaceAll("\\\\+", "\\\\");
When writing regular expressions, you typically need to double-escape backslashes. So you would do this:
str = str.replaceAll("\\\\+", "\\\\");
I'd use Matcher.quoteReplacement() and String.replaceAll() here.
Like this:
String s;
[...]
s = s.replaceAll("\\\\+", Matcher.quoteReplacement("\\"));

Removing backslash and newline character (occurring together) in Java

I have stream of data coming from different feeds which I need to clean up.
Data is in specific format and if some sentence spans through multiple lines it is separated using "\"(backslash), which I want to remove. \ is also present in other part of text for escaping quotes etc and I don't want to remove these backslashes. So eventually I want to remove "\\n".
I have tried following regex for removing \ and \n but it didn't work :
singleLine.replaceAll("(\\\\n|\\\\r)", "");
I am not sure what regex would work in this case.
Regex isn't really necessary for this; If I were you, I would use...
singleLine=singleLine.replace("\\\\n", "");
Many people think the replace method only replaces one, but in fact the only difference is that replaceAll uses regex, while replace simply replaces exact matches of the String.
If you do want to use regex though, I believe you have to do \\\\\\\\ (you have to 'nullify' the escape character in Java, and in regex, so x4, not just x2)
Explaining this some more
The only other issue is in your example, you never set singeLine equal to anything; I'm not sure if you hid that, or missed that.
Edit:
Explaining the reasoning for \\\\\\\\ some more, Java requires that you do "\\" to represent one \. Regex also has a use for the \ character, and requires you do the same again for it. If you just "\\" in Java, the regex parser essentially receives "\", it's escape character for certain things. You need to give the regex parser two of them, to escape it, so in Java, you need to do "\\\\" just to represent a match for a single "\"
You'll need 5 backslash characters for each pattern in that regexp.
Use:
singleLine.replaceAll("(\\\\\n|\\\\\r)", "");
The backslash character is both an escape sequence in your string and an escape sequence in the regexp. So to represent a literal \ in a regexp you'll need to use 4 \ characters - your regexp needs \\ to get an escaped backslash, and each of those needs to be escaped in the java String - and then another to represent either \n or \r.
String str = "string with \\\n newline and \\\n newline ...";
String repl = str.replaceAll("(\\\\\n|\\\\\r)", "");
System.out.println("str: " + str);
System.out.println("repl: " + repl);
Output:
STR: string with \
newline and \
newline ...
REPL: string with newline and newline ...
You need to assign the return value to another String object, or the same object, because of String immutability.
singleLine = singleLine.replaceAll("(\\\\n|\\\\r)", "");
More info is here
Remember that Strings are immutable. This means that replaceAll() does not change the String in singleLine. You must use the return value to get the modified String. For example, you can do
singleLine = singleLine.replaceAll("(\\\\n|\\\\r)", "");

Categories