I have stream of data coming from different feeds which I need to clean up.
Data is in specific format and if some sentence spans through multiple lines it is separated using "\"(backslash), which I want to remove. \ is also present in other part of text for escaping quotes etc and I don't want to remove these backslashes. So eventually I want to remove "\\n".
I have tried following regex for removing \ and \n but it didn't work :
singleLine.replaceAll("(\\\\n|\\\\r)", "");
I am not sure what regex would work in this case.
Regex isn't really necessary for this; If I were you, I would use...
singleLine=singleLine.replace("\\\\n", "");
Many people think the replace method only replaces one, but in fact the only difference is that replaceAll uses regex, while replace simply replaces exact matches of the String.
If you do want to use regex though, I believe you have to do \\\\\\\\ (you have to 'nullify' the escape character in Java, and in regex, so x4, not just x2)
Explaining this some more
The only other issue is in your example, you never set singeLine equal to anything; I'm not sure if you hid that, or missed that.
Edit:
Explaining the reasoning for \\\\\\\\ some more, Java requires that you do "\\" to represent one \. Regex also has a use for the \ character, and requires you do the same again for it. If you just "\\" in Java, the regex parser essentially receives "\", it's escape character for certain things. You need to give the regex parser two of them, to escape it, so in Java, you need to do "\\\\" just to represent a match for a single "\"
You'll need 5 backslash characters for each pattern in that regexp.
Use:
singleLine.replaceAll("(\\\\\n|\\\\\r)", "");
The backslash character is both an escape sequence in your string and an escape sequence in the regexp. So to represent a literal \ in a regexp you'll need to use 4 \ characters - your regexp needs \\ to get an escaped backslash, and each of those needs to be escaped in the java String - and then another to represent either \n or \r.
String str = "string with \\\n newline and \\\n newline ...";
String repl = str.replaceAll("(\\\\\n|\\\\\r)", "");
System.out.println("str: " + str);
System.out.println("repl: " + repl);
Output:
STR: string with \
newline and \
newline ...
REPL: string with newline and newline ...
You need to assign the return value to another String object, or the same object, because of String immutability.
singleLine = singleLine.replaceAll("(\\\\n|\\\\r)", "");
More info is here
Remember that Strings are immutable. This means that replaceAll() does not change the String in singleLine. You must use the return value to get the modified String. For example, you can do
singleLine = singleLine.replaceAll("(\\\\n|\\\\r)", "");
Related
How to write a regular expression to match this \" (a backslash then a quote)? Assume I have a string like this:
click to search
I need to replace all the \" with a ", so the result would look like:
click to search
This one does not work: str.replaceAll("\\\"", "\"") because it only matches the quote. Not sure how to get around with the backslash. I could have removed the backslash first, but there are other backslashes in my string.
If you don't need any of regex mechanisms like predefined character classes \d, quantifiers etc. instead of replaceAll which expects regex use replace which expects literals
str = str.replace("\\\"","\"");
Both methods will replace all occurrences of targets, but replace will treat targets literally.
BUT if you really must use regex you are looking for
str = str.replaceAll("\\\\\"", "\"")
\ is special character in regex (used for instance to create \d - character class representing digits). To make regex treat \ as normal character you need to place another \ before it to turn off its special meaning (you need to escape it). So regex which we are trying to create is \\.
But to create string literal representing text \\ so you could pass it to regex engine you need to write it as four \ ("\\\\"), because \ is also special character in String literals (part of code written using "...") since it can be used for instance as \t to represent tabulator.
That is why you also need to escape \ there.
In short you need to escape \ twice:
in regex \\
and then in String literal "\\\\"
You don't need a regular expression.
str.replace("\\\"", "\"")
should work just fine.
The replace method takes two substrings and replaces all non-overlapping occurrences of the first with the second. Per the javadoc:
public String replace(CharSequence target,
CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence. The replacement proceeds from the beginning of the string to the end, for example, replacing "aa" with "b" in the string "aaa" will result in "ba" rather than "ab".
try this: str.replaceAll("\\\\\"", "\\\"")
because Java will replace \ twice:
(1) \\\\\" --> \\" (for string)
(2) \\" --> \" (for regex)
How to write a regular expression to match this \" (a backslash then a quote)? Assume I have a string like this:
click to search
I need to replace all the \" with a ", so the result would look like:
click to search
This one does not work: str.replaceAll("\\\"", "\"") because it only matches the quote. Not sure how to get around with the backslash. I could have removed the backslash first, but there are other backslashes in my string.
If you don't need any of regex mechanisms like predefined character classes \d, quantifiers etc. instead of replaceAll which expects regex use replace which expects literals
str = str.replace("\\\"","\"");
Both methods will replace all occurrences of targets, but replace will treat targets literally.
BUT if you really must use regex you are looking for
str = str.replaceAll("\\\\\"", "\"")
\ is special character in regex (used for instance to create \d - character class representing digits). To make regex treat \ as normal character you need to place another \ before it to turn off its special meaning (you need to escape it). So regex which we are trying to create is \\.
But to create string literal representing text \\ so you could pass it to regex engine you need to write it as four \ ("\\\\"), because \ is also special character in String literals (part of code written using "...") since it can be used for instance as \t to represent tabulator.
That is why you also need to escape \ there.
In short you need to escape \ twice:
in regex \\
and then in String literal "\\\\"
You don't need a regular expression.
str.replace("\\\"", "\"")
should work just fine.
The replace method takes two substrings and replaces all non-overlapping occurrences of the first with the second. Per the javadoc:
public String replace(CharSequence target,
CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence. The replacement proceeds from the beginning of the string to the end, for example, replacing "aa" with "b" in the string "aaa" will result in "ba" rather than "ab".
try this: str.replaceAll("\\\\\"", "\\\"")
because Java will replace \ twice:
(1) \\\\\" --> \\" (for string)
(2) \\" --> \" (for regex)
I am working on a project with lexical analysis and basically I have to generate tokens that are text and that are not text.
Tokens that are text are considered all characters until the "{$" sequence.
Tokens that are not text are considered all characters inside the "{$" and "$}" sequences.
Note that the "{$" character sequence can be escaped by writing "\{$" so this also becomes a part of text.
My job is to read a String of text, and for that I am using Regular expressions.
I am using the Java Scanner and Pattern classes and this is my work so far:
String text = "This is \\{$ just text$}\nThis is {$not_text$}."
Scanner sc = new Scanner(text);
Pattern textPattern = Pattern.compile("{\\$"); // insert working regex here
sc.useDelimiter(textPattern);
System.out.println(sc.next());
This is what should be printed out:
This is \{$ just text$}
This is
How do I make a regex for the following logical statement:
match "{$" AND NOT match "\{$"
You can use Negative Look-Behind (?<!\\) in front of \{\$ to ensure that escaped curly braces are not matched:
(?<!\\)\{\$
Demo
Possible solution:
String text = "This is \\{$ just text$}\nThis is {$not_text$}.";
Pattern textPattern = Pattern.compile(
"(?<text>(?:\\\\.|(?!\\{\\$).)+)" // text - `\x` or non-start-of `{$`
+ "|" // OR
+ "(?<nonText>\\{\\$.*?\\$\\})"); // non-text
Matcher m = textPattern.matcher(text);
while (m.find()) {
if (m.group(1)!=null){
System.out.println("text : "+m.group("text"));
}else{
System.out.println("non-text : "+m.group("nonText"));
}
}
System.out.println("\01234");
Explanation:
From what I see, you want \ to be special character used for escaping.
Problem now is to determine where \ is meant to escape character/sequence after it, and when it should be treated as simple printable character (literal).
(possible problem)
Lets say that you have text dir1\dir2\ and you want to add after it non-text foo. How would you write it?
You could try writing dir1\dir2\{$foo$} but this could mean that you just escaped {$ which would prevent foo from being seen as non-text.
In Java, String literals faced same problem since \ can be used to create other special characters using
pairs \n \r \t \"
Unicode codepoints \uFFFF
octal format \012.
Solution used in Java (and many other languages) was making \ always special character which to create \ literal required escaping it with another \ (there was no real need to add yet another special character for that). So to represent \ we need to write it as \\.
So if we have text dir1\dir2\ we would need to write it as dir1\\dir2\\. This would allow us to concatenate to it {$non-text$} without fear that this last \\ placed right before {$ will be causing misinterpretation of it and prevent seeing it as non-text sequence.
So now when we see dir1\\dir2\\{$foo$} we can interpret {$ properly.
From this point I am assuming you are also using this approach which ensures proper interpretation of \.
Now, lets try to create rule which will let us find/separate text and non-text characters.
Based on our example we know that dir1\\dir2\\{$foo$} is: text dir1\\dir2\\ and non-text {$foo$}.
So as you see splitting on {$ which is not preceded by \ can fail you sometimes (if number of preceding \ is not odd).
Probably simpler solution is to accept
for text:
\\. - regex representing characters which are preceded by \ (this will handle \\ literal and escaped \{ (which will also allow us to accept rest of $..$} part)
(?!\{\$). - regex representing character which isn't { which would start {$ area.
for non-text:
\{\$.*?\$\} - regex representing {$...$} - we know that it will be unescaped because all escaped characters will be accepted by \\..
I am converting unicode characters stored a String into unicode text.
For example, here is a String -
String unicode = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064";
Now from this string, I want to get separate unicode character -
u0041 u006e u0064 u0072 u006f u0069 u0064
So for that, I use the following code -
String[] parts = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064".split("\");
But now since the " after \ is ignored in split("\"), I am getting a error.
How to not ignore a character after \?
The \ character is an escape character. You are getting a syntax error because \" is the escape sequence for placing a " character in a String literal. To place a \ inside a String literal, you need to use \\ (the first \ escapes the special meaning of the second \). So a syntactically correct statement would be:
String[] parts = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064".split("\\");
But that is not going to give you what you want, because the first argument does not contain any \ characters. (Also, the split() method expects a regular expression and \ is not a valid regular expression.) Instead, it contains seven characters with code points U+0041, etc. Perhaps you want:
String[] parts = "\\u0041\\u006e\\u0064\\u0072\\u006f\\u0069\\u0064".split("\\\\");
or perhaps you want
char[] parts = "\u0041\u006e\u0064\u0072\u006f\u0069\u0064".toCharArray();
and you can then convert each element of parts to a Unicode code point string.
You need to escape the backslash. You also need to escape the backslash again because split() treats the string as a regular expression. Use .split("\\\\");
How to write a regular expression to match this \" (a backslash then a quote)? Assume I have a string like this:
click to search
I need to replace all the \" with a ", so the result would look like:
click to search
This one does not work: str.replaceAll("\\\"", "\"") because it only matches the quote. Not sure how to get around with the backslash. I could have removed the backslash first, but there are other backslashes in my string.
If you don't need any of regex mechanisms like predefined character classes \d, quantifiers etc. instead of replaceAll which expects regex use replace which expects literals
str = str.replace("\\\"","\"");
Both methods will replace all occurrences of targets, but replace will treat targets literally.
BUT if you really must use regex you are looking for
str = str.replaceAll("\\\\\"", "\"")
\ is special character in regex (used for instance to create \d - character class representing digits). To make regex treat \ as normal character you need to place another \ before it to turn off its special meaning (you need to escape it). So regex which we are trying to create is \\.
But to create string literal representing text \\ so you could pass it to regex engine you need to write it as four \ ("\\\\"), because \ is also special character in String literals (part of code written using "...") since it can be used for instance as \t to represent tabulator.
That is why you also need to escape \ there.
In short you need to escape \ twice:
in regex \\
and then in String literal "\\\\"
You don't need a regular expression.
str.replace("\\\"", "\"")
should work just fine.
The replace method takes two substrings and replaces all non-overlapping occurrences of the first with the second. Per the javadoc:
public String replace(CharSequence target,
CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence. The replacement proceeds from the beginning of the string to the end, for example, replacing "aa" with "b" in the string "aaa" will result in "ba" rather than "ab".
try this: str.replaceAll("\\\\\"", "\\\"")
because Java will replace \ twice:
(1) \\\\\" --> \\" (for string)
(2) \\" --> \" (for regex)