Java regular expression to remove all non alphanumeric characters EXCEPT spaces - java

I'm trying to write a regular expression in Java which removes all non-alphanumeric characters from a paragraph, except the spaces between the words.
This is the code I've written:
paragraphInformation = paragraphInformation.replaceAll("[^a-zA-Z0-9\s]", "");
However, the compiler gave me an error message pointing to the s saying it's an illegal escape character. The program compiled OK before I added the \s to the end of the regular expression, but the problem with that was that the spaces between words in the paragraph were stripped out.
How can I fix this error?

You need to double-escape the \ character: "[^a-zA-Z0-9\\s]"
Java will interpret \s as a Java String escape character, which is indeed an invalid Java escape. By writing \\, you escape the \ character, essentially sending a single \ character to the regex. This \ then becomes part of the regex escape character \s.

You need to escape the \ so that the regular expression recognizes \s :
paragraphInformation = paragraphInformation.replaceAll("[^a-zA-Z0-9\\s]", "");

Generally whenever you see that error, it means you only have a single backslash where you need two:
paragraphInformation = paragraphInformation.replaceAll("[^a-zA-Z0-9\\s]", "");

Victoria, you must write \\s not \s here.

Please take a look at this site, you can test Java Regex online and get wellformatted regex string patterns back:
http://www.regexplanet.com/advanced/java/index.html

Related

difference between '\' and '\\' in while using it as escape characters

I know that we use escape characters like \n for next line and \t for tab.
But today while working on few string I came across \\$.
I had to print "nike$" so to print it I had to modify the string as "nike\\$".
I want to know what is the exact difference between \ and \\.
Inside a string literal, \ is an escape: The next character that follows tells us what it will do, as in your \n example for newline.
This means you can't put \ in a string on its own, since it's half of an escape sequence. Instead, to have a \ actually in a string, you use \\.
I had to print "nike$" so to print it I had to modify the string as "nike\\$"
"nike\\$" will result in a string that outputs (for instance, via System.out.println) as nike\$, not nike$.
Your use of \\$ suggests to me that you were feeding a regular expression pattern into something, e.g.:
p = Pattern.compile("nike\\$");
In that situation, we have two levels of escaping going on: The string literal, and the regular expression. To have a literal $ in a regular expression, it has to be escaped by \ because otherwise it's an end-of-input assertion. To get that \$ actually to the regular expression parser when using a string literal, we have to escape the backslash in the literal so we actually have a backslash in the string for the regular expression engine to see, thus \\$.

RegEx special char "|" escaping in Java

I am trying to split a string like: abc|aa||
When I use the regular string.split I am required to provide a regular expression.
I tried to do the following :
string.split("|")
string.split("\|")
string.split("/|")
string.split("\Q|\E")
Non of them work.....
Does anyone know how to make it work?
I don't know how you tried, but
public static void main(String[] args) {
String a= "abc|aa||";
String split = Pattern.quote("|");
System.out.println(split);
System.out.println(Arrays.toString(a.split(split)));
}
prints out
\Q|\E
[abc, aa]
effectively splitting on |. The \Q ... \E is a regex quote. Anything inside it will be matched as a literal pattern.
string.split("\|"); // won't work because \| is not a valid escape sequence
string.split("/|"); // will compile, but split on / and empty space, so between each character
string.split("|"); // will compile, but split on empty space, so between each character
// true alternative to quoted solution above
string.split("\\|") // escape the second \ which will resolve as an escaped | in the regex pattern
using a double backslash is required because the backslash is also a special character. So you need to escape the escape character. i.e. \
\|
| is a special character hence you need to escape it using slashes. Try using
string.split("\\|")
| is a special character for the regular expression, thus it must be escaped e.g. \|
The backslash \ is a special character in Java, thus it must also be escaped
As a result, must do the following to achieve the desired effect.
string.split("\\|")
All of the following patterns split it all right: "\\Q|\\E" "\\|" "[|]" of course the latter two are preferrable

matching { regular expression java

I am new to regular expressions (and to java), so this is probably a simple question.
I am trying to match the character { at the end of a line. My attempts are simply this:
row.matches("{$")
row.matches("\{$")
But both just give
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal repetition
What am I doing wrong?
row.matches("^.*\\{$");
You simply need to escape the {, since it's a metacharacter. Because Java reserves a single backslash for special contexts (\n, \r, etc.), two backslashes are required to generate one backslash for the Pattern. Therefore,
\\{
will properly evaluate to
\{
Not only this, but the matches method checks to see iff the entire string matches, instead of just a subset. Hence, the ^.* part
You must escape the { character as it is a special char for regex
row.matches("\\{$")
Did escaping the angle bracket work?
as in \\{$
Tried it against
hello world{
whatever{
hello{dontmatch
}
}
It matched world{ and whatever{ but not hello{dontmatch
you need to escape the { with an \ but to prevent that the \{ is read as special character (like \n for line-feed) you need to escape also the \ with an additional \ resulting to:
row.matches("\\{$");

java regex pattern unclosed character class

I need some help. Im getting:
Caused by: java.util.regex.PatternSyntaxException: Unclosed character class near index 24
^[a-zA-Z└- 0-9£µ /.'-\]*$
^
at java.util.regex.Pattern.error(Pattern.java:1713)
at java.util.regex.Pattern.clazz(Pattern.java:2254)
at java.util.regex.Pattern.sequence(Pattern.java:1818)
at java.util.regex.Pattern.expr(Pattern.java:1752)
at java.util.regex.Pattern.compile(Pattern.java:1460)
at java.util.regex.Pattern.<init>(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
Here is my code:
String testString = value.toString();
Pattern pattern = Pattern.compile("^[a-zA-Z\300-\3770-9\u0153\346 \u002F.'-\\]*$");
Matcher m = pattern.matcher(testString);
I have to use the unicode value for some because I'm working with xhtml.
Any help would be great!
Assuming that you want to match \ and - and not ]:
Pattern pattern = Pattern.compile("^[a-zA-Z\300-\3770-9\u0153\346 \u002F.'\\\\-]*$");
You need to double escape your backslashes, as \ is also an escape character in regex. Thus \\] escapes the backslash for java but not for regex. You need to add another java-escaped \ in order to regex-escape your second java-escaped \.
So \\\\ after java escaping becomes \\ which is then regex escaped to \.
Moving - to the end of the sequence means that it is used as a character, instead of a range operator as pointed out by Pshemo.
It is hard to say what are you trying to achieve, but I can see few strange things in your regex:
you have opened class of characters but never closed it. Instead you used \\] which makes ] normal character.
If you want to include ] in your characters class then you need additional ] at the end, like "^[a-zA-Z\300-\3770-9\u0153\346 \u002F.'-\\]]*$"
if you want to include \ in your characters class then you need to use \\\\ version, because you need to escape its special meaning two times, in regex engine, and in Javas String
you used - with ('-\\]) which in character class is used to specify range of characters like a-z or A-Z. To escape its special meaning you need to use \\-

problem understanding a string pattern

I'm learning GWT by following this tutorial but there's something I don't quite fully understand in step 4. The following line's checking that a string matches a pattern:
if (!str.matches("^[0-9A-Z\\.]{1,10}$")) {...}
After checking the documentation for the Pattern class I understand that the characters ^ and $ represent the beginning and the end of the line, and that [...]{1,10} means that the part in brackets [...] has to be present at least once but not more than 10 times. What I don't understand is the final characters of the part in brackets. 0-9A-Z means a range of characters from 0 to 9 or from A to Z. But what does \\. mean?
It matches a dot character. Since dot has a special meaning in regexp, it must be escaped with a backslash. And because backslash has a special meaning in Java strings, it must be escaped with another backslash.
dot .
As it is a special character in regexp syntax.
Also it has two escapes as \ is a special character in java strings.
The dot "." in regex means "any character". An escaped dot "." (or "\.") means the dot character itself (without any special regex behaviour like the unescaped dot).
So, for example, "123.ABC" could be a line that matches the given regex (line breaks etc. not included).
It matches a dot character. A double slash '\\' simply means a single '\' as you have to escape '\'s in java strings. So '\\.' is translated to '\.' which means match just a '.' character. If you just used '.' by itself, without escaping, it would match any character. So you have to escape it, to match a '.' character.

Categories