How to escape a character in Regex expression in Java

How to escape a character in Regex expression in Java - java

I have a regex expression which removes all non alphanumeric characters. It is working fine for all special characters apart from ^. Below is the regex expression I am using.
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}]", "").toUpperCase();
I tried modifying it to
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}]\\^", "").toUpperCase();
and
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}\\^]", "").toUpperCase();
But these are also not able to remove this symbol.
Can someone please help me with this.

The first ^ inside [^...] is a negation mark making the character class a negated one (matching characters other than what is inside).
The second one inside is considered a literal - thus, it should not be matched with the regex. Remove it, and a caret will get matched with it:
"[^\\p{IsAlphabetic}\\p{IsDigit}]"
or even shorter:
"(?U)\\P{Alnum}"
The \P{Alnum} class stands for any character other than an alphanumeric character: [\p{Alpha}\p{Digit}] (see Java regex reference). When you pass (?U), the \P{Alnum} class will not match Unicode letters. See this IDEONE demo.
Add a + at the end if you want to remove whole chunks of symbols other than \\p{IsAlphabetic} and \\p{IsDigit}.

This works as well.
System.out.println("Text 尖酸[刻薄 ^, More _0As text °ÑÑ"".replaceAll("(?U)[^[\\W_]]+", " "));
Output
Text 尖酸 刻薄 More 0As text Ñ Ñ
Not sure but the word might be the more comprehensive list of alphanum characters.
[\\W_] is a class containing non-words and an underscore.
When put into a negative Java class construct it becomes
[^[\\W_]] is a negative class of a union between nothing and
a class containing non-words and an underscore.

Related

Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

I need to validate name,saved in a String, which can be in any language with spaces using \p{L}:
You can match a single character belonging to the "letter" category with \p{L}
I tried to use String.matches, but it failed to match non English characters, even for 1 character, for example
String name = "อั";
boolean isMatch = name.matches("[\\p{L}]+")); // return false
I tried with/without brackets, adding + for multiple letters, but it's always failing to match non English characters
Is there an issue using String.matches with \p{L}?
I failed also using [\\x00-\\x7F]+ suggested in Pattern
\p{ASCII} All ASCII:[\x00-\x7F]

You should bear in mind that Java regex parses strings as collections of Unicode code units, not code points. \p{L} matches any Unicode letter from the BMP plane, it does not match letters glued with diacritics after them.
Since your input can contain letters and diacritics you should at least use both \p{L} and \p{M} Unicode property classes in your character class:
String regex = "[\\p{L}\\p{M}]+";
If the input string can contain words separated with whitespaces, you may add \s shorthand class and to match any kind of whitespace you may compile this regex with Pattern.UNICODE_CHARACTER_CLASS flag:
String regex = "(?U)[\\p{L}\\p{M}\\s]+";
Note that this regex allows entering diacritics, letters and whitespaces in any order. If you need a more precise regex (e.g. diacritics only allowed after a base letter) you may consider something like
String regex = "(?U)\\s*(?>\\p{L}\\p{M}*+)+(?:\\s+(?>\\p{L}\\p{M}*+)+)*\\s*";
Here, (?>\\p{L}\\p{M}*+)+ matches one or more letters each followed with zero or more diacritics, \s* matches zero or more whitespaces and \s+ matches 1 or more whitespaces.
\p{IsAlphabetic} vs. [\p{L}\p{M}]
If you check the source code, \p{Alphabetic} checks if Character.isAlphabetic(ch) is true. It is true if the char belongs to any of the following classes: UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER, LETTER_NUMBER or it has contributory property Other_Alphabetic. It is derived from Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic.
While all those L subclasses form the general L class, note that Other_Alphabetic also includes Letter number Nl class, and it includes more chars than \p{M} class, see this reference (although it is in German, the categories and char names are in English).
So, \p{IsAlphabetic} is broader than [\p{L}\p{M}] and you should make the right decision based on the languages you want to support.

The only solution I found is using \p{IsAlphabetic}
\p{Alpha} An alphabetic character:\p{IsAlphabetic}
boolean isMatch = name.matches("[ \\p{IsAlphabetic}]+"))
Which doesn't work in sites as https://regex101.com/ in demo

There are two characters there. The first is a letter, the second is a non-letter mark.
String name = "\u0e2d";
boolean isMatch = name.matches("[\\p{L}]+"); // true
works, but
String name = "\u0e2d\u0e31";
boolean isMatch = name.matches("[\\p{L}]+"); // false
does not because ั U+E31 is a Non-Spacing Mark [NSM], not a letter.

Googled that character to find the language. Seems to be Thai. Thai Unicode character range is: 0E00 to 0E7F:
When you are working with unicode characters you can use \u. So, the regex should be look like this:
[\u0E00-\u0E7F]
Which is match in this REGEX test with your character.
If you want to match any languages use this:
[\p{L}]
Which is match in this REGEX test with your example characters.

Try including more categories:
[\p{L}\p{Mn}\p{Mc}\p{Nl}\p{Pc}\p{Pd}\p{Po}\p{Sk}]+
Note that it might be best to simply not validate names. People can't really complain if they entered it wrong but your system didn't catch it. However, it's much more of a problem if someone is unable to enter their name. If you do insist on adding validation, please make it overridable: that should have the advantages of each method without their disadvantages.

Regex that get rid of all the punctuations at the top and the end of a string

I am trying to come up with a regular expression that gets rid of all the punctuations(if there is one or more) both at the top and the end of a string.
The regex I am using now looks like this:(word is the string I want to convert)
word = word.replaceAll("['?:!.,;]*([a-z]+)['?:!.,;]*", "$1").toLowerCase();
However, I still get some weird cases. For example, 'Amen' goes to 'amen' and ''tis goes to 'tis. Can anyone help me modify it so that 'Amen' will go to amen and ''tis to tis. Thanks in advance!

Replace the following pattern:
^\p{P}+|\p{P}+$
With an empty string.
Demo
\p{P} means any punctuation character. The first part of the regex will remove punctuation at the start, and the second will do it at the end.

In Java you can use:
\\p{Punct}
to identify a punctuation character.
To remove punctuation character from start or end use this:
String word = word.replaceAll("^\\p{Punct}+|\\p{Punct}+$", "");

I couldn't reproduce problem with ''tis becoming 'tis, but problem with 'Amen' is that your regex doesn't accept upper-case characters because [a-z] can accept only lower-case characters. You can change it by adding A-Z to your character class or by making your regex case insensitive with (?i) flag.
So try maybe
replaceAll("['?:!.,;]*([a-zA-Z]+)['?:!.,;]*", "$1")
or
replaceAll("(?i)['?:!.,;]*([a-z]+)['?:!.,;]*", "$1")
You can also change your strategy to just removing punctuations at start of the string or at end of the string. In that case you could just use
replaceAll("^\\p{Punct}+|\\p{Punct}+$","");
where
^ represents start of the string
$ represents end of the string
\\p{Punct} is character class representing punctuation characters (one of !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ characters) but you can use your own ['?:!.,;] class if you want

regex help in java

I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.

You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+

Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.

Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr

there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.

How to replace a special character with single slash

I have a question about strings in Java. Let's say, I have a string like so:
String str = "The . startup trace ?state is info?";
As the string contains the special character like "?" I need the string to be replaced with "\?" as per my requirement. How do I replace special characters with "\"? I tried the following way.
str.replace("?","\?");
But it gives a compilation error. Then I tried the following:
str.replace("?","\\?");
When I do this it replaces the special characters with "\\". But when I print the string, it prints with single slash. I thought it is taking single slash only but when I debugged I found that the variable is taking "\\".
Can anyone suggest how to replace the special characters with single slash ("\")?

On escape sequences
A declaration like:
String s = "\\";
defines a string containing a single backslash. That is, s.length() == 1.
This is because \ is a Java escape character for String and char literals. Here are some other examples:
"\n" is a String of length 1 containing the newline character
"\t" is a String of length 1 containing the tab character
"\"" is a String of length 1 containing the double quote character
"\/" contains an invalid escape sequence, and therefore is not a valid String literal
it causes compilation error
Naturally you can combine escape sequences with normal unescaped characters in a String literal:
System.out.println("\"Hey\\\nHow\tare you?");
The above prints (tab spacing may vary):
"Hey\
How are you?
References
JLS 3.10.6 Escape Sequences for Character and String Literals
See also
Is the char literal '\"' the same as '"' ?(backslash-doublequote vs only-doublequote)
Back to the problem
Your problem definition is very vague, but the following snippet works as it should:
System.out.println("How are you? Really??? Awesome!".replace("?", "\\?"));
The above snippet replaces ? with \?, and thus prints:
How are you\? Really\?\?\? Awesome!
If instead you want to replace a char with another char, then there's also an overload for that:
System.out.println("How are you? Really??? Awesome!".replace('?', '\\'));
The above snippet replaces ? with \, and thus prints:
How are you\ Really\\\ Awesome!
String API links
replace(CharSequence target, CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
replace(char oldChar, char newChar)
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
On how regex complicates things
If you're using replaceAll or any other regex-based methods, then things becomes somewhat more complicated. It can be greatly simplified if you understand some basic rules.
Regex patterns in Java is given as String values
Metacharacters (such as ? and .) have special meanings, and may need to be escaped by preceding with a backslash to be matched literally
The backslash is also a special character in replacement String values
The above factors can lead to the need for numerous backslashes in patterns and replacement strings in a Java source code.
It doesn't look like you need regex for this problem, but here's a simple example to show what it can do:
System.out.println(
"Who you gonna call? GHOSTBUSTERS!!!"
.replaceAll("[?!]+", "<$0>")
);
The above prints:
Who you gonna call<?> GHOSTBUSTERS<!!!>
The pattern [?!]+ matches one-or-more (+) of any characters in the character class [...] definition (which contains a ? and ! in this case). The replacement string <$0> essentially puts the entire match $0 within angled brackets.
Related questions
Having trouble with Splitting text. - discusses common mistakes like split(".") and split("|")
Regular expressions references
regular-expressions.info
Character class and Repetition with Star and Plus
java.util.regex.Pattern and Matcher

In case you want to replace ? with \?, there are 2 possibilities: replace and replaceAll (for regular expressions):
str.replace("?", "\\?")
str.replaceAll("\\?","\\\\?");
The result is "The . startup trace \?state is info\?"
If you want to replace ? with \, just remove the ? character from the second argument.

But when I print the string, it prints
with single slash.
Good. That's exactly what you want, isn't it?
There are two simple rules:
A backslash inside a String literal has to be specified as two to satisfy the compiler, i.e. "\". Otherwise it is taken as a special-character escape.
A backslash in a regular expresion has to be specified as two to satisfy regex, otherwise it is taken as a regex escape. Because of (1) this means you have to write 2x2=4 of them:"\\\\" (and because of the forum software I actually had to write 8!).

String str="\\";
str=str.replace(str,"\\\\");
System.out.println("New String="+str);
Out put:- New String=\
In java "\\" treat as "\". So, the above code replace a "\" single slash into "\\".

Regexp match in Java

Regexp in Java
I want to make a regexp who do this
verify if a word is like [0-9A-Za-z][._-'][0-9A-Za-z]
example for valid words
A21a_c32
daA.da2
das'2
dsada
ASDA
12SA89
non valid words
dsa#da2
34$
Thanks

^[0-9A-Za-z]+[._'-]?[0-9A-Za-z]+$ (see matches on rubular.com)
Key points:
^ is the start of the string anchor
$ is the end of string anchor
+ is "one-or-more repetition of"
? is "zero-or-one repetition of" (i.e. "optional")
- in a character class definition is special (range definition)...
unless it's escaped, or first, or last
. unescaped outside of a character class definition is special...
but in a character class definition it's just a period
References
regular-expressions.info/Anchors, Repetition, Dot, Character Class

If [._'-] are optional, put the ? with the next characters, like this:
[0-9A-Za-z]+([._'-][0-9A-Za-z]+)?

"(\\p{Alnum})*([.'_-])?(\\p{Alnum})*"
In this solution I assume that the delimiter is optional, the empty string is also legal, and that the string may start/end with the delimiter, or be composed only of the delimiter.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to escape a character in Regex expression in Java - java

Related

Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

Regex that get rid of all the punctuations at the top and the end of a string

regex help in java

How to replace a special character with single slash

Regexp match in Java

Categories

Resources