Java ignore special characters in string matching - java

I want to match two strings in java eg.
text: János
searchExpression: Janos
Since I don't want to replace all special characters, I thought I could just make the á a wildcard, so everything would match for this character. For instance if I search in János with Jxnos, it should find it. Of course there could be multiple special characters in the text. Does anyone have an idea how I could achieve this via any pattern matcher, or do I have to compare char by char?

use pattern and matcher classes with J\\Snos as regex. \\S matches any non-space character.
String str = "foo János bar Jxnos";
Matcher m = Pattern.compile("J\\Snos").matcher(str);
while(m.find())
{
System.out.println(m.group());
}
Output:
János
Jxnos

A possible solution would be to strip the accent with the help of Apache Commons StringUtils.stripAccents(input) method:
String input = StringUtils.stripAccents("János");
System.out.println(input); //Janos
Make sure to also read upon the more elaborate approaches based on the Normalizer class: Is there a way to get rid of accents and convert a whole string to regular letters?

Related

Regex to replace All turkish symbols to regular latin symbols

I have a class that replaces all turkish symbols to similar latin symbols and pass the result to searcher.
these are the methods for symbol replacement
#Override
String replaceTurkish(String words) {
if (checkWithRegExp(words)) {
return words.toLowerCase().replaceAll("ç", "c").replaceAll("ğ", "g").replaceAll("ı", "i").
replaceAll("ö", "o").replaceAll("ş", "s").replaceAll("ü", "u");
} else return words;
}
public static boolean checkWithRegExp(String word){
Pattern p = Pattern.compile("[öçğışü]");
Matcher m = p.matcher(word);
return m.matches();
}
But this always return unmodified words statement.
What am I doing wrong?
Thanks in advance!
Per the Java 7 api, Matcher.matches()
Attempts to match the entire region against the pattern.
Your pattern is "[öçğışü]", which regex101.com (an awesome resource) says will match
a single character in the list öçğışü literally
Perhaps you may see the problem already. Your regex is not going to match anything except a single Turkish character, since you are attempting to match the entire region against a regex which will only ever accept one character.
I recommend either using find(), per suggestion by Andreas in the comments, or using a regex like this:
".*[öçğışü].*"
which should actually find words which contains any Turkish-specific characters.
Additionally, I'll point out that regex is case-sensitive, so if there are upper-case variants of these letters, you should include those as well and modify your replace statements.
Finally (edit): you can make your Pattern case-insensitive, but your replaceAll's will still need to change to be case-insensitive. I am unsure of how this will work with non-Latin characters, so you should test that flag before relying on it.
Pattern p = Pattern.compile(".*[öçğışü].*", Pattern.CASE_INSENSITIVE);

java Regex - split but ignore text inside quotes?

using only regular expression methods, the method String.replaceAll and ArrayList
how can i split a String into tokens, but ignore delimiters that exist inside quotes?
the delimiter is any character that is not alphanumeric or quoted text
for example:
The string :
hello^world'this*has two tokens'
should output:
hello
worldthis*has two tokens
I know there is a damn good and accepted answer already present but I would like to add another regex based (and may I say simpler) approach to split the given text using any non-alphanumeric delimiter which not inside the single quotes using
Regex:
/(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+/
Which basically means match a non-alphanumeric text if it is followed by even number of single quotes in other words match a non-alphanumeric text if it is outside single quotes.
Code:
String string = "hello^world'this*has two tokens'#2ndToken";
System.out.println(Arrays.toString(
string.split("(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+"))
);
Output:
[hello, world'this*has two tokens', 2ndToken]
Demo:
Here is a live working Demo of the above code.
Use a Matcher to identify the parts you want to keep, rather than the parts you want to split on:
String s = "hello^world'this*has two tokens'";
Pattern pattern = Pattern.compile("([a-zA-Z0-9]+|'[^']*')+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
See it working online: ideone
You cannot in any reasonable way. You are posing a problem that regular expressions aren't good at.
Do not use a regular expression for this. It won't work. Use / write a parser instead.
You should use the right tool for the right task.

Find string with special char using regex

I need to scroll a List and removing all strings that contains some special char. Using RegEx I'm able to remove all string that start with these special chars but, how can I find if this special char is in the middle of the string?
For instance:
Pattern.matches("[()<>/;\\*%$].*", "(123)")
returns true and I can remove this string
but it doesn't works with this kind of string: 12(3).
Is it correct to use \* to find the occurrence of "*" char into the string?
Thanks for the help!
Andrea
You are yet another victim of Java's ill-named .matches() which tries and match the whole input and contradicts the very definition of regex matching.
What you want is matching one character among ()<>/;\\*%$. With Java, you need to create a Pattern, a Matcher from this Pattern and use .find() on this matcher:
final Pattern p = pattern.compile("[()<>/;\\*%$]");
final Matcher m = p.matcher(yourinput);
if (m.find()) // match, proceed
Try the following:
!Pattern.matches("^[^()<>/;\\*%$]*$", "(123)")
This uses a negated character class to ensure that all the characters in the string are not any of the characters in the class.
You then obviously negate the expression since you are testing for a string that does not match.
Is it correct to use \* to find the occurrence of "*" char into the string?
Yes.
Pattern.matches() tries to match the whole input. So since your regex says that the input has to start with a "special" char, 12(3) doesn't match.

regex to find substring between special characters

I am running into this problem in Java.
I have data strings that contain entities enclosed between & and ; For e.g.
&Text.ABC;, &Links.InsertSomething;
These entities can be anything from the ini file we have.
I need to find these string in the input string and remove them. There can be none, one or more occurrences of these entities in the input string.
I am trying to use regex to pattern match and failing.
Can anyone suggest the regex for this problem?
Thanks!
Here is the regex:
"&[A-Za-z]+(\\.[A-Za-z]+)*;"
It starts by matching the character &, followed by one or more letters (both uppercase and lower case) ([A-Za-z]+). Then it matches a dot followed by one or more letters (\\.[A-Za-z]+). There can be any number of this, including zero. Finally, it matches the ; character.
You can use this regex in java like this:
Pattern p = Pattern.compile("&[A-Za-z]+(\\.[A-Za-z]+)*;"); // java.util.regex.Pattern
String subject = "foo &Bar; baz\n";
String result = p.matcher(subject).replaceAll("");
Or just
"foo &Bar; baz\n".replaceAll("&[A-Za-z]+(\\.[A-Za-z]+)*;", "");
If you want to remove whitespaces after the matched tokens, you can use this re:
"&[A-Za-z]+(\\.[A-Za-z]+)*;\\s*" // the "\\s*" matches any number of whitespace
And there is a nice online regular expression tester which uses the java regexp library.
http://www.regexplanet.com/simple/index.html
You can try:
input=input.replaceAll("&[^.]+\\.[^;]+;(,\\s*&[^.]+\\.[^;]+;)*","");
See it

How do I match unicode characters in Java

I m trying to match unicode characters in Java.
Input String: informa
String to match : informátion
So far I ve tried this:
Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
String s = "informátion";
Matcher m = p.matcher(s);
if(m.matches()){
System.out.println("Match!");
}else{
System.out.println("No match");
}
It comes out as "No match". Any ideas?
The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range".
In regex terms that would be [^\x20-\x7E].
boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");
Depending on what you'd like to do with this information, here are some useful follow-up answers:
Get rid of special characters
Get rid of diacritical marks
Is it because informa isn't a substring of informátion at all?
How would your code work if you removed the last a from informa in your regex?
It sounds like you want to match letters while ignoring diacritical marks. If that's right, then normalize your strings to NFD form, strip out the diacritical marks, and then do your search.
String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...
To learn more about NFD:
https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms
http://unicode.org/faq/normalization.html

Categories