Regex to replace All turkish symbols to regular latin symbols - java

I have a class that replaces all turkish symbols to similar latin symbols and pass the result to searcher.
these are the methods for symbol replacement
#Override
String replaceTurkish(String words) {
if (checkWithRegExp(words)) {
return words.toLowerCase().replaceAll("ç", "c").replaceAll("ğ", "g").replaceAll("ı", "i").
replaceAll("ö", "o").replaceAll("ş", "s").replaceAll("ü", "u");
} else return words;
}
public static boolean checkWithRegExp(String word){
Pattern p = Pattern.compile("[öçğışü]");
Matcher m = p.matcher(word);
return m.matches();
}
But this always return unmodified words statement.
What am I doing wrong?
Thanks in advance!

Per the Java 7 api, Matcher.matches()
Attempts to match the entire region against the pattern.
Your pattern is "[öçğışü]", which regex101.com (an awesome resource) says will match
a single character in the list öçğışü literally
Perhaps you may see the problem already. Your regex is not going to match anything except a single Turkish character, since you are attempting to match the entire region against a regex which will only ever accept one character.
I recommend either using find(), per suggestion by Andreas in the comments, or using a regex like this:
".*[öçğışü].*"
which should actually find words which contains any Turkish-specific characters.
Additionally, I'll point out that regex is case-sensitive, so if there are upper-case variants of these letters, you should include those as well and modify your replace statements.
Finally (edit): you can make your Pattern case-insensitive, but your replaceAll's will still need to change to be case-insensitive. I am unsure of how this will work with non-Latin characters, so you should test that flag before relying on it.
Pattern p = Pattern.compile(".*[öçğışü].*", Pattern.CASE_INSENSITIVE);

Related

How to make user enter a certain format using regex java

public static boolean usernameValidator(String username) {
String condition = "^[a-zA-Z]+.[a-zA-z]";
Pattern p = Pattern.compile(condition);
Matcher m = p.matcher(username);
return m.matches();
}
I want the user to enter a certain format eg name.surname so only a-z or A-Z then . then a-z or A-Z again. When I do enter input in the certain format the method returns false. Am I using the syntax wrong
A few things going on here.
As I understand it, you want username to contain something of the form: <letters>.<letters>.
There are a few things wrong with your regular expression if that is what you are aiming for.
In the second set of square brackets ([]), you have written A-z rather than A-Z, and there should be a + afterwards. A + indicates you want one or more characters. Without it, the [a-zA-Z] only matches a single character.
The period is also a special character in regex (meaning any character) so you need to escape it with a back-slash \, but that is a special character used to escape other characters so you need a double backslash \\.
Hence, I think you are aiming for:
public static boolean usernameValidator(String username) {
String condition = "^[A-Za-z]+\\.[A-Za-z]+$";
Pattern p = Pattern.compile(condition);
Matcher m = p.matcher(username);
return m.matches();
}
I've added the $ to indicate that you want to match to the end of a line since you have already included the ^ to match the start.
I don't believe either of these are necessary in this case so could reduce the regex to [A-Za-z]+\\.[A-Za-z]+.
If you are new to regular expressions, maybe have a read of one of the following pages:
JavaDocs for Pattern
W3Schools Java Regex
TutorialsPoint Java Regex

Java Regex pattern to match String from all languages that end with a whitespace

Basically, I need to match words that start with a character from a string. The following is an example:
I am trying to match #this_word but ignore the rest.
I also need the regex to match characters from different languages. I tried this:
#\\s*(\\w+)
but err, it only includes English words.
When I try regex such as the followed:
#(?>\\p{L}\\p{M}*+)+
I get an outofboundsexception.
Edit
Apparently the reason I used to get that error was because I wrote:
matcher.group(1);
Instead of:
matcher.group(0);
If you do not care about digits, just add a (?U) flag before the pattern:
UNICODE_CHARACTER_CLASS
public static final int UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.
The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U).
The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.
Regex:
Pattern ptrn = Pattern.compile("(?U)#\\w+");
See IDEONE demo
You can actually subtract digits from \w with [\\w&&[^\\d]] to only match underscores and Unicode letters:
Pattern ptrn = Pattern.compile("#[\\w&&[^\\d]]+", Pattern.UNICODE_CHARACTER_CLASS);
Another demo
As an alternative, to match any Unicode letter you may use \p{L}\p{M}*+ subpattern (\p{L} is a base letter and \p{M} matches diacritics). So, to match only letters after # you can use #(?>\p{L}\p{M}*+)+.
To also support match an underscore, add it as an alternative: #(?>\p{L}\p{M}*+|_)+.
If you do not care about where the diacritic is, use just a character class: #[\p{L}\p{M}_]+.
See this IDEONE demo:
String str = "I am trying to match #эту_строку but ignore the rest.";
Pattern ptrn = Pattern.compile("#(?>\\p{L}\\p{M}*+|_)+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
Use this pattern:
#[^\s]+
This might work. It will match every non-spaced characters in the given String..
You can use the following code to capture all Unicode letters (matched by \p{L} class):
String ss="I am trying to match #this_word but ignore the rest.";
Matcher m =Pattern.compile("#(\\p{L})+",Pattern.CASE_INSENSITIVE).matcher(ss);
while (m.find()) {
System.out.println(m.group());
}

Java ignore special characters in string matching

I want to match two strings in java eg.
text: János
searchExpression: Janos
Since I don't want to replace all special characters, I thought I could just make the á a wildcard, so everything would match for this character. For instance if I search in János with Jxnos, it should find it. Of course there could be multiple special characters in the text. Does anyone have an idea how I could achieve this via any pattern matcher, or do I have to compare char by char?
use pattern and matcher classes with J\\Snos as regex. \\S matches any non-space character.
String str = "foo János bar Jxnos";
Matcher m = Pattern.compile("J\\Snos").matcher(str);
while(m.find())
{
System.out.println(m.group());
}
Output:
János
Jxnos
A possible solution would be to strip the accent with the help of Apache Commons StringUtils.stripAccents(input) method:
String input = StringUtils.stripAccents("János");
System.out.println(input); //Janos
Make sure to also read upon the more elaborate approaches based on the Normalizer class: Is there a way to get rid of accents and convert a whole string to regular letters?

Java Regex to check "=number", ex "=5455"?

I want to check a string that matches the format "=number", ex "=5455".
As long as the fist char is "=" & the subsequence is any number in [0-9] (dot is not allowed), then it will popup "correct" message.
if(str.matches("^[=][0-9]+")){
Window.alert("correct");
}
So, is this ^[=][0-9]+ the correct one?
if it is not correct, can u provide a correct solution?
if it is correct, then can u find a better solution?
I'm no big regex expert and more knowledgeable people than me might correct this answer, but:
I don't think there's a point in using [=] rather than simply = - the [...] block is used to declare multiple choices, why declare a multiple choice of one character?
I don't think you need to use ^ (if your input string contains any character before =, it won't match anyway). I'm unsure as to whether its presence makes your regex faster, slower or has no effect.
In conclusion, I'd use =[0-9]+
That should be correct it is looking for an anchored at the beginning = sign and then 1 or more digits between 0-9
Your regex will work, even though it can be simplified:
.matches() does not really do regex matching, since it tries and matches all the input against the regex; therefore the beginning of input anchor is not needed;
you don't need the character class around the =.
Therefore:
if (str.matches("=[0-9]+")) { ... }
If you want to match a string which only begins with that regex, you have to use a Pattern, a Matcher and .find():
final Pattern p = Pattern.compile("^=[0-9]+");
final Matcher m = p.matcher(str);
if (m.find()) { ... }
And finally, Matcher also has .lookingAt() which anchors the regex only at the beginning of the input.

Match word in String in Java

I'm trying to match Strings that contain the word "#SP" (sans quotes, case insensitive) in Java. However, I'm finding using Regexes very difficult!
Strings I need to match:
"This is a sample #sp string",
"#SP string text...",
"String text #Sp"
Strings I do not want to match:
"Anything with #Spider",
"#Spin #Spoon #SPORK"
Here's what I have so far: http://ideone.com/B7hHkR .Could someone guide me through building my regexp?
I've also tried: "\\w*\\s*#sp\\w*\\s*" to no avail.
Edit: Here's the code from IDEone:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("\\b#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
(edit: positive lookbehind not needed, only matching is done, not replacement)
You are yet another victim of Java's misnamed regex matching methods.
.matches() quite unfortunately so tries to match the whole input, which is a clear violation of the definition of "regex matching" (a regex can match anywhere in the input). The method you need to use is .find().
This is a braindead API, and unfortunately Java is not the only language having such misguided method names. Python also pleads guilty.
Also, you have the problem that \\b will detect on word boundaries and # is not part of a word. You need to use an alternation detecting either the beginning of input or a space.
Your code would need to look like this (non fully qualified classes):
Pattern p = Pattern.compile("(^|\\s)#SP\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
You're doing fine, but the \b in front of the # is misleading. \b is a word boundary, but # is already not a word character (i.e. it isn't in the set [0-9A-Za-z_]). Therefore, the space before the # isn't considered a word boundary. Change to:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("(^|\\s)#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
The (^|\s) means: match either ^ OR \s, where ^ means the beginning of your string (e.g. "#SP String"), and \s means a whitespace character.
The regular expression "\\w*\\s*#sp\\w*\s*" will match 0 or more words, followed by 0 or more spaces, followed by #sp, followed by 0 or more words, followed by 0 or more spaces. My suggestion is to not use \s* to break words up in your expression, instead, use \b.
"(^|\b)#sp(\b|$)"

Categories