Regex for validating alphabetics and numbers in the localized string - java

I have an input field which is localized. I need to add a validation using a regex that it must take only alphabets and numbers. I could have used [a-z0-9] if I were using only English.
As of now, I am using the method Character.isLetterOrDigit(name.charAt(i)) (yes, I am iterating through each character) to filter out the alphabets present in various languages.
Are there any better ways of doing it? Any regex or other libraries available for this?

Since Java 7 you can use Pattern.UNICODE_CHARACTER_CLASS
String s = "Müller";
Pattern p = Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("not found");
}
with out the option it will not recognize the word "Müller", but using Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
See here for more details
You can also have a look here for more Unicode information in Java 7.
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (of will be in Java 8)

boolean foundMatch = name.matches("[\\p{L}\\p{Nd}]*");
should work.
[\p{L}\p{Nd}] matches a character that is either a Unicode letter or digit. The regex .matches() method ensures that the entire string matches the pattern.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
-- Jamie Zawinksi
I say this in jest, but iterating through the String like you are doing will have runtime performance at least as good as any regex — there's no way a regex can do what you want any faster; and you don't have the overhead of compiling a pattern in the first place.
So as long as:
the validation doesn't need to do anything else regex-like (nothing was mentioned in the question)
the intention of the code looping through the String is clear (and if not, refactor until it is)
then why replace it with a regex just because you can?

Related

Finding whole word only in Java string search

I'm running into the problem of finding a searched pattern within a larger pattern in my Java program. For example, I'll try and find all for loops, but will stumble upon formula. Most of the suggestions I've found talk about using regular expression searches like
String regex = "\\b"+keyword+"\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(searchString);
or some variant of this. The issue I'm running into is that I'm crawling through code, not a book-like text where there are spaces on either side of every word. For example, this will miss for(, which I would like to find. Is there another clever way to find whole words only?
Edit: Thanks for the suggestions. How about cases in which there the keyword starts on the first entry of the string? For example,
class Vec {
public:
...
};
where I'm searching for class (or alternatively public). The patterns suggested by Thanga, Austin Lee, npinti, and Kai Iskratsch do not work in this case. Any ideas?
In your case, the issue is that the \b flag will look for punctuation marks, white spaces and the beginning or end of the string. An opening bracket does not fall within any of these categories, and is thus omitted.
The easiest way to fix this would be to replace "\\b"+keyword+"\\b" with "[\\b(]"+keyword+"[\\b)]".
In regex syntax, the square brackets denote a set of which the regex engine will attempt to match any character it contains.
As per this previous SO question, it would seem that \b and [\b] are not the same. Whilst \b represents a word boundary, [\b] represents a backspace character. To fix this, simply replace "\\b"+keyword+"\\b" with "(\b|\()"+keyword+"(\b|\))".
Regex should match 0 or more chars. The below code change will fix the issue
String regex = ".*("+keyword+").*";
You could modify your regex to search for multiple characters afterwords, for example
[^\w]+"for"+[^\w] using the Pattern class in Java.
For your reference:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Basically you will have to adapt your regex to all the possible patterns it can find. But considering your actually dealing with code, you are better of building a parser/tokenizer for that language, or using one that already exists. Then all you have to do is run through the tokens to find the the ones you want.

Regex extract string in java

I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}
You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo
You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");
If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.

Regular expression not extracting the exact pattern

I am working in Java to read a string of over 100000 characters.
I have a list of keywords, that I search the string for, and if the string is present I call a function which does some internal processing.
The kind of keyword I have is "face", for example - I wish to get all the patterns where I have matches for "faces" not "facebook". I can accept a space character behind the face in the string so if in a string I have a match like " face" or " faces" or "face " or " faces" i can accept that too. However I can not accept "duckface" or "duckface " etc.
I have written the regex
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
where keyword is my list of keywords, but I am not getting the desired results. Can you read my description and please suggest what might be issue and how I can fix it?
Also if a pointer to a really good regex for Java page is shared I would appreciate that as well.
Thank you Contributers ..
Edit
The reason I know it is not working is I have used the following code:
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
Matcher m = p.matcher(myInputDataSting);
if(m.find())
{
System.out.println("Its a Match: "+m.group());
}
This returns a blank string...
If keyword is "face", then your current regex is
\s+faces\s+|\s+
which matches either one or more whitespace characters, followed by faces, followed by one or more whitespace characters, or one or more whitespace characters. (The pipe | has very low precedence.)
What you really want is
\bfaces?\b
which matches a word boundary, followed by face, optionally followed by s, followed by a word boundary.
So, you can write:
Pattern p = Pattern.compile("\\b"+keyword+"s?\\b");
(though obviously this will only work for words like face that form their plurals by simply adding s).
You can find a comprehensive listing of Java's regular-expression support at http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html, but it's not much of a tutorial. For that, I'd recommend just Googling "regular expression tutorial", and finding one that suits you. (It doesn't have to be Java-specific: most of the tutorials you'll find are for flavors of regular-expression that are very similar to Java's.)
You should use
Pattern p = Pattern.compile("\b"+keyword+"s?\b");
, where keyword is not plural. \\b means that keyword must be as a complete word in searched string. s? means that keyword's value may end with s.
If you are not familar enough with regular expressions I recommend reading http://docs.oracle.com/javase/tutorial/essential/regex/index.html, because there are examples and explanations.

How do I translate this Perl regular expression into Java?

How would you translate this Perl regex into Java?
/pattern/i
While compiles, it does not match "PattErn" for me, it fails
Pattern p = Pattern.compile("/pattern/i");
Matcher m = p.matcher("PattErn");
System.out.println(m.matches()); // prints "false"
How would you translate this Perl regex into Java?
/pattern/i
You can't.
There are a lot of reasons for this. Here are a few:
Java doesn't support as expressive a regex language as Perl does. It lacks grapheme support (like \X) and full property support (like \p{Sentence_Break=SContinue}), is missing Unicode named characters, doesn't have a (?|...|...|) branch reset operator, doesn’t have named capture groups or a logical \x{...} escape before Java 7, has no recursive regexes, etc etc etc. I could write a book on what Java is missing here: Get used to going back to a very primitive and awkward to use regex engine compared with what you’re used to.
Another even worse problem is because you have lookalike faux amis like \w and and \b and \s, and even \p{alpha} and \p{lower}, which behave differently in Java compared with Perl; in some cases the Java versions are completely unusable and buggy. That’s because Perl follows UTS#18 but before Java 7, Java did not. You must add the UNICODE_CHARACTER_CLASSES flag from Java 7 to get these to stop being broken. If you can’t use Java 7, give up now, because Java had many many many other Unicode bugs before Java 7 and it just isn’t worth the pain of dealing with them.
Java handles linebreaks via ^ and $ and ., but Perl expects Unicode linebreaks to be \R. You should look at UNIX_LINES to understand what is going on there.
Java does not by default apply any Unicode casefolding whatsoever. Make sure to add the UNICODE_CASE flag to your compilation. Otherwise you won’t get things like the various Greek sigmas all matching one another.
Finally, it is different because at best Java only does simple casefolding, while Perl always does full casefolding. That means that you won’t get \xDF to match "SS" case insensitively in Java, and similar related issues.
In summary, the closest you can get is to compile with the flags
CASE_INSENSITIVE | UNICODE_CASE | UNICODE_CHARACTER_CLASSES
which is equivalent to an embedded "(?iuU)" in the pattern string.
And remember that match in Java doesn’t mean match, perversely enough.
EDIT
And here’s the rest of the story...
While compiles, it does not match "PattErn" for me, it fails
Pattern p = Pattern.compile("/pattern/i");
Matcher m = p.matcher("PattErn");
System.out.println(m.matches()); // prints "false"
You shouldn’t have slashes around the pattern.
The best you can do is to translate
$line = "I have your PaTTerN right here";
if ($line =~ /pattern/i) {
print "matched.\n";
}
this way
import java.util.regex.*;
String line = "I have your PaTTerN right here";
String pattern = "pattern";
Pattern regcomp = Pattern.compile(pattern, CASE_INSENSITIVE
| UNICODE_CASE
// comment next line out for legacy Java \b\w\s breakage
| UNICODE_CHARACTER_CLASSES
);
Matcher regexec = regcomp.matcher(line);
if (regexec.find()) {
System.out.println("matched");
}
There, see how much easier that isn’t? :)
Java regex do not have delimiters, and use a separate argument for modifies:
Pattern p = Pattern.compile("pattern", Pattern.CASE_INSENSITIVE);
The Perl equivalent of:
/pattern/i
in Java would be:
Pattern p = Pattern.compile("(?i)pattern");
Or simply do:
System.out.println("PattErn".matches("(?i)pattern"));
Note that "string".matches("pattern") validates the pattern against the entire input string. In other words, the following would return false:
"foo pattern bar".matches("pattern")

Simple regex required

I've never used regexes in my life and by jove it looks like a deep pool to dive into. Anyway,
I need a regex for this pattern (AN is alphanumeric (a-z or 0-9), N is numeric (0-9) and A is alphabetic (a-z)):
AN,AN,AN,AN,AN,N,N,N,N,N,N,AN,AN,AN,A,A
That's five AN's, followed by six N's, followed by three AN's, followed finally by two A's.
If it makes a difference, the language I'm using is Java.
[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}
should work in most RE dialects for the tasks as you specified it -- most of them will also support abbreviations such as \d (digit) in lieu of [0-9] (but if alphabetics need to be lowercase, as you appear to be requesting, you'll probably need to spell out the a-z parts).
Replace each AN by [a-z0-9], each N by [0-9], and each A by [a-z].
30 seconds in Expresso:
[a-zA-Z0-9]{5}[0-9]{6}[a-zA-Z0-9]{3}[0-9]{2}
Case insensitive, but you can probably define that in Java instead of the regex.
For the example you posted, the following should work fine.
(([A-Za-z\d])*,){5}+(([\d])*,){6}+(([A-Za-z\d])*,){3}+([\d])*,[\d]*
In Java you should be able use it like this:
boolean foundMatch = subjectString.matches("(([A-Za-z\\d])*,){5}+(([\\d])*,){6}+(([A-Za-z\\d])*,){3}+([\\d])*,[\\d]*");
I used, this tool to help in learning RegEx, it also make this really easy.
http://www.regexbuddy.com/
Try looking at some simple java regex tutorials such as this
They'll tell you how you form regular expressions and also how to use it in java.
This should match the pattern you request.
[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}
In addition, you could add Beginning of String / End of String matches, if your string match should fail if any other chars are in it:
^[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}$

Categories