Using regex in Java - java

I am checking for precedence in a string in my function. Is there a way to change my code to incorporate regex -- I am quite unfamiliar with regex, and though I have read some tutorials online, still not getting 'it' too well. So any guidance would really be appreciated.
For example in a string XLYZ, if the char after X is not L or C, the 'violation' statement gets printed. Here's the code below:
if (subtnString[cnt]=='X' && cnt+1<subtnString.length){
if(subtnString[cnt+1]!= 'L' || subtnString[cnt+1]!= 'C'){
System.out.println("Violation: X can be subtracted from L and C only");
return false;
}
}
Is there a way I can use regex to replace this code?

You can use something like this:
Pattern regex = Pattern.compile("^X[LC]");
Matcher regexMatcher = regex.matcher(subjectString);
if(regexMatcher.find() ) { // it matched!
}
else { // nasty message
}
In the demo, see the strings that match.
Explanation
The ^ anchor asserts that we are at the beginning of the string
X matches the literal X
[LC] is a character class that matches either one L or one C
Reference
Java Class String
Using Regular Expressions in Java

to match texts that violate your rule use this regex:
X[^LC]
see Demo
and to match regex that do not violate your rule use this:
X[LC]
see Demo

Related

How to check if a string contains only digits in Java

In Java for String class there is a method called matches, how to use this method to check if my string is having only digits using regular expression. I tried with below examples, but both of them returned me false as result.
String regex = "[0-9]";
String data = "23343453";
System.out.println(data.matches(regex));
String regex = "^[0-9]";
String data = "23343453";
System.out.println(data.matches(regex));
Try
String regex = "[0-9]+";
or
String regex = "\\d+";
As per Java regular expressions, the + means "one or more times" and \d means "a digit".
Note: the "double backslash" is an escape sequence to get a single backslash - therefore, \\d in a java String gives you the actual result: \d
References:
Java Regular Expressions
Java Character Escape Sequences
Edit: due to some confusion in other answers, I am writing a test case and will explain some more things in detail.
Firstly, if you are in doubt about the correctness of this solution (or others), please run this test case:
String regex = "\\d+";
// positive test cases, should all be "true"
System.out.println("1".matches(regex));
System.out.println("12345".matches(regex));
System.out.println("123456789".matches(regex));
// negative test cases, should all be "false"
System.out.println("".matches(regex));
System.out.println("foo".matches(regex));
System.out.println("aa123bb".matches(regex));
Question 1:
Isn't it necessary to add ^ and $ to the regex, so it won't match "aa123bb" ?
No. In java, the matches method (which was specified in the question) matches a complete string, not fragments. In other words, it is not necessary to use ^\\d+$ (even though it is also correct). Please see the last negative test case.
Please note that if you use an online "regex checker" then this may behave differently. To match fragments of a string in Java, you can use the find method instead, described in detail here:
Difference between matches() and find() in Java Regex
Question 2:
Won't this regex also match the empty string, "" ?*
No. A regex \\d* would match the empty string, but \\d+ does not. The star * means zero or more, whereas the plus + means one or more. Please see the first negative test case.
Question 3
Isn't it faster to compile a regex Pattern?
Yes. It is indeed faster to compile a regex Pattern once, rather than on every invocation of matches, and so if performance implications are important then a Pattern can be compiled and used like this:
Pattern pattern = Pattern.compile(regex);
System.out.println(pattern.matcher("1").matches());
System.out.println(pattern.matcher("12345").matches());
System.out.println(pattern.matcher("123456789").matches());
You can also use NumberUtil.isNumber(String str) from Apache Commons
Using regular expressions is costly in terms of performance. Trying to parse string as a long value is inefficient and unreliable, and may be not what you need.
What I suggest is to simply check if each character is a digit, what can be efficiently done using Java 8 lambda expressions:
boolean isNumeric = someString.chars().allMatch(x -> Character.isDigit(x));
One more solution, that hasn't been posted, yet:
String regex = "\\p{Digit}+"; // uses POSIX character class
You must allow for more than a digit (the + sign) as in:
String regex = "[0-9]+";
String data = "23343453";
System.out.println(data.matches(regex));
Long.parseLong(data)
and catch exception, it handles minus sign.
Although the number of digits is limited this actually creates a variable of the data which can be used, which is, I would imagine, the most common use-case.
We can use either Pattern.compile("[0-9]+.[0-9]+") or Pattern.compile("\\d+.\\d+"). They have the same meaning.
the pattern [0-9] means digit. The same as '\d'.
'+' means it appears more times.
'.' for integer or float.
Try following code:
import java.util.regex.Pattern;
public class PatternSample {
public boolean containNumbersOnly(String source){
boolean result = false;
Pattern pattern = Pattern.compile("[0-9]+.[0-9]+"); //correct pattern for both float and integer.
pattern = Pattern.compile("\\d+.\\d+"); //correct pattern for both float and integer.
result = pattern.matcher(source).matches();
if(result){
System.out.println("\"" + source + "\"" + " is a number");
}else
System.out.println("\"" + source + "\"" + " is a String");
return result;
}
public static void main(String[] args){
PatternSample obj = new PatternSample();
obj.containNumbersOnly("123456.a");
obj.containNumbersOnly("123456 ");
obj.containNumbersOnly("123456");
obj.containNumbersOnly("0123456.0");
obj.containNumbersOnly("0123456a.0");
}
}
Output:
"123456.a" is a String
"123456 " is a String
"123456" is a number
"0123456.0" is a number
"0123456a.0" is a String
According to Oracle's Java Documentation:
private static final Pattern NUMBER_PATTERN = Pattern.compile(
"[\\x00-\\x20]*[+-]?(NaN|Infinity|((((\\p{Digit}+)(\\.)?((\\p{Digit}+)?)" +
"([eE][+-]?(\\p{Digit}+))?)|(\\.((\\p{Digit}+))([eE][+-]?(\\p{Digit}+))?)|" +
"(((0[xX](\\p{XDigit}+)(\\.)?)|(0[xX](\\p{XDigit}+)?(\\.)(\\p{XDigit}+)))" +
"[pP][+-]?(\\p{Digit}+)))[fFdD]?))[\\x00-\\x20]*");
boolean isNumber(String s){
return NUMBER_PATTERN.matcher(s).matches()
}
Refer to org.apache.commons.lang3.StringUtils
public static boolean isNumeric(CharSequence cs) {
if (cs == null || cs.length() == 0) {
return false;
} else {
int sz = cs.length();
for(int i = 0; i < sz; ++i) {
if (!Character.isDigit(cs.charAt(i))) {
return false;
}
}
return true;
}
}
In Java for String class, there is a method called matches(). With help of this method you can validate the regex expression along with your string.
String regex = "^[\\d]{4}$";
String value = "1234";
System.out.println(data.matches(value));
The Explanation for the above regex expression is:-
^ - Indicates the start of the regex expression.
[] - Inside this you have to describe your own conditions.
\\\d - Only allows digits. You can use '\\d'or 0-9 inside the bracket both are same.
{4} - This condition allows exactly 4 digits. You can change the number according to your need.
$ - Indicates the end of the regex expression.
Note: You can remove the {4} and specify + which means one or more times, or * which means zero or more times, or ? which means once or none.
For more reference please go through this website: https://www.rexegg.com/regex-quickstart.html
Offical regex way
I would use this regex for integers:
^[-1-9]\d*$
This will also work in other programming languages because it's more specific and doesn't make any assumptions about how different programming languages may interpret or handle regex.
Also works in Java
\\d+
Questions regarding ^ and $
As #vikingsteve has pointed out in java, the matches method matches a complete string, not parts of a string. In other words, it is unnecessary to use ^\d+$ (even though it is the official way of regex).
Online regex checkers are more strict and therefore they will behave differently than how Java handles regex.
Try this part of code:
void containsOnlyNumbers(String str)
{
try {
Integer num = Integer.valueOf(str);
System.out.println("is a number");
} catch (NumberFormatException e) {
// TODO: handle exception
System.out.println("is not a number");
}
}

java regular expression not working

I am trying to match input data from the user and search if there is a match of this input.
for example if the user type : A*B*C*
i want to search all word which start with A and contains B and B
i tried this code and it;s not working:(get output false)
public static void main(String[] args)
{
String envVarRegExp = "^A[^\r\n]B[^\r\n]C[^\r\n]";
Pattern pattern = Pattern.compile(envVarRegExp);
Matcher matcher = pattern.matcher("AmBmkdCkk");
System.out.println(matcher.find());
}
Thanks.
You don't really need Regex here. Simple String class methods will work: -
String str = "AfasdBasdfCa";
if (str.startsWith("A") && str.contains("B") && str.contains("C")) {
System.out.println("true");
}
Note that this will not ensure that your B and C are in specific order, which I assume you don't need as you have not mentioned anything about that.
If you want them to be in some order (like B comes before C then use this Regex: -
if (str.matches("^A.*B.*C.*$")) {
System.out.println("true");
}
Note that, . will match any character except newline. So, you can use it instead of [^\r\n], its more clear. And you need to use the quantifier * because you need to match any repetition of the characters before B or C is found.
Also, String.matches matches the complete string, and hence the anchors at the ends.
I thing you should use * modifier in your regex like this (for 0 or more matches between A & B and then between B & C):
String envVarRegExp = "^A[^\r\n]*B[^\r\n]*C";
EDIT: It appears that you're working off the input coming from your user where user can use asterisk * in inputs. If that is the case consider this:
String envVarRegExp = userInput.replace("*", ".*?");
Where userInput is String like this:
String userInput = "a*b*c*d*e";
You need to add quantifiers to your character classes;
String envVarRegExp = "^A[^\r\n]*B[^\r\n]*C[^\r\n]*$";

Java regex: check if word has non alphanumeric characters

This is my code to determine if a word contains any non-alphanumeric characters:
String term = "Hello-World";
boolean found = false;
Pattern p = Pattern.Compile("\\W*");
Matcher m = p.Matcher(term);
if(matcher.find())
found = true;
I am wondering if the regex expression is wrong. I know "\W" would matches any non-word characters. Any idea on what I am missing ??
Change your regex to:
.*\\W+.*
This is the expresion you are looking for:
"^[a-zA-Z0-9]+$"
When it evaluates to false that means does not match so that mean you found what you wanted.
It's 2016 or later and you should think about international strings from other alphabets than just Latin. The frequently cited [^a-zA-Z] will not match in that case. There are better ways in Java now:
[^\\p{IsAlphabetic}^\\p{IsDigit}]
See the reference (section "Classes for Unicode scripts, blocks, categories and binary properties"). There's also this answer that I found helpful.
Methods are in the wrong case.
The matcher was declared as m but used as matcher.
The repetition should be "one or many" + instead of "zero or many " *
This works correctly:
String term = "Hello-World";
boolean found = false;
Pattern p = Pattern.compile("\\W+");//<-- compile( not Compile(
Matcher m = p.matcher(term); //<-- matcher( not Matcher
if(m.find()) { //<-- m not matcher
found = true;
}
Btw, it would be enough if you just :
boolean found = m.find();
:)
The problem is the '*'. '*' matches ZERO or more characters. You want to match at least one non word character, so you must use '+' as the quantity modifier. Hence match \W+ (Capital W there for NON word)
Your expression does not take account of possible non-English letters. It's also more complicated than it needs to be. Unless you are using regexs for some reason other than need (such as your professor having told you to) you are much better off with:
boolean found = false;
for (int i=0;i<mystring.length();++i) {
if (!Character.isLetterOrDigit(mystring.charAt(i))) {
found=true;
break;
}
}
When I had to do this same thing the regex I use is "(\w)*" Thats what I use. Not sure if capitol w is the same but I also used parenthesis.
If you are okay to use Apache StringUtils, then it's as simple as following
StringUtils.isAlphanumeric(inp)
if (value.matches(".*[^a-zA-Z0-9].*")) { // tested, seems to work.
System.out.println("match");
} else {
System.out.println("no match");
}

How do I know if a regexp has more than one possible match?

I am writing Java code that has to distinguish regular expressions with more than one possible match from regular expressions that have only one possible match.
For example:
"abc." can have several matches ("abc1", abcf", ...),
while "abcd" can only match "abcd".
Right now my best idea was to look for all unescaped regexp special characters.
I am convinced that there is a better way to do it in Java. Ideas?
(Late addition):
To make things clearer - there is NO specific input to test against. A good solution for this problem will have to test the regex itself.
In other words, I need a method who'se signature may look something like this:
boolean isSingleResult(String regex)
This method should return true if only for one possible String s1. The expression s1.matches(regex) will return true. (See examples above.)
This sounds dirty, but it might be worth having a look at the Pattern class in the Java source code.
Taking a quick peek, it seems like it 'normalize()'s the given regex (Line 1441), which could turn the expression into something a little more predictable. I think reflection can be used to tap into some private resources of the class (use caution!). It could be possible that while tokenizing the regex pattern, there are specific indications if it has reached some kind "multi-matching" element in the pattern.
Update
After having a closer look, there is some data within package scope that you can use to leverage the work of the Pattern tokenizer to walk through the nodes of the regex and check for multiple-character nodes.
After compiling the regular expression, iterate through the compiled "Node"s starting at Pattern.root. Starting at line 3034 of the class, there are the generalized types of nodes. For example class Pattern.All is multi-matching, while Pattern.SingleI or Pattern.SliceI are single-matching, and so on.
All these token classes appear to be in package scope, so it should be possible to do this without using reflection, but instead creating a java.util.regex.PatternHelper class to do the work.
Hope this helps.
If it can only have one possible match it isn't reeeeeally an expression, now, is it? I suspect your best option is to use a different tool altogether, because this does not at all sound like a job for regular expressions, but if you insist, well, no, I'd say your best option is to look for unescaped special characters.
The only regular expression that can ONLY match one input string is one that specifies the string exactly. So you need to match expressions with no wildcard characters or character groups AND that specify a start "^" and end "$" anchor.
"the quick" matches:
"the quick brownfox"
"the quick brown dog"
"catch the quick brown fox"
"^the quick brown fox$" matches ONLY:
"the quick brown fox"
Now I understand what you mean. I live in Belgium...
So this is something what work on most expressions. I wrote this by myself. So maybe I forgot some rules.
public static final boolean isSingleResult(String regexp) {
// Check the exceptions on the exceptions.
String[] exconexc = "\\d \\D \\w \\W \\s \\S".split(" ");
for (String s : exconexc) {
int index = regexp.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
// Then remove all exceptions:
String regex = regexp.replaceAll("\\\\.", "");
// Now, all the strings how can mean more than one match
String[] mtom = "+ . ? | * { [:alnum:] [:word:] [:alpha:] [:blank:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:]".split(" ");
// iterate all mtom-Strings
for (String s : mtom) {
int index = regex.indexOf(s);
if (index != -1) // Forbidden char found
{
return false;
}
}
return true;
}
Martijn
I see that the only way is to check if regexp matches multiple times for particular input.
package com;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class AAA {
public static void main(String[] args) throws Exception {
String input = "123 321 443 52134 432";
Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher(input);
int i = 0;
while (matcher.find()) {
++i;
}
System.out.printf("Matched %d times%n", i);
}
}

Find ASCII "arrows" in text

I'm trying to find all the occurrences of "Arrows" in text, so in
"<----=====><==->>"
the arrows are:
"<----", "=====>", "<==", "->", ">"
This works:
String[] patterns = {"<=*", "<-*", "=*>", "-*>"};
for (String p : patterns) {
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
}
but this doesn't:
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
No idea why. It often reports "<" instead of "<====" or similar.
What is wrong?
Solution
The following program compiles to one possible solution to the question:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class A {
public static void main( String args[] ) {
String p = "<=+|<-+|=+>|-+>|<|>";
Matcher m = Pattern.compile(p).matcher(args[0]);
while (m.find()) {
System.out.println(m.group());
}
}
}
Run #1:
$ java A "<----=====><<---<==->>==>"
<----
=====>
<
<---
<==
->
>
==>
Run #2:
$ java A "<----=====><=><---<==->>==>"
<----
=====>
<=
>
<---
<==
->
>
==>
Explanation
An asterisk will match zero or more of the preceding characters. A plus (+) will match one or more of the preceding characters. Thus <-* matches < whereas <-+ matches <- and any extended version (such as <--------).
When you match "<=*|<-*|=*>|-*>" against the string "<---", it matches the first part of the pattern, "<=*", because * includes zero or more. Java matching is greedy, but it isn't smart enough to know that there is another possible longer match, it just found the first item that matches.
Your first solution will match everything that you are looking for because you send each pattern into matcher one at a time and they are then given the opportunity to work on the target string individually.
Your second attempt will not work in the same manner because you are putting in single pattern with multiple expressions OR'ed together, and there are precedence rules for the OR'd string, where the leftmost token will be attempted first. If there is a match, no matter how minimal, the get() will return that match and continue on from there.
See Thangalin's response for a solution that will make the second work like the first.
for <======= you need <=+ as the regex. <=* will match zero or more ='s which means it will always match the zero case hence <. The same for the other cases you have. You should read up a bit on regexs. This book is FANTASTIC:
Mastering Regular Expressions
Your provided regex pattern String does work for your example: "<----=====><==->>"
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
However it is broken for some other examples pointed out in the answers such as input string "<-" yields "<", yet strangely "<=" yields "<=" as it should.

Categories