Negative Lookaround Regex - Only one occurrence - Java - java

I am trying to find if a string contains only one occurrence of a word ,
e.g.
String : `jjdhfoobarfoo` , Regex : `foo` --> false
String : `wewwfobarfoo` , Regex : `foo` --> true
String : `jjfffoobarfo` , Regex : `foo` --> true
multiple foo's may happen anywhere in the string , so they can be non-consecutive,
I test the following regex matching in java with string foobarfoo, but it doesn't work and it returns true :
static boolean testRegEx(String str){
return str.matches(".*(foo)(?!.*foo).*");
}
I know this topic may seem duplicate , but I am surprised because when I use this regex : (foo)(?!.*foo).* it works !
Any idea why this happens ?

Use two anchored look-aheads:
static boolean testRegEx(String str){
return str.matches("^(?=.*foo)(?!.*foo.*foo.*$).*");
}
A couple of key points are that there is a negative look-ahead to check for 2 foo's that is anchored to start, and importantly containes an end of input.

If you want to check if a string contains another string exactly once, here are two possible solutions, (one with regex, one without)
static boolean containsRegexOnlyOnce(String string, String regex) {
Matcher matcher = Pattern.compile(regex).matcher(string);
return matcher.find() && !matcher.find();
}
static boolean containsOnlyOnce(String string, String substring) {
int index = string.indexOf(substring);
if (index != -1) {
return string.indexOf(substring, index + substring.length()) == -1;
}
return false;
}
All of them work fine. Here's a demo of your examples:
String str1 = "jjdhfoobarfoo";
String str2 = "wewwfobarfoo";
String str3 = "jjfffoobarfo";
String foo = "foo";
System.out.println(containsOnlyOnce(str1, foo)); // false
System.out.println(containsOnlyOnce(str2, foo)); // true
System.out.println(containsOnlyOnce(str3, foo)); // true
System.out.println(containsRegexOnlyOnce(str1, foo)); // false
System.out.println(containsRegexOnlyOnce(str2, foo)); // true
System.out.println(containsRegexOnlyOnce(str3, foo)); // true

You can use this pattern:
^(?>[^f]++|f(?!oo))*foo(?>[^f]++|f(?!oo))*$
It's a bit long but performant.
The same with the classical example of the ashdflasd string:
^(?>[^a]++|a(?!shdflasd))*ashdflasd(?>[^a]++|a(?!shdflasd))*$
details:
(?> # open an atomic group
[^f]++ # all characters but f, one or more times (possessive)
| # OR
f(?!oo) # f not followed by oo
)* # close the group, zero or more times
The possessive quantifier ++ is like a greedy quantifier + but doesn't allow backtracks.
The atomic group (?>..) is like a non capturing group (?:..) but doesn't allow backtracks too.
These features are used here for performances (memory and speed) but the subpattern can be replaced by:
(?:[^f]+|f(?!oo))*

The problem with your regex is that the first .* initially consumes the whole string, then backs off until it finds a spot where the rest of the regex can match. That means, if there's more than one foo in the string, your regex will always match the last one. And from that position, the lookahead will always succeed as well.
Regexes that you use for validating have to be more precise than the ones you use for matching. Your regex is failing because the .* can match the sentinel string, 'foo'. You need to actively prevent matches of foo before and after the one you're trying to match. Casimir's answer shows one way to do that; here's another:
"^(?>(?!foo).)*+foo(?>(?!foo).)*+$"
It's not quite as efficient, but I think it's a lot easier to read. In fact, you could probably use this regex:
"^(?!.*foo.*foo).+$"
It's a great deal more inefficient, but a complete regex n00b would probably figure out what it does.
Finally, notice that none of theses regexes--mine or Casimir's--uses lookbehinds. I know it seems like the perfect tool for the job, but no. In fact, lookbehind should never be the first tool you reach for. And not just in Java. Whatever regex flavor you use, it's almost always easier to match the whole string in the normal way than it is to use lookbehinds. And usually much more efficient, too.

Someone answered the question, but deleted it ,
The following short code works correctly :
static boolean testRegEx(String str){
return !str.matches("(.*?foo.*){0}|(.*?foo.*){2,}");
}
Any idea on how to invert the result inside the regex itself ?

Related

Java Regex ? (expr){num} confusion?

I'm trying to identify strings which contain exactly one integer.
That is exactly one string of contiguous digits e.g. "1234" (no dots, no commas).
So I thought this should do it: (This is with the Java String Escapes included):
(\\d+){1,}
So the "\d+" correctly a string of contiguous digits. (right?)
I included this expression as a sub-expression within "(" and ")" and then I'm trying to say "only one of these sub-expressions.
Here's the result of ( matcher.find() ) of checking various strings:
(note the regex from now on is'raw' here - NOT Java String Escaped).
Pattern:(\d+){1,}
Input String Result
1 true
XX-1234 true
do-not-match-no-integers false
do-not-match-1234-567 true
do-not-match-123-456 true
It seems the '1' in the pattern is applying to the "+\d" string, rather than the number of those contiguous strings.
Because if I change the number from 1 to 4; I can see the result change to the following:
Pattern:(\d+){4,}
Input String Result
1 false
XX-1234 true
do-not-match-no-integers false
do-not-match-1234-567 true
do-not-match-123-456 false
What am I missing here ?
Out of interest - if I take off the "(" and ")" altogether - I'm getting a different result again
Pattern:\d+{4,}
Input String Result
1 true
XX-1234 true
do-not-match-no-integers false
do-not-match-1234-567 true
do-not-match-123-456 true
Matcher.find() will try to find a match inside the String. You should try Matcher.matches() instead to see if the pattern fits in all the string.
In this way, the pattern you need is \d+
EDIT:
Seems that I misunderstood the question. One way to find if the String has only one integer, using the same pattern is:
int matchCounter = 0;
while (Matcher.find() || matchCounter < 2){
matchCounter++;
}
return matchCounter == 1
This is the regex:
^[^\d]*\d+[^\d]*$
That's zero or more non digits, followed by a substring of digits and then zero or more non digits again until the end of the string. Here is the java code (with escaped slashes):
class MainClass {
public static void main(String[] args) {
String regex="^[^\\d]*\\d+[^\\d]*$";
System.out.println("1".matches(regex)); // true
System.out.println("XX-1234".matches(regex)); // true
System.out.println("XX-1234-YY".matches(regex)); // true
System.out.println("do-not-match-no-integers".matches(regex)); // false
System.out.println("do-not-match-1234-567".matches(regex)); // false
System.out.println("do-not-match-123-456".matches(regex)); // false
}
}
You can use the RegEx ^\D*?(\d+)\D*?$
^\D*? makes sure there is no digits between the start of your line and your first group
(\d+) matches your digits
\D*?$ makes sure there is no digits between the your first group and the end of your line
Demo.
So, for your Java String, it would be : ^\\D*?(\\d+)\\D*?$
I think you will have to make sure your regex considers the entire string, using ^ and $.
To do that, you could match zero or more non-digits, followed by 1 or more digits, and then zero or more non-digits.
The following should do the trick:
^[^\d]*(\d+)[^\d]*$
Here it is on regex101.com: https://regex101.com/r/CG0RiL/2
Edit: As pointed out by Veselin Davidov my regex isn't correct.
If i understand you right you want it only to say true when the entire String matches the pattern. yes?
Then you have to call matcher.matches();
Also i think your pattern must be just \d+.
If you have problem with regex i can recommend you https://regex101.com/ it explains you why it matches something and gives you a quick preview.
I use it every time i have to write regex.

How to match the first character in a String with a regexp?

I need a regular expression to evaluate if the first character of a word is a lowercase letter or not.
I have this java code: Character.toString(charcter).matches("[a-z?]")
For example if I have those words the result would be:
a13 => true
B54 => false
&32 => false
I want to match only one letter and I don't know if I need to use "?", "." or "{1}" after or inside "[a-z]"
There is a built in way to do this without regexes.
Character.isLowerCase(string.charAt(0))
Please use this for your needs: /^[a-z]/
You want to match if there's exactly one lowercase letter. As #Luiggi Medonza stated, you really do/should not need Regular Expressions for this, but if you want to use them, you most likely want this pattern:
[a-z]{1}
What ? does is an optional match. You want a strict match of length 1, so you need {1}.
#Ted Hopp mentioned that you don't need the {1}. Your entire match should look like this:
entire_string.matches("^[a-z].+$")
Again, using built-in string methods will be much faster/better to use.
Here I got similar requirement like in a string first character should alphabet from a-z or A-Z. than the user can type anything like number or some limited symbols.
Solution
public static boolean designationValidate(String n) {
int l = n.length();
if (l >= 4) {
Pattern pattern = Pattern.compile("^[a-zA-Z][a-zA-Z0-9-() ]*$");
Matcher matcher = pattern.matcher(n);
return (matcher.find() && matcher.group().equals(n));
} else
return false;
}
in above example I am validation minimum character should more than 3 length and start with alphabet. If you want any other symbols you can enter there.
The method will return true if expressions match otherwise return false.
May this will helpful for you.

How to find the exact word using a regex in Java?

Consider the following code snippet:
String input = "Print this";
System.out.println(input.matches("\\bthis\\b"));
Output
false
What could be possibly wrong with this approach? If it is wrong, then what is the right solution to find the exact word match?
PS: I have found a variety of similar questions here but none of them provide the solution I am looking for.
Thanks in advance.
When you use the matches() method, it is trying to match the entire input. In your example, the input "Print this" doesn't match the pattern because the word "Print" isn't matched.
So you need to add something to the regex to match the initial part of the string, e.g.
.*\\bthis\\b
And if you want to allow extra text at the end of the line too:
.*\\bthis\\b.*
Alternatively, use a Matcher object and use Matcher.find() to find matches within the input string:
Pattern p = Pattern.compile("\\bthis\\b");
Matcher m = p.matcher("Print this");
m.find();
System.out.println(m.group());
Output:
this
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.
Full example method for matcher:
public static String REGEX_FIND_WORD="(?i).*?\\b%s\\b.*?";
public static boolean containsWord(String text, String word) {
String regex=String.format(REGEX_FIND_WORD, Pattern.quote(word));
return text.matches(regex);
}
Explain:
(?i) - ignorecase
.*? - allow (optionally) any characters before
\b - word boundary
%s - variable to be changed by String.format (quoted to avoid regex
errors)
\b - word boundary
.*? - allow (optionally) any characters after
For a good explanation, see: http://www.regular-expressions.info/java.html
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. This
is different from most other regex libraries, where the "quick match
test" method returns true if the regex can be matched anywhere in the
string. If myString is abc then myString.matches("bc") returns false.
bc matches abc, but ^bc$ (which is really being used here) does not.
This writes "true":
String input = "Print this";
System.out.println(input.matches(".*\\bthis\\b"));
You may use groups to find the exact word. Regex API specifies groups by parentheses. For example:
A(B(C))D
This statement consists of three groups, which are indexed from 0.
0th group - ABCD
1st group - BC
2nd group - C
So if you need to find some specific word, you may use two methods in Matcher class such as: find() to find statement specified by regex, and then get a String object specified by its group number:
String statement = "Hello, my beautiful world";
Pattern pattern = Pattern.compile("Hello, my (\\w+).*");
Matcher m = pattern.matcher(statement);
m.find();
System.out.println(m.group(1));
The above code result will be "beautiful"
Is your searchString going to be regular expression? if not simply use String.contains(CharSequence s)
System.out.println(input.matches(".*\\bthis$"));
Also works. Here the .* matches anything before the space and then this is matched to be word in the end.

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)
Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.
Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time
Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));
It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.
I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.
If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

Java regex return after first match

how do i return after the first match of regular expression? (does the Matcher.find() method do that? )
say I have a string "abcdefgeee". I want to ask the regex engine stop finding immediately after it finds the first match of "e" for example. I am writing a method to return true/false if the pattern is found and i don't want to find the whole string for "e". (I am looking for a regex solution )
Another question, sometimes when i use matches() , it doesn't return correctly. For example, if i compile my pattern like "[a-z]". and then use matches(), it doesn't match. But when I compile the pattern as ".*[a-z].*", it matches.... is that the behaviour of the matches() method of Matcher class?
Edit, here's actually what i want to do. For example I want to search for a $ sign AND a # sign in a string. So i would define 2 compiled patterns (since i can't find any logical AND for regex as I know the basics).
pattern1 = Pattern.compiled("$");
pattern2 = Pattern.compiled("#");
then i would just use
if ( match1.find() && match2.find() ){
return true;
}
in my method.
I only want the matchers to search the string for first occurrence and return.
thanks
For your second question, matches does work correctly, you example uses two different regular expressions.
.*[a-z].* will match a String that has at least one character. [a-z] will only match a one character String that is lower case a-z. I think you might mean to use something like [a-z]+
Another question, sometimes when i use matches() , it doesn't return correctly. For example, if i compile my pattern like "[a-z]". and then use matches(), it doesn't match. But when I compile the pattern as ".[a-z].", it matches.... is that the behaviour of the matches() method of Matcher class?
Yes, matches(...) tests the entire target string.
... here's actually what i want to do. For example I want to search for a $ sign AND a # sign in a string. So i would define 2 compiled patterns (since i can't find any logical AND for regex as I know the basics).
I know you said you wanted to use regex, but all your examples seems to suggest you have no need for them: those are all singe characters that can be handled with a couple of indexOf(...) calls.
Anyway, using regex, you could do it like this:
public static boolean containsAll(String text, String... patterns) {
for(String p : patterns) {
Matcher m = Pattern.compile(p).matcher(text);
if(!m.find()) return false;
}
return true;
}
But, again: indexOf(...) would do the trick as well:
public static boolean containsAll(String text, String... subStrings) {
for(String s : subStrings) {
if(text.indexOf(s) < 0) return false;
}
return true;
}

Categories