Negative lookahead regex not working - java

input1="caused/VBN by/IN thyroid disorder"
Requirement: find word "caused" that is followed by slash followed by any number of capital alphabets -- and not followed by space + "by/IN.
In the example above "caused/VBN" is followed by " by/IN", so 'caused' should not match.
input2="caused/VBN thyroid disorder"
"by/IN" doesn't follow caused, so it should match
regex="caused/[A-Z]+(?![\\s]+by/IN)"
caused/[A-Z]+ -- word 'caused' + / + one or more capital letters
(?![\\s]+by) -- negative lookahead - not matching space and by
Below is a simple method that I used to test
public static void main(String[] args){
String input = "caused/VBN by/IN thyroid disorder";
String regex = "caused/[A-Z]+(?![\\s]+by/IN)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while(matcher.find()){
System.out.println(matcher.group());
}
Output: caused/VB
I don't understand why my negative lookahead regex is not working.

You need to include a word boundary in your regular expression:
String regex = "caused/[A-Z]+\\b(?![\\s]+by/IN)";
Without it you can get a match, but not what you were expecting:
"caused/VBN by/IN thyroid disorder";
^^^^^^^^^
this matches because "N by" doesn't match "[\\s]+by"

The character class []+ match will be adjusted (via backtracking) so that the lookahead will match.
What you have to do is stop the backtracking so that the expression []+ is fully matched.
This can be done a couple of different ways.
A positive lookahead, followed by a consumption
"caused(?=(/[A-Z]+))\\1(?!\\s+by/IN)"
A standalone sub-expression
"caused(?>/[A-Z]+)(?!\\s+by/IN)"
A possesive quantifier
"caused/[A-Z]++(?!\\s+by/IN)"

Related

java regex char sequence

I have a string with multiple "message" inside it. "message" starts with certain char sequence. I've tried:
String str = 'ab message1ab message2ab message3'
Pattern pattern = Pattern.compile('(?<record>ab\\p{ASCII}+(?!ab))');
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
handleMessage(matcher.group('record'))
}
but \p{ASCII}+ greedy eat everything.
Symbols a, b can be inside message only their sequence mean start of next message
p{ASCII}+ is the greedy regex for one or more ASCII characters, meaning that it will use the longest possible match. But you can use the reluctant quantifier if you want the shortest possible match: p{ASCII}+?. In that case, you should use a positive lookahead assertion.
The regex could become:
Pattern pattern = Pattern.compile("(?<record>ab\\p{ASCII}+?)(?=(ab)|\\z)");
Please note the (ab)|\z to match the last message...

Regular expression java to extract the balance from a string

I have a String which contains " Dear user BAL= 1,234/ ".
I want to extract 1,234 from the String using the regular expression. It can be 1,23, 1,2345, 5,213 or 500
final Pattern p=Pattern.compile("((BAL)=*(\\s{1}\\w+))");
final Matcherm m = p.matcher(text);
if(m.find())
return m.group(3);
else
return "";
This returns 3.
What regular expression should I make? I am new to regular expressions.
You search in your regex for word characters \w+ but you should search for digits with \d+.
Additionally there is the comma, so you need to match that as well.
I'd use
/.BAL=\s([\d,]+(?=/)./
as pattern and get only the number in the resulting group.
Explanation:
.* match anything before
BAL= match the string "BAL="
\s match a whitespace
( start matching group
[\d,]+ matches every digit or comma one ore more times
(?=/) match the former only if followed by a slash
) end matching group
.* matches anything thereaft
This is untestet, but it should work like this:
final Pattern p=Pattern.compile(".*BAL=\\s([\\d,]+(?=/)).*");
final Matcherm m = p.matcher(text);
if(m.find())
return m.group(1);
else
return "";
According to an online tester, the pattern above matches the text:
BAL= 1,234/
If it didn't have to be extracted by the regular expression you could simply do:
// split on any whitespace into a 4-element array
String[] foo = text.split("\\s+");
return foo[3];

How to get all integers before hyphen from java String

I want to parse through hyphen, the answer should be 0 0 1 (integer), what could be the best way to parse in java
public static String str ="[0-S1|0-S2|1-S3, 1-S1|0-S2|0-S3, 0-S1|1-S2|0-S3]";
Please help me out.
Use the below regex with Pattern and matcher classes.
Pattern.compile("\\d+(?=-)");
\\d+ - Matches one or more digits. + repeats the previous token \\d (which matches a digit character) one or more times.
(?=-) - Only if it's followed by an hyphen. (?=-) Called positive lookahead assertion which asserts that the match must be followed by an - symbol.
String str ="[0-S1|0-S2|1-S3, 1-S1|0-S2|0-S3, 0-S1|1-S2|0-S3]";
Matcher m = Pattern.compile("\\d+(?=-)").matcher(str);
while(m.find())
{
System.out.println(m.group());
}
one lazy way: if you already know the pattern of the string, use substring and indexof to locate your word.
String str ="[0-S1|0-S2|1-S3, 1-S1|0-S2|0-S3, 0-S1|1-S2|0-S3]";
integer int1 = Integer.parseInt(str.substring(str.indexOf("["),str.indexOf("-S1")));
and so on.

Java RegEx negative lookbehind

I have the following Java code:
Pattern pat = Pattern.compile("(?<!function )\\w+");
Matcher mat = pat.matcher("function example");
System.out.println(mat.find());
Why does mat.find() return true? I used negative lookbehind and example is preceded by function. Shouldn't it be discarded?
See what it matches:
public static void main(String[] args) throws Exception {
Pattern pat = Pattern.compile("(?<!function )\\w+");
Matcher mat = pat.matcher("function example");
while (mat.find()) {
System.out.println(mat.group());
}
}
Output:
function
xample
So first it finds function, which isn't preceded by "function". Then it finds xample which is preceded by function e and therefore not "function".
Presumably you want the pattern to match the whole text, not just find matches in the text.
You can either do this with Matcher.matches() or you can change the pattern to add start and end anchors:
^(?<!function )\\w+$
I prefer the second approach as it means that the pattern itself defines its match region rather then the region being defined by its usage. That's just a matter of preference however.
Your string has the word "function" that matches \w+, and is not preceded by "function ".
Notice two things here:
You're using find() which returns true for a sub-string match as well.
Because of the above, "function" matches as it is not preceded by "function".
The whole string would have never matched because your regex didn't
include spaces.
Use Mathcher#matches() or ^ and $ anchors with a negative lookahead instead:
Pattern pat = Pattern.compile("^(?!function)[\\w\\s]+$"); // added \s for whitespaces
Matcher mat = pat.matcher("function example");
System.out.println(mat.find()); // false

Java Regex : How to match one or more space characters

How do you match more than one space character in Java regex?
I have a regex I am trying to match. The regex fails when I have two or more space characters.
public static void main(String[] args) {
String pattern = "\\b(fruit)\\s+([^a]+\\w+)\\b"; //Match 'fruit' not followed by a word that begins with 'a'
String str = "fruit apple"; //One space character will not be matched
String str_fail = "fruit apple"; //Two space characters will be matched
System.out.println(preg_match(pattern,str)); //False (Thats what I want)
System.out.println(preg_match(pattern,str_fail)); //True (Regex fail)
}
public static boolean preg_match(String pattern,String subject) {
Pattern regex = Pattern.compile(pattern);
Matcher regexMatcher = regex.matcher(subject);
return regexMatcher.find();
}
The problem is actually because of backtracking. Your regex:
"\\b(fruit)\\s+([^a]+\\w+)\\b"
Says "fruit, followed by one or more spaces, followed by one or more non 'a' characters, followed by one or more 'word' characters". The reason this fails with two spaces is because \s+ matches the first space, but then gives back the second, which then satisfies the [^a]+ (with the second space) and the \s+ portion (with the first).
I think you can fix it by simply using the posessive quantifier instead, which would be \s++. This tells the \s not to give back the second space character. You can find the documentation on Java's quantifiers here.
As an illustration, here are two examples at Rubular:
Using the possessive quantifier on \s (gives expected results, from what you describe)
Your current regex with separate groupings around [^a\]+ and \w+. Notice that the second match group (representing the [^a]+) is capturing a the second space character.

Categories