pattern.matcher() vs pattern.matches()

pattern.matcher() vs pattern.matches() - java

I am wondering why the results of the java regex pattern.matcher() and pattern.matches() differ when provided the same regular expression and same string
String str = "hello+";
Pattern pattern = Pattern.compile("\\+");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println("I found the text " + matcher.group() + " starting at "
+ "index " + matcher.start() + " and ending at index " + matcher.end());
}
System.out.println(java.util.regex.Pattern.matches("\\+", str));
The result of the above are:
I found the text + starting at index 5 and ending at index 6
false
I found that using an expression to match the full string works fine in case of matches(".*\\+").

pattern.matcher(String s) returns a Matcher that can find patterns in the String s. pattern.matches(String str) tests, if the entire String (str) matches the pattern.
In brief (just to remember the difference):
pattern.matcher - test if the string contains-a pattern
pattern.matches - test if the string is-a pattern

Matcher.find() attempts to find the next subsequence of the input sequence that matches the pattern.
Pattern.matches(String regex, CharSequence input) compiles the regex into a Matcher and returns Matcher.matches().
Matcher.matches attempts to match the entire region (string) against the pattern (Regex).
So, in your case, the Pattern.matches("\\+", str) returns a false since str.equals("+") is false.

From the Javadoc, see the if, and only if, the entire region section
/**
* Attempts to match the entire region against the pattern.
*
* <p> If the match succeeds then more information can be obtained via the
* <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods. </p>
*
* #return <tt>true</tt> if, and only if, <b>the entire region</b> sequence
* matches this matcher's pattern
*/
public boolean matches() {
return match(from, ENDANCHOR);
}
So if your String was just "+", you'd get a true result.

matches tries to match the expression against the entire string. Meaning, it checks whether the entire string is a patern or not.
conceptually think it like this, it implicitly adds a ^ at the start and $ at the end of your pattern.
For, String str = "hello+", if you want matches() to return true, you need to have pattern like ".\+."
I hope this answered your question.

Pattern.matches is testing the whole String, in your case you should use:
System.out.println(java.util.regex.Pattern.matches(".*\\+", str));
Meaning any string and a + symbol

I think your question should really be "When should I use the Pattern.matches() method?", and the answer is "Never." Were you expecting it to return an array of the matched substrings, like .NET's Matches methods do? That's a perfectly reasonable expectation, but no, Java has nothing like that.
If you just want to do a quick-and-dirty match, adorn the regex with .* at either end, and use the string's own matches() method:
System.out.println(str.matches(".*\\+.*"));
If you want to extract multiple matches, or access information about a match afterward, create a Matcher instance and use its methods, like you did in your question. Pattern.matches() is nothing but a wasted opportunity.

Matcher matcher = pattern.matcher(text);
In this case, a matcher object instance will be returned which performs match operations on the input text by interpreting the pattern. Then we can use,matcher.find() to match no. of patterns from the input text.
(java.util.regex.Pattern.matches("\\+", str))
Here, the matcher object will be created implicitly and a boolean will be returned which matches the whole text with the pattern. This will work as same as the str.matches(regex) function in String.

The code equivalent to java.util.regex.Pattern.matches("\\+", str) would be:
Pattern.compile("\\+").matcher(str).matches();
method find will find the first occurrence of the pattern in the string.

Related

Regular expressions in multi-line text code in Java [duplicate]

I am trying to match a multi line text using java. When I use the Pattern class with the Pattern.MULTILINE modifier, I am able to match, but I am not able to do so with (?m).
The same pattern with (?m) and using String.matches does not seem to work.
I am sure I am missing something, but no idea what. Am not very good at regular expressions.
This is what I tried
String test = "User Comments: This is \t a\ta \n test \n\n message \n";
String pattern1 = "User Comments: (\\W)*(\\S)*";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
System.out.println(p.matcher(test).find()); //true
String pattern2 = "(?m)User Comments: (\\W)*(\\S)*";
System.out.println(test.matches(pattern2)); //false - why?

First, you're using the modifiers under an incorrect assumption.
Pattern.MULTILINE or (?m) tells Java to accept the anchors ^ and $ to match at the start and end of each line (otherwise they only match at the start/end of the entire string).
Pattern.DOTALL or (?s) tells Java to allow the dot to match newline characters, too.
Second, in your case, the regex fails because you're using the matches() method which expects the regex to match the entire string - which of course doesn't work since there are some characters left after (\\W)*(\\S)* have matched.
So if you're simply looking for a string that starts with User Comments:, use the regex
^\s*User Comments:\s*(.*)
with the Pattern.DOTALL option:
Pattern regex = Pattern.compile("^\\s*User Comments:\\s+(.*)", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group(1);
}
ResultString will then contain the text after User Comments:

This has nothing to do with the MULTILINE flag; what you're seeing is the difference between the find() and matches() methods. find() succeeds if a match can be found anywhere in the target string, while matches() expects the regex to match the entire string.
Pattern p = Pattern.compile("xyz");
Matcher m = p.matcher("123xyzabc");
System.out.println(m.find()); // true
System.out.println(m.matches()); // false
Matcher m = p.matcher("xyz");
System.out.println(m.matches()); // true
Furthermore, MULTILINE doesn't mean what you think it does. Many people seem to jump to the conclusion that you have to use that flag if your target string contains newlines--that is, if it contains multiple logical lines. I've seen several answers here on SO to that effect, but in fact, all that flag does is change the behavior of the anchors, ^ and $.
Normally ^ matches the very beginning of the target string, and $ matches the very end (or before a newline at the end, but we'll leave that aside for now). But if the string contains newlines, you can choose for ^ and $ to match at the start and end of any logical line, not just the start and end of the whole string, by setting the MULTILINE flag.
So forget about what MULTILINE means and just remember what it does: changes the behavior of the ^ and $ anchors. DOTALL mode was originally called "single-line" (and still is in some flavors, including Perl and .NET), and it has always caused similar confusion. We're fortunate that the Java devs went with the more descriptive name in that case, but there was no reasonable alternative for "multiline" mode.
In Perl, where all this madness started, they've admitted their mistake and gotten rid of both "multiline" and "single-line" modes in Perl 6 regexes. In another twenty years, maybe the rest of the world will have followed suit.

str.matches(regex) behaves like Pattern.matches(regex, str) which attempts to match the entire input sequence against the pattern and returns
true if, and only if, the entire input sequence matches this matcher's pattern
Whereas matcher.find() attempts to find the next subsequence of the input sequence that matches the pattern and returns
true if, and only if, a subsequence of the input sequence matches this matcher's pattern
Thus the problem is with the regex. Try the following.
String test = "User Comments: This is \t a\ta \ntest\n\n message \n";
String pattern1 = "User Comments: [\\s\\S]*^test$[\\s\\S]*";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
System.out.println(p.matcher(test).find()); //true
String pattern2 = "(?m)User Comments: [\\s\\S]*^test$[\\s\\S]*";
System.out.println(test.matches(pattern2)); //true
Thus in short, the (\\W)*(\\S)* portion in your first regex matches an empty string as * means zero or more occurrences and the real matched string is User Comments: and not the whole string as you'd expect. The second one fails as it tries to match the whole string but it can't as \\W matches a non word character, ie [^a-zA-Z0-9_] and the first character is T, a word character.

The multiline flag tells regex to match the pattern to each line as opposed to the entire string for your purposes a wild card will suffice.

Regular Expression always returns false

I have a problem to get a regular expression to get work.
I use an XMLRPC Library to get information from an wiki.
so far so good.
After retrieving the data into a String Variable I would like to search through with a regular expression but the matcher will always return "false".
But if I asking the String ....contains("xyz"); the Answer is true.
The String looks something like this:
====== Datensicherheit ====== ''Kriterium von Sicherheit'' Typ: technisch Definition: \ //Allgemein.........
String regex = "Definition";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
System.out.println(matcher.matches());
Does anybody know what I'm doing wrong?

This is an issue with your regex expression. If you are wanting to know if the string contains "Definition", your regex needs to be:
String regex = ".*Definition.*";

Note that matches() returns true if, and only if, the entire region sequence matches this matcher's pattern. see the java doc # https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#matches()
So, it will only be true if the entire "text" region matches "Definition", which is unlikely :).
Try find() instead which is true if, and only if, a subsequence of the input sequence starting at the given index matches this matcher's pattern.

How to get the value of a wildcard in a string Java?

This code works in javascript, is it possible to do something similar in java? (get the value of a wildcard in a string)
var a = "HI MY NAME IS BOB"
var b = /HI MY NAME IS (.*)/
alert("HI " + b.exec(a)[1])

Probably you need to find captured group #1:
Pattern p = Pattern.compile("(?i)Hi MY NAME IS (.*)");
Matcher m = p.matcher("Hi MY NAME IS BOB");
if (m.find()) {
System.out.println( "Group #1: " + m.group(1) ); // BOB
}
(?i) is for ignore case match
m.group(1) will give value of first captured group from your regex i.e. (.*)

What you're looking for is the java.util.regex package.
The syntax of regular expressions is a whole different answer, but I'll assume you're somewhat familiar with it here.
To use a regex in Java, you'll need to create two objects, a Pattern and a Matcher.
Quoting the documentation, a Pattern object is "A compiled representation of a regular expression", and a Matcher object is "An engine that performs match operations on a character sequence by interpreting a Pattern."
In other words, you use a Pattern to define your regex, and a Matcher to apply it.
So let's take this line-by-line:
import java.util.regex.*;
String a = "MY NAME is BOB";
Obviously, you need to import the package and define the string you're going to apply the regex to.
Pattern wildcard = Pattern.compile("HI MY NAME IS (.*)");
Pattern.compile takes a String representing a regex and returns a Pattern.
Matcher match = wildcard.matcher(a);
Pattern objects have an instance method, matcher, that takes the string you want to apply the Pattern to, and returns a Matcher.
System.out.println(match.group(1));
Calling match.group(n) returns the string matching the nth group of parentheses (to be more precise, the nth capturing group) in your regex. match.group(0), which is equivalent to match.group(), returns the string representing your entire match. In this case, we're using match.group(1) because we want to match the only set of parentheses in our regex - the (.*) at the end.
Putting it all together, we get:
import java.util.regex.*;
String a = "MY NAME is BOB";
Pattern wildcard = Pattern.compile("HI MY NAME IS (.*)");
Matcher match = wildcard.matcher(a);
System.out.println(match.group(1));

The class you need is Pattern. It is similar to regex in its functions, but the mechanics are different. Here is a link to the documentation.

This is not entirely what you asked for but I want to try to be little helpful. You can print out "BOB" using the substring(). Here is what worked for me: String a = new String("Hi my name is Bob");
System.out.printf("This print as Bob\n%s",a.substring(14)); The substring starts at index 14 and prints to the end of the string.

Extract all occurrences of pattern K and check if string matches "K*" in 1 pass

For a given input string and a given pattern K, I want to extract every occurrence of K (or some part of it (using groups)) from the string and check that the entire string matches K* (as in it consists of 0 or more K's with no other characters).
But I would like to do this in a single pass using regular expressions. More specifically, I'm currently finding the pattern using Matcher.find, but this is not strictly required.
How would I do this?
I already found a solution (and posted an answer), but would like to know if there is specific regex or Matcher functionality that addresses / can address this issue, or simply if there are better / different ways of doing it. But, even if not, I still think it's an interesting question.
Example:
Pattern: <[0-9]> (a single digit in <>)
Valid input: <1><2><3>
Invalid inputs:
<1><2>a<3>
<1><2>3
Oh look, a flying monkey!
<1><2><3
Code to do it in 2 passes with matches:
boolean products(String products)
{
String regex = "(<[0-9]>)";
Pattern pAll = Pattern.compile(regex + "*");
if (!pAll.matcher(products).matches())
return false;
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
while (matcher.find())
System.out.println(matcher.group());
return true;
}

1. Defining the problem
Since it is not clear what to output when the whole string does not match pattern K*, I will redefine the problem to make it clear what to output in such case.
Given any pattern K:
Check that the string has the pattern K*.
If the string has pattern K*, then split the string into non-overlapping tokens that matches K.
If the string only has prefix that matches pattern K*, then pick the prefix that is chosen by K*+1, and split the prefix into tokens that matches K.
1 I don't know if there is anyway to get the longest prefix that matches K. Of course, you can always remove the last character one by one and test against K* until it matches, but it is obviously inefficient.
Unless specify otherwise, whatever I write below will follow my problem description above. Note that the 3rd bullet point of the problem is to resolve the ambiguity on which prefix string to take.
2. Repeated capturing group in .NET
The problem above can be solved if we have the solution to the problem:
Given a pattern (K)*, which is a repeated capturing group, get the captured text for all the repetitions, instead of only the last repetition.
In the case where the string has pattern K*, by matching against ^(K)*$, we can get all tokens that match pattern K.
In the case where the string only has prefix that matches K*, by matching against ^(K)*, we can get all tokens that match pattern K.
This is the case in .NET regex, since it keeps all the captured text for a repeated capturing group.
However, since we are using Java, we don't have access to such feature.
3. Solution in Java
Checking that the string has the pattern K* can always be done with Matcher.matches()/String.matches(), since the engine will do full-blown backtracking on the input string to somehow "unify" K* with the input string. The hard thing is to split the input string into tokens that matches pattern K.
If K* is equivalent to K*+
If the pattern K has the property:
For all strings2, K* is equivalent to K*+, i.e. how the input string is split up into tokens that match pattern K is the same.
2 You can define this condition for only the input strings you are operating on, but ensuring this pre-condition is not easy. When you define it for all strings, you only need to analyze your regex to check whether the condition holds or not.
Then a one-pass solution that solves the problem can be constructed. You can repeatedly use Matcher.find() on the pattern \GK, and checks that the last match found is right at the end of the string. This is similar to your current solution, except that you do the boundary check with code.
The + after the quantifier * in K*+ makes the quantifier possessive. Possessive quantifier will prevent the engine from backtracking, which means each repetition is always the first possible match for the pattern K. We need this property so that the solution \GK has equivalent meaning, since it will also return the first possible match for the pattern K.
If K* is NOT equivalent to K*+
Without the property above, we need 2 passes to solve the problem. First pass to call Matcher.matches()/String.matches() on the pattern K*. On second pass:
If the string does not match pattern K*, we will repeatedly use Matcher.find() on the pattern \GK until no more match can be found. This can be done due to how we define which prefix string to take when the input string does not match pattern K*.
If the string matches pattern K*, repeatedly use Matcher.find() on the pattern \GK(?=K*$) is one solution. This will result in redundant work matching the rest of the input string, though.
Note that this solution is universally applicable for any K. In other words, it also applies for the case where K* is equivalent to K*+ (but we will use the better one-pass solution for that case instead).

Here is an additional answer to the already accepted one. Below is an example code snippet that only goes through the pattern once with m.find(), which is similar to your one pass solution, but will not parse non-matching lines.
import java.util.regex.*;
class test{
public static void main(String args[]){
String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*)");
Matcher m = pat.matcher(t);
while (m.find()) {
System.out.println("Matches!");
System.out.println(m.group());
}
}
}
The regex explained:
<\\d> --This is your k pattern as defined above
?= -- positive lookahead (check what is ahead of K)
<\\d>* -- Match k 0 or more times
$ -- End of line
?<= -- positive lookbehind (check what is behind K)
^ -- beginning of line
<\\d>* -- followed by 0 or more Ks
Regular expressions are beautiful things.
Edit: As pointed out to me by #nhahtdh, this is just an implemented version of the answer. In fact the implementation above can be improved with the knowledge in the answer.(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*) can be changed to \\G<\\d>(?=(<\\d>)*$).

Below is a one-pass solution using Matcher.start and Matcher.end.
boolean products(String products)
{
String regex = "<[0-9]>";
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
int lastEnd = 0;
while (matcher.find())
{
if (lastEnd != matcher.start())
return false;
System.out.println(matcher.group());
lastEnd = matcher.end();
}
if (lastEnd != products.length())
return false;
return true;
}
The only disadvantage is that it will print out (or process) all values prior to finding invalid data.
For example, products("<1><2>a<3>"); will print out:
<1>
<2>
prior to throwing the exception (because up until there the string is valid).
Either having this happen or having to store all of them temporarily seems to be unavoidable.

String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)*");
Matcher m = pat.matcher(t);
if (m.matches()) {
//String[] tt = t.split("(?<=>)"); // Look behind on '>'
String[] tt = t.split("(?<=(<\\d>))"); // Look behind on K
}

How to find the exact word using a regex in Java?

Consider the following code snippet:
String input = "Print this";
System.out.println(input.matches("\\bthis\\b"));
Output
false
What could be possibly wrong with this approach? If it is wrong, then what is the right solution to find the exact word match?
PS: I have found a variety of similar questions here but none of them provide the solution I am looking for.
Thanks in advance.

When you use the matches() method, it is trying to match the entire input. In your example, the input "Print this" doesn't match the pattern because the word "Print" isn't matched.
So you need to add something to the regex to match the initial part of the string, e.g.
.*\\bthis\\b
And if you want to allow extra text at the end of the line too:
.*\\bthis\\b.*
Alternatively, use a Matcher object and use Matcher.find() to find matches within the input string:
Pattern p = Pattern.compile("\\bthis\\b");
Matcher m = p.matcher("Print this");
m.find();
System.out.println(m.group());
Output:
this
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.

Full example method for matcher:
public static String REGEX_FIND_WORD="(?i).*?\\b%s\\b.*?";
public static boolean containsWord(String text, String word) {
String regex=String.format(REGEX_FIND_WORD, Pattern.quote(word));
return text.matches(regex);
}
Explain:
(?i) - ignorecase
.*? - allow (optionally) any characters before
\b - word boundary
%s - variable to be changed by String.format (quoted to avoid regex
errors)
\b - word boundary
.*? - allow (optionally) any characters after

For a good explanation, see: http://www.regular-expressions.info/java.html
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. This
is different from most other regex libraries, where the "quick match
test" method returns true if the regex can be matched anywhere in the
string. If myString is abc then myString.matches("bc") returns false.
bc matches abc, but ^bc$ (which is really being used here) does not.
This writes "true":
String input = "Print this";
System.out.println(input.matches(".*\\bthis\\b"));

You may use groups to find the exact word. Regex API specifies groups by parentheses. For example:
A(B(C))D
This statement consists of three groups, which are indexed from 0.
0th group - ABCD
1st group - BC
2nd group - C
So if you need to find some specific word, you may use two methods in Matcher class such as: find() to find statement specified by regex, and then get a String object specified by its group number:
String statement = "Hello, my beautiful world";
Pattern pattern = Pattern.compile("Hello, my (\\w+).*");
Matcher m = pattern.matcher(statement);
m.find();
System.out.println(m.group(1));
The above code result will be "beautiful"

Is your searchString going to be regular expression? if not simply use String.contains(CharSequence s)

System.out.println(input.matches(".*\\bthis$"));
Also works. Here the .* matches anything before the space and then this is matched to be word in the end.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.