I have a problem to get a regular expression to get work.
I use an XMLRPC Library to get information from an wiki.
so far so good.
After retrieving the data into a String Variable I would like to search through with a regular expression but the matcher will always return "false".
But if I asking the String ....contains("xyz"); the Answer is true.
The String looks something like this:
====== Datensicherheit ====== ''Kriterium von Sicherheit'' Typ: technisch Definition: \ //Allgemein.........
String regex = "Definition";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
System.out.println(matcher.matches());
Does anybody know what I'm doing wrong?
This is an issue with your regex expression. If you are wanting to know if the string contains "Definition", your regex needs to be:
String regex = ".*Definition.*";
Note that matches() returns true if, and only if, the entire region sequence matches this matcher's pattern. see the java doc # https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#matches()
So, it will only be true if the entire "text" region matches "Definition", which is unlikely :).
Try find() instead which is true if, and only if, a subsequence of the input sequence starting at the given index matches this matcher's pattern.
TL;DR
What are the design decisions behind Matcher's API?
Background
Matcher has a behaviour that I didn't expect and for which I can't find a good reason. The API documentation says:
Once created, a matcher can be used to perform three different kinds of match operations:
[...]
Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.
What the API documentation further says is:
The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.
Example
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
System.out.println(matcher.group("foo")); // (1)
System.out.println(matcher.group("bar"));
This code throws a
java.lang.IllegalStateException: No match found
at (1). To get around this, it is necessary to call matches() or other methods that bring the Matcher into a state that allows group(). The following works:
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
matcher.matches(); // (2)
System.out.println(matcher.group("foo"));
System.out.println(matcher.group("bar"));
Adding the call to matches() at (2) sets the Matcher into the proper state to call group().
Question, probably not constructive
Why is this API designed like this? Why not automatically match when the Matcher is build with Patter.matcher(String)?
Actually, you misunderstood the documentation. Take a 2nd look at the statement you quoted: -
attempting to query any part of it before a successful match will cause an
IllegalStateException to be thrown.
A matcher may throw IllegalStateException on accessing matcher.group() if no match was found.
So, you need to use following test, to actually initiate the matching process: -
- matcher.matches() //Or
- matcher.find()
The below code: -
Matcher matcher = pattern.matcher();
Just creates a matcher instance. This will not actually match a string. Even if there was a successful match.
So, you need to check the following condition, to check for successful matches: -
if (matcher.matches()) {
// Then use `matcher.group()`
}
And if the condition in the if returns false, that means nothing was matched. So, if you use matcher.group() without checking this condition, you will get IllegalStateException if the match was not found.
Suppose, if Matcher was designed the way you are saying, then you would have to do a null check to check whether a match was found or not, to call matcher.group(), like this: -
The way you think should have been done:-
// Suppose this returned the matched string
Matcher matcher = pattern.matcher(s);
// Need to check whether there was actually a match
if (matcher != null) { // Prints only the first match
System.out.println(matcher.group());
}
But, what if, you want to print any further matches, since a pattern can be matched multiple times in a String, for that, there should be a way to tell the matcher to find the next match. But the null check would not be able to do that. For that you would have to move your matcher forward to match the next String. So, there are various methods defined in Matcher class to serve the purpose. The matcher.find() method matches the String till all the matches is found.
There are other methods also, that match the string in a different way, that depends on you how you want to match. So its ultimately on Matcher class to do the matching against the string. Pattern class just creates a pattern to match against. If the Pattern.matcher() were to match the pattern, then there has to be some way to define various ways to match, as matching can be in different ways. So, there comes the need of Matcher class.
So, the way it actually is: -
Matcher matcher = pattern.matcher(s);
// Finds all the matches until found by moving the `matcher` forward
while(matcher.find()) {
System.out.println(matcher.group());
}
So, if there are 4 matches found in the string, your first way, would print only the first one, while the 2nd way will print all the matches, by moving the matcher forward to match the next pattern.
I Hope that makes it clear.
The documentation of Matcher class describes the use of the three methods it provides, which says: -
A matcher is created from a pattern by invoking the pattern's matcher
method. Once created, a matcher can be used to perform three different
kinds of match operations:
The matches method attempts to match the entire input sequence
against the pattern.
The lookingAt method attempts to match the input sequence, starting
at the beginning, against the pattern.
The find method scans the input sequence looking for the next
subsequence that matches the pattern.
Unfortunately, I have not been able find any other official sources, saying explicitly Why and How of this issue.
My answer is very similar to Rohit Jain's but includes some reasons why the 'extra' step is necessary.
java.util.regex implementation
The line:
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
causes a new Pattern object to be allocated, and it internally stores a structure representing the RE - information such as a choice of characters, groups, sequences, greedy vs. non-greedy, repeats and so on.
This pattern is stateless and immutable, so it can be reused, is multi-theadable and optimizes well.
The lines:
String s = "foo=23,bar=42";
Matcher matcher = p.matcher(s);
returns a new Matcher object for the Pattern and String - one that has not yet read the String. Matcher is really just a state machine's state, where the state machine is the Pattern.
The matching can be run by stepping the state machine through the matching process using the following API:
lookingAt(): Attempts to match the input sequence, starting at the beginning, against the pattern
find(): Scans the input sequence looking for the next subsequence that matches the pattern.
In both cases, the intermediate state can be read using the start(), end(), and group() methods.
Benefits of this approach
Why would anyone want to do step through the parsing?
Get values from groups that have quantification greater than 1 (i.e. groups that repeat and end up matching more than once). For example in the trivial RE below that parses variable assignments:
Pattern p = new Pattern("([a-z]=([0-9]+);)+");
Matcher m = p.matcher("a=1;b=2;x=3;");
m.matches();
System.out.println(m.group(2)); // Only matches value for x ('3') - not the other values
See the section on "Group name" in "Groups and capturing" the JavaDoc on Pattern
The developer can use the RE as a lexer and the developer can bind the lexed tokens to a parser. In practice, this would work for simple domain languages, but regular expressions are probably not the way to go for a full-blown computer language. EDIT This is partly related to the previous reason, but it can frequently be easier and more efficient to create the parse tree processing the text than lexing all the input first.
(For the brave-hearted) you can debug REs and find out which subsequence is failing to match (or incorrectly matching).
However, on most occasions you do not need to step the state machine through the matching, so there is a convenience method (matches) which runs the pattern matching to completion.
If a matcher would automatically match the input string, that would be wasted effort in case you wish to find the pattern.
A matcher can be used to check if the pattern matches() the input string, and it can be used to find() the pattern in the input string (even repeatedly to find all matching substrings). Until you call one of these two methods, the matcher does not know what test you want to perform, so it cannot give you any matched groups. Even if you do call one of these methods, the call may fail - the pattern is not found - and in that case a call to group must fail as well.
This is expected and documented.
The reason is that .matches() returns a boolean indicating if there was a match. If there was a match, then you can call .group(...) meaningfully. Otherwise, if there's no match, a call to .group(...) makes no sense. Therefore, you should not be allowed to call .group(...) before calling matches().
The correct way to use a matcher is something like the following:
Matcher m = p.matcher(s);
if (m.matches()) {
...println(matcher.group("foo"));
...
}
My guess is the design decision was based on having queries that had clear, well defined semantics that didn't conflate existence with match properties.
Consider this: what would you expect Matcher queries to return if the matcher has not successfully matched something?
Let's first consider group(). If we haven't successfully matched something, Matcher shouldn't return the empty string, as it hasn't matched the empty string. We could return null at this point.
Ok, now let's consider start() and end(). Each return int. What int value would be valid in this case? Certainly no positive number. What negative number would be appropriate? -1?
Given all this, a user is still going to have to check return values for every query to verify if a match occurred or not. Alternatively, you could check to see if it matches successfully outright, and if successful, the query semantics all have well-defined meaning. If not, the user gets consistent behaviour no matter which angle is queried.
I'll grant that re-using IllegalStateException may not have resulted in the best description of the error condition. But if we were to rename/subclass IllegalStateException to NoSuccessfulMatchException, one should be able to appreciate how the current design enforces query consistency and encourages the user to use queries that have semantics that are known to be defined at the time of asking.
TL;DR: What is value of asking the specific cause of death of a living organism?
You need to check the return value of matcher.matches(). It will return true when a match was found, false otherwise.
if (matcher.matches()) {
System.out.println(matcher.group("foo"));
System.out.println(matcher.group("bar"));
}
If matcher.matches() does not find a match and you call matcher.group(...), you'll still get an IllegalStateException. That's exactly what the documentation says:
The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.
When matcher.match() returns false, no successful match has been found and it doesn't make a lot of sense to get information on the match by calling for example group().
Consider the following code snippet:
String input = "Print this";
System.out.println(input.matches("\\bthis\\b"));
Output
false
What could be possibly wrong with this approach? If it is wrong, then what is the right solution to find the exact word match?
PS: I have found a variety of similar questions here but none of them provide the solution I am looking for.
Thanks in advance.
When you use the matches() method, it is trying to match the entire input. In your example, the input "Print this" doesn't match the pattern because the word "Print" isn't matched.
So you need to add something to the regex to match the initial part of the string, e.g.
.*\\bthis\\b
And if you want to allow extra text at the end of the line too:
.*\\bthis\\b.*
Alternatively, use a Matcher object and use Matcher.find() to find matches within the input string:
Pattern p = Pattern.compile("\\bthis\\b");
Matcher m = p.matcher("Print this");
m.find();
System.out.println(m.group());
Output:
this
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.
Full example method for matcher:
public static String REGEX_FIND_WORD="(?i).*?\\b%s\\b.*?";
public static boolean containsWord(String text, String word) {
String regex=String.format(REGEX_FIND_WORD, Pattern.quote(word));
return text.matches(regex);
}
Explain:
(?i) - ignorecase
.*? - allow (optionally) any characters before
\b - word boundary
%s - variable to be changed by String.format (quoted to avoid regex
errors)
\b - word boundary
.*? - allow (optionally) any characters after
For a good explanation, see: http://www.regular-expressions.info/java.html
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. This
is different from most other regex libraries, where the "quick match
test" method returns true if the regex can be matched anywhere in the
string. If myString is abc then myString.matches("bc") returns false.
bc matches abc, but ^bc$ (which is really being used here) does not.
This writes "true":
String input = "Print this";
System.out.println(input.matches(".*\\bthis\\b"));
You may use groups to find the exact word. Regex API specifies groups by parentheses. For example:
A(B(C))D
This statement consists of three groups, which are indexed from 0.
0th group - ABCD
1st group - BC
2nd group - C
So if you need to find some specific word, you may use two methods in Matcher class such as: find() to find statement specified by regex, and then get a String object specified by its group number:
String statement = "Hello, my beautiful world";
Pattern pattern = Pattern.compile("Hello, my (\\w+).*");
Matcher m = pattern.matcher(statement);
m.find();
System.out.println(m.group(1));
The above code result will be "beautiful"
Is your searchString going to be regular expression? if not simply use String.contains(CharSequence s)
System.out.println(input.matches(".*\\bthis$"));
Also works. Here the .* matches anything before the space and then this is matched to be word in the end.
I am getting a text from the DB which contains Strings of the form
CO<sub>2</sub>
In order to recognize this I wrote the following code
String footText = "... some text containing CO<sub>2</sub>";
String co2HTML = "CO<sub>2</sub>";
Pattern pat = Pattern.compile(co2HTML);
Matcher mat = pat.matcher(footText);
final boolean hasCO2 = mat.matches();
The problem is that hasCO2 is always false although the inout text has that substring.
What is wrong hete?
Thanks!
You should use find() instead of matches(), since the latter tries to match the entire string against the pattern rather than perform a search.
From the Javadoc:
The matches method attempts to match the entire input sequence against the pattern.
The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
The find method scans the input sequence looking for the next subsequence that matches the pattern.
Also, the pattern in question doesn't really require regular expressions; you could use String.indexOf() to perform the search.
I can't figure out why this regex doesn't work, I've tested it in php and other regex engines where it works fine and matches ",AA,".
Pattern p = Pattern.compile("(^|,)AA(,|$)");
Matcher m = p.matcher("A,B,AA,C,D");
//assigns as false
boolean matches = m.matches();
Side note: I have a split/array binary search method for doing an IN_SET / NOT_IN_SET search against the string. This is just an example I need to get working before implementing regex as another comparing option.
matches() validates the entire string. You want to use find() instead.
From the API:
matches()
Attempts to match the entire region against the pattern.
-- http://download.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#matches()
and:
find()
Attempts to find the next subsequence of the input sequence that matches the pattern.
-- http://download.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#find()
Matcher matches the entire region against the pattern. Use find().