How do I avoid the implicit "^" and "$" in Java regular expression matching?

How do I avoid the implicit "^" and "$" in Java regular expression matching? - java

I've been struggling with doing some relatively straightforward regular expression matching in Java 1.4.2. I'm much more comfortable with the Perl way of doing things. Here's what's going on:
I am attempting to match /^<foo>/ from "<foo><bar>"
I try:
Pattern myPattern= Pattern.compile("^<foo>");
Matcher myMatcher= myPattern.matcher("<foo><bar>");
System.out.println(myMatcher.matches());
And I get "false"
I am used to saying:
print "<foo><bar>" =~ /^<foo>/;
which does indeed return true.
After much searching and experimentation, I discovered this which said:
"The String method further optimizes its search criteria by placing an invisible ^ before the pattern and a $ after it."
When I tried:
Pattern myPattern= Pattern.compile("^<foo>.*");
Matcher myMatcher= myPattern.matcher("<foo><bar>");
System.out.println(myMatcher.matches());
then it returns the expected true. I do not want that pattern though. The terminating .* should not be necessary.
Then I discovered the Matcher.useAnchoringBounds(boolean) method. I thought that expressly telling it to not use the anchoring bounds would work. It did not. I tried issuing a
myMatcher.reset();
in case I needed to flush it after turning the attribute off. No luck. Subsequently calling .matches() still returns false.
What have I overlooked?
Edit:
Well, that was easy, thanks.

Use the Matcher find method (instead of the matches method)

Matcher.useAnchoringBounds() was added in JDK1.5 so if you are using 1.4, I'm not sure that it would help you even if it did work (notice the #since 1.5 in the Javadocs).
The Javadocs for Matcher also state that the match() method:
Attempts to match the entire region against the pattern.
(emphasis mine)
Which explains why you only got .matches() == true when you changed the pattern to end with .*.
To match against the region starting at the beginning, but not necessarily requiring that the entire region be matched, use either the find() or lookingAt() methods.

If you examine the "match", what part of the input string do you expect to find?
In other words,
Matcher myMatcher= myPattern.matcher("<foo><bar>");
if (myMatcher.matches()) {
System.out.println(myMatcher.group(0));
}
… should print what?
If you are expecting it to print just "<foo>", use the find() method on Matcher instead of matches(). If you really want to find matches when the input starts with "<foo>", then you need to explicitly indicate that with a '^'.
If you are expecting it to match "<foo><bar>", you need to include the trailing ".*".

Related

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?

The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.

Tell it to find only non-semicolons.
[^;]+

What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Using Regex in Android Studio can't get the output [duplicate]

TL;DR
What are the design decisions behind Matcher's API?
Background
Matcher has a behaviour that I didn't expect and for which I can't find a good reason. The API documentation says:
Once created, a matcher can be used to perform three different kinds of match operations:
[...]
Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.
What the API documentation further says is:
The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.
Example
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
System.out.println(matcher.group("foo")); // (1)
System.out.println(matcher.group("bar"));
This code throws a
java.lang.IllegalStateException: No match found
at (1). To get around this, it is necessary to call matches() or other methods that bring the Matcher into a state that allows group(). The following works:
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
matcher.matches(); // (2)
System.out.println(matcher.group("foo"));
System.out.println(matcher.group("bar"));
Adding the call to matches() at (2) sets the Matcher into the proper state to call group().
Question, probably not constructive
Why is this API designed like this? Why not automatically match when the Matcher is build with Patter.matcher(String)?

Actually, you misunderstood the documentation. Take a 2nd look at the statement you quoted: -
attempting to query any part of it before a successful match will cause an
IllegalStateException to be thrown.
A matcher may throw IllegalStateException on accessing matcher.group() if no match was found.
So, you need to use following test, to actually initiate the matching process: -
- matcher.matches() //Or
- matcher.find()
The below code: -
Matcher matcher = pattern.matcher();
Just creates a matcher instance. This will not actually match a string. Even if there was a successful match.
So, you need to check the following condition, to check for successful matches: -
if (matcher.matches()) {
// Then use `matcher.group()`
}
And if the condition in the if returns false, that means nothing was matched. So, if you use matcher.group() without checking this condition, you will get IllegalStateException if the match was not found.
Suppose, if Matcher was designed the way you are saying, then you would have to do a null check to check whether a match was found or not, to call matcher.group(), like this: -
The way you think should have been done:-
// Suppose this returned the matched string
Matcher matcher = pattern.matcher(s);
// Need to check whether there was actually a match
if (matcher != null) { // Prints only the first match
System.out.println(matcher.group());
}
But, what if, you want to print any further matches, since a pattern can be matched multiple times in a String, for that, there should be a way to tell the matcher to find the next match. But the null check would not be able to do that. For that you would have to move your matcher forward to match the next String. So, there are various methods defined in Matcher class to serve the purpose. The matcher.find() method matches the String till all the matches is found.
There are other methods also, that match the string in a different way, that depends on you how you want to match. So its ultimately on Matcher class to do the matching against the string. Pattern class just creates a pattern to match against. If the Pattern.matcher() were to match the pattern, then there has to be some way to define various ways to match, as matching can be in different ways. So, there comes the need of Matcher class.
So, the way it actually is: -
Matcher matcher = pattern.matcher(s);
// Finds all the matches until found by moving the `matcher` forward
while(matcher.find()) {
System.out.println(matcher.group());
}
So, if there are 4 matches found in the string, your first way, would print only the first one, while the 2nd way will print all the matches, by moving the matcher forward to match the next pattern.
I Hope that makes it clear.
The documentation of Matcher class describes the use of the three methods it provides, which says: -
A matcher is created from a pattern by invoking the pattern's matcher
method. Once created, a matcher can be used to perform three different
kinds of match operations:
The matches method attempts to match the entire input sequence
against the pattern.
The lookingAt method attempts to match the input sequence, starting
at the beginning, against the pattern.
The find method scans the input sequence looking for the next
subsequence that matches the pattern.
Unfortunately, I have not been able find any other official sources, saying explicitly Why and How of this issue.

My answer is very similar to Rohit Jain's but includes some reasons why the 'extra' step is necessary.
java.util.regex implementation
The line:
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
causes a new Pattern object to be allocated, and it internally stores a structure representing the RE - information such as a choice of characters, groups, sequences, greedy vs. non-greedy, repeats and so on.
This pattern is stateless and immutable, so it can be reused, is multi-theadable and optimizes well.
The lines:
String s = "foo=23,bar=42";
Matcher matcher = p.matcher(s);
returns a new Matcher object for the Pattern and String - one that has not yet read the String. Matcher is really just a state machine's state, where the state machine is the Pattern.
The matching can be run by stepping the state machine through the matching process using the following API:
lookingAt(): Attempts to match the input sequence, starting at the beginning, against the pattern
find(): Scans the input sequence looking for the next subsequence that matches the pattern.
In both cases, the intermediate state can be read using the start(), end(), and group() methods.
Benefits of this approach
Why would anyone want to do step through the parsing?
Get values from groups that have quantification greater than 1 (i.e. groups that repeat and end up matching more than once). For example in the trivial RE below that parses variable assignments:
Pattern p = new Pattern("([a-z]=([0-9]+);)+");
Matcher m = p.matcher("a=1;b=2;x=3;");
m.matches();
System.out.println(m.group(2)); // Only matches value for x ('3') - not the other values
See the section on "Group name" in "Groups and capturing" the JavaDoc on Pattern
The developer can use the RE as a lexer and the developer can bind the lexed tokens to a parser. In practice, this would work for simple domain languages, but regular expressions are probably not the way to go for a full-blown computer language. EDIT This is partly related to the previous reason, but it can frequently be easier and more efficient to create the parse tree processing the text than lexing all the input first.
(For the brave-hearted) you can debug REs and find out which subsequence is failing to match (or incorrectly matching).
However, on most occasions you do not need to step the state machine through the matching, so there is a convenience method (matches) which runs the pattern matching to completion.

If a matcher would automatically match the input string, that would be wasted effort in case you wish to find the pattern.
A matcher can be used to check if the pattern matches() the input string, and it can be used to find() the pattern in the input string (even repeatedly to find all matching substrings). Until you call one of these two methods, the matcher does not know what test you want to perform, so it cannot give you any matched groups. Even if you do call one of these methods, the call may fail - the pattern is not found - and in that case a call to group must fail as well.

This is expected and documented.
The reason is that .matches() returns a boolean indicating if there was a match. If there was a match, then you can call .group(...) meaningfully. Otherwise, if there's no match, a call to .group(...) makes no sense. Therefore, you should not be allowed to call .group(...) before calling matches().
The correct way to use a matcher is something like the following:
Matcher m = p.matcher(s);
if (m.matches()) {
...println(matcher.group("foo"));
...
}

My guess is the design decision was based on having queries that had clear, well defined semantics that didn't conflate existence with match properties.
Consider this: what would you expect Matcher queries to return if the matcher has not successfully matched something?
Let's first consider group(). If we haven't successfully matched something, Matcher shouldn't return the empty string, as it hasn't matched the empty string. We could return null at this point.
Ok, now let's consider start() and end(). Each return int. What int value would be valid in this case? Certainly no positive number. What negative number would be appropriate? -1?
Given all this, a user is still going to have to check return values for every query to verify if a match occurred or not. Alternatively, you could check to see if it matches successfully outright, and if successful, the query semantics all have well-defined meaning. If not, the user gets consistent behaviour no matter which angle is queried.
I'll grant that re-using IllegalStateException may not have resulted in the best description of the error condition. But if we were to rename/subclass IllegalStateException to NoSuccessfulMatchException, one should be able to appreciate how the current design enforces query consistency and encourages the user to use queries that have semantics that are known to be defined at the time of asking.
TL;DR: What is value of asking the specific cause of death of a living organism?

You need to check the return value of matcher.matches(). It will return true when a match was found, false otherwise.
if (matcher.matches()) {
System.out.println(matcher.group("foo"));
System.out.println(matcher.group("bar"));
}
If matcher.matches() does not find a match and you call matcher.group(...), you'll still get an IllegalStateException. That's exactly what the documentation says:
The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.
When matcher.match() returns false, no successful match has been found and it doesn't make a lot of sense to get information on the match by calling for example group().

String matches() not able to pick ^ [duplicate]

trivial regex question (the answer is most probably Java-specific):
"#This is a comment in a file".matches("^#")
This returns false. As far as I can see, ^ means what it always means and # has no special meaning, so I'd translate ^# as "A '#' at the beginning of the string". Which should match. And so it does, in Perl:
perl -e "print '#This is a comment'=~/^#/;"
prints "1". So I'm pretty sure the answer is something Java specific. Would somebody please enlighten me?
Thank you.

Matcher.matches() checks to see if the entire input string is matched by the regex.
Since your regex only matches the very first character, it returns false.
You'll want to use Matcher.find() instead.
Granted, it can be a bit tricky to find the concrete specification, but it's there:
String.matches() is defined as doing the same thing as Pattern.matches(regex, str).
Pattern.matches() in turn is defined as Pattern.compile(regex).matcher(input).matches().
Pattern.compile() returns a Pattern.
Pattern.matcher() returns a Matcher
Matcher.matches() is documented like this (emphasis mine):
Attempts to match the entire region against the pattern.

The matches method matches your regex against the entire string.
So try adding a .* to match rest of the string.
"#This is a comment in a file".matches("^#.*")
which returns true. One can even drop all anchors(both start and end) from the regex and the match method will add it for us. So in the above case we could have also used "#.*" as the regex.

This should meet your expectations:
"#This is a comment in a file".matches("^#.*$")
Now the input String matches the pattern "First char shall be #, the rest shall be any char"
Following Joachims comment, the following is equivalent:
"#This is a comment in a file".matches("#.*")

string.matches(regex) returns false, although I think it should be true

I am working with Java regular expressions.
Oh, I really miss Perl!! Java regular expressions are so hard.
Anyway, below is my code.
oneLine = "{\"kind\":\"list\",\"items\"";
System.out.println(oneLine.matches("kind"));
I expected "true" to be shown on the screen, but I could only see "false".
What's wrong with the code? And how can I fix it?
Thank you in advance!!

String#matches() takes a regex as parameter, in which anchors are implicit. So, your regex pattern will be matched at the beginning till the end of the string.
Since your string does not start with "kind", so it returns false.
Now, as per your current problem, I think you don't need to use regex here. Simply using String#contains() method will work fine: -
oneLine.contains("kind");
Or, if you want to use matches, then build the regex to match complete string: -
oneLine.matches(".*kind.*");

The .matches method is intended to match the entire string. So you need something like:
.*kind.*
Demo: http://ideone.com/Gb5MQZ

Matches tries to match the whole string (implicit ^ and $ anchors), you want to use contains() to check for parts of the string.

Why doesn't this regex work as expected in Java?

trivial regex question (the answer is most probably Java-specific):
"#This is a comment in a file".matches("^#")
This returns false. As far as I can see, ^ means what it always means and # has no special meaning, so I'd translate ^# as "A '#' at the beginning of the string". Which should match. And so it does, in Perl:
perl -e "print '#This is a comment'=~/^#/;"
prints "1". So I'm pretty sure the answer is something Java specific. Would somebody please enlighten me?
Thank you.

Matcher.matches() checks to see if the entire input string is matched by the regex.
Since your regex only matches the very first character, it returns false.
You'll want to use Matcher.find() instead.
Granted, it can be a bit tricky to find the concrete specification, but it's there:
String.matches() is defined as doing the same thing as Pattern.matches(regex, str).
Pattern.matches() in turn is defined as Pattern.compile(regex).matcher(input).matches().
Pattern.compile() returns a Pattern.
Pattern.matcher() returns a Matcher
Matcher.matches() is documented like this (emphasis mine):
Attempts to match the entire region against the pattern.

The matches method matches your regex against the entire string.
So try adding a .* to match rest of the string.
"#This is a comment in a file".matches("^#.*")
which returns true. One can even drop all anchors(both start and end) from the regex and the match method will add it for us. So in the above case we could have also used "#.*" as the regex.

This should meet your expectations:
"#This is a comment in a file".matches("^#.*$")
Now the input String matches the pattern "First char shall be #, the rest shall be any char"
Following Joachims comment, the following is equivalent:
"#This is a comment in a file".matches("#.*")

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I avoid the implicit "^" and "$" in Java regular expression matching? - java

Use the Matcher find method (instead of the matches method)

Related

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

Using Regex in Android Studio can't get the output [duplicate]

String matches() not able to pick ^ [duplicate]

string.matches(regex) returns false, although I think it should be true

Why doesn't this regex work as expected in Java?

Categories

Resources