why this regex ".*" match against "abcd 1234 abcd" gives two matches? - java

Why do I get two matches when using the regular expression .* on the string abcd 1234 abcd? See https://regex101.com/r/rV8jfz/1.
From the explanation given by regex101, I can see that the second match happened at position 14-14 and the value matched is null. But why is a second match done? Is there a way that I can avoid the second match?
I understand .* means zero or more of any character, so it's trying to find zero occurrences. But I don't understand why this null match is required.
The problem is when used in any language (e.g. Java), when I do while(matcher.find()) { ... }, this would loop twice while I would want it to loop only once.
I know this could not be a real world match situation, but to understand and explore regex, I see this as a good case to study.
Edit - follwing #terdon response.
I did like to keep the /g option in regex101, i am aware about it. I would like to know the total possible matches.
https://regex101.com/r/EvOoAr/1 -> pattern abcd against string abcd 1234 abcd gives two matches. And i wan't to know this information.
the problem i find is, when dealing this in a language like java -
Ref - https://onecompiler.com/java/3xnax494k
String str = "abcd 1234 abcd";
Pattern p = Pattern.compile(".*");
Matcher matcher = p.matcher(str);
int matchCount=0;
while(matcher.find()) {
matchCount++;
System.out.println("match number: " + matchCount);
System.out.println("matcher.groupCount(): " + matcher.groupCount());
System.out.println("matcher.group(): " + matcher.group());
}
The output is -
match number: 1
matcher.groupCount(): 0 //you can ignore this
matcher.group(): abcd 1234 abcd
match number: 2
matcher.groupCount(): 0
matcher.group(): //this is my concern. The program has to deal with this nothing match some how.
It would be nice for me as a programmer, if the find() did not match against "nothing". I should add additional code in the loop to catch this "nothing" case.
This null problem (in code) will get even worse with this regex case - https://regex101.com/r/5HuJ0R/1 -> [0-9]* against abcd 1234 abcd gives 12 matches.

The reason you get two matches is because you are using the g (global) operator. If you remove that from your regex101 example, you will only get one match.
This happens because the global operator makes the regex engine try to find as many matches on the string as possible. Since the expression .* matches everything, it also matches nothing, i.e. the empty string. Therefore, the first match is the entire string and then the second match is matching the "nothing" that comes after, it is matching an empty string. Removing the g will make it stop at the first match, the entire string, and not try to find others:

Related

can deal with the first line space when i use regex for polynomials

here is my code
String a = "X^5+2X^2+3X^3+4X^4";
String exp[]=a.split("(|\\+\\d)[xX]\\^");
for(int i=0;i<exp.length;i++) {
System.out.println("exp: "+exp[i]+" ");
}
im try to find the output which is 5,2,3,4
but instead i got this answer
exp:
exp:5
exp:2
exp:3
exp:4
i dont know where is the first line space come from, and i cannot find a will to get rid of that, i try to use others regex for this and also use compile,still can get rid of the first line, i try to use new string "X+X^5+2X^2+3X^3+4X^4";the first line shows exp:X.
and i also use online regex compiler to try my problem, but their answer is 5,2,3,4, buy eclipse give a space ,and then 5,2,3,4 ,need a help to figure this out
Try to use regex, e.g:
String input = "X^5+2X^2+3X^3+4X^4";
Pattern pattern = Pattern.compile("\\^([0-9]+)");
Matcher matcher = pattern.matcher(input);
for (int i = 1; matcher.find(); i++) {
System.out.println("exp: " + matcher.group(1));
}
It gives output:
exp: 5
exp: 2
exp: 3
exp: 4
How does it work:
Pattern used: \^([0-9]+)
Which matches any strings starting with ^ followed by 1 or more digits (note the + sign). Dash (^) is prefixed with backslash (\) because it has a special meaning in regular expressions - beginning of a string - but in Your case You just want an exact match of a ^ character.
We want to wrap our matches in a groups to refer to them late during matching process. It means we need to mark them using parenthesis ( and ).
Then we want to pu our pattern into Java String. In String literal, \character has a special meaning - it is used as a control character, eg "\n" represents a new line. It means that if we put our pattern into String literal, we need to escape a \ so our pattern becomes: "\\^([0-9]+)". Note double \.
Next we iterate through all matches getting group 1 which is our number match. Note that a ^.character is not covered in our match even if it is a part of our pattern. It is so because wr used parenthesis to mark our searched group, which in our case are only digits
Because you are using the split method which looks for the occurrence of the regex and, well.. splits the string at this position. Your string starts with X^ so it very much matches your regex.

How to create a Regex to find exact String length?

Having these cases:
12345678901234
123456789012345
1234567890123456
12345678901234567
I need to find the String which has exact 15 chars length.
Until now I made this code:
String pattern = "(([0-9]){15})";
Mathcer m = new Mathcer(pattern);
if (m.find()){
System.out.println(m.group(1));
}
The results were like this:
12345678901234 (not found which is GOOD)
123456789012345 (found which is GOOD)
1234567890123456 (found which is NOT GOOD)
12345678901234567 (found which is NOT GOOD)
How can I create a regex which can give me result of exact 15 like I thought this regex can give me. More then 15 is not acceptable.
Mark the start and the end of the string using the ^ and $ anchors:
String pattern = "^([0-9]{15})$";
^ matches the position at the beginning of the string
$ matches the position at the end of the string
Without these anchors, you're only looking for 15 consecutive digits anywhere within the string. Matching strings can additionally have more digits (or even contain letters), though, and still match.
(Also, your inner pair of parentheses is superfluous — I've removed it. If you're accessing the value of the entire match rather than the value captured by the first group, you can even emit the other parentheses: "^[0-9]{15}$")
Regex101 Demo
Just add a start and end to your regex:
^(([0-9]){15})$
The ^ means "beginning of string"
The $ means "end of string"
Therefore, there can only be 15 numbers in the string.
For more regex operators in Java, see the Pattern documentation
Simply use matches() instead of 'find()'

Java regexp grouping and + operator (Obtaining multiple values of a group)

I was wondering if is it possible to obtain all the matches of a group with a + operator on a java regular expression.
Example code:
public static void main(String[] args) {
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:\\s*(([\\w\\s]+),?\\s*)+.");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Regular expression Match: "+ m.group(0));
System.out.println("Group 1: "+ m.group(1));
System.out.println("Group 2: "+ m.group(2));
}
}
OUTPUT:
Regular expression Match: Start: First match, second match, third match.
Group 1: third match
Group 2: third match
Despite group 2 matched 3 times "First match, " "second match, " "third match" due to the second "+" operator that is on the Regexp we can access just the last one on match.group(2).
My questions is:
¿There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
thanks.
As mentioned in other answers, you can't match n groups using + like this.
However, if you are looking to solve this problem in Java then using a Scanner to break on the delimiters may help:
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:|\\s*,");
Scanner s = new Scanner(input).useDelimiter(p);
while (s.hasNext()) {
System.out.println("Matched: " + s.next());
}
This prints out:
Matched: First match
Matched: second match
Matched: third match.
You asked:
There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
Answer is NO, if same group matches some text multiple times then you can only access last matched text.
There are of course other ways to return multiple matches.
I think this may not be possible with your regular expression.
As per the docs:
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second
time because of quantification then its previously-captured value, if
any, will be retained if the second evaluation fails. Matching the
string "aba" against the expression (a(b)?)+, for example, leaves
group two set to "b". All captured input is discarded at the beginning
of each match.
Like most other regex flavors, Java doesn't save the intermediate captures of a repeated group. But that feature isn't really as useful as might think. For example, the .NET flavor provides the CaptureCollection class for that purpose, but you still have to write the code to loop through it. Not that big a deal, but still it's usually easier to use multiple matches, like the other responders suggested. Try it with this regex:
"(?:Start:|\\G,)\\s*([\\w\\s]+)"
\G is a kind of anchor that causes the regex to reject any match that doesn't start exactly where the last match ended. If there was no previous match (i.e., this is the first match attempt), it acts like \A and matches only at the very beginning of the string. That's partly why I placed the , in that part of the regex; I think it's safe to assume the string doesn't start with a comma.
Note that the first group is non-capturing; the part you're looking for will always be in 'group(1)`.

Regular expression to match pattern

I'm trying to get matches for commands like this;
[AUTR| <version_software> | <version_protocol> | <msg> ]
[PING]
What is the regular expression that find this matches for the first command?
AUTR
version_software
version_protocol
msg
this is the code that parse that:
String[] tokens = msg.replace('<',' ').replace('>',' ').replace('[', ' ').replace(']', ' ').split("\\|");
for (int i=0; i<tokens.length; i++) tokens[i] = tokens[i].trim();
I'm only wondering how it can be done with a regex solution.
EDIT:
I'm trying to match groups with easier expressions, and with this code the call to m.groupCount returns one... but when I try to print it... it throws this exception "java.lang.IllegalStateException: No match found"
Pattern pattern = Pattern.compile("([\\w+])");
Matcher m = pattern.matcher("[AUTR]");
for (int i=0; i<m.groupCount();i++)
{
System.out.println(m.group(i));
}
EDIT:
http://fiddle.re/6ykc
Regular Expression:
\[([\w]+)(\s*\|\s*<([\w. ]+)>\s*)*\]
Java Regex String:
"\\[([\\w]+)(\\s*\\|\\s*<([\\w. ]+)>\\s*)*\\]"
Note that this is for variable commands now and that all extra parameters must match the following character set [a-zA-Z_0-9. ] (Includes periods and spaces).
Issue: There is an issue with variable length commands that you cannot capture more than one group with a variable type grouping.
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
EDIT 2:
In order to get all of them you can do 2 regular expressions, one to grab the command:
String command_regex = "\\[([\\w]+)";
And find that and then find the parameters which you can use the <> as your key character to select:
String parameters = "<([\\w. ]+)>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string_to_match);
while (matcher.find()) {
System.out.println(matcher.group());
}
Hope that helps.
ORIGINAL:
Not exactly sure on the formatting, are the "<" and ">" and "|" required? And what are the formats for the command, version_software, version_protocol and message? This is my attempt though for regular expressions (tested in Python)
\[(\w+)\s*\|\s*<([\w.]+)>\s*\|\s*<(\w+)>\s*\|\s*<([\w\s]+)>\s*\]
You need to make sure to escape the brackets and the pipe symbols (I added \s* conditions between because I don't know if there will be spaces or not. If you do:
>> search.re("expression above", line)
>> search.groups()
It should give all tokens in python at least. I left it more hardcoded to allow room for adjustments on each token you wanted to grab, otherwise you could reduce the last 3 parts by making it a group and saying to repeat 3 times. Let me know results?

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

Categories