Java repetitive pattern matching (3)

Java repetitive pattern matching (3) - java

I am trying to solve a simple Java regex matching problem but still getting conflicting results (following up on this and that question).
More specifically, I am trying to match a repetitive text input, consisting of groups that are delimited by '|' (vertical bar) that may be directly preceded by underscore ('_'), especially if the groups are not empty (i.e., if no two consecutive | delimiters appear in the input).
An example such input is:
Text group 1_|Text group 2_|||Text group 5_|||Text group 8
In addition, I need a way to verify that a match has occurred, in order to avoid applying the processing related to that input to other, totally different inputs that my application also processes, using different regular expressions.
To confirm that a regex works, I am using RegexPal.
After several tests, the closest to what I want are the following two Regular Expressions, suggested in the questions I quoted above:
1. (?:\||^)([^\\|]*)
2. \G([^\|]+?)_?\||\G()\||\G([^\|]*)$
Using either of these, if I run a matcher.find() loop I get:
All the text groups, with the underscore included in the end, from Regex 1
All the text groups apart from the last, with no underscore but 2 empty groups in the end, from Regex 2.
So, apparently Regex 2 is not correct (and RegexPal also does not show it as matching).
I could use Regex 1 and do some post-processing to remove the trailing underscore, although ideally I would like the regex to do that for me.
However, none of the two aforementioned regular expressions returns true for matcher.matches(), whereas matcher.find() is always true even for totally irrelevant input (reasonable, since there will often be at least 1 matching group, even in other text).
I thus have two questions:
Is there a correct (fully working) regex that excludes the trailing underscore?
Is there any way of checking that only the correct regex has matched?
The code used to test Regex 1, is something like
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher = Pattern.compile("(?:\\||^)([^\\\\|]*)").matcher(input);
if (matcher.matches())
{
System.out.println("Input MATCHED: " + input);
while (matcher.find())
{
System.out.println("\t\t" + matcher.group(1));
}
}
else
{
System.out.println("\tInput NOT MATCHED: " + input);
}
Using the above code always results in "NOT MATCHED". Removing the if/else and only using matcher.find() does retrieve all text groups.

Matcher#matches method attempts to match the entire input sequence against the pattern, that is why you are getting the result Input NOT MATCHED. See the documentation here http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#matches
If you want to exclude the trailing underscore you can use this regex (slight modification of what you already have)
(?:\\||^)([^\\\\|_]*)
This would work if you are sure that _ comes just before |.

RegexPal is a JavaScript regex tool. The Java and JavaScript regular expression languages differ. Consider using a Java Regex tool; perhaps this one
This may be close to what you want: (?:([^_\|]+)_{0,1}+\|*)+
Edit: Code added.
In java 6 this prints each group (the find() loop).
public static void main(String[] args)
{
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher;
Pattern pattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)+");
Pattern groupPattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)");
matcher = pattern.matcher(input);
if (matcher.matches())
{
Matcher groupMatcher;
System.out.println("matcher.matches() is true");
int groupCount = matcher.groupCount();
for (int index = 1; index <= groupCount; ++index)
{
System.out.print("group (pattern)[");
System.out.print(index);
System.out.print("]: ");
System.out.println(matcher.group(index));
}
groupMatcher = groupPattern.matcher(input);
while (groupMatcher.find())
{
System.out.print("group (groupPattern):");
System.out.println(groupMatcher.group());
System.out.println(groupMatcher.group(1));
}
}
else
{
System.out.println("No match");
}
}

Related

can deal with the first line space when i use regex for polynomials

here is my code
String a = "X^5+2X^2+3X^3+4X^4";
String exp[]=a.split("(|\\+\\d)[xX]\\^");
for(int i=0;i<exp.length;i++) {
System.out.println("exp: "+exp[i]+" ");
}
im try to find the output which is 5,2,3,4
but instead i got this answer
exp:
exp:5
exp:2
exp:3
exp:4
i dont know where is the first line space come from, and i cannot find a will to get rid of that, i try to use others regex for this and also use compile,still can get rid of the first line, i try to use new string "X+X^5+2X^2+3X^3+4X^4";the first line shows exp:X.
and i also use online regex compiler to try my problem, but their answer is 5,2,3,4, buy eclipse give a space ,and then 5,2,3,4 ,need a help to figure this out

Try to use regex, e.g:
String input = "X^5+2X^2+3X^3+4X^4";
Pattern pattern = Pattern.compile("\\^([0-9]+)");
Matcher matcher = pattern.matcher(input);
for (int i = 1; matcher.find(); i++) {
System.out.println("exp: " + matcher.group(1));
}
It gives output:
exp: 5
exp: 2
exp: 3
exp: 4
How does it work:
Pattern used: \^([0-9]+)
Which matches any strings starting with ^ followed by 1 or more digits (note the + sign). Dash (^) is prefixed with backslash (\) because it has a special meaning in regular expressions - beginning of a string - but in Your case You just want an exact match of a ^ character.
We want to wrap our matches in a groups to refer to them late during matching process. It means we need to mark them using parenthesis ( and ).
Then we want to pu our pattern into Java String. In String literal, \character has a special meaning - it is used as a control character, eg "\n" represents a new line. It means that if we put our pattern into String literal, we need to escape a \ so our pattern becomes: "\\^([0-9]+)". Note double \.
Next we iterate through all matches getting group 1 which is our number match. Note that a ^.character is not covered in our match even if it is a part of our pattern. It is so because wr used parenthesis to mark our searched group, which in our case are only digits

Because you are using the split method which looks for the occurrence of the regex and, well.. splits the string at this position. Your string starts with X^ so it very much matches your regex.

Java Regular Expression: matching a customized Hashtag pattern with a lookahead/lookbehind condition

I am currently learning how to write regular expressions in Java by trying to match simple Hashtag pattern. The Hashtags obey the following conditions:
It starts with a hashtag: #
It has to contain at least 1 letter: [a-zA-Z]
It can contain any of the characters from the class [a-zA-Z0-9_]
It cannot be preceded by a character of the class [a-zA-Z0-9_]
Based on this, I thought that the correct regular expression is:
PATTERN = "(?<![a-zA-Z0-9_])#(?=.*[a-zA-Z])[a-zA-Z0-9_]+"
Here I'm using a lookahead (?=.*[a-zA-Z]) to make sure Condition 2 holds and using a lookbehind (?<![a-zA-Z0-9_]) to make sure Condition 4 holds. I'm less certain about ending with a +.
This works on simple test cases but fails on complicated ones such as:
String text = "####THIS_IS_A_HASHTAG; ;#This_1_2...#12_and_this but not #123 or #this# #or#that";
where does not match #THIS_IS_A_HASHTAG, #This_1_2 and 12_and_this
Could someone explain what I'm doing wrong?

This lookahead:
(?=.*[a-zA-Z])
may produce wrong results for the cases when input is like this:
####12345...#12_and_this
by giving you 2 matches #12345 and #12_and_this. Whereas as per your rules only 2nd should be valid match.
To fix this you can use this regex:
(?<![a-zA-Z0-9_])#(?=[0-9_]*[a-zA-Z])[a-zA-Z0-9_]+
Where lookahead (?=[0-9_]*[a-zA-Z]) means assert presence of a letter after # with optional presence of a digit or underscore in between.
Here is a regex demo for you

How about this?
(example here)
String text = "####THIS_IS_A_HASHTAG;;;#This_1_2...#12_and_this ";
String regex = "#[A-Za-z0-9_]+";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
System.out.println(m.group());
}
It looks like it meets your criteria as stated:
#THIS_IS_A_HASHTAG
#This_1_2
#12_and_this

Java regexp grouping and + operator (Obtaining multiple values of a group)

I was wondering if is it possible to obtain all the matches of a group with a + operator on a java regular expression.
Example code:
public static void main(String[] args) {
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:\\s*(([\\w\\s]+),?\\s*)+.");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("Regular expression Match: "+ m.group(0));
System.out.println("Group 1: "+ m.group(1));
System.out.println("Group 2: "+ m.group(2));
}
}
OUTPUT:
Regular expression Match: Start: First match, second match, third match.
Group 1: third match
Group 2: third match
Despite group 2 matched 3 times "First match, " "second match, " "third match" due to the second "+" operator that is on the Regexp we can access just the last one on match.group(2).
My questions is:
¿There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
thanks.

As mentioned in other answers, you can't match n groups using + like this.
However, if you are looking to solve this problem in Java then using a Scanner to break on the delimiters may help:
String input = "Start: First match, second match, third match.";
Pattern p = Pattern.compile("Start:|\\s*,");
Scanner s = new Scanner(input).useDelimiter(p);
while (s.hasNext()) {
System.out.println("Matched: " + s.next());
}
This prints out:
Matched: First match
Matched: second match
Matched: third match.

You asked:
There exist a way to access the other hits of the group 2 on that expression or when a + operator causes multiple match on a group only the last one can be accesed?.
Answer is NO, if same group matches some text multiple times then you can only access last matched text.
There are of course other ways to return multiple matches.

I think this may not be possible with your regular expression.
As per the docs:
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second
time because of quantification then its previously-captured value, if
any, will be retained if the second evaluation fails. Matching the
string "aba" against the expression (a(b)?)+, for example, leaves
group two set to "b". All captured input is discarded at the beginning
of each match.

Like most other regex flavors, Java doesn't save the intermediate captures of a repeated group. But that feature isn't really as useful as might think. For example, the .NET flavor provides the CaptureCollection class for that purpose, but you still have to write the code to loop through it. Not that big a deal, but still it's usually easier to use multiple matches, like the other responders suggested. Try it with this regex:
"(?:Start:|\\G,)\\s*([\\w\\s]+)"
\G is a kind of anchor that causes the regex to reject any match that doesn't start exactly where the last match ended. If there was no previous match (i.e., this is the first match attempt), it acts like \A and matches only at the very beginning of the string. That's partly why I placed the , in that part of the regex; I think it's safe to assume the string doesn't start with a comma.
Note that the first group is non-capturing; the part you're looking for will always be in 'group(1)`.

Extract all occurrences of pattern K and check if string matches "K*" in 1 pass

For a given input string and a given pattern K, I want to extract every occurrence of K (or some part of it (using groups)) from the string and check that the entire string matches K* (as in it consists of 0 or more K's with no other characters).
But I would like to do this in a single pass using regular expressions. More specifically, I'm currently finding the pattern using Matcher.find, but this is not strictly required.
How would I do this?
I already found a solution (and posted an answer), but would like to know if there is specific regex or Matcher functionality that addresses / can address this issue, or simply if there are better / different ways of doing it. But, even if not, I still think it's an interesting question.
Example:
Pattern: <[0-9]> (a single digit in <>)
Valid input: <1><2><3>
Invalid inputs:
<1><2>a<3>
<1><2>3
Oh look, a flying monkey!
<1><2><3
Code to do it in 2 passes with matches:
boolean products(String products)
{
String regex = "(<[0-9]>)";
Pattern pAll = Pattern.compile(regex + "*");
if (!pAll.matcher(products).matches())
return false;
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
while (matcher.find())
System.out.println(matcher.group());
return true;
}

1. Defining the problem
Since it is not clear what to output when the whole string does not match pattern K*, I will redefine the problem to make it clear what to output in such case.
Given any pattern K:
Check that the string has the pattern K*.
If the string has pattern K*, then split the string into non-overlapping tokens that matches K.
If the string only has prefix that matches pattern K*, then pick the prefix that is chosen by K*+1, and split the prefix into tokens that matches K.
1 I don't know if there is anyway to get the longest prefix that matches K. Of course, you can always remove the last character one by one and test against K* until it matches, but it is obviously inefficient.
Unless specify otherwise, whatever I write below will follow my problem description above. Note that the 3rd bullet point of the problem is to resolve the ambiguity on which prefix string to take.
2. Repeated capturing group in .NET
The problem above can be solved if we have the solution to the problem:
Given a pattern (K)*, which is a repeated capturing group, get the captured text for all the repetitions, instead of only the last repetition.
In the case where the string has pattern K*, by matching against ^(K)*$, we can get all tokens that match pattern K.
In the case where the string only has prefix that matches K*, by matching against ^(K)*, we can get all tokens that match pattern K.
This is the case in .NET regex, since it keeps all the captured text for a repeated capturing group.
However, since we are using Java, we don't have access to such feature.
3. Solution in Java
Checking that the string has the pattern K* can always be done with Matcher.matches()/String.matches(), since the engine will do full-blown backtracking on the input string to somehow "unify" K* with the input string. The hard thing is to split the input string into tokens that matches pattern K.
If K* is equivalent to K*+
If the pattern K has the property:
For all strings2, K* is equivalent to K*+, i.e. how the input string is split up into tokens that match pattern K is the same.
2 You can define this condition for only the input strings you are operating on, but ensuring this pre-condition is not easy. When you define it for all strings, you only need to analyze your regex to check whether the condition holds or not.
Then a one-pass solution that solves the problem can be constructed. You can repeatedly use Matcher.find() on the pattern \GK, and checks that the last match found is right at the end of the string. This is similar to your current solution, except that you do the boundary check with code.
The + after the quantifier * in K*+ makes the quantifier possessive. Possessive quantifier will prevent the engine from backtracking, which means each repetition is always the first possible match for the pattern K. We need this property so that the solution \GK has equivalent meaning, since it will also return the first possible match for the pattern K.
If K* is NOT equivalent to K*+
Without the property above, we need 2 passes to solve the problem. First pass to call Matcher.matches()/String.matches() on the pattern K*. On second pass:
If the string does not match pattern K*, we will repeatedly use Matcher.find() on the pattern \GK until no more match can be found. This can be done due to how we define which prefix string to take when the input string does not match pattern K*.
If the string matches pattern K*, repeatedly use Matcher.find() on the pattern \GK(?=K*$) is one solution. This will result in redundant work matching the rest of the input string, though.
Note that this solution is universally applicable for any K. In other words, it also applies for the case where K* is equivalent to K*+ (but we will use the better one-pass solution for that case instead).

Here is an additional answer to the already accepted one. Below is an example code snippet that only goes through the pattern once with m.find(), which is similar to your one pass solution, but will not parse non-matching lines.
import java.util.regex.*;
class test{
public static void main(String args[]){
String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*)");
Matcher m = pat.matcher(t);
while (m.find()) {
System.out.println("Matches!");
System.out.println(m.group());
}
}
}
The regex explained:
<\\d> --This is your k pattern as defined above
?= -- positive lookahead (check what is ahead of K)
<\\d>* -- Match k 0 or more times
$ -- End of line
?<= -- positive lookbehind (check what is behind K)
^ -- beginning of line
<\\d>* -- followed by 0 or more Ks
Regular expressions are beautiful things.
Edit: As pointed out to me by #nhahtdh, this is just an implemented version of the answer. In fact the implementation above can be improved with the knowledge in the answer.(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*) can be changed to \\G<\\d>(?=(<\\d>)*$).

Below is a one-pass solution using Matcher.start and Matcher.end.
boolean products(String products)
{
String regex = "<[0-9]>";
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
int lastEnd = 0;
while (matcher.find())
{
if (lastEnd != matcher.start())
return false;
System.out.println(matcher.group());
lastEnd = matcher.end();
}
if (lastEnd != products.length())
return false;
return true;
}
The only disadvantage is that it will print out (or process) all values prior to finding invalid data.
For example, products("<1><2>a<3>"); will print out:
<1>
<2>
prior to throwing the exception (because up until there the string is valid).
Either having this happen or having to store all of them temporarily seems to be unavoidable.

String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)*");
Matcher m = pat.matcher(t);
if (m.matches()) {
//String[] tt = t.split("(?<=>)"); // Look behind on '>'
String[] tt = t.split("(?<=(<\\d>))"); // Look behind on K
}

Regular expression to match pattern

I'm trying to get matches for commands like this;
[AUTR| <version_software> | <version_protocol> | <msg> ]
[PING]
What is the regular expression that find this matches for the first command?
AUTR
version_software
version_protocol
msg
this is the code that parse that:
String[] tokens = msg.replace('<',' ').replace('>',' ').replace('[', ' ').replace(']', ' ').split("\\|");
for (int i=0; i<tokens.length; i++) tokens[i] = tokens[i].trim();
I'm only wondering how it can be done with a regex solution.
EDIT:
I'm trying to match groups with easier expressions, and with this code the call to m.groupCount returns one... but when I try to print it... it throws this exception "java.lang.IllegalStateException: No match found"
Pattern pattern = Pattern.compile("([\\w+])");
Matcher m = pattern.matcher("[AUTR]");
for (int i=0; i<m.groupCount();i++)
{
System.out.println(m.group(i));
}

EDIT:
http://fiddle.re/6ykc
Regular Expression:
\[([\w]+)(\s*\|\s*<([\w. ]+)>\s*)*\]
Java Regex String:
"\\[([\\w]+)(\\s*\\|\\s*<([\\w. ]+)>\\s*)*\\]"
Note that this is for variable commands now and that all extra parameters must match the following character set [a-zA-Z_0-9. ] (Includes periods and spaces).
Issue: There is an issue with variable length commands that you cannot capture more than one group with a variable type grouping.
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
EDIT 2:
In order to get all of them you can do 2 regular expressions, one to grab the command:
String command_regex = "\\[([\\w]+)";
And find that and then find the parameters which you can use the <> as your key character to select:
String parameters = "<([\\w. ]+)>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string_to_match);
while (matcher.find()) {
System.out.println(matcher.group());
}
Hope that helps.
ORIGINAL:
Not exactly sure on the formatting, are the "<" and ">" and "|" required? And what are the formats for the command, version_software, version_protocol and message? This is my attempt though for regular expressions (tested in Python)
\[(\w+)\s*\|\s*<([\w.]+)>\s*\|\s*<(\w+)>\s*\|\s*<([\w\s]+)>\s*\]
You need to make sure to escape the brackets and the pipe symbols (I added \s* conditions between because I don't know if there will be spaces or not. If you do:
>> search.re("expression above", line)
>> search.groups()
It should give all tokens in python at least. I left it more hardcoded to allow room for adjustments on each token you wanted to grab, otherwise you could reduce the last 3 parts by making it a group and saying to repeat 3 times. Let me know results?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.