I have been using the java.util.regex.* classes for Regular Expression in Java and all good so far. But today I have a different requirement. For example consider the pattern to be "aabb". Now if the input String is aa it will definitely not match, however there is still possibility that if I append bb it becomes aabb and it matches. However if I would have started with cc, no matter what I append it will never match.
I have explored the Pattern and Matcher class but didn't find any way of achieving this.
The input will come from user and system have to wait till pattern matches or it will never match irrespective of any input further.
Any clue?
Thanks.
You should have looked more closely at the Matcher API; the hitEnd() method works exactly as you described:
import java.util.regex.*;
public class Test
{
public static void main(String[] args) throws Exception
{
String[] ss = { "aabb", "aa", "cc", "aac" };
Pattern p = Pattern.compile("aabb");
Matcher m = p.matcher("");
for (String s : ss) {
m.reset(s);
if (m.matches()) {
System.out.printf("%-4s : match%n", s);
}
else if (m.hitEnd()) {
System.out.printf("%-4s : partial match%n", s);
}
else {
System.out.printf("%-4s : no match%n", s);
}
}
}
}
output:
aabb : match
aa : partial match
cc : no match
aac : no match
As far as I know, Java is the only language that exposes this functionality. There's also the requireEnd() method, which tells you if more input could turn a match into a non-match, but I don't think it's relevant in your case.
Both methods were added to support the Scanner class, so it can apply regexes to a stream without requiring the whole stream to be read into memory.
Pattern p = Pattern.compile(expr);
Matcher m = p.matcher(string);
m.find();
So you want to know not whether a String s matches the regex, but whether there might be a longer String starting with s that would match? Sorry, Regexes can't help you there because you get no access to the internal state of the matcher; you only get the boolean result and any groups you have defined, so you never know why a match failed.
If you're willing to hack the JDK libraries, you can extend (or probably fork) java.util.regex and give out more information about the matching process. If the match failed because the input was 'used up' the answer would be true; if it failed because of character discrimination or other checks it would be false. That seems like a lot of work though, because your problem is completely the opposite of what regexes are supposed to do.
Another option: maybe you can simply redefine the task so that you can treat the input as the regexp and match aabb against *aa.**? You have to be careful about regex metacharacters, though.
For the example you give you could try to use an anti-pattern to disqualify invalid results. For example "^[^a]" would tell you you're input "c..." can't match your example pattern of "aabb".
Depending on your pattern you may be able to break it up into smaller patterns to check and use multiple matchers and then set their bounds as one match occurs and you move to the next. This approach may work but if you're pattern is complex and can have variable length sub-parts you may end up reimplementing part of the matcher in your own code to adjust the possible bounds of the match to make it more or less greedy. A pseudo-code general idea of this would be:
boolean match(String input, Matcher[] subpatterns, int matchStart, int matchEnd){
matcher = next matcher in list;
int stop = matchend;
while(true){
if matcher.matches input from matchstart -> matchend{
if match(input, subpatterns, end of current match, end of string){
return true;
}else{
//make this match less greedy
stop--;
}
}else{
//no match
return false;
}
}
}
You could then merge this idea with the anti-patterns, and have anti-subpatterns and after each subpattern match you check the next anti-pattern, if it matches you know you have failed, otherwise continue the matching pattern. You would likely want to return something like an enum instead of a boolean (i.e. ALL_MATCHED, PARTIAL_MATCH, ANTI_PATTERN_MATCH, ...)
Again depending on the complexity of your actual pattern that you are trying to match writing the appropriate sub patterns / anti-pattern may be difficult if not impossible.
One way to do this is to parse your regex into a sequence of sub-regexes, and then reassemble them in a way that allows you to do partial matches; e.g. "abc" has 3 sub-regexes "a", "b" and "c" which you can then reassemble as "a(b*(c)?)?".
Things get more complicated when the input regex contains alternation and groups, but the same general approach should work.
The problem with this approach is that the resulting regex is more complicated, and could potentially lead to excessive backtracking for complex input regexes.
If you make each character of the regex optional and relax the multiplicity constraints, you kinda get what you want. Example if you have a matching pattern "aa(abc)+bbbb", you can have a 'possible match' pattern 'a?a?(a?b?c?)*b?b?b?b?'.
This mechanical way of producing possible-match pattern does not cover advanced constructs like forward and backward refs though.
You might be able to accomplish this with a state machine (http://en.wikipedia.org/wiki/State_machine). Have your states/transitions represent valid input and one error state. You can then feed the state machine one character (possibly substring depending on your data) at a time. At any point you can check if your state machine is in the error state. If it is not in the error state then you know that future input may still match. If it is in the error state then you know something previously failed and any future input will not make the string valid.
Related
TL;DR
What are the design decisions behind Matcher's API?
Background
Matcher has a behaviour that I didn't expect and for which I can't find a good reason. The API documentation says:
Once created, a matcher can be used to perform three different kinds of match operations:
[...]
Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.
What the API documentation further says is:
The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.
Example
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
System.out.println(matcher.group("foo")); // (1)
System.out.println(matcher.group("bar"));
This code throws a
java.lang.IllegalStateException: No match found
at (1). To get around this, it is necessary to call matches() or other methods that bring the Matcher into a state that allows group(). The following works:
String s = "foo=23,bar=42";
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
Matcher matcher = p.matcher(s);
matcher.matches(); // (2)
System.out.println(matcher.group("foo"));
System.out.println(matcher.group("bar"));
Adding the call to matches() at (2) sets the Matcher into the proper state to call group().
Question, probably not constructive
Why is this API designed like this? Why not automatically match when the Matcher is build with Patter.matcher(String)?
Actually, you misunderstood the documentation. Take a 2nd look at the statement you quoted: -
attempting to query any part of it before a successful match will cause an
IllegalStateException to be thrown.
A matcher may throw IllegalStateException on accessing matcher.group() if no match was found.
So, you need to use following test, to actually initiate the matching process: -
- matcher.matches() //Or
- matcher.find()
The below code: -
Matcher matcher = pattern.matcher();
Just creates a matcher instance. This will not actually match a string. Even if there was a successful match.
So, you need to check the following condition, to check for successful matches: -
if (matcher.matches()) {
// Then use `matcher.group()`
}
And if the condition in the if returns false, that means nothing was matched. So, if you use matcher.group() without checking this condition, you will get IllegalStateException if the match was not found.
Suppose, if Matcher was designed the way you are saying, then you would have to do a null check to check whether a match was found or not, to call matcher.group(), like this: -
The way you think should have been done:-
// Suppose this returned the matched string
Matcher matcher = pattern.matcher(s);
// Need to check whether there was actually a match
if (matcher != null) { // Prints only the first match
System.out.println(matcher.group());
}
But, what if, you want to print any further matches, since a pattern can be matched multiple times in a String, for that, there should be a way to tell the matcher to find the next match. But the null check would not be able to do that. For that you would have to move your matcher forward to match the next String. So, there are various methods defined in Matcher class to serve the purpose. The matcher.find() method matches the String till all the matches is found.
There are other methods also, that match the string in a different way, that depends on you how you want to match. So its ultimately on Matcher class to do the matching against the string. Pattern class just creates a pattern to match against. If the Pattern.matcher() were to match the pattern, then there has to be some way to define various ways to match, as matching can be in different ways. So, there comes the need of Matcher class.
So, the way it actually is: -
Matcher matcher = pattern.matcher(s);
// Finds all the matches until found by moving the `matcher` forward
while(matcher.find()) {
System.out.println(matcher.group());
}
So, if there are 4 matches found in the string, your first way, would print only the first one, while the 2nd way will print all the matches, by moving the matcher forward to match the next pattern.
I Hope that makes it clear.
The documentation of Matcher class describes the use of the three methods it provides, which says: -
A matcher is created from a pattern by invoking the pattern's matcher
method. Once created, a matcher can be used to perform three different
kinds of match operations:
The matches method attempts to match the entire input sequence
against the pattern.
The lookingAt method attempts to match the input sequence, starting
at the beginning, against the pattern.
The find method scans the input sequence looking for the next
subsequence that matches the pattern.
Unfortunately, I have not been able find any other official sources, saying explicitly Why and How of this issue.
My answer is very similar to Rohit Jain's but includes some reasons why the 'extra' step is necessary.
java.util.regex implementation
The line:
Pattern p = Pattern.compile("foo=(?<foo>[0-9]*),bar=(?<bar>[0-9]*)");
causes a new Pattern object to be allocated, and it internally stores a structure representing the RE - information such as a choice of characters, groups, sequences, greedy vs. non-greedy, repeats and so on.
This pattern is stateless and immutable, so it can be reused, is multi-theadable and optimizes well.
The lines:
String s = "foo=23,bar=42";
Matcher matcher = p.matcher(s);
returns a new Matcher object for the Pattern and String - one that has not yet read the String. Matcher is really just a state machine's state, where the state machine is the Pattern.
The matching can be run by stepping the state machine through the matching process using the following API:
lookingAt(): Attempts to match the input sequence, starting at the beginning, against the pattern
find(): Scans the input sequence looking for the next subsequence that matches the pattern.
In both cases, the intermediate state can be read using the start(), end(), and group() methods.
Benefits of this approach
Why would anyone want to do step through the parsing?
Get values from groups that have quantification greater than 1 (i.e. groups that repeat and end up matching more than once). For example in the trivial RE below that parses variable assignments:
Pattern p = new Pattern("([a-z]=([0-9]+);)+");
Matcher m = p.matcher("a=1;b=2;x=3;");
m.matches();
System.out.println(m.group(2)); // Only matches value for x ('3') - not the other values
See the section on "Group name" in "Groups and capturing" the JavaDoc on Pattern
The developer can use the RE as a lexer and the developer can bind the lexed tokens to a parser. In practice, this would work for simple domain languages, but regular expressions are probably not the way to go for a full-blown computer language. EDIT This is partly related to the previous reason, but it can frequently be easier and more efficient to create the parse tree processing the text than lexing all the input first.
(For the brave-hearted) you can debug REs and find out which subsequence is failing to match (or incorrectly matching).
However, on most occasions you do not need to step the state machine through the matching, so there is a convenience method (matches) which runs the pattern matching to completion.
If a matcher would automatically match the input string, that would be wasted effort in case you wish to find the pattern.
A matcher can be used to check if the pattern matches() the input string, and it can be used to find() the pattern in the input string (even repeatedly to find all matching substrings). Until you call one of these two methods, the matcher does not know what test you want to perform, so it cannot give you any matched groups. Even if you do call one of these methods, the call may fail - the pattern is not found - and in that case a call to group must fail as well.
This is expected and documented.
The reason is that .matches() returns a boolean indicating if there was a match. If there was a match, then you can call .group(...) meaningfully. Otherwise, if there's no match, a call to .group(...) makes no sense. Therefore, you should not be allowed to call .group(...) before calling matches().
The correct way to use a matcher is something like the following:
Matcher m = p.matcher(s);
if (m.matches()) {
...println(matcher.group("foo"));
...
}
My guess is the design decision was based on having queries that had clear, well defined semantics that didn't conflate existence with match properties.
Consider this: what would you expect Matcher queries to return if the matcher has not successfully matched something?
Let's first consider group(). If we haven't successfully matched something, Matcher shouldn't return the empty string, as it hasn't matched the empty string. We could return null at this point.
Ok, now let's consider start() and end(). Each return int. What int value would be valid in this case? Certainly no positive number. What negative number would be appropriate? -1?
Given all this, a user is still going to have to check return values for every query to verify if a match occurred or not. Alternatively, you could check to see if it matches successfully outright, and if successful, the query semantics all have well-defined meaning. If not, the user gets consistent behaviour no matter which angle is queried.
I'll grant that re-using IllegalStateException may not have resulted in the best description of the error condition. But if we were to rename/subclass IllegalStateException to NoSuccessfulMatchException, one should be able to appreciate how the current design enforces query consistency and encourages the user to use queries that have semantics that are known to be defined at the time of asking.
TL;DR: What is value of asking the specific cause of death of a living organism?
You need to check the return value of matcher.matches(). It will return true when a match was found, false otherwise.
if (matcher.matches()) {
System.out.println(matcher.group("foo"));
System.out.println(matcher.group("bar"));
}
If matcher.matches() does not find a match and you call matcher.group(...), you'll still get an IllegalStateException. That's exactly what the documentation says:
The explicit state of a matcher is initially undefined; attempting to query any part of it before a successful match will cause an IllegalStateException to be thrown.
When matcher.match() returns false, no successful match has been found and it doesn't make a lot of sense to get information on the match by calling for example group().
I want to check a string that matches the format "=number", ex "=5455".
As long as the fist char is "=" & the subsequence is any number in [0-9] (dot is not allowed), then it will popup "correct" message.
if(str.matches("^[=][0-9]+")){
Window.alert("correct");
}
So, is this ^[=][0-9]+ the correct one?
if it is not correct, can u provide a correct solution?
if it is correct, then can u find a better solution?
I'm no big regex expert and more knowledgeable people than me might correct this answer, but:
I don't think there's a point in using [=] rather than simply = - the [...] block is used to declare multiple choices, why declare a multiple choice of one character?
I don't think you need to use ^ (if your input string contains any character before =, it won't match anyway). I'm unsure as to whether its presence makes your regex faster, slower or has no effect.
In conclusion, I'd use =[0-9]+
That should be correct it is looking for an anchored at the beginning = sign and then 1 or more digits between 0-9
Your regex will work, even though it can be simplified:
.matches() does not really do regex matching, since it tries and matches all the input against the regex; therefore the beginning of input anchor is not needed;
you don't need the character class around the =.
Therefore:
if (str.matches("=[0-9]+")) { ... }
If you want to match a string which only begins with that regex, you have to use a Pattern, a Matcher and .find():
final Pattern p = Pattern.compile("^=[0-9]+");
final Matcher m = p.matcher(str);
if (m.find()) { ... }
And finally, Matcher also has .lookingAt() which anchors the regex only at the beginning of the input.
I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)
Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.
Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time
Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));
It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.
I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.
If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}
Let's say we have the following input:
<amy>
(bob)
<carol)
(dean>
We also have the following regex:
<(\w+)>|\((\w+)\)
Now we get two matches (as seen on rubular.com):
<amy> is a match, \1 captures amy, \2 fails
(bob) is a match, \2 captures bob, \1 fails
This regex does most of what we want, which are:
It matches the open and close brackets properly (i.e. no mixing)
It captures the part we're interested in
However, it does have a few drawbacks:
The capturing pattern (i.e. the "main" part) is repeated
It's only \w+ in this case, but generally speaking this can be quite complex,
If it involves backreferences, then they must be renumbered for each alternate!
Repetition makes maintenance a nightmare! (what if it changes?)
The groups are essentially duplicated
Depending on which alternate matches, we must query different groups
It's only \1 or \2 in this case, but generally the "main" part can have capturing groups of their own!
Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
The situation quickly worsens if we also want to match {...}, [...], etc.
So the question is obvious: how can we do this without repeating the "main" pattern?
Note: for the most part I'm interested in java.util.regex flavor, but other flavors are welcomed.
Appendix
There's nothing new in this section; it only illustrates the problem mentioned above with an example.
Let's take the above example to the next step: we now want to match these:
<amy=amy>
(bob=bob)
[carol=carol]
But not these:
<amy=amy) # non-matching bracket
<amy=bob> # left hand side not equal to right hand side
Using the alternate technique, we have the following that works (as seen on rubular.com):
<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]
As explained above:
The main pattern can't simply be repeated; backreferences must be renumbered
Repetition also means maintenance nightmare if it ever changes
Depending on which alternate matches, we must query either \1 \2, \3 \4, or \5 \6
You can use a lookahead to "lock in" the group number before doing the real match.
String s = "<amy=amy>(bob=bob)[carol=carol]";
Pattern p = Pattern.compile(
"(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.printf("found %s in %s%n", m.group(2), m.group());
}
output:
found amy in <amy=amy>
found bob in (bob=bob)
found carol in [carol=carol]
It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:
"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"
In preg (Perl Regex library), this will match your example, and \3 will catch the insides:
((<)|\()(\w+)(?(2)>|\))
It will not work in JS, though - you did not specify the dialect...
It depends on the conditional operator (?(2)...|...) which basically says if 2 is a non-null capture, then match before the pipe, else match after the pipe. In this form, pipe is not alternation ("or").
UPDATE Sorry, I completely missed the Java bit :) Anyway, apparently Java does not support the conditional construct; and I have no idea how else I'd go about it :(
Also, for your Appendix (even though it's the wrong dialect):
(?:(<)|(\()|\[)(\w+)=\3(?(1)>|(?(2)\)|]))
The name is in again in \3 (I got rid of the first capturing paren, but I had to add another one for one extra opening paren check)
The only solution that I was able to come up with is inspired by technique of capturing an empty string on different alternates; backreferencing to these groups later can serve as pseudo-conditionals.
Thus, this pattern works for the second example (as seen on rubular.com):
__main__
/ \
(?:<()|\(()|\[())((\w+)=\5)(\1>|\2\)|\3\])
\_______________/ \_____________/
\1 \2 \3
So essentially for each opening bracket, we assign a group that captures an empty string. Then when we try to match the closing bracket, we see which group was succesful, and match the corresponding closing bracket.
The "main" part does not have to be repeated, but in Java, backreferences may have to be renumbered. This won't be a problem in flavors that support named groups.
May be this example in Perl will interest you :
$str = q/<amy=amy> (bob=bob) [carol=carol] <amy=amy) <amy=bob>/;
$re = qr/(?:<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\])+/;
#list = ($str =~ /$re/g);
for(#list) {
say $i++," = ",$_;
}
I just surround your regex by (?:regex)+
When you get things like this, using a single regex is a silly restriction, and I simply don't agree with your "maintenance nightmare" to using more than one - repeating a similar-but-different expression several times is likely to be more maintainable (well, less unmaintainable), and maybe even better performance too, than a single overly-complex regex.
But anyway, there's no repetition if you just use variables to compose your regex.
Here's some pseudo-code:
Brackets = "<>,(),[]"
CoreRegex = "(\w+)=\1"
loop CurBracket in Brackets.split(',')
{
Input.match( Regex.quote(CurBracket.left(1)) & CoreRegex & Regex.quote(CurBracket.right(1)) )
}
(p.s.that's just to give the general idea - I'd probably use already-escaped arrays for the bracket sets in actual implementation).
Assuming there is no easy way to manually write this regular expression, why not leave it to the computer?
You could have a function, maybe like below (I am using C# syntax here, as I am a bit more familiar with regexes here than in Java, but it should not be too difficult to adapt it to Java).
Note that I left the function AdaptBackreferences() more or less unimplemented as an exercise to the reader. It should just adapt the backreference numbering.
struct BracketPair {public string Open; public string Close;};
static string[] MatchTextInBrackets(string text, string innerPattern, BracketPair[] bracketPairs) {
StringBuilder sb = new StringBuilder();
// count number of catching parentheses of innerPattern here:
int numberOfInnerCapturingParentheses = Regex.Match("", innerPattern).Groups.Count - 1;
bool firstTime = true;
foreach (BracketPair pair in bracketPairs) {
// apply logic to change backreference numbering:
string adaptedInnerPattern = AdaptBackreferences(innerPattern);
if (firstTime) { firstTime = false; } else { sb.Append('|'); }
sb.Append(pair.Open).Append("(").Append(adaptedInnerPattern).Append(")").Append(pair.Close);
}
string myPattern = sb.ToString();
MatchCollection matches = Regex.Matches(text, myPattern);
string[] result = new string[matches.Count];
for(int i=0; i < matches.Count; i++) {
StringBuilder mb = new StringBuilder();
for(int j=0; j < bracketPairs.Length; j++) {
mb.Append(matches[i].Groups[1 + j * (numberOfInnerCapturingParentheses + 1)]); // append them all together, assuming all exept one are empty
}
result[i] = mb.ToString();
}
return result;
}
static string AdaptBackreferences(string pattern) { return pattern; } // to be written
I'm using the Java matcher to try and match the following:
#tag TYPE_WITH_POSSIBLE_SUBTYPE -PARNAME1=PARVALUE1 -PARNAME2=PARVALUE2: MESSAGE
The TYPE_WITH_POSSIBLE_SUBTYPE consists of letters with periods.
Every parameter has to consist of letters, and every value has to consist of numerics/letters. There can be 0 or more parameters.
Immediately after the last parameter value comes the semicolon, a space, and the remainder is considered message.
Everything needs to be grouped.
My current regexp (as a Java literal) is:
(#tag)[\\s]+?([\\w\\.]*?)[\\s]*?(-.*=.*)*?[\\s]*?[:](.*)
However, I keep getting all the parameters as one group. How do I get each as a separate group, if it is even possible?
I don't work that much with regexps, so I always mess something up.
If you want to capture each parameter separately, you have to have a capture group for each one. Of course, you can't do that because you don't know how many parameters there will be. I recommend a different approach:
Pattern p = Pattern.compile("#tag\\s+([^:]++):\\s*(.*)");
Matcher m = p.matcher(s);
if (m.find())
{
String[] parts = m.group(1).split("\\s+");
for (String part : parts)
{
System.out.println(part);
}
}
System.out.printf("message: %s%n", m.group(2));
The first element in the array is your TYPE name and the rest (if there are any more) are the parameters.
Try this out (you may need to add extra '\' to make it work within a string.
(#tag)\s*(\w*)\s*(-[\w\d]*=[\w\d]*\s*)*:(.*)
By the way, I highly recommend this site to help you build regular expressions: RegexPal. Or even better is RegexBuddy; its well worth the $40 if you plan on doing a lot of regular expressions in the future.