Regular expression to parse option string - java

I'm using the Java matcher to try and match the following:
#tag TYPE_WITH_POSSIBLE_SUBTYPE -PARNAME1=PARVALUE1 -PARNAME2=PARVALUE2: MESSAGE
The TYPE_WITH_POSSIBLE_SUBTYPE consists of letters with periods.
Every parameter has to consist of letters, and every value has to consist of numerics/letters. There can be 0 or more parameters.
Immediately after the last parameter value comes the semicolon, a space, and the remainder is considered message.
Everything needs to be grouped.
My current regexp (as a Java literal) is:
(#tag)[\\s]+?([\\w\\.]*?)[\\s]*?(-.*=.*)*?[\\s]*?[:](.*)
However, I keep getting all the parameters as one group. How do I get each as a separate group, if it is even possible?
I don't work that much with regexps, so I always mess something up.

If you want to capture each parameter separately, you have to have a capture group for each one. Of course, you can't do that because you don't know how many parameters there will be. I recommend a different approach:
Pattern p = Pattern.compile("#tag\\s+([^:]++):\\s*(.*)");
Matcher m = p.matcher(s);
if (m.find())
{
String[] parts = m.group(1).split("\\s+");
for (String part : parts)
{
System.out.println(part);
}
}
System.out.printf("message: %s%n", m.group(2));
The first element in the array is your TYPE name and the rest (if there are any more) are the parameters.

Try this out (you may need to add extra '\' to make it work within a string.
(#tag)\s*(\w*)\s*(-[\w\d]*=[\w\d]*\s*)*:(.*)
By the way, I highly recommend this site to help you build regular expressions: RegexPal. Or even better is RegexBuddy; its well worth the $40 if you plan on doing a lot of regular expressions in the future.

Related

Regex function to find specific depth in recursive

I have the following scenario where I am supposed to use regex (Java/PCRE) on a line of code and strip off certain defined function and only strong the value of that function like in example below:
Input
ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))
Output : Replace Regex
ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)
Here CInt, Math.Truncate, and ObjectToNumber is removed retaining on output as shown above
The functions CInt, Math.Truncate keep on changing to CStr or Math.Random etc etc so regex query can not be hardcoded.
I tried a lot of options on stackoverflow but most did not work.
Also it would be nice if the query is customizable like Cint returns everything function CInt refers to. ( find a text then everything between first ( and ) ignoring balanced parenthesis pairs in between.
I know it's not pretty, but it's your fault to use raw regex for this :)
#Test
void unwrapCIntCall() {
String input = "ArrayNew(1) = adjustalpha(shadowcolor, CInt(Math.Truncate (ObjectToNumber (Me.bezierviewshadow.getTag))))";
String expectedOutput = "ArrayNew(1) = adjustalpha(shadowcolor, Me.bezierviewshadow.getTag)";
String output = input.replaceAll("CInt\\s*\\(\\s*Math\\.Truncate\\s*\\(\\s*ObjectToNumber\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
assertEquals(expectedOutput, output);
}
Now some explanation; the \\s* parts allow any number of any whitespace character, where they are. In the pattern, I used (.*) in the middle, which means I match anything there, but it's fine*. I used (.*) instead of .* so that particular section gets captured as capturing group $1 (because $0 is always the whole match). The interesting part being captured, I can refer them in the replacement string.
*as long as you don't have multiple of such assignments within one string. Otherwise, you should break up the string into parts which contain only one such assignment and apply this replacement for each of those strings. Or, try (.*?) instead of (.*), it compiles for me - AFAIK that makes the .* match as few characters as possible.
If the methods actually being called vary, then replace their names in the regex with the variation you expect, like replace CInt with (?CInt|CStr), Math\\.Truncate with Math\\.(?Truncate|Random) etc. (Using (? instead of ( makes that group non-capturing, so they won't take up $1, $2, etc. slots).
If that gets too complicated, than you should really think whether you really want to do it with regex, or whether it'd be easier to just write a relatively longer function with plain string methods, like indexOf and substring :)
Bonus; if absolutely everything varies, but the call depth, then you might try this one:
String output = input.replaceAll("[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*[\\w\\d.]+\\s*\\(\\s*(.*)\\s*\\)\\s*\\)\\s*\\)", "$1");
Yes, it's definitely a nightmare to read, but as far as I understand, you are after this monster :)
You can use ([^()]*) instead of (.*) to prevent deeper nested expressions. Note, that fine control of depth is a real weakness of everyday regular expressions.

get the last portion of the link using java regex

I have an arraylist links. All links having same format abc.([a-z]*)/\\d{4}/
List<String > links= new ArrayList<>();
links.add("abc.com/2012/aa");
links.add("abc.com/2014/dddd");
links.add("abc.in/2012/aa");
I need to get the last portion of every link. ie, the part after domain name. Domain name can be anything(.com, .in, .edu etc).
/2012/aa
/2014/dddd
/2012/aa
This is the output i want. How can i get this using regex?
Thanks
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
(see here for background)
Why use regex ? Perhaps a simpler solution is to use String.split("/") , which gives you an array of substrings of the original string, split by /. See this question for more info.
Note that String.split() does in fact take a regex to determine the boundaries upon which to split. However you don't need a regex in this case and a simple character specification is sufficient.
Try with below regex and use regex grouping feature that is grouped based on parenthesis ().
\.[a-zA-Z]{2,3}(/.*)
Pattern description :
dot followed by two or three letters followed by forward slash then any characters
DEMO
Sample code:
Pattern pattern = Pattern.compile("\\.[a-zA-Z]{2,3}(/.*)");
Matcher matcher = pattern.matcher("abc.com/2012/aa");
if (matcher.find()) {
System.out.println(matcher.group(1));
}
output:
/2012/aa
Note:
You can make it more precise by using \\.[a-zA-Z]{2,3}(/\\d{4}/.*) if there are always 4 digits in the pattern.
String result = s.replaceAll("^[^/]*","");
s would be the string in your list.
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
Why not just use the URI class?
output = new URI(link).getPath()
Try this one and use the second capturing group
(.*?)(/.*)
Use foreach loop to iterate over list.
Use substring and indexOf('/').
FOR EXAMPLE
String s="abc.com/2014/dddd";
System.out.println(s.substring(s.indexOf('/')));
OUTPUT
/2014/dddd
Or you can go for split method.
System.out.println(s.split("/",2)[1]);//OUTPUT:2014/dddd --->you need to add /

Regex which matches a string containing at least the specified characters

I have a huge dictionary which I'm trying to look through using a regex. What I would like to do is to find all the words in the dictionary which contain at least one occurrences of each character I provide in no particular order.
Right now I can find words which only contain the specified characters but like I said that is not exactly what I want.
Example:
I want at least one occurrence of each of the following characters {b, a, d}
astring.matches(regex)
I would expect words like:
badder,
baddest,
baffled
Notice they all contain at least one occurence of each character but in no particular order and other characters are present in the strings.
Anyone know how to do this? Other suggestions are also welcome!
You need a series of look-aheads:
^(?=.*b)(?=.*a)(?=.*d).*
which is a pain to construct. However, you can ease the pain by using regex to build it:
String regex = "^" + "bad".replaceAll(".", "(?=.*$0)") + ".*";
If using repeatedly with String.matches(), you would be better to use the following code, because every call to String.matches() compiles the regex again (there is no caching):
// do this once
Pattern pattern = Pattern.compile(regex);
// reuse the pattern many times
if (pattern.matcher(input).matches())
You can use a lookahead to do this if it's available
(?=.*b)(?=.*a)(?=.*d)
However this is quite inefficient. Any reason you can't use multiple String.indexOf checks?

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)
Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.
Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time
Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));
It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.
I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.
If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

How can I perform a partial match with java.util.regex.*?

I have been using the java.util.regex.* classes for Regular Expression in Java and all good so far. But today I have a different requirement. For example consider the pattern to be "aabb". Now if the input String is aa it will definitely not match, however there is still possibility that if I append bb it becomes aabb and it matches. However if I would have started with cc, no matter what I append it will never match.
I have explored the Pattern and Matcher class but didn't find any way of achieving this.
The input will come from user and system have to wait till pattern matches or it will never match irrespective of any input further.
Any clue?
Thanks.
You should have looked more closely at the Matcher API; the hitEnd() method works exactly as you described:
import java.util.regex.*;
public class Test
{
public static void main(String[] args) throws Exception
{
String[] ss = { "aabb", "aa", "cc", "aac" };
Pattern p = Pattern.compile("aabb");
Matcher m = p.matcher("");
for (String s : ss) {
m.reset(s);
if (m.matches()) {
System.out.printf("%-4s : match%n", s);
}
else if (m.hitEnd()) {
System.out.printf("%-4s : partial match%n", s);
}
else {
System.out.printf("%-4s : no match%n", s);
}
}
}
}
output:
aabb : match
aa : partial match
cc : no match
aac : no match
As far as I know, Java is the only language that exposes this functionality. There's also the requireEnd() method, which tells you if more input could turn a match into a non-match, but I don't think it's relevant in your case.
Both methods were added to support the Scanner class, so it can apply regexes to a stream without requiring the whole stream to be read into memory.
Pattern p = Pattern.compile(expr);
Matcher m = p.matcher(string);
m.find();
So you want to know not whether a String s matches the regex, but whether there might be a longer String starting with s that would match? Sorry, Regexes can't help you there because you get no access to the internal state of the matcher; you only get the boolean result and any groups you have defined, so you never know why a match failed.
If you're willing to hack the JDK libraries, you can extend (or probably fork) java.util.regex and give out more information about the matching process. If the match failed because the input was 'used up' the answer would be true; if it failed because of character discrimination or other checks it would be false. That seems like a lot of work though, because your problem is completely the opposite of what regexes are supposed to do.
Another option: maybe you can simply redefine the task so that you can treat the input as the regexp and match aabb against *aa.**? You have to be careful about regex metacharacters, though.
For the example you give you could try to use an anti-pattern to disqualify invalid results. For example "^[^a]" would tell you you're input "c..." can't match your example pattern of "aabb".
Depending on your pattern you may be able to break it up into smaller patterns to check and use multiple matchers and then set their bounds as one match occurs and you move to the next. This approach may work but if you're pattern is complex and can have variable length sub-parts you may end up reimplementing part of the matcher in your own code to adjust the possible bounds of the match to make it more or less greedy. A pseudo-code general idea of this would be:
boolean match(String input, Matcher[] subpatterns, int matchStart, int matchEnd){
matcher = next matcher in list;
int stop = matchend;
while(true){
if matcher.matches input from matchstart -> matchend{
if match(input, subpatterns, end of current match, end of string){
return true;
}else{
//make this match less greedy
stop--;
}
}else{
//no match
return false;
}
}
}
You could then merge this idea with the anti-patterns, and have anti-subpatterns and after each subpattern match you check the next anti-pattern, if it matches you know you have failed, otherwise continue the matching pattern. You would likely want to return something like an enum instead of a boolean (i.e. ALL_MATCHED, PARTIAL_MATCH, ANTI_PATTERN_MATCH, ...)
Again depending on the complexity of your actual pattern that you are trying to match writing the appropriate sub patterns / anti-pattern may be difficult if not impossible.
One way to do this is to parse your regex into a sequence of sub-regexes, and then reassemble them in a way that allows you to do partial matches; e.g. "abc" has 3 sub-regexes "a", "b" and "c" which you can then reassemble as "a(b*(c)?)?".
Things get more complicated when the input regex contains alternation and groups, but the same general approach should work.
The problem with this approach is that the resulting regex is more complicated, and could potentially lead to excessive backtracking for complex input regexes.
If you make each character of the regex optional and relax the multiplicity constraints, you kinda get what you want. Example if you have a matching pattern "aa(abc)+bbbb", you can have a 'possible match' pattern 'a?a?(a?b?c?)*b?b?b?b?'.
This mechanical way of producing possible-match pattern does not cover advanced constructs like forward and backward refs though.
You might be able to accomplish this with a state machine (http://en.wikipedia.org/wiki/State_machine). Have your states/transitions represent valid input and one error state. You can then feed the state machine one character (possibly substring depending on your data) at a time. At any point you can check if your state machine is in the error state. If it is not in the error state then you know that future input may still match. If it is in the error state then you know something previously failed and any future input will not make the string valid.

Categories