This RegEx captures wrong number of groups - java

I have to parse a string and capture some values:
FREQ=WEEKLY;WKST=MO;BYDAY=2TU,2WE
I want to capture 2 groups:
grp 1: 2, 2
grp 2: TU, WE
The Numbers represents intervals. TU, WE represents weekdays. I need both.
I'm using this code:
private final static java.util.regex.Pattern regBYDAY = java.util.regex.Pattern.compile(".*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*");
String rrule = "FREQ=WEEKLY;WKST=MO;BYDAY=2TU,2WE";
java.util.regex.Matcher result = regBYDAY.matcher(rrule);
if (result.matches())
{
int grpCount = result.groupCount();
for (int i = 1; i < grpCount; i++)
{
String g = result.group(i);
...
}
}
grpCount == 2 - why? If I read the java documentation correctly (that little bit) I should get 5? 0 = the whole expression, 1,2,3,4 = my captures 2,2,TU and WE.
result.group(1) == "2";
I'm a C# Programmer with very little java experience so I tested the RegEx in the "Regular Expression Workbench" - a great C# Program for testing RegEx. There my RegEx works fine.
https://code.msdn.microsoft.com/RegexWorkbench
RegExWB:
.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*
Matching:
FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR
1 => 22
1 => -2
1 => +223
2 => TU
2 => WE
2 => FR

You may also use this approach to increase readability and up to certain point independence from the implementation using a more common regexp subset
final Pattern re1 = Pattern.compile(".*;BYDAY=(.*)");
final Pattern re2 = Pattern.compile("(?:([+-]?[0-9]*)([A-Z]{2}),?)");
final Matcher matcher1 = re1.matcher(rrule);
if ( matcher1.matches() ) {
final String group1 = matcher1.group(1);
Matcher matcher2 = re2.matcher(group1);
while(matcher2.find()) {
System.out.println("group: " + matcher2.group(1) + " " +
matcher2.group(2));
}
}

Your regex works the same in Java as it does in C#; it's just that in Java you can only access the final capture for each group. In fact, .NET is one of only two regex flavors I know of that let you retrieve intermediate captures (Perl 6 being the other).
This is probably the simplest way to do what you want in Java:
String s= "FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR";
Pattern p = Pattern.compile("(?:;BYDAY=|,)([+-]?[0-9]+)([A-Z]{2})");
Matcher m = p.matcher(s);
while (m.find())
{
System.out.printf("Interval: %5s, Day of Week: %s%n",
m.group(1), m.group(2));
}
Here's the equivalent C# code, in case you're interested:
string s = "FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR";
Regex r = new Regex(#"(?:;BYDAY=|,)([+-]?[0-9]+)([A-Z]{2})");
foreach (Match m in r.Matches(s))
{
Console.WriteLine("Interval: {0,5}, Day of Week: {1}",
m.Groups[1], m.Groups[2]);
}

I'm a bit rusty, but I'll propose to "caveats". First of all, regexp(s) come in various dialects. There is a fantastic O'Reilly book about this, but there is a chance that your C# utility applies slightly different rules.
As an example, I used a similar (but different tool) and discovered that it did parse things differenty...
First of all it rejected your regexp (maybe a typo?) the initial "*" does not make sense, unless you put a dot (.) in front of it. Like this:
.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*
Now it was accepted, but it "matched" only the 2/WE part, and "skipped" the 2/TU pair.
(I suggest you read about greedy and non-greedy matching to understand this a bit better.
Therefore I updated your pattern as follows:
.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?),(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*
And now it works and correctly captures 2,TU,2 and WE.
Maybe this helps?

Related

Java Regex needed

I need regex that will fail only for below patterns and pass for everything else.
RXXXXXXXXXX (X are digits)
XXX.XXX.XXX.XXX (IP address)
I have basic knowledge of regex but not sure how to achieve this one.
For the first part, I know how to use regex to not start with R but how to make sure it allows any number of digits except 10 is not sure.
^[^R][0-9]{10}$ - it will do the !R thing but not sure how to pull off the not 10 digits part.
Well, simply define a regex:
Pattern p = Pattern.compile("R[0-9]{10} ((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))(\\.((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3}");
Matcher m = p.matcher(theStringToMatch);
if(!m.matches()) {
//do something, the test didn't pass thus ok
}
Or a jdoodle.
EDIT:
Since you actually wanted two possible patterns to filter out, chance the pattern to:
Pattern p = Pattern.compile("(R[0-9]{10})|(((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))(\\.((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3})");
If you want to match the entire string (so that the string should start and end with the pattern, place ^ in from and $ at the end of the pattern.
This should work:
!(string.matches("R\d{10}|(\d{3}\\.){3}\d{3}");
The \d means any digit, the brackets mean how many times it is repeated, and the \. means the period character. Parentheses indicate a grouping.
Here's a good reference on java regex with examples.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
Regex is not meant to validate every kind of input. You could, but sometimes it is not the right approach (similar to use a wrench as a hammer: it could do it but is not meant for it).
Split the string in two parts, by the space, then validate each:
String foo = "R1234567890 255.255.255.255";
String[] stringParts = foo.split(" ");
Pattern p = Pattern.compile("^[^R][0-9]{10}$");
Matcher m = p.macher(stringParts[0]);
if (m.matches()) {
//the first part is valid
//start validating the IP
String[] ipParts = stringParts.split("\\.");
for (String ip : ipParts) {
int ipPartValue = Integer.parseInt(ip);
if (!(ipPartValue >= 0 && ipPartValue <= 255)) {
//error...
}
}
}

Need to match a string with following regex format

A String will be of format [( 1.0N)-195( 1.0E)-195(28)-769.7(NESW)-1080.8(U)-617.9(43-047-30127)]
I need a regex to match to see if the string contains -XXX-XXX (where X is a digit)
Pattern p = Pattern.compile("(?=.*?(?:-[0-9][0-9][0-9]-[0-9][0-9][0-9]))");
if(p.matcher(a).matches())
{
System.out.println("Matched");
}
Also I've tried -[0-9][0-9][0-9]-[0-9][0-9][0-9] and (?=.*?-[0-9][0-9][0-9]-[0-9][0-9][0-9])
Nothing worked
A substring would be much easier, but (?:\\d{2})(-\\d{3}-\\d{5}) will match -XXX-XXXXX as the 1 group.
I'm assuming the 3 digits in the last number was a mistake. If not just change the 5 to a 3.
If you want to check if the string contains -3digits-3digits
String a = "43-037-30149";
Pattern p = Pattern.compile(".*(-[0-9]{3}-[0-9]{3})");
if(p.matcher(a).matches())
{
System.out.println("Matched");
}
why don't you use substring??
String b = a.substring(2,9);
or this one:
String c = a.substring(a.indexOf('-'), a.indexOf('-') + 8);
making "only" a substring would also be much more efficient! ;)

Find substring that surrounds a pattern using java

I have a long string variable X and another string(a word or two in length) Y. I want to find 50 words before and after Y where it appears in X. How can I achieve this using reg-ex.
Why does this have to be an regexp? What if there aren't 50 words surrounding it, because the match is at the beginning of the string?
Consider just locating the match, then separately finding an appropriate "snippet" surrounding it, without trying to cram it all into one magic, unmaintainable regular expression.
There is nothing wrong with doing it explicit: find the match, grow the snippet to the desired size, return the match. Make that a well-documented method "extractSnippet" instead of trying to do it in a single regular expression.
This code generates a string of 300 words (Word0 .. Word299), defines the target to search for as "Word12 Word13" and then finds up to 50 words before that string and up to 50 words after it.
final StringBuilder b = new StringBuilder();
final String matchWords = "Word12 Word13";
for (int i = 0; i < 300; i++) b.append("Word").append(i).append(" ");
final Matcher m =
Pattern.compile(
"((?:\\S+\\s+){0,50})" + Pattern.quote(matchWords) + "((?:\\s+\\S+){0,50})"
).matcher(b.toString());
if (m.find()) System.out.println("Words before: " + m.group(1) +
"\nAfter: " + m.group(2));
Check this PHP regex out, I'm pretty sure it'll work for Java too:
php > preg_match_all("/([a-z]+ ){4}donkey( [a-z]+){4}/","summer donna summer donna summer donkey hop hop hop hop bzzp",$matches); print_r($matches);
Array
(
[0] => Array
(
[0] => donna summer donna summer donkey hop hop hop hop
)
[1] => Array
(
[0] => summer
)
[2] => Array
(
[0] => hop
)
)
Java needs the Java.util.regex.* lib (the last kliny is for the dependancies) to preform that. Import that and invoke an instance such as:
Pattern p = Pattern.compile("(\\d+)");
Matcher m = p.matcher(name);
StringBuffer sb = new StringBuffer();
while(m.find()){
sb.append(m.group()); //this appends the context of Pattern p to the appended sb
}
In the Pattern, regular regex syntax can be invoked.
I would think you could run into issues where there may not be 50+- words preceding or succeeding the y string.
Roughly, I would say first check for existence with a pattern like $y to preform on X.Then go to the expense of counting words with a split operation and a " " space delimiter. From there, its a counting problem.

Match exactly N repetitions of the same character

How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.
I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.
Use negative lookahead and negative lookbehind.
This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).
English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."
Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:
The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"
This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.
Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]
I suspect you want to be using negative lookahead: (.)\1{N-1}(?!\1).
But that said...I suspect the simplest cross-language solution is just write it yourself without using regexes.
UPDATE:
^(.)\\1{3}(?!\\1)|(.)(?<!(?=\\2)..)\\2{3}(?!\\2) works for me more generally, including matches starting at the beginning of the string.
It is easy to put too much burden onto regular expressions and try to get them to do everything, when just nearly everything will do!
Use a regex to find all substrings consisting of a single character, and then check their length separately, like this:
use strict;
use warnings;
my $str = 'xxaaaayyybbbbbzzccccxx';
while ( $str =~ /((.)\2*)/g ) {
next unless length $1 == 4;
my $substr = $1;
print "$substr\n";
}
output
aaaa
cccc
Perl’s regex engine does not support variable-length lookbehind, so we have to be deliberate about it.
sub runs_of_length {
my($n,$str) = #_;
my $n_minus_1 = $n - 1;
my $_run_pattern = qr/
(?:
# In the middle of the string, we have to force the
# run being matched to start on a new character.
# Otherwise, the regex engine will give a false positive
# by starting in the middle of a run.
(.) ((?!\1).) (\2{$n_minus_1}) (?!\2) |
#$1 $2 $3
# Don't forget about a potential run that starts at
# the front of the target string.
^(.) (\4{$n_minus_1}) (?!\4)
# $4 $5
)
/x;
my #runs;
while ($str =~ /$_run_pattern/g) {
push #runs, defined $4 ? "$4$5" : "$2$3";
}
#runs;
}
A few test cases:
my #tests = (
"xxaaaayyybbbbbzzccccxx",
"aaaayyybbbbbzzccccxx",
"xxaaaa",
"aaaa",
"",
);
$" = "][";
for (#tests) {
my #runs = runs_of_length 4, $_;
print qq<"$_":\n>,
" - [#runs]\n";
}
Output:
"xxaaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"aaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"xxaaaa":
- [aaaa]
"aaaa":
- [aaaa]
"":
- []
It’s a fun puzzle, but your regex-averse colleagues will likely be unhappy if such a construction shows up in production code.
How about this in python?
def match(string, n):
parts = []
current = None
for c in string:
if not current:
current = c
else:
if c == current[-1]:
current += c
else:
parts.append(current)
current = c
result = []
for part in parts:
if len(part) == n:
result.append(part)
return result
Testing with your string with various sizes:
match("xxaaaayyybbbbbzzccccxx", 6) = []
match("xxaaaayyybbbbbzzccccxx", 5) = ["bbbbb"]
match("xxaaaayyybbbbbzzccccxx", 4) = ['aaaa', 'cccc']
match("xxaaaayyybbbbbzzccccxx", 3) = ["yyy"]
match("xxaaaayyybbbbbzzccccxx", 2) = ['xx', 'zz']
Explanation:
The first loop basically splits the text into parts, like so: ["xx", "aaaa", "yyy", "bbbbb", "zz", "cccc", "xx"]. Then the second loop tests those parts for their length. In the end the function only returns the parts that have the current length. I'm not the best at explaining code, so anyone is free to enhance this explanation if needed.
Anyways, I think this'll do!
Why not leave to regexp engine what it does best - finding longest string of same symbols and then check length yourself?
In Perl:
my $str = 'xxaaaayyybbbbbzzccccxx';
while($str =~ /(.)\1{3,}/g){
if(($+[0] - $-[0]) == 4){ # insert here full match length counting specific to language
print (($1 x 4), "\n")
}
}
>>> import itertools
>>> zz = 'xxaaaayyybbbbbzzccccxxaa'
>>> z = [''.join(grp) for key, grp in itertools.groupby(zz)]
>>> z
['xx', 'aaaa', 'yyy', 'bbbbb', 'zz', 'cccc', 'xx', 'aa']
From there you can iterate through the list and check for occasions when N==4 very easily, like this:
>>> [item for item in z if len(item)==4]
['cccc', 'aaaa']
In Java we can do like below code
String test ="xxaaaayyybbbbbzzccccxx uuuuuutttttttt";
int trimLegth = 4; // length of the same characters
Pattern p = Pattern.compile("(\\w)\\1+",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE);
Matcher m = p.matcher(test);
while (m.find())
{
if(m.group().length()==trimLegth) {
System.out.println("Same Characters String " + m.group());
}
}

Java Regular Expression code to process {Item1}.Item2 into array or list

Can anyone give me some advice into how to use Java RegEx to process:
{Item1}.Item2
so that I get an array or list containing
Item1
Item2
I was thinking of a RegEx like:
Pattern p = Pattern.compile("\\{(.+?)\\}\\.(.*?)");
Matcher match = p.matcher(mnemonicExpression);
while(match.find()) {
System.out.println(match.group());
}
But this does not seem to work.
Any help would be much appreciated.
Kind Regards
jcstock74
You need to grab the individual match groups 1 and 2. By using group(), you're effectively doing group(0), which is the entire match. Also, the last .*? shouldn't be reluctant, otherwise, it matches just an empty string.
Try this:
Pattern p = Pattern.compile("^\\{(.+?)\\}\\.(.*)$");
// \ / \/
// 1 2
Matcher match = p.matcher("{Item1}.Item2");
while(match.find()) {
System.out.println("1 = " + match.group(1));
System.out.println("2 = " + match.group(2));
}
which produces:
1 = Item1
2 = Item2
Bonus answer: This web page has a very nice regular expression tester using java.util.regex. This is the best way to test your expressions and it even provides the escaped java String you would use in Pattern.compile():
http://www.regexplanet.com/simple/index.html

Categories