Capturing <thisPartOnly> and (thisPartOnly) with the same group - java

Let's say we have the following input:
<amy>
(bob)
<carol)
(dean>
We also have the following regex:
<(\w+)>|\((\w+)\)
Now we get two matches (as seen on rubular.com):
<amy> is a match, \1 captures amy, \2 fails
(bob) is a match, \2 captures bob, \1 fails
This regex does most of what we want, which are:
It matches the open and close brackets properly (i.e. no mixing)
It captures the part we're interested in
However, it does have a few drawbacks:
The capturing pattern (i.e. the "main" part) is repeated
It's only \w+ in this case, but generally speaking this can be quite complex,
If it involves backreferences, then they must be renumbered for each alternate!
Repetition makes maintenance a nightmare! (what if it changes?)
The groups are essentially duplicated
Depending on which alternate matches, we must query different groups
It's only \1 or \2 in this case, but generally the "main" part can have capturing groups of their own!
Not only is this inconvenient, but there may be situations where this is not feasible (e.g. when we're using a custom regex framework that is limited to querying only one group)
The situation quickly worsens if we also want to match {...}, [...], etc.
So the question is obvious: how can we do this without repeating the "main" pattern?
Note: for the most part I'm interested in java.util.regex flavor, but other flavors are welcomed.
Appendix
There's nothing new in this section; it only illustrates the problem mentioned above with an example.
Let's take the above example to the next step: we now want to match these:
<amy=amy>
(bob=bob)
[carol=carol]
But not these:
<amy=amy) # non-matching bracket
<amy=bob> # left hand side not equal to right hand side
Using the alternate technique, we have the following that works (as seen on rubular.com):
<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\]
As explained above:
The main pattern can't simply be repeated; backreferences must be renumbered
Repetition also means maintenance nightmare if it ever changes
Depending on which alternate matches, we must query either \1 \2, \3 \4, or \5 \6

You can use a lookahead to "lock in" the group number before doing the real match.
String s = "<amy=amy>(bob=bob)[carol=carol]";
Pattern p = Pattern.compile(
"(?=[<(\\[]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\])");
Matcher m = p.matcher(s);
while(m.find())
{
System.out.printf("found %s in %s%n", m.group(2), m.group());
}
output:
found amy in <amy=amy>
found bob in (bob=bob)
found carol in [carol=carol]
It's still ugly as hell, but you don't have to recalculate all the group numbers every time you make a change. For example, to add support for curly brackets, it's just:
"(?=[<(\\[{]((\\w+)=\\2))(?:<\\1>|\\(\\1\\)|\\[\\1\\]|\\{\\1\\})"

In preg (Perl Regex library), this will match your example, and \3 will catch the insides:
((<)|\()(\w+)(?(2)>|\))
It will not work in JS, though - you did not specify the dialect...
It depends on the conditional operator (?(2)...|...) which basically says if 2 is a non-null capture, then match before the pipe, else match after the pipe. In this form, pipe is not alternation ("or").
UPDATE Sorry, I completely missed the Java bit :) Anyway, apparently Java does not support the conditional construct; and I have no idea how else I'd go about it :(
Also, for your Appendix (even though it's the wrong dialect):
(?:(<)|(\()|\[)(\w+)=\3(?(1)>|(?(2)\)|]))
The name is in again in \3 (I got rid of the first capturing paren, but I had to add another one for one extra opening paren check)

The only solution that I was able to come up with is inspired by technique of capturing an empty string on different alternates; backreferencing to these groups later can serve as pseudo-conditionals.
Thus, this pattern works for the second example (as seen on rubular.com):
__main__
/ \
(?:<()|\(()|\[())((\w+)=\5)(\1>|\2\)|\3\])
\_______________/ \_____________/
\1 \2 \3
So essentially for each opening bracket, we assign a group that captures an empty string. Then when we try to match the closing bracket, we see which group was succesful, and match the corresponding closing bracket.
The "main" part does not have to be repeated, but in Java, backreferences may have to be renumbered. This won't be a problem in flavors that support named groups.

May be this example in Perl will interest you :
$str = q/<amy=amy> (bob=bob) [carol=carol] <amy=amy) <amy=bob>/;
$re = qr/(?:<((\w+)=\2)>|\(((\w+)=\4)\)|\[((\w+)=\6)\])+/;
#list = ($str =~ /$re/g);
for(#list) {
say $i++," = ",$_;
}
I just surround your regex by (?:regex)+

When you get things like this, using a single regex is a silly restriction, and I simply don't agree with your "maintenance nightmare" to using more than one - repeating a similar-but-different expression several times is likely to be more maintainable (well, less unmaintainable), and maybe even better performance too, than a single overly-complex regex.
But anyway, there's no repetition if you just use variables to compose your regex.
Here's some pseudo-code:
Brackets = "<>,(),[]"
CoreRegex = "(\w+)=\1"
loop CurBracket in Brackets.split(',')
{
Input.match( Regex.quote(CurBracket.left(1)) & CoreRegex & Regex.quote(CurBracket.right(1)) )
}
(p.s.that's just to give the general idea - I'd probably use already-escaped arrays for the bracket sets in actual implementation).

Assuming there is no easy way to manually write this regular expression, why not leave it to the computer?
You could have a function, maybe like below (I am using C# syntax here, as I am a bit more familiar with regexes here than in Java, but it should not be too difficult to adapt it to Java).
Note that I left the function AdaptBackreferences() more or less unimplemented as an exercise to the reader. It should just adapt the backreference numbering.
struct BracketPair {public string Open; public string Close;};
static string[] MatchTextInBrackets(string text, string innerPattern, BracketPair[] bracketPairs) {
StringBuilder sb = new StringBuilder();
// count number of catching parentheses of innerPattern here:
int numberOfInnerCapturingParentheses = Regex.Match("", innerPattern).Groups.Count - 1;
bool firstTime = true;
foreach (BracketPair pair in bracketPairs) {
// apply logic to change backreference numbering:
string adaptedInnerPattern = AdaptBackreferences(innerPattern);
if (firstTime) { firstTime = false; } else { sb.Append('|'); }
sb.Append(pair.Open).Append("(").Append(adaptedInnerPattern).Append(")").Append(pair.Close);
}
string myPattern = sb.ToString();
MatchCollection matches = Regex.Matches(text, myPattern);
string[] result = new string[matches.Count];
for(int i=0; i < matches.Count; i++) {
StringBuilder mb = new StringBuilder();
for(int j=0; j < bracketPairs.Length; j++) {
mb.Append(matches[i].Groups[1 + j * (numberOfInnerCapturingParentheses + 1)]); // append them all together, assuming all exept one are empty
}
result[i] = mb.ToString();
}
return result;
}
static string AdaptBackreferences(string pattern) { return pattern; } // to be written

Related

Porting Twemoji regex to extract Unicode emojis in Java

I'm trying to identify the same emojis in a String for extraction that Twemoji would, using Java. A straight up port isn't working for a great deal of emojis - I think I've identified the issue, so I'll give it in an example below:
Suppose we have the emoji 🪔 (Codeunits being \ud83e\ude94). In Javascript regex, this is captured by, \ud83e[\ude94-\ude99] which will first match the \ude83e then find subsequent \ude94 within the range indicated inside the brackets. The same expression in Java regex, however, fails to match at all. If I modify the Java pattern to [\ud83e[\ude94-\ude99]], according to an online engine, the 2nd half is captured, but not the 1st.
My working theory is that Java encounters the brackets and treats everything inside as a single codepoint and when combined with the outside codeunit, thinks it's looking for two codepoints instead of one. Is there an easy way to fix this or the regex pattern to work around it? The obvious fix would be to use something like [\ud83e\ude94-\ud83e\ude99], the actual regex pattern is quite lengthy. I wonder if there might be an easy encoding fix somewhere here as well.
Toy sample below:
public static void main(String[] args) {
String emojiPattern = "\ud83e[\ude94-\ude99]";
String raw = "\ud83e\ude94";
Pattern pattern = Pattern.compile(emojiPattern);
Matcher matcher = pattern.matcher(raw);
System.out.println(matcher.matches());
}
If you're trying to match a single specific codepoint, don't mess with surrogate pairs; refer to it by number:
String emojiPattern = "\\x{1FA94}";
or by name:
String emojiPattern = "\\N{DIYA LAMP}"
If you want to match any codepoint in the block U+1FA94 is in, use the name of the block in a property atom:
String emojiPattern = "\\p{blk=Symbols and Pictographs Extended-A}";
If you switch out any of these three regular expressions your example program will print 'true'.
The problem you're running into is a UTF-16 surrogate pair is a single codepoint, and the RE engine matches codepoints, not code units; you can't match just the low or high half - just the pattern "\ud83e" will fail to match too (When used with Matcher#find instead of Matcher#matches of course), for example. It's all or none.
To do the kind of ranged matching you want, you have to turn away from regular expressions and look at the code units directly. Something like
char[] codeUnits = raw.toCharArray();
for (int i = 0; i < codeUnits.length - 1; i++) {
if (codeUnits[i] == 0xD83E &&
(codeUnits[i + 1] >= 0xDE94 && codeUnits[i + 1] <= 0xDE99)) {
System.out.println("match");
}
}

Regex in Java: match groups until first symbol occurrence

My string looks like this:
"Chitkara DK, Rawat DJY, Talley N. The epidemiology of childhood recurrent abdominal pain in Western countries: a systematic review. Am J Gastroenterol. 2005;100(8):1868-75. DOI."
What I want is to get letters in uppercase (as separate words only) until first dot, to get: DK DJY N. But not other characters after, like J DOI.
Here`s my part of code for Java class Pattern:
\\b[A-Z]{1,3}\\b
Is there a general option in regex to stop matching after certain character?
You can make use of the contionous matching using \G and extract your desired matches from the first capturing group:
(?:\\G|^)[^.]+?\\b([A-Z]{1,3})\\b
You need to use the MULTILINE flag to use this in a multiline context. If your content is always a single line you may drop the |^ from your pattern.
See https://regex101.com/r/JXIu21/3
Note that regex101 uses a PCRE pattern, but all features used are also available in Java regex.
Sebastian Proske's answer is great, but it's often easier (and more readable) to split complex parsing tasks into separate steps. We can split your goal into two separate steps and thereby create a much simpler and more clearly-correct solution, using your original pattern.
private static final Pattern UPPER_CASE_ABBV_PATTERN =
Pattern.compile("\\b[A-Z]{1,3}\\b");
public static List<String> getAbbreviationsInFirstSentence(String input) {
// isolate the first sentence, since that's all we care about
String firstSentence = input.split("\\.")[0];
// then look for matches in the first sentence
Matcher m = UPPER_CASE_ABBV_PATTERN.matcher(firstSentence);
List<String> results = new ArrayList<>();
while (m.find()) {
results.add(m.group());
}
return results;
}

Java Regex to check "=number", ex "=5455"?

I want to check a string that matches the format "=number", ex "=5455".
As long as the fist char is "=" & the subsequence is any number in [0-9] (dot is not allowed), then it will popup "correct" message.
if(str.matches("^[=][0-9]+")){
Window.alert("correct");
}
So, is this ^[=][0-9]+ the correct one?
if it is not correct, can u provide a correct solution?
if it is correct, then can u find a better solution?
I'm no big regex expert and more knowledgeable people than me might correct this answer, but:
I don't think there's a point in using [=] rather than simply = - the [...] block is used to declare multiple choices, why declare a multiple choice of one character?
I don't think you need to use ^ (if your input string contains any character before =, it won't match anyway). I'm unsure as to whether its presence makes your regex faster, slower or has no effect.
In conclusion, I'd use =[0-9]+
That should be correct it is looking for an anchored at the beginning = sign and then 1 or more digits between 0-9
Your regex will work, even though it can be simplified:
.matches() does not really do regex matching, since it tries and matches all the input against the regex; therefore the beginning of input anchor is not needed;
you don't need the character class around the =.
Therefore:
if (str.matches("=[0-9]+")) { ... }
If you want to match a string which only begins with that regex, you have to use a Pattern, a Matcher and .find():
final Pattern p = Pattern.compile("^=[0-9]+");
final Matcher m = p.matcher(str);
if (m.find()) { ... }
And finally, Matcher also has .lookingAt() which anchors the regex only at the beginning of the input.

Java repetitive pattern matching (3)

I am trying to solve a simple Java regex matching problem but still getting conflicting results (following up on this and that question).
More specifically, I am trying to match a repetitive text input, consisting of groups that are delimited by '|' (vertical bar) that may be directly preceded by underscore ('_'), especially if the groups are not empty (i.e., if no two consecutive | delimiters appear in the input).
An example such input is:
Text group 1_|Text group 2_|||Text group 5_|||Text group 8
In addition, I need a way to verify that a match has occurred, in order to avoid applying the processing related to that input to other, totally different inputs that my application also processes, using different regular expressions.
To confirm that a regex works, I am using RegexPal.
After several tests, the closest to what I want are the following two Regular Expressions, suggested in the questions I quoted above:
1. (?:\||^)([^\\|]*)
2. \G([^\|]+?)_?\||\G()\||\G([^\|]*)$
Using either of these, if I run a matcher.find() loop I get:
All the text groups, with the underscore included in the end, from Regex 1
All the text groups apart from the last, with no underscore but 2 empty groups in the end, from Regex 2.
So, apparently Regex 2 is not correct (and RegexPal also does not show it as matching).
I could use Regex 1 and do some post-processing to remove the trailing underscore, although ideally I would like the regex to do that for me.
However, none of the two aforementioned regular expressions returns true for matcher.matches(), whereas matcher.find() is always true even for totally irrelevant input (reasonable, since there will often be at least 1 matching group, even in other text).
I thus have two questions:
Is there a correct (fully working) regex that excludes the trailing underscore?
Is there any way of checking that only the correct regex has matched?
The code used to test Regex 1, is something like
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher = Pattern.compile("(?:\\||^)([^\\\\|]*)").matcher(input);
if (matcher.matches())
{
System.out.println("Input MATCHED: " + input);
while (matcher.find())
{
System.out.println("\t\t" + matcher.group(1));
}
}
else
{
System.out.println("\tInput NOT MATCHED: " + input);
}
Using the above code always results in "NOT MATCHED". Removing the if/else and only using matcher.find() does retrieve all text groups.
Matcher#matches method attempts to match the entire input sequence against the pattern, that is why you are getting the result Input NOT MATCHED. See the documentation here http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#matches
If you want to exclude the trailing underscore you can use this regex (slight modification of what you already have)
(?:\\||^)([^\\\\|_]*)
This would work if you are sure that _ comes just before |.
RegexPal is a JavaScript regex tool. The Java and JavaScript regular expression languages differ. Consider using a Java Regex tool; perhaps this one
This may be close to what you want: (?:([^_\|]+)_{0,1}+\|*)+
Edit: Code added.
In java 6 this prints each group (the find() loop).
public static void main(String[] args)
{
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher;
Pattern pattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)+");
Pattern groupPattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)");
matcher = pattern.matcher(input);
if (matcher.matches())
{
Matcher groupMatcher;
System.out.println("matcher.matches() is true");
int groupCount = matcher.groupCount();
for (int index = 1; index <= groupCount; ++index)
{
System.out.print("group (pattern)[");
System.out.print(index);
System.out.print("]: ");
System.out.println(matcher.group(index));
}
groupMatcher = groupPattern.matcher(input);
while (groupMatcher.find())
{
System.out.print("group (groupPattern):");
System.out.println(groupMatcher.group());
System.out.println(groupMatcher.group(1));
}
}
else
{
System.out.println("No match");
}
}

How can I perform a partial match with java.util.regex.*?

I have been using the java.util.regex.* classes for Regular Expression in Java and all good so far. But today I have a different requirement. For example consider the pattern to be "aabb". Now if the input String is aa it will definitely not match, however there is still possibility that if I append bb it becomes aabb and it matches. However if I would have started with cc, no matter what I append it will never match.
I have explored the Pattern and Matcher class but didn't find any way of achieving this.
The input will come from user and system have to wait till pattern matches or it will never match irrespective of any input further.
Any clue?
Thanks.
You should have looked more closely at the Matcher API; the hitEnd() method works exactly as you described:
import java.util.regex.*;
public class Test
{
public static void main(String[] args) throws Exception
{
String[] ss = { "aabb", "aa", "cc", "aac" };
Pattern p = Pattern.compile("aabb");
Matcher m = p.matcher("");
for (String s : ss) {
m.reset(s);
if (m.matches()) {
System.out.printf("%-4s : match%n", s);
}
else if (m.hitEnd()) {
System.out.printf("%-4s : partial match%n", s);
}
else {
System.out.printf("%-4s : no match%n", s);
}
}
}
}
output:
aabb : match
aa : partial match
cc : no match
aac : no match
As far as I know, Java is the only language that exposes this functionality. There's also the requireEnd() method, which tells you if more input could turn a match into a non-match, but I don't think it's relevant in your case.
Both methods were added to support the Scanner class, so it can apply regexes to a stream without requiring the whole stream to be read into memory.
Pattern p = Pattern.compile(expr);
Matcher m = p.matcher(string);
m.find();
So you want to know not whether a String s matches the regex, but whether there might be a longer String starting with s that would match? Sorry, Regexes can't help you there because you get no access to the internal state of the matcher; you only get the boolean result and any groups you have defined, so you never know why a match failed.
If you're willing to hack the JDK libraries, you can extend (or probably fork) java.util.regex and give out more information about the matching process. If the match failed because the input was 'used up' the answer would be true; if it failed because of character discrimination or other checks it would be false. That seems like a lot of work though, because your problem is completely the opposite of what regexes are supposed to do.
Another option: maybe you can simply redefine the task so that you can treat the input as the regexp and match aabb against *aa.**? You have to be careful about regex metacharacters, though.
For the example you give you could try to use an anti-pattern to disqualify invalid results. For example "^[^a]" would tell you you're input "c..." can't match your example pattern of "aabb".
Depending on your pattern you may be able to break it up into smaller patterns to check and use multiple matchers and then set their bounds as one match occurs and you move to the next. This approach may work but if you're pattern is complex and can have variable length sub-parts you may end up reimplementing part of the matcher in your own code to adjust the possible bounds of the match to make it more or less greedy. A pseudo-code general idea of this would be:
boolean match(String input, Matcher[] subpatterns, int matchStart, int matchEnd){
matcher = next matcher in list;
int stop = matchend;
while(true){
if matcher.matches input from matchstart -> matchend{
if match(input, subpatterns, end of current match, end of string){
return true;
}else{
//make this match less greedy
stop--;
}
}else{
//no match
return false;
}
}
}
You could then merge this idea with the anti-patterns, and have anti-subpatterns and after each subpattern match you check the next anti-pattern, if it matches you know you have failed, otherwise continue the matching pattern. You would likely want to return something like an enum instead of a boolean (i.e. ALL_MATCHED, PARTIAL_MATCH, ANTI_PATTERN_MATCH, ...)
Again depending on the complexity of your actual pattern that you are trying to match writing the appropriate sub patterns / anti-pattern may be difficult if not impossible.
One way to do this is to parse your regex into a sequence of sub-regexes, and then reassemble them in a way that allows you to do partial matches; e.g. "abc" has 3 sub-regexes "a", "b" and "c" which you can then reassemble as "a(b*(c)?)?".
Things get more complicated when the input regex contains alternation and groups, but the same general approach should work.
The problem with this approach is that the resulting regex is more complicated, and could potentially lead to excessive backtracking for complex input regexes.
If you make each character of the regex optional and relax the multiplicity constraints, you kinda get what you want. Example if you have a matching pattern "aa(abc)+bbbb", you can have a 'possible match' pattern 'a?a?(a?b?c?)*b?b?b?b?'.
This mechanical way of producing possible-match pattern does not cover advanced constructs like forward and backward refs though.
You might be able to accomplish this with a state machine (http://en.wikipedia.org/wiki/State_machine). Have your states/transitions represent valid input and one error state. You can then feed the state machine one character (possibly substring depending on your data) at a time. At any point you can check if your state machine is in the error state. If it is not in the error state then you know that future input may still match. If it is in the error state then you know something previously failed and any future input will not make the string valid.

Categories