Recursive group capturing regex with backreference in JAVA

Recursive group capturing regex with backreference in JAVA - java

I am trying to capture multiple groups recursively in a string using also a backreference to a group within the regex. Even though I am using a Pattern and a Matcher and a "while(matcher.find())" loop, it is still only capturing the last instance instead of all the instances. In my case the only possible tags are <sm>,<po>,<pof>,<pos>,<poi>,<pol>,<poif>,<poil>. Since these are formatting tags, I need to capture:
any text outside of a tag (so that I can format it as "normal" text, and I am going about this by capturing any text before a tag in one group while I capture the tag itself in another group, and as I iterate through the occurrences I remove everything that has been captured from the original String; if I have any text left over in the end I format that as "normal" text)
the "name" of the tag so that I know how I will have
to format the text inside the tag
the text contents of the tag that will be formatted accordingly to the tag name and its associated rules
Here is my sample code:
String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po><poil>for out of man this one has been taken.”</poil>";
String remainingText = currentText;
//first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text
if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*"))
{
//an opening or closing tag has been found, so let us start our pattern captures
//I am using a backreference \\2 to make sure the closing tag is the same as the opening tag
Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher1 = pattern1.matcher(currentText);
int iteration = 0;
while(matcher1.find()){
System.out.print("Iteration ");
System.out.println(++iteration);
System.out.println("group1:"+matcher1.group(1));
System.out.println("group2:"+matcher1.group(2));
System.out.println("group3:"+matcher1.group(3));
System.out.println("group4:"+matcher1.group(4));
if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false)
{
m_xText.insertString(xTextRange, matcher1.group(1), false);
remainingText = remainingText.replaceFirst(matcher1.group(1), "");
}
if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false)
{
switch (matcher1.group(2)) {
case "pof": [...]
case "pos": [...]
case "poif": [...]
case "po": [...]
case "poi": [...]
case "pol": [...]
case "poil": [...]
case "sm": [...]
}
remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", "");
}
}
The System.out.println is only outputting once in my console, with these results:
Iteration 1:
group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po>;
group2:poil
group3:po
group4:for out of man this one has been taken.”
Group 3 is to be ignored, the only useful groups are 1, 2 and 4 (group 3 is part of group 2). Why is this only capturing the last tag instance "poil", while it is not capturing the preceding "pof", "poi", and "po" tags?
The output I would like to see would be like this:
Iteration 1:
group1:the man said:
group2:pof
group3:po
group4:“This one, at last, is bone of my bones
Iteration 2:
group1:
group2:poi
group3:po
group4:and flesh of my flesh;
Iteration 3:
group1:
group2:po
group3:po
group4:This one shall be called ‘woman,’
Iteration 3:
group1:
group2:poil
group3:po
group4:for out of man this one has been taken.”

I just found the answer to this problem, it simply needed a non-greedy quantifier in the first capture, just like I had in the fourth capture group. This is working exactly as needed:
Pattern pattern1 = Pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);

Related

Regex to detect parentheses containing a weight

I'm having difficulties with getting my regex right.
I used this link, for detecting weight:
regex to get weight
This was the term to only find the weight, which worked:
([\d.]+)\s+(lbs?|oz|g|kg)
I wrote a Java method to color the dosage of medicaments on a html page. It should color all the text in parentheses, if it contains at least one indication of weight. (e.g. below 18: 5.5mg, over 18: 10mg)
Currently it will sometimes color the right part, but most of the time the regex gets too much or ignores a parenthese, that should be colored.
Problem currently: regex also contains every word after the closing parenthese until the end of the line.
Here my current regex:
(\\(.[^\\(]*.\\d*\\,?\\d+)\\s?+(µg|mg|g|kg).*.\\)
Here the entire method:
private static String addDosageHighlight(String htmltext) {
String dosage = "";
Pattern pattern = Pattern.compile("(\\(.[^\\(]*.\\d*\\,?\\d+)\\s?+(µg|mg|g|kg).*.\\)");
Matcher matcher = pattern.matcher(htmltext);
// Check all occurrences
if (matcher.find()) {
dosage = matcher.group();
htmltext = htmltext.replace(dosage, "<span style=\"color:magenta;\">" + dosage +"</span>");
}
return htmltext;
}
Examples:
medicament b (under 18: 10 g, over 18: 15 g) works well
medicament c (sometimes 15g if needed) can help
(sometimes 10 g)
Those all get detected, but will color all text until the end of the line, after the parentheses. I couldn't manage to get a parentheses that won't be colored which should be good.

You didn't specify if you accept decimals, but from your regex, I assume you allow decimal numbers with a comma as decimal mark.
So, I believe that this regex will do what you are looking for:
"\\([^\\)]*\\d+(,\\d+)?\\s*(µg|mg|g|kg)[^\\)]*\\)"

In your regex, your .* is too greedy and wants to eat as many characters as possible. Instead you could use something like [^)]* which will attempt to match all characters which are not a ) symbol.

Regex - Words containing multiple underscores

I'm making a Lexer, and have chosen to use Regex to split my tokens.
I'm working on all different tokens, except the one that really bugs me is words and identifiers.
You see, the rules I have in place are the following:
Words cannot start with or end with an underscore.
Words can be one or more characters in length.
Underscores can only be used between letters, and can appear multiple times.
Example of what I want:
_foo <- Invalid.
foo_ <- Invalid.
_foo_ <- Invalid.
foo_foo <- Valid.
foo_foo_foo <- Valid.
foo_foo_ <- Partially Valid. Only "foo_foo" should be picked up.
_foo_foo <- Partially Valid. Only "foo_foo" should be picked up.
I'm getting close, as this is what I currently have:
([a-zA-Z]+_[a-zA-Z]+|[a-zA-Z]+)
Except, it only detects the first occurence of an underscore. I want all of them.
Personal Request:
I would rather the answer be contained inside of a single group, as I have structured my tokeniser around them, except I would be more than happy to change my design if you can think of a better way of handling it. This is what I currently use:
private void tokenise(String regex, String[] data) {
Set<String> tokens = new LinkedHashSet<String>();
Pattern pattern = Pattern.compile(regex);
// First pass. Uses regular expressions to split data and catalog token types.
for (String line : data) {
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
for (int i = 1; i < matcher.groupCount() + 1; i++) {
if (matcher.group(i) != null) {
switch(i) {
case (1):
// Example group.
// Normally I would structure like:
// 0: Identifiers
// 1: Strings
// 2-?: So on so forth.
tokens.add("FOO:" + matcher.group());
break;
}
}
}
}
}
}

Try ([a-zA-Z]+(?:_[a-zA-Z]+)*)
The first part of the pattern, [a-zA-Z]+, matches one or more letters.
The second part of the pattern, (?:_[a-zA-Z]+), matches an undescore if it is followed by one or more letters.
The * at the end means the second part can be repeated zero or more times.
The (?: ) is like plain (), but doesn't return the matched group.

Matching and sorting a Bukkit ChatColor expression

I'm splitting up a String by spaces and then checking each piece if it contains a code (&a, &l, etc). If it matches, I have to grab the codes that are beside each other and then order them alphanumerically (0, 1, 2... a, b, c...).
Here is what I tried so far:
String message = "&l&aCheckpoint&m&6doreime";
String[] parts = message.split(" "); // This may not be needed for the example, but I'm only using one word for simplicity here
List<String> orderedMessage = new ArrayList<>();
Pattern pattern = Pattern.compile("((?:&|\u00a7)[0-9A-FK-ORa-fk-or])(.*?)"); // Completely matches the entire pattern, not what i want
for (String part : parts) {
if (pattern.matcher(part).matches()) {
List<String> orderedParts = new ArrayList<>();
// what do i do?
}
}
I need to change the pattern value so it matches groups like this:
Match: &l&aCheckpoint
Groups that I need: [&l, &a, Checkpoint]
Match: &m&6doreime
Groups that I need: [&m, &6, doreime]
How can I match each shown Match and split it into the 3 groups (where it splits each code section (&[0-9A-FK-ORa-fk-or]) and the remaining text until another code section?
Info: For anyone who is wondering why, when you submit color/format coded text to Minecraft, colors have to come first, or the format ([a-fk-or]) codes are ignored because of how Minecraft has parsed color codes since 1.5. By sorting them and rebuilding the message, it won't rely on users or developers getting the order correct.

You can get what you are after by using a slightly more complicated regex
(((?:&|§)[0-9A-FK-ORa-fk-or])+)([^&]*)
Breaking it down we have two important capturing groups
(((?:&|§)[0-9A-FK-ORa-fk-or])+)
This will capture one or more code sections of and & followed by a character
([^&]*)
The second grabs any number of non & characters which will get you the remainder of that section. (This is slightly different behavior than the regex you provided - things more complicated if & is a legal character in the string.
Putting that regex into use with a Matcher you can do the following,
String input = "&l&aCheckpoint&m&6doreime";
Pattern pattern = Pattern.compile("(((?:&|§)[0-9A-FK-ORa-fk-or])+)([^&]*)");
Matcher patternMatcher = pattern.matcher(input);
while(patternMatcher.find()){
String[] codes = patternMatcher.group(1).split("(?=&)");
String rest = patternMatcher.group(3);
}
Which will loop twice, giving you
codes = ["&l", "&a"]
rest = "Checkpoint"
on the first loop and the following on the second
codes = ["&m", "&6"]
rest = "doreime"

Java, Regex, Nested optional groups

I'm trying to capture nested optional groups in Java but it's not working out.
I'm trying to capture a keyword followed by an interval, where a keyword is anything for now, and an interval is just two dates. The interval may be optional, and the two dates may be optional as well. So, the following are valid matches.
word
word [01/01/1900, ]
word [, 01/01/2000]
word [01/01/1900, 01/01/2000]
I want to capture the keyword and both the dates even if they are null.
This is the Java MWE I've came up with.
public class Parser {
public static void main(String[] args) {
Parser parser = new Parser();
String s = "word [01/01/1900, 01/01/2000]";
parser.parse(s);
}
public void parse(String s) {
String date = "\\d{2}/\\d{2}/\\d{4}";
String interval = "\\[("+date+")?, ("+date+")?\\]";
String keyword = "(.+)( "+interval+")?";
Pattern p = Pattern.compile(keyword);
Matcher m = p.matcher(s);
if (m.matches()) {
for (int i = 0; i <= m.groupCount(); ++i) {
System.out.println(i + ": " + m.group(i));
}
}
}
}
And this is the output
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
4: null
If interval isn't optional, then it works.
String keyword = "(.+)( "+interval+")";
0: word [01/01/1900, 01/01/2000]
1: word
2: [01/01/1900, 01/01/2000]
3: 01/01/1900
4: 01/01/2000
If interval is a non-matching group (but still optional), then it doesn't work.
String keyword = "(.+)(?: "+interval+")?";
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
What do I need to do to retrieve back both dates? Thank You.
Edit: Part 2.
Suppose now I watch to match repeated keywords. i.e. the regex, keyword(, keyword)*. I tried this out, but only the first and the last instance is captured.
For simplicity, suppose I want to match the following a, b, c, d with the regex ([a-z])(?:, ([a-z]))*
However, I can only retrieve back the first and last group.
0: a, b, c, d
1: a
2: d
Why is this so?
Just found out that this cannot be done. Capture group multiple times

Change the first part of keyword from (.+) to (.+?).
Without the ?, the (.+) is a greedy quantifier. That means it will try to match as much as it can. I don't know all the mechanics of how the regex engine works, but I believe that in your case, what it's doing is setting some counter N to the number of characters remaining in the source. If it can use up that many characters and get the whole regex to match, it will. Otherwise, it tries N-1, N-2, etc., until the entire regex matches. I also think it goes from left to right when trying this; that is, since (.+) is the leftmost "part" of the pattern (for some definition of "part"), it loops on that part before it tries any looping on parts that are to the right. Thus, it's more important to make (.+) greedy than to make any other part of the pattern greedy; the (.+) takes precedence.
In your case, since (.+) is followed by an optional part, the regex matcher starts by trying the entire remainder of the string--and it succeeds, because the rest of the string, which is empty, is a fine match for an optional substring. That should also explains why it doesn't work if your substring isn't optional--the empty substring no longer matches.
Adding ? makes it a "reluctant" (or "stingy") quantifier, which works in the opposite direction. It starts by seeing if it can make a match with 0 characters, then 1, 2, ..., instead of starting with N and going downward. So when it gets up to 5, matching "word ", and it finds that the rest of the string matches your optional part, it completes and gives the results you were expecting.

Can't retrieve data from matched * group in Java

I'm having trouble figuring out the proper regex.
Here is some sample code:
#Test
public void testFindEasyNaked() {
System.out.println("Naked_find");
String arg = "hi mom <us-patent-grant seq=\"002\" image=\"D000001\" >foo<name>Fred</name></us-patent-grant> extra stuff";
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
System.out.println(nakedPat);
Pattern naked = Pattern.compile(nakedPat, Pattern.MULTILINE + Pattern.DOTALL );
Matcher m = naked.matcher(arg);
if (m.find()) {
System.out.println("found naked");
for (int i = 0; i <= m.groupCount(); i++) {
System.out.printf("%d: %s\n", i, m.group(i));
}
} else {
System.out.println("can't find naked either");
}
System.out.flush();
}
My regex matches the string, but I am not able to pull the repeated pattern.
What I want is to have
seq=\"002\" image=\"D000001\"
pulled out as a group. Here is what the program shows when I execute it.
Naked_find
<(us-patent-grant)((\s*[\S&&[^>]])*)*\s*>(.+?)</\1>
found naked
0: <us-patent-grant seq="002" image="D000001" >foo<name>Fred</name></us-patent-grant>
1: us-patent-grant
2:
3: "
4: foo<name>Fred</name>
The group #4 is fine, but where is the data for #2 and #3, and why is there a double quote in #3?
Thanks
Pat

Even if using an XML parser would be sound, I think I can explain the error in your regular expression:
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
You try to match the parameters in the part ((\\s*[\\S&&[^>]])*)*. Look at your innermost group: you have \s* ("one or more space") followed by \\S&&[^>] ("one non-space which is not >). It means that in your group, you will either have from zero to some spaces followed by a single non-space character.
So this will match any non-space character between "us-patent-grant" and >. And every time the regular expression engine will match it, it will assign the value to the group 3. It means the group previously matched are lost. That's why you have the last character of the tag, that is ".
You can improve it a bit by adding a + after [\\S&&[^>]], so it will match at least a complete sequence of non-spaces, but you would only obtain the last tag attribute in your group. You should instead use a better and simpler way:
Your goal being to pull out seq="002" image="D000001" in a group, what you should do is simply to match the sequence of every characters which are not > after "us-patent-grant":
"<(us-patent-grant)\\s*([^>]*)\\s*>(.+?)</\\1>"
This way, you have the following values in your groups:
Group 1: us-patent-grant
Group 2: seq=\"002\" image=\"D000001\"
Group 3: foo<name>Fred</name>
Here is the test on Regexplanet: http://fiddle.re/ezfd6

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.