Java, Regex, Nested optional groups

Java, Regex, Nested optional groups - java

I'm trying to capture nested optional groups in Java but it's not working out.
I'm trying to capture a keyword followed by an interval, where a keyword is anything for now, and an interval is just two dates. The interval may be optional, and the two dates may be optional as well. So, the following are valid matches.
word
word [01/01/1900, ]
word [, 01/01/2000]
word [01/01/1900, 01/01/2000]
I want to capture the keyword and both the dates even if they are null.
This is the Java MWE I've came up with.
public class Parser {
public static void main(String[] args) {
Parser parser = new Parser();
String s = "word [01/01/1900, 01/01/2000]";
parser.parse(s);
}
public void parse(String s) {
String date = "\\d{2}/\\d{2}/\\d{4}";
String interval = "\\[("+date+")?, ("+date+")?\\]";
String keyword = "(.+)( "+interval+")?";
Pattern p = Pattern.compile(keyword);
Matcher m = p.matcher(s);
if (m.matches()) {
for (int i = 0; i <= m.groupCount(); ++i) {
System.out.println(i + ": " + m.group(i));
}
}
}
}
And this is the output
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
4: null
If interval isn't optional, then it works.
String keyword = "(.+)( "+interval+")";
0: word [01/01/1900, 01/01/2000]
1: word
2: [01/01/1900, 01/01/2000]
3: 01/01/1900
4: 01/01/2000
If interval is a non-matching group (but still optional), then it doesn't work.
String keyword = "(.+)(?: "+interval+")?";
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
What do I need to do to retrieve back both dates? Thank You.
Edit: Part 2.
Suppose now I watch to match repeated keywords. i.e. the regex, keyword(, keyword)*. I tried this out, but only the first and the last instance is captured.
For simplicity, suppose I want to match the following a, b, c, d with the regex ([a-z])(?:, ([a-z]))*
However, I can only retrieve back the first and last group.
0: a, b, c, d
1: a
2: d
Why is this so?
Just found out that this cannot be done. Capture group multiple times

Change the first part of keyword from (.+) to (.+?).
Without the ?, the (.+) is a greedy quantifier. That means it will try to match as much as it can. I don't know all the mechanics of how the regex engine works, but I believe that in your case, what it's doing is setting some counter N to the number of characters remaining in the source. If it can use up that many characters and get the whole regex to match, it will. Otherwise, it tries N-1, N-2, etc., until the entire regex matches. I also think it goes from left to right when trying this; that is, since (.+) is the leftmost "part" of the pattern (for some definition of "part"), it loops on that part before it tries any looping on parts that are to the right. Thus, it's more important to make (.+) greedy than to make any other part of the pattern greedy; the (.+) takes precedence.
In your case, since (.+) is followed by an optional part, the regex matcher starts by trying the entire remainder of the string--and it succeeds, because the rest of the string, which is empty, is a fine match for an optional substring. That should also explains why it doesn't work if your substring isn't optional--the empty substring no longer matches.
Adding ? makes it a "reluctant" (or "stingy") quantifier, which works in the opposite direction. It starts by seeing if it can make a match with 0 characters, then 1, 2, ..., instead of starting with N and going downward. So when it gets up to 5, matching "word ", and it finds that the rest of the string matches your optional part, it completes and gives the results you were expecting.

Related

How regex lookaround works when used alone

public class Test {
public static void main(String[] args){
Pattern a = Pattern.compile("(?=\\.)|(?<=\\.)");
Matcher b = a.matcher(".");
while (b.find()) System.out.print("+");
}
}
I've been reading the lookaround section on Regular-Expressions.info and trying to figure out how it works, and I'm stuck with this thing. when I run the code above the result is ++, which I don't understand, because since "." is the only token to match the pattern against, and apparently there's nothing behind or ahead of the "." so how can it match twice?

As the regex engine advances through the input, it considers both characters and positions before and after characters as distinct positions within the input.
Your input has 3 positions:
Just before the first character
The first character
Just after the first character
Position 1 matches (?=\\.).
Position 3 matches (?<=\\.).

Recursive group capturing regex with backreference in JAVA

I am trying to capture multiple groups recursively in a string using also a backreference to a group within the regex. Even though I am using a Pattern and a Matcher and a "while(matcher.find())" loop, it is still only capturing the last instance instead of all the instances. In my case the only possible tags are <sm>,<po>,<pof>,<pos>,<poi>,<pol>,<poif>,<poil>. Since these are formatting tags, I need to capture:
any text outside of a tag (so that I can format it as "normal" text, and I am going about this by capturing any text before a tag in one group while I capture the tag itself in another group, and as I iterate through the occurrences I remove everything that has been captured from the original String; if I have any text left over in the end I format that as "normal" text)
the "name" of the tag so that I know how I will have
to format the text inside the tag
the text contents of the tag that will be formatted accordingly to the tag name and its associated rules
Here is my sample code:
String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po><poil>for out of man this one has been taken.”</poil>";
String remainingText = currentText;
//first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text
if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*"))
{
//an opening or closing tag has been found, so let us start our pattern captures
//I am using a backreference \\2 to make sure the closing tag is the same as the opening tag
Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher1 = pattern1.matcher(currentText);
int iteration = 0;
while(matcher1.find()){
System.out.print("Iteration ");
System.out.println(++iteration);
System.out.println("group1:"+matcher1.group(1));
System.out.println("group2:"+matcher1.group(2));
System.out.println("group3:"+matcher1.group(3));
System.out.println("group4:"+matcher1.group(4));
if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false)
{
m_xText.insertString(xTextRange, matcher1.group(1), false);
remainingText = remainingText.replaceFirst(matcher1.group(1), "");
}
if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false)
{
switch (matcher1.group(2)) {
case "pof": [...]
case "pos": [...]
case "poif": [...]
case "po": [...]
case "poi": [...]
case "pol": [...]
case "poil": [...]
case "sm": [...]
}
remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", "");
}
}
The System.out.println is only outputting once in my console, with these results:
Iteration 1:
group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po>;
group2:poil
group3:po
group4:for out of man this one has been taken.”
Group 3 is to be ignored, the only useful groups are 1, 2 and 4 (group 3 is part of group 2). Why is this only capturing the last tag instance "poil", while it is not capturing the preceding "pof", "poi", and "po" tags?
The output I would like to see would be like this:
Iteration 1:
group1:the man said:
group2:pof
group3:po
group4:“This one, at last, is bone of my bones
Iteration 2:
group1:
group2:poi
group3:po
group4:and flesh of my flesh;
Iteration 3:
group1:
group2:po
group3:po
group4:This one shall be called ‘woman,’
Iteration 3:
group1:
group2:poil
group3:po
group4:for out of man this one has been taken.”

I just found the answer to this problem, it simply needed a non-greedy quantifier in the first capture, just like I had in the fourth capture group. This is working exactly as needed:
Pattern pattern1 = Pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);

Can't retrieve data from matched * group in Java

I'm having trouble figuring out the proper regex.
Here is some sample code:
#Test
public void testFindEasyNaked() {
System.out.println("Naked_find");
String arg = "hi mom <us-patent-grant seq=\"002\" image=\"D000001\" >foo<name>Fred</name></us-patent-grant> extra stuff";
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
System.out.println(nakedPat);
Pattern naked = Pattern.compile(nakedPat, Pattern.MULTILINE + Pattern.DOTALL );
Matcher m = naked.matcher(arg);
if (m.find()) {
System.out.println("found naked");
for (int i = 0; i <= m.groupCount(); i++) {
System.out.printf("%d: %s\n", i, m.group(i));
}
} else {
System.out.println("can't find naked either");
}
System.out.flush();
}
My regex matches the string, but I am not able to pull the repeated pattern.
What I want is to have
seq=\"002\" image=\"D000001\"
pulled out as a group. Here is what the program shows when I execute it.
Naked_find
<(us-patent-grant)((\s*[\S&&[^>]])*)*\s*>(.+?)</\1>
found naked
0: <us-patent-grant seq="002" image="D000001" >foo<name>Fred</name></us-patent-grant>
1: us-patent-grant
2:
3: "
4: foo<name>Fred</name>
The group #4 is fine, but where is the data for #2 and #3, and why is there a double quote in #3?
Thanks
Pat

Even if using an XML parser would be sound, I think I can explain the error in your regular expression:
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
You try to match the parameters in the part ((\\s*[\\S&&[^>]])*)*. Look at your innermost group: you have \s* ("one or more space") followed by \\S&&[^>] ("one non-space which is not >). It means that in your group, you will either have from zero to some spaces followed by a single non-space character.
So this will match any non-space character between "us-patent-grant" and >. And every time the regular expression engine will match it, it will assign the value to the group 3. It means the group previously matched are lost. That's why you have the last character of the tag, that is ".
You can improve it a bit by adding a + after [\\S&&[^>]], so it will match at least a complete sequence of non-spaces, but you would only obtain the last tag attribute in your group. You should instead use a better and simpler way:
Your goal being to pull out seq="002" image="D000001" in a group, what you should do is simply to match the sequence of every characters which are not > after "us-patent-grant":
"<(us-patent-grant)\\s*([^>]*)\\s*>(.+?)</\\1>"
This way, you have the following values in your groups:
Group 1: us-patent-grant
Group 2: seq=\"002\" image=\"D000001\"
Group 3: foo<name>Fred</name>
Here is the test on Regexplanet: http://fiddle.re/ezfd6

pattern search using regex in java

public static void main(String args[]) {
Pattern p = Pattern.compile("ab"); // Case 1
Pattern p = Pattern.compile("bab"); // Case 2
Matcher m = p.matcher("abababa");
while(m.find()){
System.out.print(m.start());
}
}
When I used Case 1, then output is 024 as expected. But, when I used Case 2 then output is 1, but I was expected 13. So, anyone explain me, is there any exceptional rule in regex, which causes this output, if not. Then, why I'm getting this output.
Help appreciated !!
Note : Case 1 and Case 2 are independently used.

The match consumes the input, so the next match is found after the end of the previous match:
Position of "bab" matcher's pointer before each match would be:
|abababa
abab|aba

For Case 2:
its because, after it search's for bab, it wouldn't consider the already searched char(b in this case at index 3) thus you get only 1.
Input: abababa
Search for bab,
find's a match starting at index 1 and ending at index 3, next the search would start at index 4(aba)

RegEx to split camelCase or TitleCase (advanced)

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.
(?<!^)(?=[A-Z])
It works as expected:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
For example with Java:
String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}
My problem is that it does not work in some cases:
Case 1: VALUE -> V / A / L / U / E
Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext
To my mind, the result shoud be:
Case 1: VALUE
Case 2: eclipse / RCP / Ext
In other words, given n uppercase chars:
if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
if the n chars are at the end, the group should be: (n chars).
Any idea on how to improve this regex?

The following regex works for all of the above examples:
public static void main(String[] args)
{
for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
System.out.println(w);
}
}
It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".
The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.

It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:
(?<=[a-z])(?=[A-Z])
Here is how this regex splits your example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCPExt
The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.
Addendum - Improved version
Note: This answer recently got an upvote and I realized that there is a better way...
By adding a second alternative to the above regex, all of the OP's test cases are correctly split.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
Here is how the improved regex splits the example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCP / Ext
Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.

Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase

I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:
((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))
and here's an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
; (^[a-z]+) Match against any lower-case letters at the start of the string.
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)
Here I'm separating each word with a space, so here are some examples of how the string is transformed:
ThisIsATitleCASEString => This Is A Title CASE String
andThisOneIsCamelCASE => and This One Is Camel CASE
This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:
((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))
and an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
; (^[a-z]+) Match against any lower-case letters at the start of the command.
; ([0-9]+) Match against one or more consecutive numbers (anywhere in the string, including at the start).
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)
And here are some examples of how a string with numbers is transformed with this regex:
myVariable123 => my Variable 123
my2Variables => my 2 Variables
The3rdVariableIsHere => The 3 rdVariable Is Here
12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too

To handle more letters than just A-Z:
s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");
Either:
Split after any lowercase letter, that is followed by uppercase letter.
E.g parseXML -> parse, XML.
or
Split after any letter, that is followed by upper case letter and lowercase letter.
E.g. XMLParser -> XML, Parser.
In more readable form:
public class SplitCamelCaseTest {
static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";
static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
);
public static String splitCamelCase(String s) {
return SPLIT_CAMEL_CASE.splitAsStream(s)
.collect(joining(" "));
}
#Test
public void testSplitCamelCase() {
assertEquals("Camel Case", splitCamelCase("CamelCase"));
assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
assertEquals("XML Parser", splitCamelCase("XMLParser"));
assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
assertEquals("VALUE", splitCamelCase("VALUE"));
}
}

Brief
Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.
Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.
Code
See this regex in use here
([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)
Results
Sample Input
eclipseRCPExt
SomethingIsWrittenHere
TEXTIsWrittenHERE
VALUE
loremIpsum
Sample Output
eclipse
RCP
Ext
Something
Is
Written
Here
TEXT
Is
Written
HERE
VALUE
lorem
Ipsum
Explanation
Match one or more uppercase alpha character [A-Z]+
Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b

You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.

You can use the expression below for Java:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)

Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):
String test = "_eclipse福福RCPExt";
Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);
Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
// matches should be consecutive
if (componentMatcher.start() != endOfLastMatch) {
// do something horrible if you don't want garbage in between
// we're lenient though, any Chinese characters are lucky and get through as group
String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
components.add(startOrInBetween);
}
components.add(componentMatcher.group(1));
endOfLastMatch = componentMatcher.end();
}
if (endOfLastMatch != test.length()) {
String end = test.substring(endOfLastMatch, componentMatcher.start());
components.add(end);
}
System.out.println(components);
This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.

I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.
I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).
This able to split strings such as:
DrivingB2BTradeIn2019Onwards
to
Driving B2B Trade in 2019 Onwards

A JavaScript Solution
/**
* howToDoThis ===> ["", "how", "To", "Do", "This"]
* #param word word to be split
*/
export const splitCamelCaseWords = (word: string) => {
if (typeof word !== 'string') return [];
return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.