How does \G work in .split? - java

I like to do code-golfing in Java (even though Java way too verbose to be competitive), which is completing a certain challenge in as few bytes as possible. In one of my answers I had the following piece of code:
for(var p:"A4;B8;CU;EM;EW;E3;G6;G9;I1;L7;NZ;O0;R2;S5".split(";"))
Which basically loops over the 2-char Strings after we converted it into a String-array with .split. Someone suggested I could golf it to this instead to save 4 bytes:
for(var p:"A4B8CUEMEWE3G6G9I1L7NZO0R2S5".split("(?<=\\G..)"))
The functionality is still the same. It loops over the 2-char Strings.
However, neither of us was 100% sure how this works, hence this question.
What I know:
I know .split("(?<= ... )") is used to split, but keep the trailing delimiter.
There is also a way to keep a leading delimiter, or delimiter as separated item:
"a;b;c;d".split("(?<=;)") // Results in ["a;", "b;", "c;", "d"]
"a;b;c;d".split("(?=;)") // Results in ["a", ";b", ";c", ";d"]
"a;b;c;d".split("((?<=;)|(?=;))") // Results in ["a", ";", "b", ";", "c", ";", "d"]
I know \G is used to stop after a non-match is encountered.
EDIT: \G is used to indicate the position where the last match ended (or the start of the string for the first run). Corrected definition thanks to #SebastianProske.
int count = 0;
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("match,");
java.util.regex.Matcher matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
count++;
System.out.println(count); // Results in 5
count = 0;
pattern = java.util.regex.Pattern.compile("\\Gmatch,");
matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
count++;
System.out.println(count); // Results in 3
But how does .split("(?<=\\G..)") work exactly when using \G inside the split?
And why does .split("(?=\\G..)") not work?
Here a "Try it online"-link for all code-snippets described above to see them in action.

how does .split("(?<=\\G..)") work
(?<=X) is a zero-width positive lookbehind for X. \G is the end of the previous match (not some kind of stop instruction) or beginning of input, and of course .. is two individual characters. So (?<=\G..) is a zero-width lookbehind for the end of the previous match plus two characters. Since this is split and we're describing a delimiter, making the entire thing a zero-width assertion means we only use it to identify where to break the string, not to actually consume any characters.
So let's walk through ABCDEF:
\G matches beginning of input, and .. matches AB, so (?<=\G..) finds the zero-width space between AB and CD because this is a lookbehind: That is, the first point at which there is \G.. prior to the regex cursor is the point between AB and CD. So split between AB and CD.
\G marks the location just after AB so (?<=\G..) finds the zero-width space between CD and EF, because as the regex cursor goes forward, that's the first place where \G.. matches: \G matching the location between AB and CD and .. matching CD. So split between CD and EF.
Same again: \G marks the location just after CD so (?<=\G..) finds the zero-width space between EF and end-of-input. So split between EF and end-of-input.
Create an array with all of the matches except the empty one at the end (because this is split with an implicit length = 0 which discards empty strings at the end).
Result { "AB", "CD", "EF" }.
And why does .split("(?=\\G..)") not work?
Because (?=X) is a positive lookahead. The end of the previous match will never be ahead of the regex cursor. It can only be behind it.

First off, \G definition: it's an anchor which matches beginning of string or end of previous match. It's a position. It neither does consume a character nor changes cursor position. Alan Moore previously in an answer wrote this behavior of \G inside lookbehinds is engine specific. This would split at desired length in Java but doesn't produce the same result in PCRE.
So how does \G in (?<=\G..) work? Look at below step-by-step demonstration of where dot and \G match:
↓A4
\G..↓B8
\G..↓CU
\G..
.
.
\G matches beginning of input string then dots match A and 4 in order. Engine continues traversing and stop right between 8 and C. Here lookbehind matches:
A 4 B 8
\G . . (?<=\G..)
Where \G matches is where previous dots ended matching i.e. position right after 4 and before B. This process continues to the end of input string. It splits a string by 2 units of data (safely a character here). It shouldn't work on multi-line input strings and if it does it splits partially since dot . doesn't match a newline character or it doesn't split at all since \G doesn't match start of a line (only start of input string).
And why does .split("(?=\\G..)") not work?
Because of a lookahead's nature - which looks forward - there is no possiblities for it to meet where previous match ended. It just continues walking, till to the end.

Related

Replace substring of text matching regexp

I have text that looks like something like this:
1. Must have experience in Java 2. Team leader...
I want to render this in HTML as an ordered list. Now adding the </li> tag to the end is simple enough:
s = replace(s, ". ", "</li>");
But how do I go about replacing the 1., 2. etc with <li>?
I have the regular expression \d*\.$ which matches a number with a period, but the problem is is that is a substring so matching 1. Must have experience in Java 2. Team leader with \d*\.$ returns false.
Code
See regex in use here
\d+\.\s+(.*?)\s*(?=\d+\.\s+|$)
Replace
<li>$1</li>\n
Results
Input
Must have experience in Java 2. Team leader...
Output
<li>Must have experience in Java</li>
<li>Team leader...</li>
Explanation
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
(.*?) Capture any character any number of times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
(?=\d+\.\s+|$) Positive lookahead ensuring either of the following doesn't match
\d+\.\s+
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
$ Assert position at the end of the line
But how do I go about replacing the 1., 2. etc with <li>?
You can use String#replaceAll which can allow regex instead of replace :
s = s.replaceAll("\\d+\\.\\s", "</li>");
Note
You don't need to use $ in the end of your regex.
You have to escape dot . because it's mean any character in regex
You can use \s for one space or \s* for zero or more spaces or \s+ for one or more space
We want
<ol>
<li>one</li>
<li>two<li>
</ol>
This can be done as:
s = s.replaceAll("(?s)(\\d+\\.)\\s+(.*\\.)\\s*", "<li>$2</li></ol>");
s = s.replaceFirst("<li>", "<ol><li>");
s = s.replaceAll("(?s)</li></ol><li>", "</li>\n<li>");
The trick is to first add </li></ol> with a spurious </ol> that should only remain after the last list item.
(?s) is the DOTALL notation, causing . to also match line breaks.
In case of more than one numbered list this will not do. Also it assumes one single sentence per list item.

Java Regex Quantifiers in String Split

The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?
The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here
The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

Regex to find all possible occurrences of text starting and ending with ~

I would like to find all possible occurrences of text enclosed between two ~s.
For example: For the text ~*_abc~xyz~ ~123~, I want the following expressions as matching patterns:
~*_abc~
~xyz~
~123~
Note it can be an alphabet or a digit.
I tried with the regex ~[\w]+?~ but it is not giving me ~xyz~. I want ~ to be reconsidered. But I don't want just ~~ as a possible match.
Use capturing inside a positive lookahead with the following regex:
Sometimes, you need several matches within the same word. For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D. You can do it with this single regex:
(?=(\w+))
At the first position in the string (before the A), the engine starts the first match attempt. The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1. The lookahead succeeds, and so does the match attempt. Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string). It also returns what was captured by Group 1: ABCD
The engine then moves to the next position in the string and starts the next match attempt. Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1. The match succeeds, and Group 1 contains BCD.
The engine moves to the next position in the string, and the process repeats itself for CD then D.
So, use
(?=(~[^\s~]+~))
See the regex demo
The pattern (?=(~[^\s~]+~)) checks each position inside a string and searches for ~ followed with 1+ characters other than whitespace and ~ and then followed with another ~. Since the index is moved only after a position is checked, and not when the value is captured, overlapping substrings get extracted.
Java demo:
String text = " ~*_abc~xyz~ ~123~";
Pattern p = Pattern.compile("(?=(~[^\\s~]+~))");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group(1));
}
System.out.println(res); // => [~*_abc~, ~xyz~, ~123~]
Just in case someone needs a Python demo:
import re
p = re.compile(r'(?=(~[^\s~]+~))')
test_str = " ~*_abc~xyz~ ~123~"
print(p.findall(test_str))
# => ['~*_abc~', '~xyz~', '~123~']
Try this [^~\s]*
This pattern doesn't consider the characters ~ and space (refered to as \s).
I have tested it, it works on your string, here's the demo.

How to make Java String split greedy with lookahead?

Code is basically:
String[] result = "T&&T&T".split("(?=\\w|&+)");
I was expecting the lookahead to be greedy but instead it is returning the array:
T, &, &, T, &, T
What I am aiming for is:
T, &&, T, &, T
Is this possible for split and lookahead?
I have tried the following split regex values but the result is still not greedy for the ampersand:
"(?=\\w|&&?)"
"(?=\\w|&{1,2})"
It is already greedy, but I think you are misunderstanding how your split is working. The problem is that you are thinking of the characters but not the space between them (this is one of the places where regexes can get away from you).
You are asking to split at the places in the string where the next character is either a word character or a series of ampersands. In your string, let's mark the places that satisfy that:
T|&|&|T|&|T
In the space between the first T and the first ampersand, the next character is an ampersand (matches (?=&) which is valid in your regex), the space between the two ampersands also matches for this same reason. The space between the ampersands and the second T also matches (matches (?=\w)), and so on.
The split function will test each space in the string to determine if it is a candidate for a split position. To do what you want, you have to be careful about using the lookahead, so that we don't allow allow splits in the middle of a string of ampersands.
There are multiple ways you may overcome this; Wiktor Stribiżew provides a suggestion that works in his comment.
Usually using a look-behind to check that you are not repeating an undesired character will work, or if possible you can use a look-behind to identify the matching places, and a look-ahead to avoid the undesired repetitions. For example, if we wish to split at all characters keeping repeated characters together, you could do (?<=(.))(?!\\1) which splits your example as T, &&, T, &, T.
Lookarounds cannot be greedy or reluctant, they just check if the adjoining text to the left (lookbehind) and to the right (lookahead) matches the lookaround subpattern. If there is a match, and the lookaround is positive, the empty location is matched. If the lookaround is not anchored, each location in string is tested against the pattern in the lookaround, even the beginning and end. See this screenshot showing that (with your (?=\w|&&?)):
Since the lookaround is a zero-width assertion and it does not consume characters, all locations (before each character and at the end) are tested. Thus, you get matches between each character.
The (?=\w|&&?) checks the first location before T: it gets matched with \w, so this location is matched (see the first |). Then comes the next location, after the first T before the &. It is matched as it is followed woth &&. Then the regex engine goes on to check the location after the first & and the second &. It is matched as there is a & after it. This way, we match up to the end. The end location is not matched as it is not followed with & or a word character.
You may restrict the pattern inside a lookaround with another lookaround to avoid matching specific locations inside the input string.
(?=\w|(?<!&)&)
^^^^^^
The (?<!&)& pattern will match a & that is not preceded with another &. See the regex demo.
IDEONE demo:
String[] result = "T&&T&T".split("(?=\\w|(?<!&)&)");
System.out.println(Arrays.toString(result));
// => [T, &&, T, &, T]
The lookaround solution is a generic one. If we are to consider the current case, you can surely "shorten" the pattern to \b (which will also find a match at the end of the string, though Java String#split will safely remove trailing empty elements from the resulting array) that matches all locations between a non-word and word characters and also at the start/end of the string if there is a word character at its start/end. This won't work if the alternatives (like \w and & in your regex) belong to the same type (say, both are word characters.
How about this:
"(?=\\w)|(?<=\\w)"
or allowing repeat of T:
"(?<!\\w)(?=\\w)|(?<=\\w)(?!\\w)"
or the best form here
It looks like you want to split between different chars, so generally:
String[] parts = input.split("(?<=T)(?=&)|(?<=&)(?=T)");
But in this case, you can split on word boundaries except at start/end:
String[] parts = input.split("(?<=.)\b(?=.)");

validate string in java

I have a string with data separated by commas like this:
$d4kjvdf,78953626,10.0,103007,0,132103.8945F,
I tried the following regex but it doesn't match the strings I want:
[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,
The $ at the beginning of your data string is not matching the regex. Change the first character class to [$a-zA-Z0-9]. And a couple of the comma separated values contain a literal dot. [$.a-zA-Z0-9] would cover both cases. Also, it's probably a good idea to anchor the regex at the start and end by adding ^ and $ to the beginning and end of the regex respectively. How about this for the full regex:
^[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,$
Update:
You said number of commas is your primary matching criteria. If there should be 6 commas, this would work:
^([^,]+,){6}$
That means: match at least 1 character that is anything but a comma, followed by a comma. And perform the aforementioned match 6 times consecutively. Note: your data must end with a trailing comma as is consistent with your sample data.
Well your regular expression is certainly jarbled - there are clearly characters (like $ and .) that your expression won't match, and you don't need to \\ escape ,s. Lets first describe our requirements, you seem to be saying a valid string is defined as:
A string consisting of 6 commas, with one or more characters before each one
We can represent that with the following pattern:
(?:[^,]+,){6}
This says match one or more non-commas, followed by a comma - [^,]+, - six times - {6}. The (?:...) notation is a non-capturing group, which lets us say match the whole sub-expression six times, without it, the {6} would only apply to the preceding character.
Alternately, we could use normal, capturing groups to let us select each individual section of the matching string:
([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?
Now we can not only match the string, but extract its contents at the same time, e.g.:
String str = "$d4kjvdf,78953626,10.0,103007,0,132103.8945F,";
Pattern regex = Pattern.compile(
"([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?");
Matcher m = regex.matcher(str);
if(m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
This prints:
$d4kjvdf
78953626
10.0
103007
0
132103.8945F

Categories