What Java regular expression do I need to match this text? - java

I'm trying to match the following using a regular expression in Java - I have some data separated by the two characters 'ZZ'. Each record starts with 'ZZ' and finishes with 'ZZ' - I want to match a record with no ending 'ZZ' for example, I want to match the trailing 'ZZanychars' below (Note: the *'s are not included in the string - they're just marking the bit I want to match).
ZZanycharsZZZZanycharsZZZZanychars
But I don't want the following to match because the record has ended:
ZZanycharsZZZZanycharsZZZZanycharsZZ
EDIT: To clarify things - here are the 2 testcases I am using:
// This should match and in one of the groups should be 'ZZthree'
String testString1 = "ZZoneZZZZtwoZZZZthree";
// This should not match
String testString2 = "ZZoneZZZZtwoZZZZthreeZZ";
EDIT: Adding a third test:
// This should match and in one of the groups should be 'threeZee'
String testString3 = "ZZoneZZZZtwoZZZZthreeZee";

(Edited after the post of the 3rd example)
Try:
(?!ZZZ)ZZ((?!ZZ).)++$
Demo:
import java.util.regex.*;
public class Main {
public static void main(String[] args) {
String[] tests = {
"ZZoneZZZZtwoZZZZthree",
"ZZoneZZZZtwoZZZZthreeZZ",
"ZZoneZZZZtwoZZZZthreeZee"
};
Pattern p = Pattern.compile("(?!ZZZ)ZZ((?!ZZ).)++$");
for(String tst : tests) {
Matcher m = p.matcher(tst);
System.out.println(tst+" -> "+(m.find() ? m.group() : "no!"));
}
}
}

To match only the final, unterminated record:
(?<=[^Z]ZZ|^)ZZ(?:(?!ZZ).)++$
The starting delimiter is two Z's, but there can be a third Z that's considered part of the data. The lookbehind ensures that you don't match a Z that's part of the previous record's ending delimiter (since an ending delimiter can not be preceded by a non-delimiter Z). However, this assumes there will never be empty records (or records containing only a single Z), which could lead to eight or more Z's in a row:
ZZabcZZZZdefZZZZZZZZxyz
If that were possible, I would forget about trying to match the final record by itself, and instead match all of them from the beginning:
(?:ZZ(?:(?!ZZ).)*+ZZ)*+(ZZ(?:(?!ZZ).)++$)
The final, unterminated record is now captured in group #1.

I'd suggest something like...
/ZZ(.*?)(ZZ|$)/
This will match:
ZZ — the literal string
(.*?) — anychars
(ZZ|$) — either another ZZ literal, or the end of the string

^ZZ.*(?<!ZZ)$
Assert position at the beginning of the string «^»
Match the characters “ZZ” literally «ZZ»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!ZZ)»
Match the characters “ZZ” literally «ZZ»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
Created with RegexBuddy

There's one tricky part to this: The ZZ being both the start token and the end token.
There's one start case (ZZ, not followed by another ZZ which would signify that the first ZZ was actually an end token), and two end cases (ZZ end of string, ZZ followed by ZZ). The goal is to match the start case and NOT either of the end cases.
To that end, here's what I suggest:
/ZZ(?!ZZ)(.*?)(ZZ(?!(ZZ|$))|$)/
For string ZZfooZZZZbarZZbazZZ:
This will NOT match ZZfooZZ, a legitimate record: ZZ, not followed by ZZ, followed by any combination of characters (here "foo"), followed by ZZ, but that ZZ is followed by ZZ, which opens the next record.
The next part examined is the ZZ after foo. This fails because the ZZ cannot be followed by another ZZ, yet in this case it is. This is as we want because the ZZ right after foo does not start a new record anyway.
The ZZ right before bar is not followed by another ZZ, so it's a legitimate start of record. "bar" is consumed by the .*?. Then there is a ZZ, but it is NOT followed by another ZZ or the end of string, which means that the ZZbar token is no good.
(It COULD be interpreted by a human as ZZbarZZ with bazZZ not being valid, but in either case there's something wrong, so I just wrote the regex to consider the wrongly-formatted record to occur here)
So ZZbar will be caught/matched by the regex, as illegitimate.
The ZZ after the bar isn't followed by ZZ, is followed by baz, followed by a ZZ that fails the lookahead assertion stating it can't be followed by the end of the string. So ZZbazZZ is a legitimate record and is not captured in the regex.
One more case: For ZZfoo, the beginning ZZ is okay, the foo is captured, then the regex notes that it's the end of the string, and no ZZ has occurred. Thus, ZZfoo is captured as an illegitimate match.
Let me know if this doesn't make sense, so I can make it more clear.

How about trying to remove all matches for ZZallcharsZZ and what you have left is what you want.
ZZ.*?ZZ

Related

Java Regex Quantifiers in String Split

The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?
The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here
The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

Regex to find all possible occurrences of text starting and ending with ~

I would like to find all possible occurrences of text enclosed between two ~s.
For example: For the text ~*_abc~xyz~ ~123~, I want the following expressions as matching patterns:
~*_abc~
~xyz~
~123~
Note it can be an alphabet or a digit.
I tried with the regex ~[\w]+?~ but it is not giving me ~xyz~. I want ~ to be reconsidered. But I don't want just ~~ as a possible match.
Use capturing inside a positive lookahead with the following regex:
Sometimes, you need several matches within the same word. For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D. You can do it with this single regex:
(?=(\w+))
At the first position in the string (before the A), the engine starts the first match attempt. The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1. The lookahead succeeds, and so does the match attempt. Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string). It also returns what was captured by Group 1: ABCD
The engine then moves to the next position in the string and starts the next match attempt. Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1. The match succeeds, and Group 1 contains BCD.
The engine moves to the next position in the string, and the process repeats itself for CD then D.
So, use
(?=(~[^\s~]+~))
See the regex demo
The pattern (?=(~[^\s~]+~)) checks each position inside a string and searches for ~ followed with 1+ characters other than whitespace and ~ and then followed with another ~. Since the index is moved only after a position is checked, and not when the value is captured, overlapping substrings get extracted.
Java demo:
String text = " ~*_abc~xyz~ ~123~";
Pattern p = Pattern.compile("(?=(~[^\\s~]+~))");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group(1));
}
System.out.println(res); // => [~*_abc~, ~xyz~, ~123~]
Just in case someone needs a Python demo:
import re
p = re.compile(r'(?=(~[^\s~]+~))')
test_str = " ~*_abc~xyz~ ~123~"
print(p.findall(test_str))
# => ['~*_abc~', '~xyz~', '~123~']
Try this [^~\s]*
This pattern doesn't consider the characters ~ and space (refered to as \s).
I have tested it, it works on your string, here's the demo.

Regex matcher to handle a character or end of line

I would like to create a matching pattern for a situation like this
DOMAIN+("Y|A")?
I would like the matching options to be only
DOMAIN
DOMAINY
DOMAINA
but seems like DOMAINX, DOMAINY etc. are matching as well.
Yes, they are matching because you did not specify that the String needed to end with this. DOMAIN(Y|A)? is matching DOMAINX because it rightfully contains DOMAIN followed by nothing (which is accepted since ? validates 0 or 1 occurence).
You can add this restriction by specifying $ at the end of the regular expression.
Sample code that shows the result of matches. In your full code, you probably want to compile a Pattern instead of doing it each time.
public static void main(String[] args) {
String regex = "DOMAIN(Y|A)?$";
System.out.println("DOMAIN".matches(regex)); // prints true
System.out.println("DOMAINX".matches(regex)); // prints false
System.out.println("DOMAINY".matches(regex)); // prints true
System.out.println("DOMAINA".matches(regex)); // prints true
}
You could use word boundaries, \b, in order to prevent strings such as "DOMAINX" from being matched.
If you just want to handle cases where there are characters after the word, add \b to the end:
DOMAIN(?:Y|A)?\b
Otherwise, you could place \b around the expression to handle cases where there may be characters at the start/end:
\bDOMAIN(?:Y|A)?\b
I also made (?:Y|A) a non-capturing group and I removed the quotes.
See the matches here.
However, as your title implies, if you only want to handle characters at the end of a line, use the $ anchor at the end of your expression:
DOMAIN(?:Y|A)?$
You may have to add the m (multi-line) flag so that the anchor matches at the start/end of a line rather than at the start/end of the string:
(?m)DOMAIN(?:Y|A)?$
You need this
DOMAIN(Y|A)?
If you need it to be a word in text you should anchor it with \b as Josh shows.
Your regex does the following
DOMAIN+("Y|A")?
DOMAIN+("Y|A")?
Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Regex syntax only
[Match the character string “DOMAI” literally (case sensitive)][1] DOMAI
[Match the character “N” literally (case sensitive)][1] N+
[Between one and unlimited times, as many times as possible, giving back as needed (greedy)][2] +
[Match the regex below and capture its match into backreference number 1][3] ("Y|A")?
[Between zero and one times, as many times as possible, giving back as needed (greedy)][4] ?
[Match this alternative (attempting the next alternative only if this one fails)][5] "Y
[Match the character string “"Y” literally (case sensitive)][1] "Y
[Or match this alternative (the entire group fails if this one fails to match)][5] A"
[Match the character string “A"” literally (case sensitive)][1] A"

Programming error leads to inexplanable regex

for a test I created following regex by mistake:
|(\\w+)|
I was puzzled that this regex really works and I can't explain the result:
public static void main(String[] args) {
String toReplace="Hey I'm a lovely String an I'm giving my |value| worth!";
// String replacement1="2 cent"; // I planned to replace |value| with 2 cent
String replacement1="#"; // to produce a better Output
String regex="|(\\w+)|"; // I forgot to escape the |
replacement1="#";
result=toReplace.replaceAll(regex,replacement1);
System.out.println(result);
}
the result is:
#H#e#y# #I#'#m# #a# #l#o#v#e#l#y# #S#t#r#i#n#g# #a#n# #I#'#m# #g#i#v#i#n#g# #m#y# #|#v#a#l#u#e#|# #w#o#r#t#h#!#
My ideas so far are that java tries to replace "nothing" between the characters but why not the characters itself?
\\w+ should match the 'H'
I would expect that every char is replaced by 3 # signs or only by one but that the characters are not replaced puzzles me.
You're right, this regex matches the empty string between each character.
Since the first alternative (the empty string left of |) matches, the rest of the pattern isn't even tried, so the \w+ isn't even reached by the matching engine. You could have written any (valid) pattern to the right of that first |, it wouldn't ever be reached.
The engine works the following way: It has a current position cursor in the subject string. It tries to match starting at that current position. Since your regex is a match, it will perform the replacement at this point, and then move the current position cursor after the found match.
But since the match is zero-width, it simply advances to the next character, because not doing so would result in an infinite loop.

Split string of varying length using regex

I don't know if this is possible using regex. I'm just asking in case someone knows the answer.
I have a string ="hellohowareyou??". I need to split it like this
[h, el, loh, owar, eyou?, ?].
The splitting is done such that the first string will have length 1, second length 2 and so on. The last string will have the remaining characters. I can do it easily without regex using a function like this.
public ArrayList<String> splitString(String s)
{
int cnt=0,i;
ArrayList<String> sList=new ArrayList<String>();
for(i=0;i+cnt<s.length();i=i+cnt)
{
cnt++;
sList.add(s.substring(i,i+cnt));
}
sList.add(s.substring(i,s.length()));
return sList;
}
I was just curious whether such a thing can be done using regex.
Solution
The following snippet generates the pattern that does the job (see it run on ideone.com):
// splits at indices that are triangular numbers
class TriangularSplitter {
// asserts that the prefix of the string matches pattern
static String assertPrefix(String pattern) {
return "(?<=(?=^pattern).*)".replace("pattern", pattern);
}
// asserts that the entirety of the string matches pattern
static String assertEntirety(String pattern) {
return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
}
// repeats an assertion as many times as there are dots behind current position
static String forEachDotBehind(String assertion) {
return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
}
public static void main(String[] args) {
final String TRIANGULAR_SPLITTER =
"(?x) (?<=^.) | measure (?=(.*)) check"
.replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
.replace("notGyet", assertPrefix("(?! \\1 \\G)"))
.replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
.replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
.replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
;
String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
System.out.println(
java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
);
// [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
}
}
Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G and forward references.
References
This is a brief description of the basic regex constructs used:
(?x) is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
^ and $ are the beginning and end-of-the-line anchors. \G is the end-of-previous match anchor.
| denotes alternation (i.e. "or").
? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
(…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1).
(?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…) is a positive lookbehind; it looks to the left.
(?!…) is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.
Related questions
Articles in the [nested-reference] series:
How does this regex find triangular numbers?
How can we match a^n b^n with Java regex?
How does this Java regex detect palindromes?
How does the regular expression (?<=#)[^#]+(?=#) work?
Explanation
The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:
(?<=^.), i.e. we can lookbehind and see the beginning of the string one dot away
This matches at index 1, and is a crucial starting point to the rest of the process
Otherwise, we measure to reconstruct how the last match was made (using \G as reference point), storing the result of the measurement in "before" \G and "after" \G capturing groups. We then check if the current position is the one prescribed by the measurement to find where the next match should be made.
Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:
\1 captures the string "before" \G
\2 captures some string "after" \G
If the length of \1 is e.g. 1+2+3+...+k, then the length of \2 needs to be k.
Hence \2 . has length k+1 and should be the next part in our split!
\3 captures the string to the right of our current position
Hence when we can assertEntirety on \1 \G \2 . \3, we match and set the new \G
You can use mathematical induction to rigorously prove the correctness of this algorithm.
To help illustrate how this works, let's work through an example. Let's take abcdefghijklm as input, and say that we've already partially splitted off [a, bc, def].
\G we now need to match here!
↓ ↓
a b c d e f g h i j k l m n
\____1____/ \_2_/ . \__3__/ <--- \1 G \2 . \3
L=1+2+3 L=3
Remember that \G marks the end of the last match, and it occurs at triangular number indices. If \G occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G to be a triangular number index.
Thus in our example, given where \G is where we just splitted off def, we measured that k=3, and the next match will split off ghij as expected.
To have \1 and \2 be built according to the above specification, we basically do a while "loop": for as long as it's notGyet, we count up to k as follows:
+NBefore, i.e. we extend \1 by one forEachDotBehind
+1After, i.e. we extend \2 by just one
Note that notGyet contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1 "hits" \G.
Conclusion
Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).
There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.
That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.
Regex purpose is to recognize patterns. Here you doesn't search for patterns but for a length split. So regex are not appropriate.
It is propably possible, but not with a single regex : to find the first n characters using a regex, you use: "^(.{n}).*"
So, you can search with that regex for the 1st character.
Then, you make a substring, and you search for the 2 next characters.
Etc.
Like #splash said, it will make the code more complicated, and unefficient, since you use regex for something outside of their purpose.
String a = "hellohowareyou??";
int i = 1;
while(true) {
if(i >= a.length()) {
System.out.println(a);
break;
}
else {
String b = a.substring(i++);
String[] out = a.split(Pattern.quote(b) + "$");
System.out.println(out[0]);
a = b;
if(b.isEmpty())
break;
}
}

Categories