Split string of varying length using regex

Split string of varying length using regex - java

I don't know if this is possible using regex. I'm just asking in case someone knows the answer.
I have a string ="hellohowareyou??". I need to split it like this
[h, el, loh, owar, eyou?, ?].
The splitting is done such that the first string will have length 1, second length 2 and so on. The last string will have the remaining characters. I can do it easily without regex using a function like this.
public ArrayList<String> splitString(String s)
{
int cnt=0,i;
ArrayList<String> sList=new ArrayList<String>();
for(i=0;i+cnt<s.length();i=i+cnt)
{
cnt++;
sList.add(s.substring(i,i+cnt));
}
sList.add(s.substring(i,s.length()));
return sList;
}
I was just curious whether such a thing can be done using regex.

Solution
The following snippet generates the pattern that does the job (see it run on ideone.com):
// splits at indices that are triangular numbers
class TriangularSplitter {
// asserts that the prefix of the string matches pattern
static String assertPrefix(String pattern) {
return "(?<=(?=^pattern).*)".replace("pattern", pattern);
}
// asserts that the entirety of the string matches pattern
static String assertEntirety(String pattern) {
return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
}
// repeats an assertion as many times as there are dots behind current position
static String forEachDotBehind(String assertion) {
return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
}
public static void main(String[] args) {
final String TRIANGULAR_SPLITTER =
"(?x) (?<=^.) | measure (?=(.*)) check"
.replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
.replace("notGyet", assertPrefix("(?! \\1 \\G)"))
.replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
.replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
.replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
;
String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
System.out.println(
java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
);
// [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
}
}
Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G and forward references.
References
This is a brief description of the basic regex constructs used:
(?x) is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
^ and $ are the beginning and end-of-the-line anchors. \G is the end-of-previous match anchor.
| denotes alternation (i.e. "or").
? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
(…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1).
(?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…) is a positive lookbehind; it looks to the left.
(?!…) is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.
Related questions
Articles in the [nested-reference] series:
How does this regex find triangular numbers?
How can we match a^n b^n with Java regex?
How does this Java regex detect palindromes?
How does the regular expression (?<=#)[^#]+(?=#) work?
Explanation
The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:
(?<=^.), i.e. we can lookbehind and see the beginning of the string one dot away
This matches at index 1, and is a crucial starting point to the rest of the process
Otherwise, we measure to reconstruct how the last match was made (using \G as reference point), storing the result of the measurement in "before" \G and "after" \G capturing groups. We then check if the current position is the one prescribed by the measurement to find where the next match should be made.
Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:
\1 captures the string "before" \G
\2 captures some string "after" \G
If the length of \1 is e.g. 1+2+3+...+k, then the length of \2 needs to be k.
Hence \2 . has length k+1 and should be the next part in our split!
\3 captures the string to the right of our current position
Hence when we can assertEntirety on \1 \G \2 . \3, we match and set the new \G
You can use mathematical induction to rigorously prove the correctness of this algorithm.
To help illustrate how this works, let's work through an example. Let's take abcdefghijklm as input, and say that we've already partially splitted off [a, bc, def].
\G we now need to match here!
↓ ↓
a b c d e f g h i j k l m n
\____1____/ \_2_/ . \__3__/ <--- \1 G \2 . \3
L=1+2+3 L=3
Remember that \G marks the end of the last match, and it occurs at triangular number indices. If \G occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G to be a triangular number index.
Thus in our example, given where \G is where we just splitted off def, we measured that k=3, and the next match will split off ghij as expected.
To have \1 and \2 be built according to the above specification, we basically do a while "loop": for as long as it's notGyet, we count up to k as follows:
+NBefore, i.e. we extend \1 by one forEachDotBehind
+1After, i.e. we extend \2 by just one
Note that notGyet contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1 "hits" \G.
Conclusion
Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).
There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.
That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.

Regex purpose is to recognize patterns. Here you doesn't search for patterns but for a length split. So regex are not appropriate.
It is propably possible, but not with a single regex : to find the first n characters using a regex, you use: "^(.{n}).*"
So, you can search with that regex for the 1st character.
Then, you make a substring, and you search for the 2 next characters.
Etc.
Like #splash said, it will make the code more complicated, and unefficient, since you use regex for something outside of their purpose.

String a = "hellohowareyou??";
int i = 1;
while(true) {
if(i >= a.length()) {
System.out.println(a);
break;
}
else {
String b = a.substring(i++);
String[] out = a.split(Pattern.quote(b) + "$");
System.out.println(out[0]);
a = b;
if(b.isEmpty())
break;
}
}

Related

How to find a last occurrence of set of characters in string using regex in java?

I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18.
So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z). When the pattern is applied to that string(multiline), I can find out index from the start group. The code:
int findIndex(String str){
int index = -1;
Pattern p = Pattern.compile("(?s).*(x|y|z)");
Matcher m = regex.matcher(str);
if(m.find()){
index = m.start(1);
}
return index;
}
As expected it is returning the values correctly, if there is match.
But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.
I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?
PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.

There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:
Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex doesn't allow to to match the pattern from right to left.
Instead of using a greedy quantifier simply match the pattern and loop Matcher.find() until last occurrence is found.
Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.
If option 2 is not efficient enough for your case I'd suggest to try RE2/J:
Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.

Performance issues with the (?s).*(x|y|z) regex come from the fact the .* pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x, y or z. If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.
The ([xyz])(?=[^xyz]*$) pattern seems a little bit better: it captures x, y or z and asserts there is no other x, y or z up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.
The fastest regex to get your job done is
^(?:[^xyz]*+([xyz]))+
It matches
^ - start of string
(?:[^xyz]*+([xyz]))+ - 1 or more repetitions of
[^xyz]*+ - any 0 or more chars other than x, y and z matched possessively (no backtracking into the pattern is allowed)
([xyz]) - Group 1: x, y or z.
The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).

StringBuilder both has a reverse and is a CharSequence, so searching is possible.
Pattern p = Pattern.compile("[xyz]");
StringBuilder sb = new StringBuilder(str).reverse();
Matcher m = p.matcher(sb);
return m.find() ? sb.length() - m.end() : -1;
Unfortunately reversal is costly.
A solution without regex is probably faster.
(BTW surrogate pairs are handled correctly by the reversal.)

Java Regex Quantifiers in String Split

The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?

The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here

The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

Java regular expression: A-Z and - or _, but only once

I've only dabbled in regular expressions and was wondering if someone could help me make a Java regex, which matches a string with these qualities:
It is 1-14 characters long
It consists only of A-Z, a-z and the letters _ or -
The symbol - and _ must be contained only once (together) and not at the start
It should match
Hello-Again
ThisIsValid
AlsoThis_
but not
-notvalid
Not-Allowed-This
Nor-This_thing
VeryVeryLongStringIndeed
I've tried the following regex string
[a-zA-Z^\\-_]+[\\-_]?[a-zA-Z^\\-_]*
and it seems to work. However, I'm not sure how to do the total character limiting part with this approach. I've also tried
[[a-zA-Z]+[\\-_]?[a-zA-Z]*]{1,14}
but it matches (for example) abc-cde_aa which it shouldn't.

This ought to work:
(?![_-])(?!(?:.*[_-]){2,})[A-Za-z_-]{1,14}
The regex is quite complex, let my try and explain it.
(?![_-]) negative lookahead. From the start of the string assert that the first character is not _ or -. The negative lookahead "peeks" of the current position and checks that it doesn't match [_-] which is a character group containing _ and -.
(?!(?:.*[_-]){2,}) another negative lookahead, this time matching (?:.*[_-]){2,} which is a non capturing group repeated at least two times. The group is .*[_-], it is any character followed by the same group as before. So we don't want to see some characters followed by _ or - more than once.
[A-Za-z_-]{1,14} is the simple bit. It just says the characters in the group [A-Za-z_-] between 1 and 14 times.
The second part of the pattern is the most tricky, but is a very common trick. If you want to see a character A repeated at some point in the pattern at least X times you want to see the pattern .*A at least X times because you must have
zzzzAzzzzAzzzzA....
You don't care what else is there. So what you arrive at is (.*A){X,}. Now, you don't need to capture the group - this just slows down the engine. So we make the group non-capturing - (?:.*A){X,}.
What you have is that you only want to see the pattern once, so you want not to find the pattern repeated two or more times. Hence it slots into a negative lookahead.
Here is a testcase:
public static void main(String[] args) {
final String pattern = "(?![_-])(?!(?:.*[_-]){2,})[A-Za-z_-]{1,14}";
final String[] tests = {
"Hello-Again",
"ThisIsValid",
"AlsoThis_",
"_NotThis_",
"-notvalid",
"Not-Allow-This",
"Nor-This_thing",
"VeryVeryLongStringIndeed",
};
for (final String test : tests) {
System.out.println(test.matches(pattern));
}
}
Output:
true
true
true
false
false
false
false
false
Things to note:
the character - is special inside character groups. It must go at the start or end of a group otherwise it specifies a range
lookaround is tricky and often counter-intuitive. It will check for matches without consuming, allowing you to test multiple conditions on the same data.
the repetition quantifier {} is very useful. It has 3 states. {X} is repeated exactly X times. {X,} is repeated at least X times. And {X, Y} is repeated between X and Y times.

To check if string is in form XXX-XXX where -XXX or _XXX part is optional you can use
[a-zA-Z]+([-_][a-zA-Z]*)?
which is similar to what you already had
[[a-zA-Z]+[\\-_]?[a-zA-Z]*]
but you made crucial mistake and wrapped it entirely in [...] which makes it character class, and that is not what you wanted.
To check if matched part has only 1-14 length you can use look-ahead mechanism. Just place
(?=.{1,14}$)
at start of your regex to make sure that part from start of match till end of it (represented by $) contains of any 1-14 characters.
So your final regex can look like
String regex = "(?=.{1,14}$)[a-zA-Z]+([-_][a-zA-Z]*)?";
Demo
String [] data = {
"Hello-Again",
"ThisIsValid",
"AlsoThis_",
"-notvalid",
"Not-Allowed-This",
"Nor-This_thing",
"VeryVeryLongStringIndeed",
};
for (String s : data)
System.out.println(s + " : " + s.matches(regex));
Output:
Hello-Again : true
ThisIsValid : true
AlsoThis_ : true
-notvalid : false
Not-Allowed-This : false
Nor-This_thing : false
VeryVeryLongStringIndeed : false

regular expression to allow only 1 dash

I have a textbox where I get the last name of a user. How do I allow only one dash (-) in a regular expression? And it's not supposed to be in the beginning or at the end of the string.
I have this code:
Pattern p = Pattern.compile("[^a-z-']", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(name);

Try to rephrase the question in more regexy terms. Rather than "allow only one dash, and it can't be at the beginning" you could say, "the string's beginning, followed by at least one non-dash, followed by one dash, followed by at least one non-dash, followed by the string's end."
the string's beginning: `^
at least one non-dash: [^-]+
followed by one dash: -
followed by at least one non-dash: [^-]+
followed by the string's end: $
Put those all together, and there you go. If you're using this in a context that matches against the complete string (not just any substring within it), you don't need the anchors -- though it may be good to put them in anyway, in case you later use that regex in a substring-matching context and forget to add them back in.

Why not just use indexOf() in String?
String s = "last-name";
int first = s.indexOf('-');
int last = s.lastIndexOf('-');
if(first == 0 || last == s.length()-1) // Checks if a dash is at the beginning or end
System.out.println("BAD");
if(first != last) // Checks if there is more than one dash
System.out.println("BAD");
It is slower than using regex but with usually small size of last names it should not be noticeable in the least bit. Also, it will make debugging and future maintenance MUCH easier.

It looks like your regex represents a fragment of an invalid value, and you're presumably using Matcher.find() to find if any part of your value matches that regex. Is that correct? If so, you can change your pattern to:
Pattern p = Pattern.compile("[^a-zA-Z'-]|-.*-|^-|-$");
which will match a non-letter-non-hyphen-non-apostrophe character, or a sequence of characters that both starts and ends with hyphens (thereby detecting a value that contains two hyphens), or a leading hyphen, or a trailing hyphen.

This regex represents one or more non-hyphens, followed by a single hyphen, followed by one or more non-hyphens.
^[^\-]+\-[^\-]+$
I'm not sure if the hyphen in the middle needs to be escaped with a backslash... That probably depends on what platform you're using for regex.

Try pattern something like [a-z]-[a-z].
Pattern p = Pattern.compile("[a-z]-[a-z]");

What is the difference between `Greedy` and `Reluctant` regular expression quantifiers?

From the Pattern javadocs:
Greedy quantifiers:
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times
Reluctant quantifiers:
X?? X, once or not at all
X*? X, zero or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n but not more than m times
The description of what they do is the same...so, what is the difference?
I would really appreciate some examples.
I am coding in Java, but I hear this concept is the same for most modern regex implementations.

A greedy operator always try to "grab" as much of the input as possible, while a reluctant quantifier will match as little of the input as possible and still create a match.
Example:
"The red fox jumped over the red fence"
/(.*)red/ => \1 = "The red fox jumped over the "
/(.*?)red/ => \1 = "The "
"aaa"
/a?a*/ => \1 = "a", \2 = "aa"
/a??a*/ => \1 = "", \2 = "aaa"
"Mr. Doe, John"
/^(?:Mrs?.)?.*\b(.*)$/ => \1 = "John"
/^(?:Mrs?.)?.*?\b(.*)$/ => \1 = "Doe, John"

From this link, where the tutorial author acknowledges the spirit of your question:
At first glance it may appear that
the quantifiers X?, X?? and X?+ do
exactly the same thing, since they all
promise to match "X, once or not at
all". There are subtle implementation
differences which will be explained
near the end of this section.
They go on to put together examples and offer the explanation:
Greedy quantifiers are considered
"greedy" because they force the
matcher to read in, or eat, the entire
input string prior to attempting the
first match. If the first match
attempt (the entire input string)
fails, the matcher backs off the input
string by one character and tries
again, repeating the process until a
match is found or there are no more
characters left to back off from.
Depending on the quantifier used in
the expression, the last thing it will
try matching against is 1 or 0
characters.
The reluctant quantifiers, however,
take the opposite approach: They start
at the beginning of the input string,
then reluctantly eat one character at
a time looking for a match. The last
thing they try is the entire input
string.
And for extra credit, the possessive explanation:
Finally, the possessive quantifiers
always eat the entire input string,
trying once (and only once) for a
match. Unlike the greedy quantifiers,
possessive quantifiers never back off,
even if doing so would allow the
overall match to succeed.

A greedy quantifier will match as much as possible and still get a match
A reluctant quantifier will match the smallest amount possible.
for example given the string
abcdef
the greedy qualifier
ab[a-z]*[a-z] would match abcdef
the reluctant qualifier
ab[a-z]*?[a-z] would match abc

say you have a regex "a\w*b", and use it on "abab"
Greedy matching will match "abab" (it looks for an a, as much occurrences of \w as possible, and a b) and reluctant matching will match just "ab" (as little \w as possible)

There is documentation on how Perl handles these quantifiers perldoc perlre.
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
{n}? Match exactly n times, not greedily
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
*+ Match 0 or more times and give nothing back
++ Match 1 or more times and give nothing back
?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a 's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent subexpression (?>...) for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.