How to make Java String split greedy with lookahead? - java

Code is basically:
String[] result = "T&&T&T".split("(?=\\w|&+)");
I was expecting the lookahead to be greedy but instead it is returning the array:
T, &, &, T, &, T
What I am aiming for is:
T, &&, T, &, T
Is this possible for split and lookahead?
I have tried the following split regex values but the result is still not greedy for the ampersand:
"(?=\\w|&&?)"
"(?=\\w|&{1,2})"

It is already greedy, but I think you are misunderstanding how your split is working. The problem is that you are thinking of the characters but not the space between them (this is one of the places where regexes can get away from you).
You are asking to split at the places in the string where the next character is either a word character or a series of ampersands. In your string, let's mark the places that satisfy that:
T|&|&|T|&|T
In the space between the first T and the first ampersand, the next character is an ampersand (matches (?=&) which is valid in your regex), the space between the two ampersands also matches for this same reason. The space between the ampersands and the second T also matches (matches (?=\w)), and so on.
The split function will test each space in the string to determine if it is a candidate for a split position. To do what you want, you have to be careful about using the lookahead, so that we don't allow allow splits in the middle of a string of ampersands.
There are multiple ways you may overcome this; Wiktor Stribiżew provides a suggestion that works in his comment.
Usually using a look-behind to check that you are not repeating an undesired character will work, or if possible you can use a look-behind to identify the matching places, and a look-ahead to avoid the undesired repetitions. For example, if we wish to split at all characters keeping repeated characters together, you could do (?<=(.))(?!\\1) which splits your example as T, &&, T, &, T.

Lookarounds cannot be greedy or reluctant, they just check if the adjoining text to the left (lookbehind) and to the right (lookahead) matches the lookaround subpattern. If there is a match, and the lookaround is positive, the empty location is matched. If the lookaround is not anchored, each location in string is tested against the pattern in the lookaround, even the beginning and end. See this screenshot showing that (with your (?=\w|&&?)):
Since the lookaround is a zero-width assertion and it does not consume characters, all locations (before each character and at the end) are tested. Thus, you get matches between each character.
The (?=\w|&&?) checks the first location before T: it gets matched with \w, so this location is matched (see the first |). Then comes the next location, after the first T before the &. It is matched as it is followed woth &&. Then the regex engine goes on to check the location after the first & and the second &. It is matched as there is a & after it. This way, we match up to the end. The end location is not matched as it is not followed with & or a word character.
You may restrict the pattern inside a lookaround with another lookaround to avoid matching specific locations inside the input string.
(?=\w|(?<!&)&)
^^^^^^
The (?<!&)& pattern will match a & that is not preceded with another &. See the regex demo.
IDEONE demo:
String[] result = "T&&T&T".split("(?=\\w|(?<!&)&)");
System.out.println(Arrays.toString(result));
// => [T, &&, T, &, T]
The lookaround solution is a generic one. If we are to consider the current case, you can surely "shorten" the pattern to \b (which will also find a match at the end of the string, though Java String#split will safely remove trailing empty elements from the resulting array) that matches all locations between a non-word and word characters and also at the start/end of the string if there is a word character at its start/end. This won't work if the alternatives (like \w and & in your regex) belong to the same type (say, both are word characters.

How about this:
"(?=\\w)|(?<=\\w)"
or allowing repeat of T:
"(?<!\\w)(?=\\w)|(?<=\\w)(?!\\w)"
or the best form here

It looks like you want to split between different chars, so generally:
String[] parts = input.split("(?<=T)(?=&)|(?<=&)(?=T)");
But in this case, you can split on word boundaries except at start/end:
String[] parts = input.split("(?<=.)\b(?=.)");

Related

Regex return true if even a substring follows the pattern

I was just practicing regex and found something intriguing
for a string
"world9 a9$ b6$" my regular expression "^(?=.*[\\d])(?=\\S+\\$).{2,}$"
will return false as there is a space in between before the look ahead finds the $ sign with at least one digit and non space character.
As a whole the string doesn't matches the pattern.
What should be the regular expression if I want to return true even if a substring follows a pattern?
as in this one a9$ and b6$ both follow the regular expression.
You can use
^(?=\D*\d)(?=.*\S\$).{2,}$
See the regex demo. As The fourth bird mentions, since \S\$ matches two chars, you may simply move the pattern to the consuming part, and use ^(?=\D*\d).*\S\$.*$, see this regex demo.
Details
^ - start of string (implicit if used in .matches())
(?=\D*\d) - a positive lookahead that requires zero or more non-digit chars followed with a digit char immediately to the right of the current location
(?=.*\S\$) - a positive lookahead that requires zero or more chars other than line break chars, as many as possible, followed with a non-whitespace char and a $ char immediately to the right of the current location
.{2,} - any two or more chars other than line break chars, as many as possible
$ - end of string (implicit if used in .matches())
Mostly, knock out the ^ and $ bits, as those force this into a full string match, and you want substring matches. In general, look-ahead seems like a mistake here, what are you trying to accomplish by using that? (Look-ahead/look-behind is rarely needed in general). All you need is:
Pattern.compile("\\S+\\$");
possibly, if you want an element (such as a9$) to stand on its own, use \b which is regexpese for word break: Basically, whitespace (and a few other characters, such as underscores. Most non-letter, non-digits characters are considered a break. Think [^a-zA-Z0-9]) - but \b also matches start/end of input. Thus:
Pattern.compile("\\b\\S+\\$\\b")
still matches foo a9$ bar, or a9$ just fine.
If you MUST put this in terms of a full match, e.g. because matches() (which always does a full string match) is run and you can't change that, well, put ^.* in front and .*$ at the back of it, simple as that.
Absolutely nothing about this says "This can only be needed with lookahead".

How to match a string in this way?

I need to check if a String matches this specific pattern.
The pattern is:
(Numbers)(all characters allowed)(numbers)
and the numbers may have a comma ("." or ",")!
For instance the input could be 500+400 or 400,021+213.443.
I tried Pattern.matches("[0-9],?.?+[0-9],?.?+", theequation2), but it didn't work!
I know that I have to use the method Pattern.match(regex, String), but I am not being able to find the correct regex.
Dealing with numbers can be difficult. This approach will deal with your examples, but check carefully. I also didn't do "all characters" in the middle grouping, as "all" would include numbers, so instead I assumed that finding the next non-number would be appropriate.
This Java regex handles the requirements:
"((-?)[\\d,.]+)([^\\d-]+)((-?)[\\d,.]+)"
However, there is a potential issue in the above. Consider the following:
300 - -200. The foregoing won't match that case.
Now, based upon the examples, I think the point is that one should have a valid operator. The number of math operations is likely limited, so I would whitelist the operators in the middle. Thus, something like:
"((-?)[\\d,.]+)([\\s]*[*/+-]+[\\s]*)((-?)[\\d,.]+)"
Would, I think, be more appropriate. The [*/+-] can be expanded for the power operator ^ or whatever. Now, if one is going to start adding words (such as mod) in the equation, then the expression will need to be modified.
You can see this regular expression here
In your regex you have to escape the dot \. to match it literally and escape the \+ or else it would make the ? a possessive quantifier. To match 1+ digits you have to use a quantifier [0-9]+
For your example data, you could match 1+ digits followed by an optional part which matches either a dot or a comma at the start and at the end. If you want to match 1 time any character you could use a dot.
Instead of using a dot, you could also use for example a character class [-+*] to list some operators or list what you would allow to match. If this should be the only match, you could use anchors to assert the start ^ and the end $ of the string.
\d+(?:[.,]\d+)?.\d+(?:[.,]\d+)?
In Java:
String regex = "\\d+(?:[.,]\\d+)?.\\d+(?:[.,]\\d+)?";
Regex demo
That would match:
\d+(?:[.,]\d+)? 1+ digits followed by an optional part that matches . or , followed by 1+ digits
. Match any character (Use .+) to repeat 1+ times
Same as the first pattern

Java Regex Quantifiers in String Split

The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?
The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here
The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

Regex for First word and last word of a string separates with

I'm trying to get a regex for the following expression but can't make it:
String have 4 words separated with dots(.).
First word matches a given one (HELLO for example).
Second and third words could have any character but dot itself (.).
Last word matches a given one again(csv for example).
So:
HELLO.something.Somethi#gElse.csv should match.
something.HELLO.?.csv shouldn't match.
HELLO.something...csv shouldn't match.
HELLO.something.somethingelse.notcsv shouldn't match
I can do it with split(.) and then check for individual words, but I'm trying to get it working with Regex and Pattern class.
Any help would be really appreciated.
This is relatively straightforward, as long as you understand character classes. A regex with square brackets [xyz] matches any character from the list {x, y, z}; a regex [^xyz] matches any character except {x, y, z}.
Now you can construct your expression:
^HELLO\.[^.]+\.[^.]+\.csv$
+ means "one or more of the preceding expression"; \. means "dot itself". ^ means "the beginning of the string"; $ means "the end of the string". These anchors prevent regex from matching
blahblahHELLO.world.world.csvblahblah
Demo.
A common goal for writing regular expressions like that is to capture some content, for example, the string between the first and the second dot, and the string between the second and the third dot. Use capturing groups to bring the content of these strings into your Java program:
^HELLO\.([^.]+)\.([^.]+)\.csv$
Each pair of parentheses defines a capturing group, indexed from 1 (group at index zero represents the capture of the entire expression). Once you obtain a match object from the pattern, you can query it for the groups, and extract the corresponding strings.
Note that backslashes in Java regex need to be doubled.
(^HELLO\.[^.]+\.[^.]+\.csv$)
Here is the same regex with token explanation on regex101.

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Categories