Regex look-behind without obvious maximum length in Java

Regex look-behind without obvious maximum length in Java - java

I always thought that a look-behind assertion in Java's regex-API (and many other languages for that matter) must have an obvious length. So, STAR and PLUS quantifiers are not allowed inside look-behinds.
The excellent online resource regular-expressions.info seems to confirm (some of) my assumptions:
"[...] Java takes things a step further by
allowing finite repetition. You still
cannot use the star or plus, but you
can use the question mark and the
curly braces with the max parameter
specified. Java recognizes the fact
that finite repetition can be
rewritten as an alternation of strings
with different, but fixed lengths.
Unfortunately, the JDK 1.4 and 1.5
have some bugs when you use
alternation inside lookbehind. These
were fixed in JDK 1.6. [...]"
-- http://www.regular-expressions.info/lookaround.html
Using the curly brackets works as long as the total length of range of the characters inside the look-behind is smaller or equal to Integer.MAX_VALUE. So these regexes are valid:
"(?<=a{0," +(Integer.MAX_VALUE) + "})B"
"(?<=Ca{0," +(Integer.MAX_VALUE-1) + "})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-2) + "})B"
But these aren't:
"(?<=Ca{0," +(Integer.MAX_VALUE) +"})B"
"(?<=CCa{0," +(Integer.MAX_VALUE-1) +"})B"
However, I don't understand the following:
When I run a test using the * and + quantifier inside a look-behind, all goes well (see output Test 1 and Test 2).
But, when I add a single character at the start of the look-behind from Test 1 and Test 2, it breaks (see output Test 3).
Making the greedy * from Test 3 reluctant has no effect, it still breaks (see Test 4).
Here's the test harness:
public class Main {
private static String testFind(String regex, String input) {
try {
boolean returned = java.util.regex.Pattern.compile(regex).matcher(input).find();
return "testFind : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testFind : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testReplaceAll(String regex, String input) {
try {
String returned = input.replaceAll(regex, "FOO");
return "testReplaceAll : Valid -> regex = "+regex+", input = "+input+", returned = "+returned;
} catch(Exception e) {
return "testReplaceAll : Invalid -> "+regex+", "+e.getMessage();
}
}
private static String testSplit(String regex, String input) {
try {
String[] returned = input.split(regex);
return "testSplit : Valid -> regex = "+regex+", input = "+input+", returned = "+java.util.Arrays.toString(returned);
} catch(Exception e) {
return "testSplit : Invalid -> "+regex+", "+e.getMessage();
}
}
public static void main(String[] args) {
String[] regexes = {"(?<=a*)B", "(?<=a+)B", "(?<=Ca*)B", "(?<=Ca*?)B"};
String input = "CaaaaaaaaaaaaaaaBaaaa";
int test = 0;
for(String regex : regexes) {
test++;
System.out.println("********************** Test "+test+" **********************");
System.out.println(" "+testFind(regex, input));
System.out.println(" "+testReplaceAll(regex, input));
System.out.println(" "+testSplit(regex, input));
System.out.println();
}
}
}
The output:
********************** Test 1 **********************
testFind : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a*)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 2 **********************
testFind : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = true
testReplaceAll : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = CaaaaaaaaaaaaaaaFOOaaaa
testSplit : Valid -> regex = (?<=a+)B, input = CaaaaaaaaaaaaaaaBaaaa, returned = [Caaaaaaaaaaaaaaa, aaaa]
********************** Test 3 **********************
testFind : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testReplaceAll : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
testSplit : Invalid -> (?<=Ca*)B, Look-behind group does not have an obvious maximum length near index 6
(?<=Ca*)B
^
********************** Test 4 **********************
testFind : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testReplaceAll : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
testSplit : Invalid -> (?<=Ca*?)B, Look-behind group does not have an obvious maximum length near index 7
(?<=Ca*?)B
^
My question may be obvious, but I'll still ask it: Can anyone explain to me why Test 1 and 2 fail, and Test 3 and 4 don't? I would have expected them all to fail, not half of them to work and half of them to fail.
Thanks.
PS. I'm using: Java version 1.6.0_14

Glancing at the source code for Pattern.java reveals that the '*' and '+' are implemented as instances of Curly (which is the object created for curly operators). So,
a*
is implemented as
a{0,0x7FFFFFFF}
and
a+
is implemented as
a{1,0x7FFFFFFF}
which is why you see exactly the same behaviors for curlies and stars.

It's a bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6695369
Pattern.compile() is always supposed to throw an exception if it can't determine the maximum possible length of a lookbehind match.

Related

Find the longest section of repeating characters

Imagine a string like this:
#*****~~~~~~**************~~~~~~~~***************************#
I am looking for an elegant way to find the indices of the longest continues section that contains a specific character. Let's assume we are searching for the * character, then I expect the method to return the start and end index of the last long section of *.
I am looking for the elegant way, I know I could just bruteforce this by checking something like
indexOf(*)
lastIndexOf(*)
//Check if in between the indices is something else if so, remember length start from new
//substring and repeat until lastIndex reached
//Return saved indices
This is so ugly brute-force - Any more elegant way of doing this? I thought about regular expression groups and comparing their length. But how to get the indices with that?

Regex-based solution
If you don't want to hardcode a specific character like * and find "Find the longest section of repeating characters" as the title of the question states, then the proper regular expression for the section of repeated characters would be:
"(.)\\1*"
Where (.) a group that consists from of a single character, and \\1 is a backreference that refers to that group. * is greedy quantifier, which means that presiding backreference could be repeated zero or more times.
Finally, "(.)\\1*" captures a sequence of subsequent identical characters.
Now to use it, we need to compile the regex into Pattern. This action has a cost, hence if the regex would be used multiple times it would be wise to declare a constant:
public static final Pattern REPEATED_CHARACTER_SECTION =
Pattern.compile("(.)\\1*");
Using features of modern Java, the longest sequence that matches the above pattern could be found literally with a single line of code.
Since Java 9 we have method Matcher.results() which return a stream of MatchResult objects, describe a matching group.
MatchResult.start() MatchResult.end() expose the way of accessing start and end indices of the group. To extract the group itself, we need to invoke MatchResult.group().
That how an implementation might look like:
public static void printLongestRepeatedSection(String str) {
String longestSection = REPEATED_CHARACTER_SECTION.matcher(str).results() // Stream<MatchResult>
.map(MatchResult::group) // Stream<String>
.max(Comparator.comparingInt(String::length)) // find the longest string in the stream
.orElse(""); // or orElseThrow() if you don't want to allow an empty string to be received as an input
System.out.println("Longest section:\t" + longestSection);
}
main()
public static void printLongestRepeatedSection(String str) {
MatchResult longestSection = REPEATED_CHARACTER_SECTION.matcher(str).results() // Stream<MatchResult>
.max(Comparator.comparingInt(m -> m.group().length())) // find the longest string in the stream
.orElseThrow(); // would throw an exception is an empty string was received as an input
System.out.println("Section start: " + longestSection.start());
System.out.println("Section end: " + longestSection.end());
System.out.println("Longest section: " + longestSection.group());
}
Output:
Section start: 34
Section end: 61
Longest section: ***************************
Links:
Official tutorials on Lambda expressions and Stream API provided by Oracle
A quick tutorial on Regular expressions
Simple and Performant Iterative solution
You can do it without regular expressions by manually iterating over the indices of the given string and checking if the previous character matches the current one.
You just need to maintain a couple of variables denoting the start and the end of the longest previously encountered section, and a variable to store the starting index of the section that is being currently examined.
That's how it might be implemented:
public static void printLongestRepeatedSection(String str) {
if (str.isEmpty()) throw new IllegalArgumentException();
int maxStart = 0;
int maxEnd = 1;
int curStart = 0;
for (int i = 1; i < str.length(); i++) {
if (str.charAt(i) != str.charAt(i - 1)) { // current and previous characters are not equal
if (maxEnd - maxStart < i - curStart) { // current repeated section is longer then the maximum section discovered previously
maxStart = curStart;
maxEnd = i;
}
curStart = i;
}
}
if (str.length() - curStart > maxEnd - maxStart) { // checking the very last section
maxStart = curStart;
maxEnd = str.length();
}
System.out.println("Section start: " + maxStart);
System.out.println("Section end: " + maxEnd);
System.out.println("Section: " + str.substring(maxStart, maxEnd));
}
main()
public static void main(String[] args) {
String source = "#*****~~~~~~**************~~~~~~~~***************************#";
printLongestRepeatedSection(source);
}
Output:
Section start: 34
Section end: 61
Section: ***************************

Use methods of class Matcher.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Solution {
public static void main(String args[]) {
String str = "#*****~~~~~~**************~~~~~~~~***************************#";
Pattern pattern = Pattern.compile("\\*+");
Matcher matcher = pattern.matcher(str);
int max = 0;
while (matcher.find()) {
int length = matcher.end() - matcher.start();
if (length > max) {
max = length;
}
}
System.out.println(max);
}
}
The regular expression searches for occurrences of one or more asterisk (*) characters.
Method end returns the index of the first character after the last character matched and method start returns the index of the first character matched. Hence the length is simply the value returned by method end minus the value returned by method start.
Each subsequent call to method find starts searching from the end of the previous match.
The only thing left is to get the longest string of asterisks.

The solution based on the regular expression may use some features from Stream API to get an array of indexes of the longest sequence of a given character:
Pattern.quote should be used to safely wrap the input character for the search within a regular expression
Stream<MatchResult> returned by Matcher::results provides necessary information about the start and end of the match
Stream::max allows to select the longest matching sequence
Optional::map and Optional::orElseGet help convert the match into the desired array of indexes
public static int[] getIndexes(char c, String str) {
return Pattern.compile(Pattern.quote(Character.toString(c)) + "+")
.matcher(str)
.results()
.max(Comparator.comparing(mr -> mr.end() - mr.start()))
.map(mr -> new int[]{mr.start(), mr.end()})
.orElseGet(() -> new int[]{-1, -1});
}
// Test:
System.out.println(Arrays.toString(getIndexes('*', "#*****~~~~~~**************~~~~~~~~***************************#")));
// -> [34, 61]

How can I split a string without knowing the split characters a-priori?

For my project I have to read various input graphs. Unfortunately, the input edges have not the same format. Some of them are comma-separated, others are tab-separated, etc. For example:
File 1:
123,45
67,89
...
File 2
123 45
67 89
...
Rather than handling each case separately, I would like to automatically detect the split characters. Currently I have developed the following solution:
String str = "123,45";
String splitChars = "";
for(int i=0; i < str.length(); i++) {
if(!Character.isDigit(str.charAt(i))) {
splitChars += str.charAt(i);
}
}
String[] endpoints = str.split(splitChars);
Basically I pick the first row and select all the non-numeric characters, then I use the generated substring as split characters. Is there a cleaner way to perform this?

Split requires a regexp, so your code would fail for many reasons: If the separator has meaning in regexp (say, +), it'll fail. If there is more than 1 non-digit character, your code will also fail. If you code contains more than exactly 2 numbers, it will also fail. Imagine it contains hello, world - then your splitChars string becomes " , " - and your split would do nothing (that would split the string "test , abc" into two, nothing else).
Why not make a regexp to fetch digits, and then find all sequences of digits, instead of focussing on the separators?
You're using regexps whether you want to or not, so let's make it official and use Pattern, while we are at it.
private static final Pattern ALL_DIGITS = Pattern.compile("\\d+");
// then in your split method..
Matcher m = ALL_DIGITS.matcher(str);
List<Integer> numbers = new ArrayList<Integer>();
// dont use arrays, generally. List is better.
while (m.find()) {
numbers.add(Integer.parseInt(m.group(0)));
}
//d+ is: Any number of digits.
m.find() finds the next match (so, the next block of digits), returning false if there aren't any more.
m.group(0) retrieves the entire matched string.

Split the string on \\D+ which means one or more non-digit characters.
Demo:
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
// Test strings
String[] arr = { "123,45", "67,89", "125 89", "678 129" };
for (String s : arr) {
System.out.println(Arrays.toString(s.split("\\D+")));
}
}
}
Output:
[123, 45]
[67, 89]
[125, 89]
[678, 129]

Why not split with [^\d]+ (every group of nondigfit) :
for (String n : "123,456 789".split("[^\\d]+")) {
System.out.println(n);
}
Result:
123
456
789

Check if String has specific repetitive special characters only

I have a string of format ^%^%^%^%. I need to check if the string has nothing other than repetitive patterns of ^%
For example
1. ^%^%^%^% > Valid
2. ^%^%aa^%^% > Invalid
3. ^%^%^%^%^%^% > Valid
4. ^%^%^^%^% > Invalid
5. %^%^%^%^% > Invalid
How do I do this in Java?
I tried :
String text = "^%^%^%^%^%";
if (Pattern.matches(("[\\^\\%]+"), text)==true) {
System.out.println("Valid");
} else {
System.out.println("Invalid");
}
However it gives me Valid for cases 4 and 5.

In your pattern you use a character class which matches only 1 of the listed characters and then repeats that 1+ times.
You could use that ^ to anchor the start of the string and end with $ to assert the end of the string.
Then repeat 1+ times matching \\^%
^(?:\\^%)+$
Regex demo

Try this pattern ^(?:\^%)+$
Explanation:
^ - match beginning of the string
(?:...) - non-capturing group
\^% - match ^% literally
(?:\^%)+ - match ^% one or more times
$ - match end of the string
Demo

You can simply do:
if (str.replace("^%", "").isEmpty()) {
…
}
The replace method replaces the string as often as possible, therefore it fits exactly what you need.
It also matches the empty string, which, according to the specification, "contains nothing else than this pattern". In cases like these, you should always ask whether the empty string is meant as well.

String[] text = {"^%^%^%^%","^%^%aa^%^%","^%^%^%^%^%^%","^%^%^^%^%","%^%^%^%^%" };
for (String t: text) {
if(Pattern.matches(("[\\^\\%]+[a-z]*[a-z]*[(\\^\\%)]+"), t)==true) {
System.out.println("Valid");
} else {
System.out.println("Invalid");
}
}

Regular Expression for Extracting Operands from Mathematical Expression

No question on SO addresses my particular problem. I know very little about regular expression. I am building an expression parser in Java using Regex Class for that purpose. I want to extract Operands, Arguments, Operators, Symbols and Function Names from expression and then save to ArrayList. Currently I am using this logic
String string = "2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)" //This is just for testing purpose later on it will be provided by user
List<String> res = new ArrayList<>();
Pattern pattern = Pattern.compile((\\Q^\\E|\\Q/\\E|\\Q-\\E|\\Q-\\E|\\Q+\\E|\\Q*\\E|\\Q)\\E|\\Q)\\E|\\Q(\\E|\\Q(\\E|\\Q%\\E|\\Q!\\E)) //This string was build in a function where operator names were provided. Its mean that user can add custom operators and custom functions
Matcher m = pattern.matcher(string);
int pos = 0;
while (m.find())
{
if (pos != m.start())
{
res.add(string.substring(pos, m.start()))
}
res.add(m.group())
pos = m.end();
}
if (pos != string.length())
{
addToTokens(res, string.substring(pos));
}
for(String s : res)
{
System.out.println(s);
}
Output:
2
!
+
atan2
(
3
+
9
,
2
+
3
)
-
2
*
PI
+
3
/
3
-
9
-
12
%
3
*
sin
(
9
-
9
)
+
(
2
+
6
/
2
)
Problem is that now Expression can contain Matrix with user defined format. I want to treat every Matrix as a Operand or Argument in case of functions.
Input 1:
String input_1 = "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6"
Output Should be:
2
+
3
-
9
*
[{2+3,2,6},{7,2+3,2+3i}]
+
9
*
6
Input 2:
String input_2 = "{[2,5][9/8,func(2+3)]}+9*8/5"
Output Should be:
{[2,5][9/8,func(2+3)]}
+
9
*
8
/
5
Input 3:
String input_3 = "<[2,9,2.36][2,3,2!]>*<[2,3,9][23+9*8/8,2,3]>"
Output Should be:
<[2,9,2.36][2,3,2!]>
*
<[2,3,9][23+9*8/8,2,3]>
I want that now ArrayList should contain every Operand, Operators, Arguments, Functions and symbols at each index. How can I achieve my desired output using Regular expression. Expression validation is not required.

I think you can try with something like:
(?<matrix>(?:\[[^\]]+\])|(?:<[^>]+>)|(?:\{[^\}]+\}))|(?<function>\w+(?=\())|(\d+[eE][-+]\d+)|(?<operand>\w+)|(?<operator>[-+\/*%])|(?<symbol>.)
DEMO
elements are captured in named capturing groups. If you don't need it, you can use short:
\[[^\]]+\]|<[^>]+>|\{[^\}]+\}|\d+[eE][-+]\d+|\w+(?=\()|\w+|[-+\/*%]|.
The \[[^\]]+\]|<[^>]+>|\{[^\}]+\} match opening bracket ({, [ or <), non clasing bracket characters, and closing bracket (},],>) so if there are no nested same-type brackets, there is no problem.
Implementatin in Java:
public class Test {
public static void main(String[] args) {
String[] expressions = {"2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)", "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6",
"{[2,5][9/8,func(2+3)]}+9*8/5","<[2,9,2.36][2,3,2!]>*<[2,3,9][23 + 9 * 8 / 8, 2, 3]>"};
Pattern pattern = Pattern.compile("(?<matrix>(?:\\[[^]]+])|(?:<[^>]+>)|(?:\\{[^}]+}))|(?<function>\\w+(?=\\())|(?<operand>\\w+)|(?<operator>[-+/*%])|(?<symbol>.)");
for(String expression : expressions) {
List<String> elements = new ArrayList<String>();
Matcher matcher = pattern.matcher(expression);
while (matcher.find()) {
elements.add(matcher.group());
}
for (String element : elements) {
System.out.println(element);
}
System.out.println("\n\n\n");
}
}
}
Explanation of alternatives:
\[[^\]]+\]|<[^>]+>|\{[^\}]+\} - match opening bracket of given
type, character which are not closing bracket of that type
(everything byt not closing bracket), and closing bracket of that
type,
\d+[eE][-+]\d+ = digit, followed by e or E, followed by operator +
or -, followed by digits, to capture elements like 2e+3
\w+(?=\() - match one or more word characters (A-Za-z0-9_) if it is
followed by ( for matching functions like sin,
\w+ - match one or more word characters (A-Za-z0-9_) for matching
operands,
[-+\/*%] - match one character from character class, to match
operators
. - match any other character, to match other symbols
Order of alternatives is quite important, as last alternative . will match any character, so it need to be last option. Similar case with \w+(?=\() and \w+, the second one will match everything like previous one, however if you don't wont to distinguish between functions and operands, the \w+ will be enough for all of them.
In longer exemple the part (?<name> ... ) in every alternative, is a named capturing group, and you can see in demo, how it group matched fragments in gorups like: operand, operator, function, etc.

With regular expressions you cannot match any level of nested balanced parentheses.
For example, in your second example {[2,5][9/8,func(2+3)]} you need to match the opening brace with the close brace, but you need to keep track of how many opening and closing inner braces/parens/etc there are. That cannot be done with regular expressions.
If, on the other hand, you simplify your problem to remove any requirement for balancing, then you probably can handle with regular expressions.

Remove the intersections of multiple regular expressions?

Pattern[] a =new Pattern[2];
a[0] = Pattern.compile("[$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d");
a[1] = Pattern.compile("Rs[.]?\\s*[\\d,]*[.]?\\d*\\d");
Ex: Rs.150 is detected by a[1] and 150 is detected by a[0].
How to remove such intersections and let it only detect by a[1] but not by a[0]?

You can use the | operator inside your regular expression. Then call the method Matcher#group(int) to see which pattern your input applies to. This method returns null if the matching group is empty.
Sample code
public static void main(String[] args) {
// Build regexp
final String MONEY_REGEX = "[$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d";
final String RS_REGEX = "Rs[.]?\\s*[\\d,]*[.]?\\d*\\d";
// Separate them with '|' operator and wrap them in two distinct matching groups
final String MONEY_OR_RS = String.format("(%s)|(%s)", MONEY_REGEX, RS_REGEX);
// Prepare some sample inputs
String[] inputs = new String[] { "$100", "Rs.150", "foo" };
Pattern p = Pattern.compile(MONEY_OR_RS);
// Test each inputs
Matcher m = null;
for (String input : inputs) {
if (m == null) {
m = p.matcher(input);
} else {
m.reset(input);
}
if (m.matches()) {
System.out.println(String.format("m.group(0) => %s\nm.group(1) => %s\n", m.group(1), m.group(2)));
} else {
System.out.println(input + " doesn't match regexp.");
}
}
}
Output
m.group(0) => $100
m.group(1) => null
m.group(0) => null
m.group(1) => Rs.150
foo doesn't match regexp.

Use an initial test to switch between expressions. How fast and/or smart this initial test is is up to you.
In this case you could do something like:
if (input.startsWith("Rs.") && a[1].matcher(input).matches()) {
return true;
}
and put it in front of your method that does the testing.
Simply putting the most common regular expressions in front of the array may help as well of course.

Description
Use a negative look ahead to match a[1] rs.150 format while at the same time preventing the a[0] 150 format.
Generic expression: (?! the a[0] regex goes here ) followed by the a[1] expression
Basic regex statment: (?![$£€]?\s*\d*[\.]?[pP]?\d*\d)Rs[.]?\s*[\d,]*[.]?\d*\d
escaped for java: (?![$£€]?\\s*\\d*[\\.]?[pP]?\\d*\\d)Rs[.]?\\s*[\\d,]*[.]?\\d*\\d

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex look-behind without obvious maximum length in Java - java

It's a bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6695369 Pattern.compile() is always supposed to throw an exception if it can't determine the maximum possible length of a lookbehind match.

Related

Find the longest section of repeating characters

How can I split a string without knowing the split characters a-priori?

Check if String has specific repetitive special characters only

Regular Expression for Extracting Operands from Mathematical Expression

Remove the intersections of multiple regular expressions?

Categories

Resources