A regular expression captures more text than it needs

A regular expression captures more text than it needs - java

I want to get the whole value 97.47 but the regular expression splits it by 9 and by 7.47 adding it to different fields
This is the regular expression that is used
private static final Pattern COMMISSION_PATTERN =
Pattern.compile(
"(total\\[((?:(?<totalFixed>\\d+)(\\s*(\\+)\\s*)?)?" +
"((?<totalPercent>\\d+(\\.\\d{1,2})?)\\s*%)?" +
"(\\s*min\\s*(?<totalMin>\\d+))?" +
"(\\s*max\\s*(?<totalMax>\\d+))?" +
"(\\s*round\\s*(?<totalRound>\\d+))?)?\\])?(\\s*)" +
"(partner\\[(?:(\\s*negative:\\s*(?<partnerNegative>(true|false))?\\s*,\\s*)?" +
"((?<partnerFixed>\\d+)(\\s*(\\+)\\s*)?)?" +
"((?<partnerPercent>\\d+(\\.\\d{1,2})?)\\s*%)?" +
"(\\s*min\\s*(?<partnerMin>\\d+))?" +
"(\\s*max\\s*(?<partnerMax>\\d+))?" +
"(\\s*round\\s*(?<partnerRound>\\d+))?" +
"(\\s*mode\\s*(?<partnerMode>\\w+))?)?\\])?");
The following value arrives in the method
"total[0] partner[97.47%]"
it is parsed in this way:
String sCommission = "total[0] partner[97.47%]";
for (String comm : sCommission.split("\n")) {
Matcher matcher = COMMISSION_PATTERN.matcher(comm.trim());
if (matcher.matches()) {
String sPartnerFixed = matcher.group("partnerFixed");//9
String sPartnerPercent = matcher.group("partnerPercent"); //7.47
And it should be:
String sPartnerFixed = matcher.group("partnerFixed"); //null
String sPartnerPercent = matcher.group("partnerPercent"); //97.47
I can't figure out where the error is in the regular expression

The (\s*(\+)\s*)? part in the ((?<partnerFixed>\d+)(\s*(\+)\s*)?)? part is optional, and \d+ in the partnerFixed group becomes "adjacent" (it can be backtracked into) to the (?<partnerPercent>\d+(?:\.\d{1,2})?) part of the regex (where \d+ also is required and matches one or more digits). So, this behavior you have is expected, unless you tell the regex engine to clearly have an obligatory pattern between these two number matching parts.
A possible solution would be a word boundary after \d+ in the (?<partnerFixed>\d+) part, i.e. replace "((?<partnerFixed>\\d+)(\\s*(\\+)\\s*)?)?" with "((?<partnerFixed>\\d+\\b)(\\s*(\\+)\\s*)?)?".
A more sophisticated and more precise way to solve this issue is to make some part of the (\s*(\+)\s*)? pattern obligatory. That is, you do not expect a match for partnerFixed if there is a single streak of digits optionally followed with . and one or two digits. If there is a partnerFixed number, what should it be separated with from the next value? I think there should be a whitespace or + enclosed with optional whitespaces, just deducing it from the pattern.
In this latter case, you can replace "((?<partnerFixed>\\d+)(\\s*(\\+)\\s*)?)?" with "((?<partnerFixed>\\d+)(\\s+|\\s*\\+\\s*))?".
See this regex demo.

Related

Pattern Matching to find trailing spaces outside of text fields in a line

I have to validate the lines from a text file. The line would be something like below.
"Field1" "Field2" "Field3 Field_3.1 Field3.2" 23 3445 "Field5".
The delimiter here is a single Space(\s). If more than one space present outside of text fields, then the line should be rejected. For example,
Note : \s would be present as literal space and not as \s in the line. For easy reading I mentioned space as \s
Invalid:
"Field1"\\s\\s"Field2" "Field3 Field_3.1 Field3.2" 23\\s\\s3445 "Field5". //two or more spaces between "Field1" and "Field2" or numeric fields 23 3445. \s would be present as literal space and not as \s
Valid
"Field1\\s\\s" "\\s\\sField2" "Field3\\s\\sField_3.1\\s\\sField3.2" 23 3445 "Field5". //two or more spaces within third field "Field3 Field_3.1 Field3.2" or at the end/beginning of any field as in first two fields.
I created a Pattern as below to validate the Spaces in between. But it's not working as expected when there're more than two Strings and a numeric present inside a Field wrapped by Double quotes like "Field3 Field_3.1 123"
public class SpaceValidation
{
public static void main(String ar[])
{
String spacePattern_1 = "[\"^\\n]\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s";
String line1 = "Field3 Field_3.1 "; // valid and pattern doesn't find it as invalid - Works as expected
String line2 = "Field3 Field_3.1 123";//Valid and but pattern find it as invalid - Not working as expected.
Pattern pattern = Pattern.compile(spacePattern_1);
Matcher matLine1 = pattern.matcher(line1);
Matcher matLine2 = pattern.matcher(line2);
if(matLine1.find())
{
sysout("Invalid Line1");
}
if(matLine2.find())
{
sysout("Invalid Line2");
}
}
I have tried another pattern given below. But due to backtracking issues reported I have to avoid the below pattern, Even this one is not working when there are more than two subfields present two or more spaces in a line.
(\".*\")\\s{2,}?(\".*\")|\\s\\s\\d|\\d\\s\\s
// * or . shouldn't be present more than once in the same condition to prevent backtracking, hence I have to use negation of \\n in the above code
Kindly let me know how I could resolve this using pattern for fields such as "field3 field3.1 123", which is a valid field. Thanks in advance.
EDIT:
After little bit tinkering, I narrowed down the issue to digit. The lines becomes invalid only if the third subfield is numeric ("Field 3 Field3.1 123"). For alphabets its working fine.
Here in the pattern \\s\\s\\d seems to be the culprit. It's that condition that flags the third subfield as invalid(numeric subfield 123). But I need that to validate numeric fields present outside of the DoubleQuotes.

You can use
^(?:\"[^\"]*\"|\d+)(?:\s(?:\"[^\"]*\"|\d+))*$
If you are using it to extract lines from a multiline document:
(?m)^(?:\"[^\"\n\r]*\"|\d+)(?:\h(?:\"[^\"\n\r]*\"|\d+))*\r?$
See the regex demo.
Details:
^ - start of a string (line, if you use (?m) or Pattern.MULTILINE)
(?:\"[^\"]*\"|\d+) - either " + zero or more chars other than " + ", or one or more digits
(?:\s(?:\"[^\"]*\"|\d+))* - zero or more sequences of
\s - a single whitespace
(?:\"[^\"]*\"|\d+) - either " + zero or more chars other than " + ", or one or more digits
$ - end of string
The second pattern contains \h instead of \s to only match horizontal whitespaces, [^\"\n\r] matches any char other than ", line feed and carriage return.
In Java:
String pattern = "^(?:\"[^\"]*\"|\\d+)(?:\\s(?:\"[^\"]*\"|\\d+))*$";
String pattern = "(?m)^(?:\"[^\"\n\r]*\"|\\d+)(?:\\h(?:\"[^\"\n\r]*\"|\\d+))*\r?$";

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?

Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];

(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.

I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.

TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Subtle Java Regular Expressions

String str = "1234545";
String regex = "\\d*";
Pattern p1 = Pattern.compile(regex);
Matcher m1 = p1.matcher(str);
while (m1.find()) {
System.out.print(m1.group() + " found at index : ");
System.out.print(m1.start());
}
The output of this program is 1234545 found at index:0 found at index:7.
My question is:
why is there a space printed when actually there is no space in the str.

The space printed between "index:0" and "at index:7" is coming from the string literal that you print. It was supposed to come after the matched string; however, in this case the match is empty.
Here is what's going on: the first match consumes all digits in the string, leaving zero characters for the following match. However, the following match succeeds, because the asterisk * in your expression allows matching empty strings.
To avoid this confusion in the future, add delimiter characters around the actual match, like this:
System.out.print("'" + m1.group() + "' at index : ");
Now you would see an empty pair of single quotes, showing that the match was empty.

Java regex for matching multiple keys in a string

Consider an input string like
Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5
and the regular expression
\b(TWO|FOUR)=([^ ]*)\b
Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (i.e., only some predefined key-value pairs should be extracted).
public static void main(String[] args) throws Exception {
String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
}
}
More specifically, the main() method above prints
TWO = 2
FOUR = 4
but every time find() is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.
Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes
\b(O.*?|T.*?)=([^ ]*)\b
the above method yields
ONE = 1
TWO = 2
THREE = 3
If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been
ONE = 1
TWO = 2
So, two questions:
Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
Is there a regular expression that can match every alternative part of the OR (|) sub-expression exactly once and not evaluate it again?

Java Returns a Match Position: You can Use Dynamically-Generated Regex on Remaining Substrings
With the understanding that it can be generalized to a more complex and useful scenario, let's take a variation on your first example: \b(TWO|FOUR|SEVEN)=([^ ]*)\b
You can use it like this:
Pattern regex = Pattern.compile("\\b(TWO|FOUR|SEVEN)=([^ ]*)\\b");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
String theMatch = regexMatcher.group();
String FoundToken = = regexMatcher.group(1);
String EndPosition = regexMatcher.end();
}
You could then:
Test the value contained by FoundToken
Depending on that value, dynamically generate a regex testing for the remaining possible tokens. For instance, if you found FOUR, your new regex would be \\b(TWO|SEVEN)=([^ ]*)\\b
Using EndPosition, apply that regex to the end of the string.
Discussion
This approach would serve your goal of not re-evaluating parts of the OR that have already matched.
It also serves your goal of avoiding duplicates.
Would that be faster? Not in this simple case. But you said you are dealing with a real problem, and it will be a valid approach in some cases.

Regex to match only commas not in parentheses?

I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?

Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.

Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

A regular expression captures more text than it needs - java

Related

Pattern Matching to find trailing spaces outside of text fields in a line

How not to match the first empty string in this regex?

Subtle Java Regular Expressions

Java regex for matching multiple keys in a string

Regex to match only commas not in parentheses?

Categories

Resources