Subtle Java Regular Expressions

Subtle Java Regular Expressions - java

String str = "1234545";
String regex = "\\d*";
Pattern p1 = Pattern.compile(regex);
Matcher m1 = p1.matcher(str);
while (m1.find()) {
System.out.print(m1.group() + " found at index : ");
System.out.print(m1.start());
}
The output of this program is 1234545 found at index:0 found at index:7.
My question is:
why is there a space printed when actually there is no space in the str.

The space printed between "index:0" and "at index:7" is coming from the string literal that you print. It was supposed to come after the matched string; however, in this case the match is empty.
Here is what's going on: the first match consumes all digits in the string, leaving zero characters for the following match. However, the following match succeeds, because the asterisk * in your expression allows matching empty strings.
To avoid this confusion in the future, add delimiter characters around the actual match, like this:
System.out.print("'" + m1.group() + "' at index : ");
Now you would see an empty pair of single quotes, showing that the match was empty.

Related

A regular expression captures more text than it needs

I want to get the whole value 97.47 but the regular expression splits it by 9 and by 7.47 adding it to different fields
This is the regular expression that is used
private static final Pattern COMMISSION_PATTERN =
Pattern.compile(
"(total\\[((?:(?<totalFixed>\\d+)(\\s*(\\+)\\s*)?)?" +
"((?<totalPercent>\\d+(\\.\\d{1,2})?)\\s*%)?" +
"(\\s*min\\s*(?<totalMin>\\d+))?" +
"(\\s*max\\s*(?<totalMax>\\d+))?" +
"(\\s*round\\s*(?<totalRound>\\d+))?)?\\])?(\\s*)" +
"(partner\\[(?:(\\s*negative:\\s*(?<partnerNegative>(true|false))?\\s*,\\s*)?" +
"((?<partnerFixed>\\d+)(\\s*(\\+)\\s*)?)?" +
"((?<partnerPercent>\\d+(\\.\\d{1,2})?)\\s*%)?" +
"(\\s*min\\s*(?<partnerMin>\\d+))?" +
"(\\s*max\\s*(?<partnerMax>\\d+))?" +
"(\\s*round\\s*(?<partnerRound>\\d+))?" +
"(\\s*mode\\s*(?<partnerMode>\\w+))?)?\\])?");
The following value arrives in the method
"total[0] partner[97.47%]"
it is parsed in this way:
String sCommission = "total[0] partner[97.47%]";
for (String comm : sCommission.split("\n")) {
Matcher matcher = COMMISSION_PATTERN.matcher(comm.trim());
if (matcher.matches()) {
String sPartnerFixed = matcher.group("partnerFixed");//9
String sPartnerPercent = matcher.group("partnerPercent"); //7.47
And it should be:
String sPartnerFixed = matcher.group("partnerFixed"); //null
String sPartnerPercent = matcher.group("partnerPercent"); //97.47
I can't figure out where the error is in the regular expression

The (\s*(\+)\s*)? part in the ((?<partnerFixed>\d+)(\s*(\+)\s*)?)? part is optional, and \d+ in the partnerFixed group becomes "adjacent" (it can be backtracked into) to the (?<partnerPercent>\d+(?:\.\d{1,2})?) part of the regex (where \d+ also is required and matches one or more digits). So, this behavior you have is expected, unless you tell the regex engine to clearly have an obligatory pattern between these two number matching parts.
A possible solution would be a word boundary after \d+ in the (?<partnerFixed>\d+) part, i.e. replace "((?<partnerFixed>\\d+)(\\s*(\\+)\\s*)?)?" with "((?<partnerFixed>\\d+\\b)(\\s*(\\+)\\s*)?)?".
A more sophisticated and more precise way to solve this issue is to make some part of the (\s*(\+)\s*)? pattern obligatory. That is, you do not expect a match for partnerFixed if there is a single streak of digits optionally followed with . and one or two digits. If there is a partnerFixed number, what should it be separated with from the next value? I think there should be a whitespace or + enclosed with optional whitespaces, just deducing it from the pattern.
In this latter case, you can replace "((?<partnerFixed>\\d+)(\\s*(\\+)\\s*)?)?" with "((?<partnerFixed>\\d+)(\\s+|\\s*\\+\\s*))?".
See this regex demo.

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?

Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];

(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.

I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.

TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Java Regex. group excluding delimiters

I'm trying to split my string using regex. It should include even zero-length matches before and after every delimiter. For example, if delimiter is ^ and my string is ^^^ I expect to get to get 4 zero-length groups.
I can not use just regex = "([^\\^]*)" because it will include extra zero-length matches after every true match between delimiters.
So I have decided to use not-delimiter symbols following after beginning of line or after delimiter. It works perfect on https://regex101.com/ (I'm sorry, i couldn't find a share option on this web-site to share my example) but in Intellij IDEa it skips one match.
So, now my code is:
final String regex = "(^|\\^)([^\\^]*)";
final String string = "^^^^";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find())
System.out.println("[" + matcher.start(2) + "-" + matcher.end(2) + "]: \"" + matcher.group(2) + "\"");
and I expect 5 empty-string matches. But I have only 4:
[0-0]: ""
[2-2]: ""
[3-3]: ""
[4-4]: ""
The question is why does it skip [1-1] match and how can I fix it?

Your regex matches either the start of string or a ^ (capturing that into Group 1) and then any 0+ chars other than ^ into Group 2. When the first match is found (the start of the string), the first group keeps an empty string (as it is the start of string) and Group 2 also holds an empty string (as the first char is ^ and [^^]* can match an empty string before a non-matching char. The whole match is zero-length, and the regex engine moves the regex index to the next position. So, after the first match, the regex index is moved from the start of the string to the position after the first ^. Then, the second match is found, the second ^ and the empty string after it. Hence, the the first ^ is not matched, it is skipped.
The solution is a simple split one:
String[] result = string.split("\\^", -1);
The second argument makes the method output all empty matches at the end of the resulting array.
See a Java demo:
String str = "^^^^";
String[] result = str.split("\\^", -1);
System.out.println("Number of items: " + result.length);
for (String s: result) {
System.out.println("\"" + s+ "\"");
}
Output:
Number of items: 5
""
""
""
""
""

How to use Substring when String length is not fixed everytime

I have string something like :
SKU: XP321654
Quantity: 1
Order date: 01/08/2016
The SKU length is not fixed , so my function sometime returns me the first or two characters of Quantity also which I do not want to get. I want to get only SKU value.
My Code :
int index = Content.indexOf("SKU:");
String SKU = Content.substring(index, index+15);
If SKU has one or two more digits then also it is not able to get because I have specified limit till 15. If I do index + 16 to get long SKU data then for Short SKU it returns me some character of Quantity also.
How can I solve it. Is there any way to use instead of a static string character length as limit.
My SKU last digit will always number so any other thing which I can use to get only SKU till it's last digit?

Using .substring is simply not the way to process such things. What you need is a regex (or regular expression):
Pattern pat = Pattern.compile("SKU\\s*:\\s*(\\S+)");
String sku = null;
Matcher matcher = pattern.matcher(Content);
if(matcher.find()) { //we've found a match
sku = matcher.group(1);
}
//do something with sku
Unescaped the regex is something like:
SKU\s*:\s*(\S+)
you are thus looking for a pattern that starts with SKU then followed by zero or more \s (spacing characters like space and tab), followed by a colon (:) then potentially zero or more spacing characters (\s) and finally the part in which you are interested: one or more (that's the meaning of +) non-spacing characters (\S). By putting these in brackets, these are a matching group. If the regex succeeds in finding the pattern (matcher.find()), you can extract the content of the matching group matcher.group(1) and store it into a string.
Potentially you can improve the regex further if you for instance know more about how a SKU looks like. For instance if it consists only out of uppercase letters and digits, you can replace \S by [0-9A-Z], so then the pattern becomes:
Pattern pat = Pattern.compile("SKU\\s*:\\s*([0-9A-Z]+)");
EDIT: for the quantity data, you could use:
Pattern pat2 = Pattern.compile("Quantity\\s*:\\s*(\\d+)");
int qt = -1;
Matcher matcher = pat2.matcher(Content);
if(matcher.find()) { //we've found a match
qt = Integer.parseInt(matcher.group(1));
}
or see this jdoodle.

You know you can just refer to the length of the string right ?
String s = "SKU: XP321654";
String sku = s.substring(4, s.length()).trim();
I think using a regex is clearly overkill in this case, it is way way simpler than this. You can even split the expression although it's a bit less efficient than the solution above, but please don't use a regex for this !
String sku = "SKU: XP321654".split(':')[1].trim();

1: you have to split your input by lines (or split by \n)
2: when you have your line: you search for : and then you take the remaining of the line (with the String size as mentionned in Dici answer).

Depending on how exactly the string contains new lines, you could do this:
public static void main(String[] args) {
String s = "SKU: XP321654\r\n" +
"Quantity: 1\r\n" +
"Order date: 01/08/2016";
System.out.println(s.substring(s.indexOf(": ") + 2, s.indexOf("\r\n")));
}
Just note that this 1-liner has several restrictions:
The SKU property has to be first. If not, then modify the start index appropriately to search for "SKU: ".
The new lines might be separated otherwise, \R is a regex for all the valid new line escape characters combinations.

Match only first and last character of a string

I had a look at other stackoverflow questions and couldn't find one that asked the same question, so here it is:
How do you match the first and last characters of a string (can be multi-line or empty).
So for example:
String = "this is a simple sentence"
Note that the string includes the beginning and ending quotation marks.
How do I get match the first and last characters where the string begins and ends with a quotation mark (").
I tried:
^"|$" and \A"\Z"
but these do not produce the desired result.
Thanks for your help in advance :)

Is this what you are looking for?
String input = "\"this is a simple sentence\"";
String result = input.replaceFirst("(?s)^\"(.*)\"$", " $1 ");
This will replace the first and last character of the input string with spaces if it starts and ends with ". It will also work across multiple lines since the DOTALL flag is specified by (?s).

The regex that matches the whole input ".*". In java, it looks like this:
String regex = "\".*\"";
System.out.println("\"this is a simple sentence\"".matches(regex)); // true
System.out.println("this is a simple sentence".matches(regex)); // false
System.out.println("this is a simple sentence\"".matches(regex)); // false
If you want to remove the quotes, use this:
String input = "\"this is a simple sentence\"";
input = input.replaceAll("(^\"|\"$)", "")); // this is a simple sentence (without any quotes)
If you want this to work over multiple lines, use this:
String input = "\"this is a simple sentence\"\n\"and another sentence\"";
System.out.println(input + "\n");
input = input.replaceAll("(?m)(^\"|\"$)", "");
System.out.println(input);
which produces output:
"this is a simple sentence"
"and another sentence"
this is a simple sentence
and another sentence
Explanation of regex (?m)(^"|"$):
(?m) means "Caret and dollar match after and before newlines for the remainder of the regular expression"
(^"|"$) means ^" OR "$, which means "start of line then a double quote" OR "double quote then end of line"

Why not use the simple logic of getting the first and last characters based on charAt method of String? Place a few checks for empty/incomplete strings and you should be done.

String regexp = "(?s)\".*\"";
String data = "\"This is some\n\ndata\"";
Matcher m = Pattern.compile(regexp).matcher(data);
if (m.find()) {
System.out.println("Match starts at " + m.start() + " and ends at " + m.end());
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Subtle Java Regular Expressions - java

Related

A regular expression captures more text than it needs

How not to match the first empty string in this regex?

Java Regex. group excluding delimiters

How to use Substring when String length is not fixed everytime

Match only first and last character of a string

Categories

Resources