Extracting segments of a semver version string via Java regex

Extracting segments of a semver version string via Java regex - java

Java 8 here. I am trying to parse a semver (or at least, my flavor of semver) string and extract out its main segments:
Major version number
Minor version number
Patch number
Qualifier (RC, SNAPSHOT, RELEASE, etc.)
Here is my code:
String version = "1.0.1-RC";
Pattern versionPattern = Pattern.compile("^[1-9]\\d*\\.\\d+\\.\\d+(?:-[a-zA-Z0-9]+)?$");
Matcher matcher = versionPattern.matcher(version);
if (matcher.matches()) {
System.out.println("\n\n\matching version is: " + matcher.group(0));
System.out.println("\nmajor #: " + matcher.group(1));
System.out.println("\nminor #: " + matcher.group(2));
System.out.println("\npatch #: " + matcher.group(3));
System.out.println("\nqualifier: " + matcher.group(4) + "\n\n\n");
}
When this runs, I get the following output on the console:
matching version is: 1.0.1-RC
2019-10-18 14:32:05,952 [main] 84b37cef-70f9-4ab8-bafb-005821699766 ERROR c.s.f.s.listeners.StartupListener - java.lang.IndexOutOfBoundsException: No group 1
What do I need to do to my regex and/our use of the Matcher API so that I can extract:
1 as the major number
0 as the minor number
1 as the patch number
RC as the qualifier
Any ideas?

NOTE:
You should not escape m in a string literal, \m is not a valid string escape sequence and the code won't compile
Matcher#matches() requires a full string match, no need to add ^ and $ anchors
To be able to reference Matcher#group(n), you need to define the groups in the pattern in the first place. Wrap the parts you need with pairs of unescaped parentheses.
Use
String version = "1.0.1-RC";
Pattern versionPattern = Pattern.compile("([1-9]\\d*)\\.(\\d+)\\.(\\d+)(?:-([a-zA-Z0-9]+))?");
Matcher matcher = versionPattern.matcher(version);
if (matcher.matches()) {
System.out.println("matching version is: " + matcher.group(0));
System.out.println("major #: " + matcher.group(1));
System.out.println("minor #: " + matcher.group(2));
System.out.println("patch #: " + matcher.group(3));
System.out.println("qualifier: " + matcher.group(4) + "\n\n\n");
}
See the Java demo, output:
matching version is: 1.0.1-RC
major #: 1
minor #: 0
patch #: 1
qualifier: RC

I was looking around on the semver specification website and found a regex there that works, and i fixed it a little to work with java regex named groups and here is the regex if anyone needs it
final static String version_regex = "^(?<major>0|[1-9]\\d*)\\.(?<minor>0|[1-9]\\d*)\\.(?<patch>0|[1-9]\\d*)(?:-(?<prerelease>(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+(?<buildmetadata>[0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$";
// After matching to grab the groups
String mayor = matcher.group("major");
String minor = matcher.group("minor");
String patch = matcher.group("patch");
String prerelease = matcher.group("prerelease");
String buildmetadata = matcher.group("buildmetadata");

Related

How to capture multiline repeated groups using regular expression

I've been trying to write a regular expression in a Kotlin application that I can use to parse multiline journal entries that are delimited by means of a timestamp prefix like so:
28-03-2020 23:00:00 - This
is
line
1
28-03-2021 14:23:15 - This
is
line
2
Each repeating group should capture the timestamp (1) and all text that occurs until either the next timestamp pattern at the start of a line or the end of text (2).
So, in the example above I expect the following output:
Match 1
Group 1: 28-03-2020 23:00:00
Group 2: This\nis\nline\n1\n
Match 2
Group 1: 28-03-2020 14:23:15
Group 2: This\nis\nline\n2\n
So far, I've managed to conjure up a regular expression that can capture the first match using:
^(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}) -([\s\S]*?)(?=^\d{2}.*?)
However, I've been unsuccessful in capturing as repeated groups so far.. can someone help?
I've setup this regex101 session to test it.

If you want to match:
Each repeating group should capture the timestamp and all text
that occurs until either the next timestamp pattern at the start of a
line or the end of text.
you can capture the timestamp at the start of the string in group 1.
Without setting an end boundary like a newline or a digit at the start of the line, capture all lines that do not start with a timestamp like pattern using a negative lookahead in group 2.
^(\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d{2})\h+-\h*(.*(?:\R(?!\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d).*)*)
^ Start of string
(\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d{2}) Capture group 1, match a datetime like pattern
\h+-\h* Match - preceded by 1+ horizontal whitespace char and followed by optional ones
( Capture group 2
.* Match the whole line
(?: Non capture group
\R Match a newline
(?!\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d) Negative lookahead, assert not a datetime like pattern directly to the right
.* If the assertion in true, match the whole line
)* Match a newline and the rest of the line if it does not start with a datetime like pattern
) Close group 2
Regex demo | Java demo
For example
String regex = "^(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2})\\h+-\\h*(.*(?:\\R(?!\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d).*)*)";
String string = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n\n\n\n"
+ "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "12\n"
+ "line\n"
+ "1";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println("--------------------");
}
Output
28-03-2020 23:00:00
This
is
line
1
--------------------
28-03-2021 14:23:15
This
is
line
2
--------------------
28-03-2020 23:00:00
This
is
12
line
1
--------------------

You should use Pattern.DOTALL like this.
public static void main(String[] args) {
String s = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n"
+ "\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n"
+ "\n";
Pattern pat = Pattern.compile(
"(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2})\\s*-\\s*(.*?)\n\n",
Pattern.DOTALL);
Matcher m = pat.matcher(s);
while (m.find()) {
System.out.println("Group 1 : " + m.group(1));
System.out.println("Group 2 : " + m.group(2));
}
}
output:
Group 1 : 28-03-2020 23:00:00
Group 2 : This
is
line
1
Group 1 : 28-03-2021 14:23:15
Group 2 : This
is
line
2

How about this.
public static void main(String[] args) {
String s = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n"
+ "\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n"
+ "\n";
Pattern r = Pattern.compile("^(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2}) -(?:[\\s\\D]*?)^(\\d{1,2})",Pattern.MULTILINE);
Matcher matcher = r.matcher(s);
while(matcher.find()) {
System.out.println("Group 1 : " + matcher.group(1));
System.out.println("Group 2 : " + matcher.group(2));
}
}
And the output is as below.
Group 1 : 28-03-2020 23:00:00
Group 2 : 1
Group 1 : 28-03-2021 14:23:15
Group 2 : 2

return' character and new line use with regex java 8

I 'm facing strange behaviour in java 8 regarding the use of (\r?\n) inside a regex to parse text file with IDE eclipse runing under java 8.
see regex101 test demo https://regex101.com/r/QHSsfQ/4
the regex work fine for java 7 with IDE eclipse .
but with IDE runing in java 8 it dosen't work ( see bellow code )
can someone help how me to solved this?
String REGEX =
"\\s+NAME.*" + "\\r?\\n"
+ "INFO-\\d{1,2}\\s+(?<name>[$\\w]+).*" + "\\r?\\n"
+ ".*" + "\\r?\\n"
+ ".*VERAT2.*" + "\\r?\\n"
+ "\\s+\\w+\\s+(?<verat2>\\w+).*"
.......
.......
Matcher matcher = Pattern.compile( REGEX ).matcher( data );
if( matcher.find() )
{
System.out.println("LEVELINFO=DATA=" + matcher.group("name") + " &&NAME=" + matcher.group("name") +" &&VERAT2="+ matcher.group("verat2")+"\n");
}
}
sc.close();
the sample text file looks like this :
DATA NAME MAC1
INFO-0 EQUIP Q10
VL VER VERAT2
V22 V22
thanks

Alternative regex:
String regexName = "^DATA\\s+NAME\\s+.*?^\\S+\\s+(?<name>\\S+)";
String regexVerat2 = "\\s+VER\\s+VERAT2\\s+.*?^\\s+\\S+\\s+(?<verat2>\\S+)";
String regex = String.format("%s.*?%s", regexName, regexVerat2);
Matcher matcher = Pattern.compile(regex, Pattern.MULTILINE|Pattern.DOTALL).matcher(input);
Regex in context:
public static void main(String[] args) {
String input =
"DATA NAME MAC1 MAC2\n"
+ "INFO-0 EQUIP Q10 Q13\n"
+ " \n"
+ " VL VER VERAT2 MAP\n"
+ " V22 V22 SELF100\n"
+ " \n"
+ " CMD1 CMD2 CMD3 CMD4 CMD4 \n"
+ " NO 44 FAL BYTE\n";
String regexName = "^DATA\\s+NAME\\s+.*?^\\S+\\s+(?<name>\\S+)";
String regexVerat2 = "\\s+VER\\s+VERAT2\\s+.*?^\\s+\\S+\\s+(?<verat2>\\S+)";
String regex = String.format("%s.*?%s", regexName, regexVerat2);
Matcher matcher = Pattern.compile(regex, Pattern.MULTILINE|Pattern.DOTALL).matcher(input);
while(matcher.find()) {
System.out.println("Name: " + matcher.group("name"));
System.out.println("Verat2 : " + matcher.group("verat2"));
}
}
Output:
Name: EQUIP
Verat2 : V22

Java Regex OR operator not working properly

I have this Strings :
String test1=":test:block1:%a1%a2%a3%a4:block2:BL";
and
String test2=":test:block2:BL:block1:%a1%a2%a3%a4";
I've created an regex pattern in order to isolate this piece of String
block1:%a1%a2%a3%a4:
from the rest of the String letting those Strings like this :
in the case of test1="block1:%a1%a2%a3%a4:"; (with ':' at the end)
in the case of test2=":block1:%a1%a2%a3%a4"; (with ':' at the beggining)
The regex i've created is :
"(block1:(.*?):|:block1:(.*))";
With test1 is working , but with test2 is retrieving me this :
block1:%a1%a2%a3%a4:block2:BL";
Can someone give me a hand with this ?
Cheers!

You may use
block1:([^:]*)
It matches block1: text and then captures into Group 1 any 0 or more chars other than :.
See Java demo:
String patternString = "block1:([^:]*)";
String[] tests = {":test:block1:%a1%a2%a3%a4:block2:BL",
":test:block2:BL:block1:%a1%a2%a3%a4"};
for (int i=0; i<tests.length; i++)
{
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(tests[i]);
if(m.find())
{
System.out.println(tests[i] + " matched. Match: " +
m.group(0) + ", Group 1: " + m.group(1));
}
}
Output:
:test:block1:%a1%a2%a3%a4:block2:BL matched. Match: block1:%a1%a2%a3%a4, Group 1: %a1%a2%a3%a4
:test:block2:BL:block1:%a1%a2%a3%a4 matched. Match: block1:%a1%a2%a3%a4, Group 1: %a1%a2%a3%a4

extracting two group number by Regular Expressions

I have following string, I want to extracting 150 and 136 from that by using Regular Expressions in java (android studio), both number are before MB (between them exist space) and some times second number is not exist how i can extracting them
in to separate group?
"Your Day Traffic is 150 MB and your Night Traffic is 136 MB "
and give me two group like this:
group 1 ==> "150"
group 2 ==> "136"
Best Answer:
after some search and try in egex101.com i found my answer:
Pattern p = Pattern.compile("^[^\\d]*(\\d+(?:\\.\\d+)?) MB(?:[^\\d]+(\\d+(?:\\.\\d+)?) MB)?.*$");//. represents single character
Matcher m = p.matcher("Your Day Traffic is 150 MB and your Night Traffic is 136 MB");
while (m.find()) {
System.out.println("group 1 ==>" + m.group(1));
System.out.println("group 2 ==>" + m.group(2));
}
and i get this :
group 1 ==>150
group 2 ==>136

You can use this regex ((\d+)\sMB) if there are one or more spaces between the number and MB you can use \s+ to match one or more space, you can do all this with Pattern
String text = "Your Day Traffic is 150 MB and your Night Traffic is 136 MB ";
String regex = "((\\d+)\\sMB)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
int group = 1;
while (matcher.find()) {
System.out.println("group " + group++ + " ==> " + matcher.group(2));
}
In your case outputs are :
group 1 ==> 150
group 2 ==> 136

Please read the java docs on regex here.
Essentially you have to ignore character between two "[number] MB" occurences. In that situation you can use a regex like so -
/.*\s+(\d+)\s+MB.*\s+(\d+)\s+MB/
Full code is given here -
import java.util.regex.*;
public class MatchMB {
public static void main(String args[]){
Pattern p = Pattern.compile(".*\\s+(\\d+)\\s+MB.*\\s+(\\d+)\\s+MB");
Matcher m = p.matcher("Your Day Traffic is 150 MB and your Night Traffic is 136 MB");
while (m.find()) {
System.out.println("group 1 ==>" + m.group(1));
System.out.println("group 2 ==>" + m.group(2));
}
}

Nested/Repeated Group in Regex

I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com

Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)

I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting segments of a semver version string via Java regex - java

Related

How to capture multiline repeated groups using regular expression

return' character and new line use with regex java 8

Java Regex OR operator not working properly

extracting two group number by Regular Expressions

Nested/Repeated Group in Regex

Categories

Resources