Nested/Repeated Group in Regex - java

I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com

Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)

I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.

Related

How to capture multiline repeated groups using regular expression

I've been trying to write a regular expression in a Kotlin application that I can use to parse multiline journal entries that are delimited by means of a timestamp prefix like so:
28-03-2020 23:00:00 - This
is
line
1
28-03-2021 14:23:15 - This
is
line
2
Each repeating group should capture the timestamp (1) and all text that occurs until either the next timestamp pattern at the start of a line or the end of text (2).
So, in the example above I expect the following output:
Match 1
Group 1: 28-03-2020 23:00:00
Group 2: This\nis\nline\n1\n
Match 2
Group 1: 28-03-2020 14:23:15
Group 2: This\nis\nline\n2\n
So far, I've managed to conjure up a regular expression that can capture the first match using:
^(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}) -([\s\S]*?)(?=^\d{2}.*?)
However, I've been unsuccessful in capturing as repeated groups so far.. can someone help?
I've setup this regex101 session to test it.
If you want to match:
Each repeating group should capture the timestamp and all text
that occurs until either the next timestamp pattern at the start of a
line or the end of text.
you can capture the timestamp at the start of the string in group 1.
Without setting an end boundary like a newline or a digit at the start of the line, capture all lines that do not start with a timestamp like pattern using a negative lookahead in group 2.
^(\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d{2})\h+-\h*(.*(?:\R(?!\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d).*)*)
^ Start of string
(\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d{2}) Capture group 1, match a datetime like pattern
\h+-\h* Match - preceded by 1+ horizontal whitespace char and followed by optional ones
( Capture group 2
.* Match the whole line
(?: Non capture group
\R Match a newline
(?!\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d) Negative lookahead, assert not a datetime like pattern directly to the right
.* If the assertion in true, match the whole line
)* Match a newline and the rest of the line if it does not start with a datetime like pattern
) Close group 2
Regex demo | Java demo
For example
String regex = "^(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2})\\h+-\\h*(.*(?:\\R(?!\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d).*)*)";
String string = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n\n\n\n"
+ "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "12\n"
+ "line\n"
+ "1";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println("--------------------");
}
Output
28-03-2020 23:00:00
This
is
line
1
--------------------
28-03-2021 14:23:15
This
is
line
2
--------------------
28-03-2020 23:00:00
This
is
12
line
1
--------------------
You should use Pattern.DOTALL like this.
public static void main(String[] args) {
String s = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n"
+ "\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n"
+ "\n";
Pattern pat = Pattern.compile(
"(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2})\\s*-\\s*(.*?)\n\n",
Pattern.DOTALL);
Matcher m = pat.matcher(s);
while (m.find()) {
System.out.println("Group 1 : " + m.group(1));
System.out.println("Group 2 : " + m.group(2));
}
}
output:
Group 1 : 28-03-2020 23:00:00
Group 2 : This
is
line
1
Group 1 : 28-03-2021 14:23:15
Group 2 : This
is
line
2
How about this.
public static void main(String[] args) {
String s = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n"
+ "\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n"
+ "\n";
Pattern r = Pattern.compile("^(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2}) -(?:[\\s\\D]*?)^(\\d{1,2})",Pattern.MULTILINE);
Matcher matcher = r.matcher(s);
while(matcher.find()) {
System.out.println("Group 1 : " + matcher.group(1));
System.out.println("Group 2 : " + matcher.group(2));
}
}
And the output is as below.
Group 1 : 28-03-2020 23:00:00
Group 2 : 1
Group 1 : 28-03-2021 14:23:15
Group 2 : 2

Extracting segments of a semver version string via Java regex

Java 8 here. I am trying to parse a semver (or at least, my flavor of semver) string and extract out its main segments:
Major version number
Minor version number
Patch number
Qualifier (RC, SNAPSHOT, RELEASE, etc.)
Here is my code:
String version = "1.0.1-RC";
Pattern versionPattern = Pattern.compile("^[1-9]\\d*\\.\\d+\\.\\d+(?:-[a-zA-Z0-9]+)?$");
Matcher matcher = versionPattern.matcher(version);
if (matcher.matches()) {
System.out.println("\n\n\matching version is: " + matcher.group(0));
System.out.println("\nmajor #: " + matcher.group(1));
System.out.println("\nminor #: " + matcher.group(2));
System.out.println("\npatch #: " + matcher.group(3));
System.out.println("\nqualifier: " + matcher.group(4) + "\n\n\n");
}
When this runs, I get the following output on the console:
matching version is: 1.0.1-RC
2019-10-18 14:32:05,952 [main] 84b37cef-70f9-4ab8-bafb-005821699766 ERROR c.s.f.s.listeners.StartupListener - java.lang.IndexOutOfBoundsException: No group 1
What do I need to do to my regex and/our use of the Matcher API so that I can extract:
1 as the major number
0 as the minor number
1 as the patch number
RC as the qualifier
Any ideas?
NOTE:
You should not escape m in a string literal, \m is not a valid string escape sequence and the code won't compile
Matcher#matches() requires a full string match, no need to add ^ and $ anchors
To be able to reference Matcher#group(n), you need to define the groups in the pattern in the first place. Wrap the parts you need with pairs of unescaped parentheses.
Use
String version = "1.0.1-RC";
Pattern versionPattern = Pattern.compile("([1-9]\\d*)\\.(\\d+)\\.(\\d+)(?:-([a-zA-Z0-9]+))?");
Matcher matcher = versionPattern.matcher(version);
if (matcher.matches()) {
System.out.println("matching version is: " + matcher.group(0));
System.out.println("major #: " + matcher.group(1));
System.out.println("minor #: " + matcher.group(2));
System.out.println("patch #: " + matcher.group(3));
System.out.println("qualifier: " + matcher.group(4) + "\n\n\n");
}
See the Java demo, output:
matching version is: 1.0.1-RC
major #: 1
minor #: 0
patch #: 1
qualifier: RC
I was looking around on the semver specification website and found a regex there that works, and i fixed it a little to work with java regex named groups and here is the regex if anyone needs it
final static String version_regex = "^(?<major>0|[1-9]\\d*)\\.(?<minor>0|[1-9]\\d*)\\.(?<patch>0|[1-9]\\d*)(?:-(?<prerelease>(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+(?<buildmetadata>[0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$";
// After matching to grab the groups
String mayor = matcher.group("major");
String minor = matcher.group("minor");
String patch = matcher.group("patch");
String prerelease = matcher.group("prerelease");
String buildmetadata = matcher.group("buildmetadata");

Regex only matches once

I have the following regex that matches only once:
Matcher m = Pattern.compile("POLYGON\\s\\(\\((([0-9]*\\.[0-9]+)\\s([0-9]*\\.[0-9]+),?)+\\)\\)")
.matcher("POLYGON ((12.789754538957263 36.12443963532555,12.778550292768816 36.089875458584984,12.77760353347314 36.12427601168043))");
while (m.find()) {
System.out.println("-> " + m.group(2) + " - " + m.group(3));
}
But it only prints the first match:
-> 12.789754538957263 - 36.12443963532555
Why does it not match the other coordinates?
I want to print a new line for each pair of coordinates, e.g.
12.789754538957263 - 36.12443963532555
12.778550292768816 - 36.089875458584984
12.77760353347314 - 36.12427601168043
Your regex should look like this (\[0-9\]*\.\[0-9\]+)\s(\[0-9\]*\.\[0-9\]+)
String input = ...
Matcher m = Pattern.compile("([0-9]*\\.[0-9]+)\\s([0-9]*\\.[0-9]+)").matcher(input);
while (m.find()) {
System.out.println("-> " + m.group(1) + " - " + m.group(2));
}
Outputs
-> 12.789754538957263 - 36.12443963532555
-> 12.778550292768816 - 36.089875458584984
-> 12.77760353347314 - 36.12427601168043
If you want to make sure that the input should between POLYGON (( .. )) you can use replaceAll to extract that inputs :
12.789754538957263 36.12443963532555,12.778550292768816 36.089875458584984,12.77760353347314 36.12427601168043
Your code should be :
.matcher(input.replaceAll("POLYGON \\(\\((.*?)\\)\\)", "$1"));
Instead of :
.matcher(input);
Solution 2
After analysing your problem, I think you need just this :
Stream.of(input.replaceAll("POLYGON \\(\\((.*?)\\)\\)", "$1").split(","))
.forEach(System.out::println);
You could still check if your input begins with a certain string like the following.
I'd use the following regex to do the check : (\[\\d.\]+)\\s(\[\\d.\]+)
It searches for sequences of digits or points separated by a space.
String input = ...
if (input.startsWith("POLYGON")) {
Matcher m = Pattern.compile("([\\d.]+)\\s([\\d.]+)").matcher(input);
while (m.find()) {
System.out.println("-> " + m.group(1) + " - " + m.group(2));
}
}

Greedy matching in regex(java)

I'm trying to tokenize the input below using java regex. I believe my expression should greedily match the outer "exec" tokens in the program below.
#Test
public void test(){
String s = "exec(\n" +
" \"command #1\"\n" +
" ,\"* * * * *\" //cron string\n" +
" ,\"false\" eq exec(\"command #3\")) //condition\n" +
")\n" +
"\n" + //split here
"exec(\n" +
" \"command #2\" \n" +
" ,\"exec(\"command #4\") //condition\n" +
");";
List<String> matches = new ArrayList<String>();
Pattern pattern = Pattern.compile("exec\\s*\\(.*\\)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
matches.add(matcher.group());
}
System.out.println(matches);
}
I'm expecting output as
[exec(
"command #1"
,"* * * * *" //cron string
,"false" eq exec("command #3")) //condition
),exec(
"command #2"
,"exec("command #4") //condition
);]
but get
[exec("command #3")), exec("command #4")]
Could anyone please help me understand where I'm going wrong?
By default, The dot character . does not match on newline characters. Here, in this case, the "exec" pattern will only match if it occurs on the same line.
You can use Pattern.DOTALL to allow matching to be done on newline characters:
Pattern.compile("exec\\s*\\(.*\\)", Pattern.DOTALL);
Alternatively (?s) can be specified, which is equivalent:
Pattern.compile("(?s)exec\\s*\\(.*\\)");

Java how to setup regex for this string

So I'm trying to pull two strings via a matcher object from one string that is stored in my online databases.
Each string appears after s:64: and is in quotations
Example s:64:"stringhere"
I'm currently trying to get them as so but any regex that I've tried has failed,
Pattern p = Pattern.compile("I don't know what to put as the regex");
Matcher m = p.matcher(data);
So with that said, all I need is the regex that will return the two strings in the matcher so that m.group(1) is my first string and m.group(2) is my second string.
Try this regex:-
s:64:\"(.*?)\"
Code:
Pattern pattern = Pattern.compile("s:64:\"(.*?)\"");
Matcher matcher = pattern.matcher(YourStringVar);
// Check all occurance
int count = 0;
while (matcher.find() && count++ < 2) {
System.out.println("Group : " + matcher.group(1));
}
Here group(1) returns the each match.
OUTPUT:
Group : First Match
Group : Second Match
Refer LIVE DEMO
String data = "s:64:\"first string\" random stuff here s:64:\"second string\"";
Pattern p = Pattern.compile("s:64:\"([^\"]*)\".*s:64:\"([^\"]*)\"");
Matcher m = p.matcher(data);
if (m.find()) {
System.out.println("First string: '" + m.group(1) + "'");
System.out.println("Second string: '" + m.group(2) + "'");
}
prints:
First string: 'first string'
Second string: 'second string'
Regex you need should be compile("s:64:\"(.*?)\".*s:64:\"(.*?)\"")

Categories