How to capture multiline repeated groups using regular expression - java

I've been trying to write a regular expression in a Kotlin application that I can use to parse multiline journal entries that are delimited by means of a timestamp prefix like so:
28-03-2020 23:00:00 - This
is
line
1
28-03-2021 14:23:15 - This
is
line
2
Each repeating group should capture the timestamp (1) and all text that occurs until either the next timestamp pattern at the start of a line or the end of text (2).
So, in the example above I expect the following output:
Match 1
Group 1: 28-03-2020 23:00:00
Group 2: This\nis\nline\n1\n
Match 2
Group 1: 28-03-2020 14:23:15
Group 2: This\nis\nline\n2\n
So far, I've managed to conjure up a regular expression that can capture the first match using:
^(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}) -([\s\S]*?)(?=^\d{2}.*?)
However, I've been unsuccessful in capturing as repeated groups so far.. can someone help?
I've setup this regex101 session to test it.

If you want to match:
Each repeating group should capture the timestamp and all text
that occurs until either the next timestamp pattern at the start of a
line or the end of text.
you can capture the timestamp at the start of the string in group 1.
Without setting an end boundary like a newline or a digit at the start of the line, capture all lines that do not start with a timestamp like pattern using a negative lookahead in group 2.
^(\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d{2})\h+-\h*(.*(?:\R(?!\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d).*)*)
^ Start of string
(\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d{2}) Capture group 1, match a datetime like pattern
\h+-\h* Match - preceded by 1+ horizontal whitespace char and followed by optional ones
( Capture group 2
.* Match the whole line
(?: Non capture group
\R Match a newline
(?!\d{2}-\d{2}-\d{4}\h+\d{2}:\d{2}:\d) Negative lookahead, assert not a datetime like pattern directly to the right
.* If the assertion in true, match the whole line
)* Match a newline and the rest of the line if it does not start with a datetime like pattern
) Close group 2
Regex demo | Java demo
For example
String regex = "^(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2})\\h+-\\h*(.*(?:\\R(?!\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d).*)*)";
String string = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n\n\n\n"
+ "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "12\n"
+ "line\n"
+ "1";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println("--------------------");
}
Output
28-03-2020 23:00:00
This
is
line
1
--------------------
28-03-2021 14:23:15
This
is
line
2
--------------------
28-03-2020 23:00:00
This
is
12
line
1
--------------------

You should use Pattern.DOTALL like this.
public static void main(String[] args) {
String s = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n"
+ "\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n"
+ "\n";
Pattern pat = Pattern.compile(
"(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2})\\s*-\\s*(.*?)\n\n",
Pattern.DOTALL);
Matcher m = pat.matcher(s);
while (m.find()) {
System.out.println("Group 1 : " + m.group(1));
System.out.println("Group 2 : " + m.group(2));
}
}
output:
Group 1 : 28-03-2020 23:00:00
Group 2 : This
is
line
1
Group 1 : 28-03-2021 14:23:15
Group 2 : This
is
line
2

How about this.
public static void main(String[] args) {
String s = "28-03-2020 23:00:00 - This\n"
+ "is\n"
+ "line\n"
+ "1\n"
+ "\n"
+ "28-03-2021 14:23:15 - This\n"
+ "is\n"
+ "line\n"
+ "2\n"
+ "\n";
Pattern r = Pattern.compile("^(\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}:\\d{2}) -(?:[\\s\\D]*?)^(\\d{1,2})",Pattern.MULTILINE);
Matcher matcher = r.matcher(s);
while(matcher.find()) {
System.out.println("Group 1 : " + matcher.group(1));
System.out.println("Group 2 : " + matcher.group(2));
}
}
And the output is as below.
Group 1 : 28-03-2020 23:00:00
Group 2 : 1
Group 1 : 28-03-2021 14:23:15
Group 2 : 2

Related

Java Regex OR operator not working properly

I have this Strings :
String test1=":test:block1:%a1%a2%a3%a4:block2:BL";
and
String test2=":test:block2:BL:block1:%a1%a2%a3%a4";
I've created an regex pattern in order to isolate this piece of String
block1:%a1%a2%a3%a4:
from the rest of the String letting those Strings like this :
in the case of test1="block1:%a1%a2%a3%a4:"; (with ':' at the end)
in the case of test2=":block1:%a1%a2%a3%a4"; (with ':' at the beggining)
The regex i've created is :
"(block1:(.*?):|:block1:(.*))";
With test1 is working , but with test2 is retrieving me this :
block1:%a1%a2%a3%a4:block2:BL";
Can someone give me a hand with this ?
Cheers!
You may use
block1:([^:]*)
It matches block1: text and then captures into Group 1 any 0 or more chars other than :.
See Java demo:
String patternString = "block1:([^:]*)";
String[] tests = {":test:block1:%a1%a2%a3%a4:block2:BL",
":test:block2:BL:block1:%a1%a2%a3%a4"};
for (int i=0; i<tests.length; i++)
{
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(tests[i]);
if(m.find())
{
System.out.println(tests[i] + " matched. Match: " +
m.group(0) + ", Group 1: " + m.group(1));
}
}
Output:
:test:block1:%a1%a2%a3%a4:block2:BL matched. Match: block1:%a1%a2%a3%a4, Group 1: %a1%a2%a3%a4
:test:block2:BL:block1:%a1%a2%a3%a4 matched. Match: block1:%a1%a2%a3%a4, Group 1: %a1%a2%a3%a4

Regex only matches once

I have the following regex that matches only once:
Matcher m = Pattern.compile("POLYGON\\s\\(\\((([0-9]*\\.[0-9]+)\\s([0-9]*\\.[0-9]+),?)+\\)\\)")
.matcher("POLYGON ((12.789754538957263 36.12443963532555,12.778550292768816 36.089875458584984,12.77760353347314 36.12427601168043))");
while (m.find()) {
System.out.println("-> " + m.group(2) + " - " + m.group(3));
}
But it only prints the first match:
-> 12.789754538957263 - 36.12443963532555
Why does it not match the other coordinates?
I want to print a new line for each pair of coordinates, e.g.
12.789754538957263 - 36.12443963532555
12.778550292768816 - 36.089875458584984
12.77760353347314 - 36.12427601168043
Your regex should look like this (\[0-9\]*\.\[0-9\]+)\s(\[0-9\]*\.\[0-9\]+)
String input = ...
Matcher m = Pattern.compile("([0-9]*\\.[0-9]+)\\s([0-9]*\\.[0-9]+)").matcher(input);
while (m.find()) {
System.out.println("-> " + m.group(1) + " - " + m.group(2));
}
Outputs
-> 12.789754538957263 - 36.12443963532555
-> 12.778550292768816 - 36.089875458584984
-> 12.77760353347314 - 36.12427601168043
If you want to make sure that the input should between POLYGON (( .. )) you can use replaceAll to extract that inputs :
12.789754538957263 36.12443963532555,12.778550292768816 36.089875458584984,12.77760353347314 36.12427601168043
Your code should be :
.matcher(input.replaceAll("POLYGON \\(\\((.*?)\\)\\)", "$1"));
Instead of :
.matcher(input);
Solution 2
After analysing your problem, I think you need just this :
Stream.of(input.replaceAll("POLYGON \\(\\((.*?)\\)\\)", "$1").split(","))
.forEach(System.out::println);
You could still check if your input begins with a certain string like the following.
I'd use the following regex to do the check : (\[\\d.\]+)\\s(\[\\d.\]+)
It searches for sequences of digits or points separated by a space.
String input = ...
if (input.startsWith("POLYGON")) {
Matcher m = Pattern.compile("([\\d.]+)\\s([\\d.]+)").matcher(input);
while (m.find()) {
System.out.println("-> " + m.group(1) + " - " + m.group(2));
}
}

extracting two group number by Regular Expressions

I have following string, I want to extracting 150 and 136 from that by using Regular Expressions in java (android studio), both number are before MB (between them exist space) and some times second number is not exist how i can extracting them
in to separate group?
"Your Day Traffic is 150 MB and your Night Traffic is 136 MB "
and give me two group like this:
group 1 ==> "150"
group 2 ==> "136"
Best Answer:
after some search and try in egex101.com i found my answer:
Pattern p = Pattern.compile("^[^\\d]*(\\d+(?:\\.\\d+)?) MB(?:[^\\d]+(\\d+(?:\\.\\d+)?) MB)?.*$");//. represents single character
Matcher m = p.matcher("Your Day Traffic is 150 MB and your Night Traffic is 136 MB");
while (m.find()) {
System.out.println("group 1 ==>" + m.group(1));
System.out.println("group 2 ==>" + m.group(2));
}
and i get this :
group 1 ==>150
group 2 ==>136
You can use this regex ((\d+)\sMB) if there are one or more spaces between the number and MB you can use \s+ to match one or more space, you can do all this with Pattern
String text = "Your Day Traffic is 150 MB and your Night Traffic is 136 MB ";
String regex = "((\\d+)\\sMB)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
int group = 1;
while (matcher.find()) {
System.out.println("group " + group++ + " ==> " + matcher.group(2));
}
In your case outputs are :
group 1 ==> 150
group 2 ==> 136
Please read the java docs on regex here.
Essentially you have to ignore character between two "[number] MB" occurences. In that situation you can use a regex like so -
/.*\s+(\d+)\s+MB.*\s+(\d+)\s+MB/
Full code is given here -
import java.util.regex.*;
public class MatchMB {
public static void main(String args[]){
Pattern p = Pattern.compile(".*\\s+(\\d+)\\s+MB.*\\s+(\\d+)\\s+MB");
Matcher m = p.matcher("Your Day Traffic is 150 MB and your Night Traffic is 136 MB");
while (m.find()) {
System.out.println("group 1 ==>" + m.group(1));
System.out.println("group 2 ==>" + m.group(2));
}
}

Nested/Repeated Group in Regex

I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com
Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)
I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.

Period in regex using java

What I'm trying to do is making a valid mail id using regular expressions, from a given string. This is my code:
Pattern pat3 = Pattern.compile("[(a-z)+][(a-z\\d)]+{3,}\\#[(a-z)+]\\.[(a-z)+]");
Matcher mat3 = pat3.matcher("dasdsa#2 #ada. ss2#dad.2om p2# 2#2.2 fad2#yahoo.com 22#yahoo.com fad#yahoo.com");
System.out.println(mat3.pattern() + " ");
while(mat3.find()){
System.out.println("Position: " + mat3.start() + " ");
}
The problem is nothing is printed out. What I want to print, and what I really expect to print, but it doesn't, is: 39, 67.
Can someone explain me, why \\. doesn't work? Before putting \\. my regex was working fine till that point.
Make your pattern as the following :
[a-z]+[a-z\\d]+{3,}\\#[a-z]+\\.[a-z]+
So, the code will be :
Pattern pat3 = Pattern.compile("[a-z]+[a-z\\d]+{3,}\\#[a-z]+\\.[a-z]+");
// Your Code
while(mat3.find()){
System.out.println("Position: " + mat3.start() + " --- Match: " + mat3.group());
}
This will give the following result :
Pattern :: [a-z]+[a-z\d]+{3,}\#[a-z]+\.[a-z]+
Position: 39 --- Match: fad2#yahoo.com
Position: 67 --- Match: fad#yahoo.com
Explanation:
You have put the pattern as
[(a-z)+][(a-z\\d)]+{3,}\\#[(a-z)+]\\.[(a-z)+]
the character set, [(a-z)+] will not match one or more repetition of lower-case alphabet. It will match only one occurrence of any of these : (, a-z, ), +
to match one or more repetition of lower-case alphabets, the character set should be like [a-z]+
So if you remove the \\. part from your pattern , and
while(mat3.find()){
System.out.println("Position: " + mat3.start() + " --- Match: " + mat3.group());
}
will give :
Pattern :: [(a-z)+][(a-z\d)]+{3,}\#[(a-z)+][(a-z)+]
Position: 15 --- Match: ss2#da // not ss2#dad
Position: 39 --- Match: fad2#ya // not fad2#yahoo
Position: 67 --- Match: fad#ya // not fad#yahoo

Categories