Greedy matching in regex(java)

Greedy matching in regex(java) - java

I'm trying to tokenize the input below using java regex. I believe my expression should greedily match the outer "exec" tokens in the program below.
#Test
public void test(){
String s = "exec(\n" +
" \"command #1\"\n" +
" ,\"* * * * *\" //cron string\n" +
" ,\"false\" eq exec(\"command #3\")) //condition\n" +
")\n" +
"\n" + //split here
"exec(\n" +
" \"command #2\" \n" +
" ,\"exec(\"command #4\") //condition\n" +
");";
List<String> matches = new ArrayList<String>();
Pattern pattern = Pattern.compile("exec\\s*\\(.*\\)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
matches.add(matcher.group());
}
System.out.println(matches);
}
I'm expecting output as
[exec(
"command #1"
,"* * * * *" //cron string
,"false" eq exec("command #3")) //condition
),exec(
"command #2"
,"exec("command #4") //condition
);]
but get
[exec("command #3")), exec("command #4")]
Could anyone please help me understand where I'm going wrong?

By default, The dot character . does not match on newline characters. Here, in this case, the "exec" pattern will only match if it occurs on the same line.
You can use Pattern.DOTALL to allow matching to be done on newline characters:
Pattern.compile("exec\\s*\\(.*\\)", Pattern.DOTALL);
Alternatively (?s) can be specified, which is equivalent:
Pattern.compile("(?s)exec\\s*\\(.*\\)");

Related

Why matcher.find() for input parameter always return 'false'?

I have a strange situation which I find difficult to understand regarding regex matcher.
When I pass the next input parameter issueBody to the matcher, the matcher.find() always return false, while passing a hard-coded String with the same value as the issueBody - it works as expected.
The regex function:
private Map<String, String> extractCodeSnippet(Set<String> resolvedIssueCodeLines, String issueBody) {
String codeSnippetForCodeLinePattern = "\\(Line #%s\\).*\\W\\`{3}\\W+(.*)(?=\\W+\\`{3})";
Map<String, String> resolvedIssuesMap = new HashMap<>();
for (String currentResolvedIssue : resolvedIssueCodeLines) {
String currentCodeLinePattern = String.format(codeSnippetForCodeLinePattern, currentResolvedIssue);
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(issueBody);
while (matcher.find()) {
resolvedIssuesMap.put(currentResolvedIssue, matcher.group());
}
}
return resolvedIssuesMap;
}
The following always return false
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(issueBody);
While the following always return true
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.MULTILINE);
Matcher matcher = pattern.matcher("**SQL_Injection** issue exists # **VB_3845_112_lines/encode.frm** in branch **master**\n" +
"\n" +
"Severity: High\n" +
"\n" +
"CWE:89\n" +
"\n" +
"[Vulnerability details and guidance](https://cwe.mitre.org/data/definitions/89.html)\n" +
"\n" +
"[Internal Guidance](https://checkmarx.atlassian.net/wiki/spaces/AS/pages/79462432/Remediation+Guidance)\n" +
"\n" +
"[ppttx](http://WIN2K12-TEMP/bbcl/ViewerMain.aspx?planid=1010013&projectid=10005&pathid=1)\n" +
"\n" +
"Lines: 41 42 \n" +
"\n" +
"---\n" +
"[Code (Line #41):](null#L41)\n" +
"```\n" +
" user_name = txtUserName.Text\n" +
"```\n" +
"---\n" +
"[Code (Line #42):](null#L42)\n" +
"```\n" +
" password = txtPassword.Text\n" +
"```\n" +
"---\n");
My question is - why? what is the difference between the two statements?

TL;DR:
By using Pattern.UNIX_LINES, you tell Java regex engine to match with . any char but a newline, LF. Use
Pattern pattern = Pattern.compile(currentCodeLinePattern, Pattern.UNIX_LINES);
In your hard-coded string, you have only newlines, LF endings, while your issueBody most likely contains \r\n, CRLF endings. Your pattern only matches a single non-word char with \W (see \\W\\`{3} pattern part), but CRLF consists of two non-word chars. By default, . does not match line break chars, so it does not match neither \r, CR, nor \n, LF. The \(Line #%s\).*\W\`{3} fails right because of this:
\(Line #%s\) - matches (Line #<NUMBER>)
.* - matches 0 or more chars other than any line break char (up to CR or CRLF)
\W - matches a char other than a letter/digit/_ (so, only \r or \n)
\`{3} - 3 backticks - these are only matched if there was a \n ending, not \r\n (CRLF).
Again, by using Pattern.UNIX_LINES, you tell Java regex engine to match with . any char but a newline, LF.
BTW, Pattern.MULTILINE only makes ^ match at the start of each line, and $ to match at the end of each line, and since there are neither ^, nor $ in your pattern, you may safely discard this option.

Regex only matches once

I have the following regex that matches only once:
Matcher m = Pattern.compile("POLYGON\\s\\(\\((([0-9]*\\.[0-9]+)\\s([0-9]*\\.[0-9]+),?)+\\)\\)")
.matcher("POLYGON ((12.789754538957263 36.12443963532555,12.778550292768816 36.089875458584984,12.77760353347314 36.12427601168043))");
while (m.find()) {
System.out.println("-> " + m.group(2) + " - " + m.group(3));
}
But it only prints the first match:
-> 12.789754538957263 - 36.12443963532555
Why does it not match the other coordinates?
I want to print a new line for each pair of coordinates, e.g.
12.789754538957263 - 36.12443963532555
12.778550292768816 - 36.089875458584984
12.77760353347314 - 36.12427601168043

Your regex should look like this (\[0-9\]*\.\[0-9\]+)\s(\[0-9\]*\.\[0-9\]+)
String input = ...
Matcher m = Pattern.compile("([0-9]*\\.[0-9]+)\\s([0-9]*\\.[0-9]+)").matcher(input);
while (m.find()) {
System.out.println("-> " + m.group(1) + " - " + m.group(2));
}
Outputs
-> 12.789754538957263 - 36.12443963532555
-> 12.778550292768816 - 36.089875458584984
-> 12.77760353347314 - 36.12427601168043
If you want to make sure that the input should between POLYGON (( .. )) you can use replaceAll to extract that inputs :
12.789754538957263 36.12443963532555,12.778550292768816 36.089875458584984,12.77760353347314 36.12427601168043
Your code should be :
.matcher(input.replaceAll("POLYGON \\(\\((.*?)\\)\\)", "$1"));
Instead of :
.matcher(input);
Solution 2
After analysing your problem, I think you need just this :
Stream.of(input.replaceAll("POLYGON \\(\\((.*?)\\)\\)", "$1").split(","))
.forEach(System.out::println);

You could still check if your input begins with a certain string like the following.
I'd use the following regex to do the check : (\[\\d.\]+)\\s(\[\\d.\]+)
It searches for sequences of digits or points separated by a space.
String input = ...
if (input.startsWith("POLYGON")) {
Matcher m = Pattern.compile("([\\d.]+)\\s([\\d.]+)").matcher(input);
while (m.find()) {
System.out.println("-> " + m.group(1) + " - " + m.group(2));
}
}

Nested/Repeated Group in Regex

I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com

Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)

I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.

Java how to setup regex for this string

So I'm trying to pull two strings via a matcher object from one string that is stored in my online databases.
Each string appears after s:64: and is in quotations
Example s:64:"stringhere"
I'm currently trying to get them as so but any regex that I've tried has failed,
Pattern p = Pattern.compile("I don't know what to put as the regex");
Matcher m = p.matcher(data);
So with that said, all I need is the regex that will return the two strings in the matcher so that m.group(1) is my first string and m.group(2) is my second string.

Try this regex:-
s:64:\"(.*?)\"
Code:
Pattern pattern = Pattern.compile("s:64:\"(.*?)\"");
Matcher matcher = pattern.matcher(YourStringVar);
// Check all occurance
int count = 0;
while (matcher.find() && count++ < 2) {
System.out.println("Group : " + matcher.group(1));
}
Here group(1) returns the each match.
OUTPUT:
Group : First Match
Group : Second Match
Refer LIVE DEMO

String data = "s:64:\"first string\" random stuff here s:64:\"second string\"";
Pattern p = Pattern.compile("s:64:\"([^\"]*)\".*s:64:\"([^\"]*)\"");
Matcher m = p.matcher(data);
if (m.find()) {
System.out.println("First string: '" + m.group(1) + "'");
System.out.println("Second string: '" + m.group(2) + "'");
}
prints:
First string: 'first string'
Second string: 'second string'

Regex you need should be compile("s:64:\"(.*?)\".*s:64:\"(.*?)\"")

How can I recognize the indefinite articles "a" or "an"?

My task is to devise a regular expression that will recognize the indefinite article in English – the word “a” or “an” i.e. to write a regular expression to identify the word a or the word an. I must test the expression by writing a test driver which reads a file containing approximately ten lines of text. Your program should count the occurrences of the words “a” and “an”. I shall not match the characters a and an in words such as than.
This is my code so far:
import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexeFindText {
public static void main(String[] args) throws IOException {
// Input for matching the regexe pattern
String file_name = "Testing.txt";
ReadFile file = new ReadFile(file_name);
String[] aryLines = file.OpenFile();
String asString = Arrays.toString(aryLines);
// Regexe to be matched
String regexe = ""; //<<--this is where the problem lies
int i;
for ( i=0; i < aryLines.length; i++ ) {
System.out.println( aryLines[ i ] ) ;
}
// Step 1: Allocate a Pattern object to compile a regexe
Pattern pattern = Pattern.compile(regexe);
//Pattern pattern = Pattern.compile(regexe, Pattern.CASE_INSENSITIVE);
// case- insensitive matching
// Step 2: Allocate a Matcher object from the compiled regexe pattern,
// and provide the input to the Matcher
Matcher matcher = pattern.matcher(asString);
// Step 3: Perform the matching and process the matching result
// Use method find()
while (matcher.find()) { // find the next match
System.out.println("find() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
}
// Use method matches()
if (matcher.matches()) {
System.out.println("matches() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("matches() found nothing");
}
// Use method lookingAt()
if (matcher.lookingAt()) {
System.out.println("lookingAt() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("lookingAt() found nothing");
}
}
}
What do I have to use to find those words within my text?

Here's the regex that will match "a" or "an":
String regex = "\\ban?\\b";
Let's break that regex down:
\b means word boundary (a single back slash is written as "\\" in java)
a is simply a literal "a"
n? means zero or one literal "n"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Greedy matching in regex(java) - java

Related

Why matcher.find() for input parameter always return 'false'?

Regex only matches once

Nested/Repeated Group in Regex

Java how to setup regex for this string

How can I recognize the indefinite articles "a" or "an"?

Categories

Resources