Refactor regex Pattern into Java flavor pattern - java

I have a regex pattern created on regex101.com:
https://regex101.com/r/cMvHlm/7/codegen?language=java
however, that regex does not seem to work in my Java program (I use spring toolsuite as IDE):
#Test
public void testRegex() {
//Pattern referenceCodePattern = Pattern.compile("((\\h|\\:)+)(([\u00DFA-Za-z0-9-_#\\\\\\/])+)(([[:punct:]])?)");
Pattern pattern = Pattern.compile(""
+ "(?:\\s+|chiffre|job-id|job-nr[.]|job-nr|\\bjob id\\b|job nr[.]|jobnummer|jobnr[.]|jobid|jobcode|job nr.|ziffer|kennziffer|kennz.|referenz code|referenz-code|"
+ "referenzcode|ref[.] nr[.]|ref[.] id|ref id|ref[.]id|ref[.]-nr[.]|ref[.]- nr[.]|"
+ "referenz nummer|referenznummer|referenz nr[.]|stellenreferenz| referenz-nr[.]|referenznr[.]|referenz|referenznummer der stelle|id#|id #|stellenausschreibungen|"
+ "stellenausschreibungs\\s?nr[.]|stellenausschreibungs-nr[.]|stellenausschreibungsnr[.]|stellenangebots id|stellenangebots-id|stellenangebotsid|stellen id|stellen-id|stellenid|stellenreferenz|"
+ "stellen-referenz|ref[.]st[.]nr[.]|stellennumer|\\bst[.]-nr[.]\\b|\\bst[.] nr[.]\\b|kenn-nr[.]|positionsnummer|kennwort|stellenkey|stellencode|job-referenzcode|stellenausschreibung|"
+ "bewerbungskennziffer|projekt id|projekt-id|reference number|reference no[.]|reference code|job code|job id|job vacancy no[.]|job-ad-number|auto req id|job ref|\\bstellenausschreibung nr[.]\\b)"
+ ":?(?:\\w*)(?:\\s*)([A-Z]*\\s*)([!\"#$%&'()*+,\\-.\\/:;<=>?#[\\]^_`{|}~]*\\w*[!\"#$%&'()*+,\\-.\\/:;<=>?#[\\]^_`{|}~]*\\w*[!\"#$%&'()*+,\\-.\\/:;<=>?#[\\]^_`{|}~]*\\w*[!\"#$%&'()*+,\\-.\\/:;<=>?#[\\]^_`{|}~]*)?");
String line = "Referenznummer: INDUSTRY Kontakt: ZAsdfsdfS Herr Andrafgdh Neue Str. 7 21244 Buchholz +42341 22322 mdjob.bu44lz#zaqusssis.de Stellenanzeige teilen: Jetzt online bewerben! oder bewerben Sie sich mit\n" +
"Geben Sie bei Ihrer Bewerbung die Stellenreferenz und die Stellenbezeichnung an! \n" +
"Stellenreferenz: 21533448-JOtest\n\n" +
"Stellenausschreibung Nr. PD-666/19";
// Create a Pattern object
//Pattern r = Pattern.compile(pattern);
Matcher m = pattern.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
System.out.println("Found value: " + m.group(2) );
}else {
System.out.println("NO MATCH");
}
}
I get the following error:
java.util.regex.PatternSyntaxException: Unclosed character class near index 1337
at java.util.regex.Pattern.error(Pattern.java:1957)
at java.util.regex.Pattern.clazz(Pattern.java:2550)
at java.util.regex.Pattern.clazz(Pattern.java:2506)
at java.util.regex.Pattern.clazz(Pattern.java:2506)
at java.util.regex.Pattern.clazz(Pattern.java:2506)
at java.util.regex.Pattern.sequence(Pattern.java:2065)
at java.util.regex.Pattern.expr(Pattern.java:1998)
at java.util.regex.Pattern.group0(Pattern.java:2907)
at java.util.regex.Pattern.sequence(Pattern.java:2053)
at java.util.regex.Pattern.expr(Pattern.java:1998)
at java.util.regex.Pattern.compile(Pattern.java:1698)
at java.util.regex.Pattern.<init>(Pattern.java:1351)
at java.util.regex.Pattern.compile(Pattern.java:1028)
Is there a way to find out where index 1337 is?

The main problem with the regex is that both [ and ] must be escaped in a character class in a Java regex as these are used to form character class unions and intersections, are "special" there.
Another issue is the [.]\b patterns won't work as expected because a word boundary after a non-word char will require a word char immediately to the right of the current position. You need a \B there, not \b.
You need to escape / char in a Java regex pattern.
You do not have to repeat the pattern at the end of the regex, you may "repeat" it with a limiting {0,3} quantifier after wrapping the repeated pattern with a non-capturing group, (?:...).
Consider a while block to get all matches. You may use a boolean flag to see if there were any matches or not.
Also, you probably want to use \\s+ alternative as the last one in the first group, it is too generic, but I will leave it at the start for the time being.
Use
Pattern pattern = Pattern.compile(""
+ "(?:\\s+|chiffre|job-id|job-nr[.]|job-nr|\\bjob id\\b|job nr[.]|jobnummer|jobnr[.]|jobid|jobcode|job nr\\.|ziffer|kennziffer|kennz\\.|referenz code|referenz-code|"
+ "referenzcode|ref[.] nr[.]|ref[.] id|ref id|ref[.]id|ref[.]-nr[.]|ref[.]- nr[.]|"
+ "referenz nummer|referenznummer|referenz nr[.]|stellenreferenz| referenz-nr[.]|referenznr[.]|referenz|referenznummer der stelle|id#|id #|stellenausschreibungen|"
+ "stellenausschreibungs\\s?nr[.]|stellenausschreibungs-nr[.]|stellenausschreibungsnr[.]|stellenangebots id|stellenangebots-id|stellenangebotsid|stellen id|stellen-id|stellenid|stellenreferenz|"
+ "stellen-referenz|ref[.]st[.]nr[.]|stellennumer|\\bst[.]-nr[.]\\B|\\bst[.] nr[.]\\B|kenn-nr[.]|positionsnummer|kennwort|stellenkey|stellencode|job-referenzcode|stellenausschreibung|"
+ "bewerbungskennziffer|projekt id|projekt-id|reference number|reference no[.]|reference code|job code|job id|job vacancy no[.]|job-ad-number|auto req id|job ref|\\bstellenausschreibung nr[.]\\B)"
+ ":?\\w*\\s*([A-Z]*\\s*)([!\"#$%&'()*+,\\-./:;<=>?#\\[\\]^_`{|}~]*(?:\\w*[!\"#$%&'()*+,\\-./:;<=>?#\\[\\]^_`{|}~]*){0,3})?");
String line = "Referenznummer: INDUSTRY Kontakt: ZAsdfsdfS Herr Andrafgdh Neue Str. 7 21244 Buchholz +42341 22322 mdjob.bu44lz#zaqusssis.de Stellenanzeige teilen: Jetzt online bewerben! oder bewerben Sie sich mit\n" +
"Geben Sie bei Ihrer Bewerbung die Stellenreferenz und die Stellenbezeichnung an! \n" +
"Stellenreferenz: 21533448-JOtest\n\n" +
"Stellenausschreibung Nr. PD-666/19";
Matcher m = pattern.matcher(line);
boolean found = false;
while (m.find()) {
found = true;
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
System.out.println("Found value: " + m.group(2) );
System.out.println(" ----------------------- " );
}
if (!found) {
System.out.println("NO MATCH");
}
See this Java demo.

In Java, unescaped [ is always considered an open class syntax, never a literal.
This is the reason some recommend always escape literal class metachars [ ]
which translates across most all engines.
Converting
[!"#$%&'()*+,\-.\/:;<=>?#[\]^_`{|}~]
to [!-/:-#\[\]-`{-~]
then refactoring the regex.
(Note there may be usability problems with the regex as well.)
Before refactor :
(?:\s+|chiffre|job-id|job-nr[.]|job-nr|\bjob[ ]id\b|job[ ]nr[.]|jobnummer|jobnr[.]|jobid|jobcode|job[ ]nr.|ziffer|kennziffer|kennz.|referenz[ ]code|referenz-code|referenzcode|ref[.][ ]nr[.]|ref[.][ ]id|ref[ ]id|ref[.]id|ref[.]-nr[.]|ref[.]-[ ]nr[.]|referenz[ ]nummer|referenznummer|referenz[ ]nr[.]|stellenreferenz|[ ]referenz-nr[.]|referenznr[.]|referenz|referenznummer[ ]der[ ]stelle|id\#|id[ ]\#|stellenausschreibungen|stellenausschreibungs\s?nr[.]|stellenausschreibungs-nr[.]|stellenausschreibungsnr[.]|stellenangebots[ ]id|stellenangebots-id|stellenangebotsid|stellen[ ]id|stellen-id|stellenid|stellenreferenz|stellen-referenz|ref[.]st[.]nr[.]|stellennumer|\bst[.]-nr[.]\b|\bst[.][ ]nr[.]\b|kenn-nr[.]|positionsnummer|kennwort|stellenkey|stellencode|job-referenzcode|stellenausschreibung|bewerbungskennziffer|projekt[ ]id|projekt-id|reference[ ]number|reference[ ]no[.]|reference[ ]code|job[ ]code|job[ ]id|job[ ]vacancy[ ]no[.]|job-ad-number|auto[ ]req[ ]id|job[ ]ref|\bstellenausschreibung[ ]nr[.]\b):?(?:\w*)(?:\s*)([A-Z]*\s*)([!"#$%&'()*+,\-.\/:;<=>?#\[\]^_`{|}~]*(?:\w*[!"#$%&'()*+,\-.\/:;<=>?#\[\]^_`{|}~]*){3})?
After refactor :
(?:\s+|chiffre|job(?:-(?:id|nr[.]?|referenzcode|ad-number)|[ ](?:(?:nr|vacancy[ ]no)[.]|code|id|ref)|n(?:ummer|r[.])|id|code)|\b(?:job[ ]id|st(?:[.][ \-]|ellenausschreibung[ ])nr[.])\b|(?:bewerbungskenn)?ziffer|kenn(?:z(?:iffer|.)|-nr[.]|wort)|ref(?:eren(?:z(?:[ ](?:code|n(?:ummer|r[.]))|-?code|n(?:ummer|r[.]|ummer[ ]der[ ]stelle))?|ce[ ](?:n(?:umber|o[.])|code))|[.](?:[ ](?:nr[.]|id)|id|(?:-[ ]?|st[.])nr[.])|[ ]id)|stellen(?:referenz|a(?:usschreibung(?:en|s(?:\s?|-)?nr[.])?|ngebots[ \-]?id)|[ ]?id|-(?:id|referenz)|numer|key|code)|[ ]referenz-nr[.]|id[ ]?\#|p(?:ositionsnummer|rojekt[ \-]id)|auto[ ]req[ ]id):?\w*\s*[A-Z]*\s*(?:[!-/:-#\[\]-`{-~]*(?:\w*[!-/:-#\[\]-`{-~]*){3})?

Related

Regex not working properly in all cases

I am using a regex to get word from string its working fine in alphanumeric case but return wrong answer if we are used arithmetic operator.
Matcher oMatcher;
Pattern oPattern;
String key = "a++";
oPattern = Pattern.compile("\\b" + key + "\\b");
oMatcher = oPattern.matcher("max winzer® build-a-chair cocktailsessel »luisa« in runder form, zum selbstgestalten");
if (oMatcher.find()) {
System.out.println("True");
}
You have to escape any potential regex special characters in key with Pattern.quote:
oPattern = Pattern.compile("\\b" + Pattern.quote(key) + "\\b");
^^^^^^^^^^^^^

Nested/Repeated Group in Regex

I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com
Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)
I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.

Intelligent String Parsing in java

I have an email Subject line that i need to parse. I need to find first occurance of any word given in a list of words and get the next word which can be separated by
('=' or ',' or ';' or 'blank' or '.').
for example
list of word for customer ["customer","client","kunden","kd.nr."]
list of word for Order ["order","auftrag","auftragsnummer","auftragnr."]
separator : [= , ; .]
subjectline: Customer 2013ABC has send an Aufrag 2056899A for Motif=A
I need to parse the information like
customer=2013ABC
order=2056899A
Motif=A
I am using Java 7 so Scanner class can be used as well.
Thanks for any tips in advance
You can achieve this by using regular expressions, here is a sample code:
Pattern p = Pattern.compile(".*(customer|client|kunden|kd\\.nr\\.)[=,;\\. ]*(\\w*).*(order|auftrag|auftragsnummer|auftragnr\\.)[=,;\\. ]*(\\w*).*[ ](.*)$", Pattern.CASE_INSENSITIVE);
String subject = "subjectline: kd.nr. 2013ABC has send an Auftrag 2056899A for Motif=A";
Matcher m = p.matcher(subject);
if(m.matches()) {
System.out.println(m.group(1) + " : " + m.group(2) );
System.out.println(m.group(3) + " : " + m.group(4));
System.out.println(m.group(5));
}
Hope this helps.

Greedy matching in regex(java)

I'm trying to tokenize the input below using java regex. I believe my expression should greedily match the outer "exec" tokens in the program below.
#Test
public void test(){
String s = "exec(\n" +
" \"command #1\"\n" +
" ,\"* * * * *\" //cron string\n" +
" ,\"false\" eq exec(\"command #3\")) //condition\n" +
")\n" +
"\n" + //split here
"exec(\n" +
" \"command #2\" \n" +
" ,\"exec(\"command #4\") //condition\n" +
");";
List<String> matches = new ArrayList<String>();
Pattern pattern = Pattern.compile("exec\\s*\\(.*\\)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
matches.add(matcher.group());
}
System.out.println(matches);
}
I'm expecting output as
[exec(
"command #1"
,"* * * * *" //cron string
,"false" eq exec("command #3")) //condition
),exec(
"command #2"
,"exec("command #4") //condition
);]
but get
[exec("command #3")), exec("command #4")]
Could anyone please help me understand where I'm going wrong?
By default, The dot character . does not match on newline characters. Here, in this case, the "exec" pattern will only match if it occurs on the same line.
You can use Pattern.DOTALL to allow matching to be done on newline characters:
Pattern.compile("exec\\s*\\(.*\\)", Pattern.DOTALL);
Alternatively (?s) can be specified, which is equivalent:
Pattern.compile("(?s)exec\\s*\\(.*\\)");

How can I recognize the indefinite articles "a" or "an"?

My task is to devise a regular expression that will recognize the indefinite article in English – the word “a” or “an” i.e. to write a regular expression to identify the word a or the word an. I must test the expression by writing a test driver which reads a file containing approximately ten lines of text. Your program should count the occurrences of the words “a” and “an”. I shall not match the characters a and an in words such as than.
This is my code so far:
import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexeFindText {
public static void main(String[] args) throws IOException {
// Input for matching the regexe pattern
String file_name = "Testing.txt";
ReadFile file = new ReadFile(file_name);
String[] aryLines = file.OpenFile();
String asString = Arrays.toString(aryLines);
// Regexe to be matched
String regexe = ""; //<<--this is where the problem lies
int i;
for ( i=0; i < aryLines.length; i++ ) {
System.out.println( aryLines[ i ] ) ;
}
// Step 1: Allocate a Pattern object to compile a regexe
Pattern pattern = Pattern.compile(regexe);
//Pattern pattern = Pattern.compile(regexe, Pattern.CASE_INSENSITIVE);
// case- insensitive matching
// Step 2: Allocate a Matcher object from the compiled regexe pattern,
// and provide the input to the Matcher
Matcher matcher = pattern.matcher(asString);
// Step 3: Perform the matching and process the matching result
// Use method find()
while (matcher.find()) { // find the next match
System.out.println("find() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
}
// Use method matches()
if (matcher.matches()) {
System.out.println("matches() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("matches() found nothing");
}
// Use method lookingAt()
if (matcher.lookingAt()) {
System.out.println("lookingAt() found the pattern \"" + matcher.group()
+ "\" starting at index " + matcher.start()
+ " and ending at index " + matcher.end());
} else {
System.out.println("lookingAt() found nothing");
}
}
}
What do I have to use to find those words within my text?
Here's the regex that will match "a" or "an":
String regex = "\\ban?\\b";
Let's break that regex down:
\b means word boundary (a single back slash is written as "\\" in java)
a is simply a literal "a"
n? means zero or one literal "n"

Categories