Extract certain words from predefined sentence using regular expression

Extract certain words from predefined sentence using regular expression - java

I have seemingly simple task, but I have no experience with regular expressions.
I have to parse SMS body with predefined message text, to get out certain information.
Here is one example:
Täname! {FirstName} {LastName} isikukoodiga {PersonCode} on sõlminud EMT Reisikindlustuse lepingu numbriga {PolicyNumber}, mis kehtib alates {CoverStartDate} kell {CoverStartTime} kuni {CoverEndDate} kell {CoverEndTime} (Eesti aja järgi). Hind: {PremiumEur} eurot. Tutvu tingimustega ({Terms}) http://emt.ee/kindlustus. Kahjukäsitluse number +3727330700.
I have to parse out everything that is in curly braces.
I came up with something like this in Java:
public static final String REGEX_CONFIRMATION = "Täname! (.*) (.*) isikukoodiga (.*) on sõlminud EMT Reisikindlustuse lepingu numbriga (.*), mis kehtib alates (.*) kell (.*) kuni (.*) kell (.*) \\(Eesti aja järgi\\). Hind: (.*) eurot. Tutvu tingimustega \\((.*)\\) http://emt.ee/kindlustus. Kahjukäsitluse number \\+3727330700.";
But it parses out only following groups:
{MARIS}, {PLOTS}, {17204046521}, {22414152}, {01.10.2002}, {13:07},
{02.10.2002}, {23:59}.
As you can see {Terms} is missing. And I can't seem to figure out where is the problem?

how about using this pattern?
\{.*?\}

Wouldn't it make more sense to simply use
\{[^{}]*\}
as your regex? In a string, you would need to write that as
"\\{[^{}]*\\}"
Explanation:
\{ # Match an opening brace
[^{}]* # Match any number of characters except braces
\} # Match a closing brace

http://www.java2s.com/Code/Java/Regular-Expressions/Findallmatches.htm
along with the following regex
\{(.*?)\}

Seems correct to me. Use the DOTALL (and in other cases maybe MULTILINE) options. DOTALL can be added as "(?s)Täname!...". Then the ".*" also maps newline chars.
As the prior matches were found this might be it.

Does it work, when You include brackets into your {TERMS} part?
Instead of:
String regex = "...Tutvu tingimustega \\((.*)\\) http://emt.ee/kindlustus. ...";
You could try:
String regex = "...Tutvu tingimustega (.*) http://emt.ee/kindlustus. ...";
OR depending on, what You have in {TERMS} string, You could change _.*_ to _[^)]*_
This way you would find zero to N chars that are not ending bracket.

Related

Removing Hashtag using Java WebFilter

I have the following configuration in the urlrewrite.xml:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE urlrewrite PUBLIC "-//tuckey.org//DTD UrlRewrite 4.0//EN" "http://www.tuckey.org/res/dtds/urlrewrite4.0.dtd">
<urlrewrite use-query-string="true">
<rule>
<from>^(/event/showEventList)(\.{1})(\bhtm\b|\bhtml\b)(\?{0,1})([a-zA-Z0-9-_=&]{0,}+)(#{0,1})([a-zA-Z0-9-_=&]{0,}+)$</from>
<to type="redirect" last="true">/events$4$5</to>
</rule>
</urlrewrite>
The regex ^(/event/showEventList)(\.{1})(\bhtm\b|\bhtml\b)(\?{0,1})([a-zA-Z0-9-_=&]{0,}+)(#{0,1})([a-zA-Z0-9-_=&]{0,}+)$ has 7 groups, which are:
(/event/showEventList): matches /event/showEventList
(\.{1}): matches a single dot (.)
(\bhtm\b|\bhtml\b): matches only htm or html
(\?{0,1}): matches question mark (?) which can may occur zero or one
([a-zA-Z0-9-_=&]{0,}+): matches the query string which can occur zero or more
(#{0,1}): matches hashtag (#) which can may occur zero or one
([a-zA-Z0-9-_=&]{0,}+): matches the fragment which can occur zero or more
If I test this configuration with a test URL: /event/showEventList.html?pageNumber=1#key=val, I am expecting that the redirected URL would be /events?pageNumber=1, but I am getting /events?pageNumber=1#key=val
I have a code snippet to test it, which is:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class UrlRewriterRegexTest {
public static void main(String[] args) {
String input = "/event/showEventList.html?pageNumber=1#key=val";
String regex = "^(/event/showEventList)(\\.{1})(\\bhtm\\b|\\bhtml\\b)(\\?{0,1})([a-zA-Z0-9-_=&]{0,}+)(#{0,1})([a-zA-Z0-9-_=&]{0,}+)$";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
System.out.println(matcher.replaceFirst("/events$4$5"));
}
}
It outputs to: /events?pageNumber=1.
Any pointer would be very helpful.

I'd simplify the expression a bit.
Escape slashes, as they are typically used as delimiters for the regex (\/event\/showEventList)
Remove superfluous quantifier (\.)
Shorten the html string test (htm(l)?) - careful, this messes with your capturing group numbers
Remove word boundary checks around html
Use ? instead of {0,1}
Use * instead of {0,}
Remove possessive quantifier (I don't see why you'd need it)
Ignore everything after #, you don't seem to need it in your replacement
This gives us ^(\/event\/showEventList)(\.)(htm(l)?)(\??)([a-zA-Z0-9-_=&]+)*#(.+)$ which subsitutes your example to /events?pageNumber=1
To play around, see https://regexr.com/4otp7

I've simplified the expression and here is the working solution
<from>^(\/event\/showEventList\.html?)(\?[a-zA-Z0-9-_=&]*)\#.*$</from>
<to type="redirect" last="true">/events$2</to>
This will match any thing and take everything from the beginning of query string till the first occurrence of #
Explanation:
Group 1 : Match the url /event/showEventList.html OR /event/showEventList.htm
Group 2 : Match all query string between o to many till the first occurrence of #
Group 2 is the string which you want to use for redirect and ignore any thing after # including #
Example:

I am answering my own question, so that in future if someone else stumbles upon the same problem, this answer could help him.
There is nothing to do with the UrlRewriteFilter framework. By enabling the debug log for this framework I have seen that the URL it is receiving before applying the defined rules doesn't have the URL Hash(#). From other SO answers and by analyzing the network traffic of the browser, I saw that the browser does not send the URL fragment to the server so it's not available in the HttpServletRequest. This is the reason the Regular Expressions are not working.
Since this hash is available in the client browser and thanks to HTML5 History API I am able to solve the problem using JavaScript:
<script type="text/javascript">
window.addEventListener('DOMContentLoaded', (event) => {
const url = new URL(window.location);
url.hash = '';
history.replaceState(null, document.title, url);
});
</script>

Regular Expression : Multiline check problem

Hello i have problem with this regexp
!
interface TenGigabitEthernet 1/49
description Uplink
no ip address
switchport
no shutdown
!
interface TenGigabitEthernet 1/50
no ip address
shutdown
!
interface TenGigabitEthernet 1/51
no ip address
shutdown
!
i tried this regexp (interface) ((.\s.)+) but it is not working becuse it match "interface" and the rest of text
I need to catch in first group "interface" and in the second i need all until first occur of "!"
so for example:
first group:
interface
second group:
TenGigabitEthernet 1/51
no ip address
shutdown
How i can do this?

Try this:
(interface)\s+([^!]+)
Here Is Demo

Use this:
(interface)\s*([^!]+) /g
The first group captures the hard-coded interface. The second group captures everything other than !, by skipping the leading whitespaces, if any. The global flag /g ensures all matches.
Demo

If the content itself can contain a !, you could check for a ! at the start of the line and repeat matching all lines until you encounter a ! at the start.
^(interface)\s*(.*(?:\n(?!!).*)*)
In Java
String regex = "^(interface)\\s*(.*(?:\\n(?!!).*)*)";
Regex demo

Looking for Correct Java REGEX for this kind of payload

I have following two different payloads where I am trying to Write Java Regex:
Input Payload 1
ISA*00* *00* *ZZ*EXDO *ZZ*047336389 *150327*1007*U*00401*900063730*0*P*>~
GS*QM*EXDO*047336389*20150327*1007*900063730*X*004010~
ST*214*900063730~
B10*326GENT15173**EXDO~
L11*019*TN~
Input Payload 2
ISA*00* *00* *02*HJBT *01*047336389 *140103*1751*U*00401*000012003*0*P*>\
GS*QM*HJBT*047336389*20140103*1751*12003*X*004010\
ST*214*0001\
B10*117094*B065199*HJBT\
N1*SH*INTEVA PRODUCTS LLC-\
I have following REGEX:
.*(ST\*214|ST\*210).*
I tried to evaluate the REGEX on this URL http://www.regexplanet.com/advanced/java/index.html
I see matches() as NO for 1st Payload and matches() as YES for 2nd Payload. I am looking for Updated REGEX which actually works for BOTH conditions here.
My Purpose here to validate payload information just like String contains method can do it using following approach.
payload.toString().contains('ST*214') || payload.toString().contains('ST*210').
I want to use regex instead of string.contains here.

"(?s).*(ST\\*214|ST\\*210).*"
In Java you need to enable DOTALL mode (to make . match with line terminators too). This can be done by including (?s) modifier. You had match only in this ST*214*900063730~ particular part of first string.

use the following regexp:
".*(ST\*214|ST\*210).*"
Have tested your two strings with following code:
public class RegTest {
public static void main (String[] args) {
String test1 = "ISA*00* 00 ZZEXDO *ZZ*047336389*150414*1108*U*00401*979863647*0*P*>~ GSQMEXDO*047336389*20150414*1108*979863647*X*004010~ ST*214*979863647~ B10*186143**EXDO~";
String test2 = "ISA*00* 00 *02*HJBT *01*047336389*140103*1751*U*00401*000012003*0*P*>\\GSQMHJBT*047336389*20140103*1751*12003*X*004010\\ST*214*0001\\B10*117094*B065199*HJBT\\N1*SH*INTEVA PRODUCTS LLC-\\";
if (test1.matches(".*(ST\\*214|ST\\*210).*")) {
System.out.println("String1 matches");
}
if (test2.matches(".*(ST\\*214|ST\\*210).*")) {
System.out.println("String2 matches");
}
}
}
just small fix, regexp in comment lost two '\' characters. You can use the regexp from code.

I think you try to match the wildcard character '*' so you should use backslashes :
.*(ST\*214|ST\*210).*
or
.*ST\*(214|210).*
or
.*ST\*21(4|0).*
or
.*ST\*21[40].*
Are the linefeed part of your payload or just some formatting ?

Replacing a space and some other character in Java

Why does this code not work?
public static void main(String[] args) {
String s = "You need the new version for this. Please update app ...";
System.out.println(s.replaceAll(". ", ".\\\\n").replaceAll(" ...", "..."));
}
This is my wanted output:
You need the new version for this.\nPlease update app...
Thanks for the information

String.replaceAll method takes Regex as first argument.
So you need to escape your dot (.), as it has special meaning in Regex, which matches any character.
System.out.println(s.replaceAll("\\. ", ".\\\\n").replaceAll(" \\.\\.\\.", "..."));
However, for your given input, you can simply use String.replace method, as it does not take Regex, and has an added advantage of that.

. is a special regex character and will match anything. You need to escape it like this: \\.
So to match three dots you must use following regex: "\\.\\.\\."
what you want is
s.replaceAll("\\. ", ".\n").replaceAll(" \\.\\.\\.", "...")

You shouldn't be using replaceAll - use replace instead. replaceAll takes a regular expression when it is not needed here (and hence it will be unnecessarily inefficient).
String s = "You need the new version for this. Please update app ...";
System.out.println(s.replace(". ", ".\\n").replace(" ...", "..."));
(Also note that I've replaced ".\\\\n" with ".\\n" here, which produces the desired output.)

try as
System.out.println(s.replace(". ", ".\n").replace(" ...", "..."));
this gives
You need the new version for this.
Please update app...

Capturing dot and comma in Java RegExp

I have following code in Java:
Pattern fieldsPattern = Pattern.compile("(\"([^\"]+)\")|"
+"("+this.field_tag+"([0-9a-zA-Z_]+))");
Matcher fieldsMatcher = fieldsPattern.matcher(field);
while(fieldsMatcher.find())
{
//...
}
This code should capture expressions like "expression" and :expression (field_tag is just ":"). The problem occurs when I try to capture an expression like: "10.1" or "10,1". It dosen't work.
But expressions:
"10-1",
"10+1"
works as expected.
I also tried use this regexp on regexpal.com - site with javascript implementation of RegExp. On this site expressions like "10.1" and "10,1" works fine.
Is there any difference in java vs javascript in capturing dots? What am I doing wrong?

This works for me
Pattern fieldsPattern = Pattern.compile("(\"[^\"]+\")");
String field =" aa \"10\" \"10.1\" and \"10,1\"";
Matcher fieldsMatcher = fieldsPattern.matcher(field);
while(fieldsMatcher.find()) {
System.out.println(fieldsMatcher.group());
}
prints
"10"
"10.1"
"10,1"
The second set of brackets in the regex appear to be redundant, but are harmless.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract certain words from predefined sentence using regular expression - java

how about using this pattern? \{.*?\}

Wouldn't it make more sense to simply use \{[^{}]\} as your regex? In a string, you would need to write that as "\\{[^{}]\\}" Explanation: \{ # Match an opening brace [^{}]* # Match any number of characters except braces \} # Match a closing brace

http://www.java2s.com/Code/Java/Regular-Expressions/Findallmatches.htm along with the following regex \{(.*?)\}

Seems correct to me. Use the DOTALL (and in other cases maybe MULTILINE) options. DOTALL can be added as "(?s)Täname!...". Then the ".*" also maps newline chars. As the prior matches were found this might be it.

Related

Removing Hashtag using Java WebFilter

Regular Expression : Multiline check problem

Looking for Correct Java REGEX for this kind of payload

Replacing a space and some other character in Java

Capturing dot and comma in Java RegExp

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract certain words from predefined sentence using regular expression - java

how about using this pattern? \{.*?\}

Wouldn't it make more sense to simply use \{[^{}]*\} as your regex? In a string, you would need to write that as "\\{[^{}]*\\}" Explanation: \{ # Match an opening brace [^{}]* # Match any number of characters except braces \} # Match a closing brace

http://www.java2s.com/Code/Java/Regular-Expressions/Findallmatches.htm along with the following regex \{(.*?)\}

Seems correct to me. Use the DOTALL (and in other cases maybe MULTILINE) options. DOTALL can be added as "(?s)Täname!...". Then the ".*" also maps newline chars. As the prior matches were found this might be it.

Related

Removing Hashtag using Java WebFilter

Regular Expression : Multiline check problem

Looking for Correct Java REGEX for this kind of payload

Replacing a space and some other character in Java

Capturing dot and comma in Java RegExp

Categories

Resources

Wouldn't it make more sense to simply use \{[^{}]\} as your regex? In a string, you would need to write that as "\\{[^{}]\\}" Explanation: \{ # Match an opening brace [^{}]* # Match any number of characters except braces \} # Match a closing brace