Why isnt this regexp backtrack working

Why isnt this regexp backtrack working - java

I have tried to use the following kind of regex
([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))|(FakeEmail:)|(Email:)|(\1\2)|(\1\3)
(pretend the \1 is the email regex group, and \2 is FakeEmail: and \3 is Email: because I didnt count the parens to figure out the real grouping)
What I am trying to do is say "Find the word email: and if you find it, pick up any email address following the word."
That email regex I got off some other question on stack overflow.
my test string could be something like
"This guy is spamming me from
FakeEmail: fakeemailAdress#someplace.com
but here is is real info:
Email: testemail#someplace.com"
Any tips? Thanks

I'm either quite confused as to what you're trying to do, or your Regex is just very wrong. In particular:
Why do you have Email: at the end, instead of the beginning - to match your example?
Why do you have both your Email: and your \1\2 separated by pipe characters, almost as if they're in fields? This is compiling the pattern as ORs. (Find the email pattern, OR the word "Email:", OR whatever \1\2 will end up meaning as it is out of context here.)
If all you're trying to do is match something like Email: testemail#someplace.com, you don't need any backtracking.
Something like this is probably all you need:
Email:\s+([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
Also, I'd strongly advise against trying to validate an email address so strictly. You may want to read http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx . I'd simplify the pattern to something more along the lines of:
Email:\s+(\S+)*#(\S+\.\S+)

Try:
(Fake)?Email: *([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
And captured group \1 will be empty if it's a real email and contain "Fake" if it's a fake email, while \2 will be the email itself.
Do you actually want to capture it if it's FakeEmail though? If you want to capture all Email but ignore all FakeEmail then do:
\bEmail: *([_a-z0-9-]+(\.[_a-z0-9-]+)*#[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4}))
The word boundary prevents the Email bit from matching "FakeEmail".
UPDATE: note your regex only matches lowercase since it's got a-z in the [] everywhere but not [A-Z]. Make sure you feed your regex into the java match function with the ignore case switch. i.e.:
Pattern.compile("(Fake)?Email: .....", Pattern.CASE_INSENSITIVE)

You can use following code to match all type of email address:
String text = "This guy is spamming me from\n" +
"FakeEmail: fakeemail+Adress#someplace.com\n" +
"fakeEmail: \n" +
"fakeemail#someplace.com" +
"but here is is real info:\n" +
"Email: test.email+info#someplace.com\n";
Matcher m = Pattern.compile("(?i)(?s)Email:\\s*([_a-z\\d\\+-]+(\\.[_a-z\\d\\+-]+)*#[a-z\\d-]+(\\.[a-z\\d-]+)*(\\.[a-z]{2,4}))").matcher(text);
while(m.find())
System.out.printf("Email is [%s]%n", m.group(1));
This will match email text:
appearing on different lines by using (?s)
ignoring case comparison by using (?i)
Email address with a period . in it
Email address with a plus sign + in it
OUTPUT: From above code is
Email is [fakeemail+Adress#someplace.com]
Email is [fakeemail#someplace.comb]
Email is [test.email+info#someplace.com]

Related

Regex Replacing issue understanding

I'm trying to program a replacement logic for invalid phone numbers, which I provide with a Map
I read through a few Regex expressions threads, but I don't know if this actually is possible.
Example:
Input phone number: +410712345678
regex I'm trying to use:
"^\\+(?:[0-9] ?){6,14}[0-9]$"
number after regex and filtering should be: +41712345678. So actually removing the first Instance of 0.
Second example:
input phone number: +41(071)2345678
regex I'm trying to use:
"^\\+(?:[0-9] ?)\\({0,3}\\){3,11}[0-9]$"
number after regex and filtering should be: +41712345678. So actually removing the First Instance of 0 and also the braces.
I'm trying to user some kind of pattern to automatically remove those invalid pieces from those phone numbers. The numbers need to be formatted that way to work with my VOIP application.
Is there any way to create a filter pattern like that with regex?

Seems like you should only apply that rule for Switzerland phone number, i.e. for +41 numbers, because simply removing the first 0 from any international number is wrong.
So, ph = ph.replaceFirst("^(\\+41)\\(?0?([0-9]{2})\\)?", "$1$2").
See regex101 for how it works.

Thank you for your answer.
I applied the Regex to my TestImport with the following code:
//...
log.debug("Applying Regex :" + SearchString + " with Replace: " + ReplaceString);
log.debug("Applying Regex for Number:" + Person.get(EPerson.Rufnummer));
Person.put(EPerson.Rufnummer, Person.get(EPerson.Rufnummer).replaceFirst(SearchString, ReplaceString));
log.debug("New Number is:" +Person.get(EPerson.Rufnummer));
log.debug("Applying Regex for Number:" + Person.get(EPerson.RufnummerMobil));
Person.put(EPerson.RufnummerMobil, Person.get(EPerson.RufnummerMobil).replaceFirst(SearchString, ReplaceString));
log.debug("New Number is:" +Person.get(EPerson.RufnummerMobil));
//...
DEBUG [AddressbookFactory] Applying Numberfilter to: {Vorname=Testinator, Nachname=Test, Rufnummer=+410717271818, RufnummerMobil=, RufnummerPrivat=+41(071)7271818, Fax=, Strasse=, PLZ=, Stadt=, Bundesland=, Email=, Firma=, URL=}
DEBUG [AddressbookFactory] Regex Detected
DEBUG [AddressbookFactory] Applying Regex :^(+41)(?0?([0-9]{2}))? with Replace: $1$2
DEBUG [AddressbookFactory] Applying Regex for Number:+410717271818
DEBUG [AddressbookFactory] New Number is: +41717271818
DEBUG [AddressbookFactory] Applying Regex for Number:+41(071)7271818
DEBUG [AddressbookFactory] New Number is: +41717271818
...
And it worked!
Thank you so much for your Quick Response!
I marked your answer as useful, but trough my "newbie" Reputation it does not indicate it.
This Question is resolved.
Sincerly Fabian95qw

Set RegEx in Java to be non-greedy by default

I have Strings like the following:
"parameter: param0=true, param1=401230 param2=asset client: desktop"
"parameter: param0=false, param1=15230 user: user213 client: desktop"
"parameter: param0=false, param1=51235 param2=asset result: ERROR"
The pattern is parameter:, then the param's, and after the params either client: and/or user: and/or result.
I want to match the stuff between parameter: and the first occurrence of either client:, user: or result:
So for the 2nd String it should match param0=false, param1=15230.
My regex is:
parameter:\s+(.*)\s+(result|client|user):
But now if I match the 2nd String it captures param0=false, param1=15230 user: user213 (looks like regex is matching greedy)
How to fix this? parameter:\s+(.*)\s+(result|client|user)+?: won't fix it
With this regex tester I can add the modifier U to the regex to make regex lazy by default, is this possible in Java too?

Try putting the ? character inside the first captured group (the subpattern you intend to extract):
parameter:\\s+(.*?)\\s+(result|client|user):

No. There is no ungreedy modifier in Java. You have to use ? behind modifiers to make the quantifiers as lazy capture.
This means you should denote all quantifiers with a ?, see the following pattern:
"parameter:\\s+?(.*?)\\s+?(result|client|user):"
Specified by:http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Regex to find hostname from Jdbc url

I am new to regex. I would like to retrieve the Hostname from postgreSQL jdbc URL using regex.
Assume the postgreSQL url will be jdbc:postgresql://production:5432/dbname. I need to retrieve "production", which is the hostname. I want to try with regex and not with Java split function. I tried with
Pattern PortFinderPattern = Pattern.compile("[//](.*):*");
final Matcher match = PortFinderPattern.matcher(url);
if (match.find()) {
System.out.println(match.group(1));
}
But it's matching all the string from hostname till the end.

Pattern PortFinderPattern = Pattern.compile(".*:\/\/([^:]+).*");

regex without grouping :
"(?<=//)[^:]*"

[//]([\\w\\d\\-\\.]+)\:
Should be enough to find it reliably. Though this is probably a better regex:
The Hostname Regex

There are some errors in your regex:
[//] - This is only one character, because the [] marks a character class, so it will not fully match //. To match it, you need to write it like this: [/][/] or \/\/.
(.*) - This will match all characters to the end of line. You need to be more specific if you want to go till a certain character. For example you could go to the colon by fetching all characters, which are not colons, like this: ([^:]*).
:* - This makes the colon optional. I guess you forgot to put a dot( every character ) after the colon, like this: :.*.
So here is your regex corrected: \/\/([^:]*):.*.
Hope this helps.
BTW. If the port number is optional after production (:5432), then I suggest the following regex:
\/\/([^/]*)(?::\d+)?\/

To capture also Oracle and MySQL JDBC URL variants with their quirks (e.g. Oracle allowing to use # instead of // or even #//), I use this regexp to get the hostname: [/#]+([^:/#]+)([:/]+|$) Then the hostname is in group 1.
Code e.g.
String jdbcURL = "jdbc:oracle:thin:#//hostname:1521/service.domain.local";
Pattern hostFinderPattern = Pattern.compile("[/#]+([^:/#]+)([:/]+|$)");
final Matcher match = hostFinderPattern.matcher(jdbcURL);
if (match.find()) {
System.out.println(match.group(1));
}
This works for all these URLs (and other variants):
jdbc:oracle:thin:#//hostname:1521/service.domain.local
jdbc:oracle:thin:#hostname:1521/service.domain.local
jdbc:oracle:thin:#hostname/service.domain.local
jdbc:mysql://localhost:3306/sakila?profileSQL=true
jdbc:postgresql://production:5432/dbname
jdbc:postgresql://production/
jdbc:postgresql://production
This assumes that
The hostname is after // or # or a combination thereof (single / would also work, but I don't think JDBC allows that).
After the hostname either : or / or the end of the string follows.
Note that the the + are greedy, this is especially important for the middle one.

java regexp for reluctant matching

need to find an expression for the following problem:
String given = "{ \"questionID\" :\"4\", \"question\":\"What is your favourite hobby?\",\"answer\" :\"answer 4\"},{ \"questionID\" :\"5\", \"question\" :\"What was the name of the first company you worked at?\",\"answer\" :\"answer 5\"}";
What I want to get: "{ \"questionID\" :\"4\", \"question\":\"What is your favourite hobby?\",\"answer\" :\"*******\"},{ \"questionID\" :\"5\", \"question\" :\"What was the name of the first company you worked at?\",\"answer\" :\"******\"}";
What I am trying:
String regex = "(.*answer\"\\s:\"){1}(.*)(\"[\\s}]?)";
String rep = "$1*****$3";
System.out.println(test.replaceAll(regex, rep));
What I am getting:
"{ \"questionID\" :\"4\", \"question\":\"What is your favourite hobby?\",\"answer\" :\"answer 4\"},{ \"questionID\" :\"5\", \"question\" :\"What was the name of the first company you worked at?\",\"answer\" :\"******\"}";
Because of the greedy behaviour, the first group catches both "answer" parts, whereas I want it to stop after finding enough, perform replacement, and then keep looking further.

The pattern
("answer"\s*:\s*")(.*?)(")
Seems to do what you want. Here's the escaped version for Java:
(\"answer\"\\s*:\\s*\")(.*?)(\")
The key here is to use (.*?) to match the answer and not (.*). The latter matches as many characters as possible, the former will stop as soon as possible.
The above pattern won't work if there are double quotes in the answer. Here's a more complex version that will allow them:
("answer"\s*:\s*")((.*?)[^\\])?(")
You'll have to use $4 instead of $3 in the replacement pattern.

The following regex works for me :
regex = "(?<=answer\"\\s:\")(answer.*?)(?=\"})";
rep = "*****";
replaceALL(regex,rep);
The \ and " might be incorrectly escaped since I tested without java.
http://regexr.com?303mm

match a string of characters between tags:

I have the following strings:
<PAUL SAINT-KARL 1997-05-07>
<BOB DEAN 2001-05-07>
<GUY JEDDY 2007-05-07>
I want a java regex that would match this type of pattern "name and date" and then extract the name and date separately.
I able to match them separately with the following java regex:
1) (\d{4}-\d{2}-\d{2})>
2) <([ A-Z&#;0-9-]*+)
What I'm looking for is one regex that would identify the full text pattern as provided, and then extract the subsections, such as the actual name, and the date.
I'm looking to use Matcher.group() to retrieve the complete match from the target string.
Thanks

Try this:
"<([ A-Z&#;0-9-]*?) (\\d{4}-\\d{2}-\\d{2})>"
I changed the *+ to *? to make the * match lazily.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why isnt this regexp backtrack working - java

Related

Regex Replacing issue understanding

Set RegEx in Java to be non-greedy by default

Regex to find hostname from Jdbc url

java regexp for reluctant matching

match a string of characters between tags:

Categories

Resources