Exclude regex in java - java

I have this line take a regex
And match the value from response
Matcher m = Pattern.compile("(" + elem.get("urlRegex").getAsString() + ")").matcher(response);
And here is the elem.get("urlRegex").getAsString()
https?://(www\.)?facebook\.com/(?!(i|bussiness|legal|dialog|sharer|share\.phpr|tr|business|platform|help|ads|policies|selfxss|audiencenetwork)$)([a-zA-Z0-9_\-]|(\.))+
And response is https response
This regex should match anything like
https://www.facebook.com/testaksdflasfjasldf
https://www.facebook.com/rqwerpoiqwern
https://www.facebook.com/gbjkdasjasdfuiew
And it shouldn't match anything like
https://www.facebook.com/i
https://www.facebook.com/bussiness
https://www.facebook.com/legal
https://www.facebook.com/sharer
But it does match both and the exclude doesn't work
I did debug it on regex101 but it works
Edit 1:
I did remove $ from exclude and it works
But because there is i in the exclude group
The regex will not match anything like
https://www.facebook.com/intel
https://www.facebook.com/inscanasdas
https://www.facebook.com/iasdasdasd
Edit 2:
I did test the smiler of my code with this regex on https://www.jdoodle.com/online-java-compiler/
Regex works

You have a few mistakes in your regex:
Escaping only needs a single backslash, not two.
All characters with special meaning in regex (like ?, (, ), .) need to be escaped.
The last part of your regex was wrong.
Use this:
https\?://\(www\.\)\?facebook\.com/(?!(i|bussiness|legal|dialog|sharer|share\.phpr|tr|business|platform|help|ads|policies|selfxss|audiencenetwork)$)[a-zA-Z0-9_\-]+
Demo

Related

Regex pattern matching is getting timed out

I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!
If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"

Regex match for string literal including escape sequence

This works just fine for normal string literal ("hello").
"([^"]*)"
But I also want my regex to match literal such as "hell\"o".
This what i have been able to come up with but it doesn't work.
("(?=(\\")*)[^"]*")
here I have tried to look ahead for <\">.
How about
Pattern.compile("\"((\\\\\"|[^\"])*)\"")//
^^ - to match " literal
^^^^ - to match \ literal
^^^^^^ - will match \" literal
or
Pattern.compile("\"((?:\\\\\"|[^\"])*)\"")//
if you don't want to add more capturing groups.
This regex accept \" or any non " between quotation marks.
Demo:
String input = "ab \"cd\" ef \"gh \\\"ij\"";
Matcher m = Pattern.compile("\"((?:\\\\\"|[^\"])*)\"").matcher(input);
while (m.find())
System.out.println(m.group(1));
Output:
cd
gh \"ij
Use this method:
"((?:[^"\\\\]*|\\\\.)*)"
[^"\\\\]* now will not match \ anymore either. But on the other alternation, you get to match any escaped character.
Try with this one:
Pattern pattern = Pattern.compile("((?:\\\"|[^\"])*)");
\\\" to match \" or,
[^\"] to match anything by "

Java regex: need one regex to match all the formats specified

A log file has these pattern appearing more than once in a line.
for example the file may look like
dsads utc-hour_of_year:2013-07-30T17 jdshkdsjhf utc-week_of_year:2013-W31 dskjdskf
utc-week_of_year:2013-W31 dskdsld fdsfd
dshdskhkds utc-month_of_year:2013-07 gfdkjlkdf
I want to replace all date specific info with "Y"
I tried :
replaceAll("_year:.*\s", "_year:Y ");` but it removes everything that occurs after the first replacement,due to greedy match of .*
dsads utc-hour_of_year:Y
utc-week_of_year:Y
dshdskhkds utc-month_of_year:Y
but the expected result is:
dsads utc-hour_of_year:Y jdshkdsjhf utc-week_of_year:Y dskjdskf
utc-week_of_year:Y dskdsld fdsfd
dshdskhkds utc-month_of_year:Y gfdkjlkdf
Try using a reluctant quantifier: _year:.*?\s.
.replaceAll("_year:.*?\\s", "_year:Y ")
System.out
.println("utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf"
.replaceAll("_year:.*?\\s", "_year:Y "));
utc-hour_of_year:Y dsfsdgfsgf utc-week_of_year:Y dsfsdgfsdgf
I am not sure what you are really trying to do and this answer is only based on your example. In case you want to do something else leave comment below or edit your question with more specific information/example
It removes everything after _year: because you are using .*\\s which means
.* zero or more of any characters (beside new line),
\\s and space after it
so in sentence
utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf
it will match
utc-hour_of_year:2013-07-30T17 dsfsdgfsgf utc-week_of_year:2013-W31 dsfsdgfsdgf
// ^from here to here^
because by default * quantifier is greedy. To make it reluctant you need to add ? after * so try maybe
"_year:.*?\\s"
or even better instead .*? match only non-space characters using \\S which is the same as negation of \\s that can be written as [^\\s]. Also if your data can be at the end of your input you shouldn't probably add \\s at the end of your regex and space in your replacement, so try maybe one of this ways
.replaceAll("_year:\\S*", "_year:Y")
.replaceAll("_year:\\S*\\s", "_year:Y ")

Regex to find hostname from Jdbc url

I am new to regex. I would like to retrieve the Hostname from postgreSQL jdbc URL using regex.
Assume the postgreSQL url will be jdbc:postgresql://production:5432/dbname. I need to retrieve "production", which is the hostname. I want to try with regex and not with Java split function. I tried with
Pattern PortFinderPattern = Pattern.compile("[//](.*):*");
final Matcher match = PortFinderPattern.matcher(url);
if (match.find()) {
System.out.println(match.group(1));
}
But it's matching all the string from hostname till the end.
Pattern PortFinderPattern = Pattern.compile(".*:\/\/([^:]+).*");
regex without grouping :
"(?<=//)[^:]*"
[//]([\\w\\d\\-\\.]+)\:
Should be enough to find it reliably. Though this is probably a better regex:
The Hostname Regex
There are some errors in your regex:
[//] - This is only one character, because the [] marks a character class, so it will not fully match //. To match it, you need to write it like this: [/][/] or \/\/.
(.*) - This will match all characters to the end of line. You need to be more specific if you want to go till a certain character. For example you could go to the colon by fetching all characters, which are not colons, like this: ([^:]*).
:* - This makes the colon optional. I guess you forgot to put a dot( every character ) after the colon, like this: :.*.
So here is your regex corrected: \/\/([^:]*):.*.
Hope this helps.
BTW. If the port number is optional after production (:5432), then I suggest the following regex:
\/\/([^/]*)(?::\d+)?\/
To capture also Oracle and MySQL JDBC URL variants with their quirks (e.g. Oracle allowing to use # instead of // or even #//), I use this regexp to get the hostname: [/#]+([^:/#]+)([:/]+|$) Then the hostname is in group 1.
Code e.g.
String jdbcURL = "jdbc:oracle:thin:#//hostname:1521/service.domain.local";
Pattern hostFinderPattern = Pattern.compile("[/#]+([^:/#]+)([:/]+|$)");
final Matcher match = hostFinderPattern.matcher(jdbcURL);
if (match.find()) {
System.out.println(match.group(1));
}
This works for all these URLs (and other variants):
jdbc:oracle:thin:#//hostname:1521/service.domain.local
jdbc:oracle:thin:#hostname:1521/service.domain.local
jdbc:oracle:thin:#hostname/service.domain.local
jdbc:mysql://localhost:3306/sakila?profileSQL=true
jdbc:postgresql://production:5432/dbname
jdbc:postgresql://production/
jdbc:postgresql://production
This assumes that
The hostname is after // or # or a combination thereof (single / would also work, but I don't think JDBC allows that).
After the hostname either : or / or the end of the string follows.
Note that the the + are greedy, this is especially important for the middle one.

Multiline Regex Matching Issue

I have the following string that I am trying to match via regex:
;IF TEST_DATE <= 200112 THEN E>=90 AND S>=90
OR P = "25" ENDIF
IF TEST_DATE >= 200201 AND TEST_DATE < 200407 THEN E>=89
AND S>=90 OR P = "25" ENDIF
I am using the following regex in an attempt to match from the semicolon (intended to be a comment) until the first ENDIF:
;\s*IF (\d|\D)+ ENDIF
Unfortunately, this pattern matches all the way until the second ENDIF. I've tried various solutions using the Java Pattern.DOTALL, as well as the (?s) flag, with no luck.
You are using greedy quantifier, due to which your pattern (\d|\D) matches everything till it finds the last ENDIF.
You need to use reluctant quantifier - +? if you want your regex to stop matching at the first ENDIF : -
;\s*IF (\d|\D)+? ENDIF
Use the non-greedy qualifier.
;\s*IF (\d|\D)*? ENDIF

Categories