Java regex to extract param from function in url string - java

My string is something like:
http://url:port?param1=foo&param2=EDIT[a,b]&param3=fuu&param4=EDIT[c,33]
I need a regex to extract:
http://url:port?param1=foo&param2=
a
b
&param3=fuu&param4=
c
33
For a single repetition of EDIT[] I'm able to use this regex:
(.*)EDIT\\[(.+)\\,(.+)\\](.*)
But I can't find a working one for a unlimited number of repetition. Something like:
((.*)EDIT\\[(.+)\\,(.+)\\](.*)){1,}

one solution is to make your very generic selectors non greedy with the ? sign. In current version, more characters than necessary are captured by your .+ and .*, so they consume the second EDIT.
the regex:
(.*?)EDIT\[(.+?)\,(.+?)\](.*?)
to see it in action: https://regex101.com/r/PuL3SW/1
EDIT: according to the comment, to capture the last part of the url if needed, you can use an alternative (second edit: i also removed the 4th capturing group (.*?) that never captured nothing, NOTE: you can be always sure that the last capture (.+) will be the end of your URL because anything else that is before an EDIT will have been captured before):
(.*?)EDIT\[(.+?)\,(.+?)\]|(.+)
updated regex here: https://regex101.com/r/PuL3SW/4

Related

Regex Pattern in Java

I have a regular expression as defined
AAA_BBB_CCCC_(.*)_[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9].
There is a string defined as --> **AAA_BBB_CCCC_DDD_EEEE_19710101T123456** and in the code, we have matcher.group(1) which can filter out what is desired as (DDD_EEEE). Now, I've a new string coming in as --> **AAA_BBB_ATCCCC_DDD_EEEE_19710101T123456**. Is there a way that I can change the regex to satisfy both old and new string? I tried few solutions that came up from Stackoverflow questions like this and others but that didn't work quite right for me.
You just need to add an optional group, (?:AT)?, before CCCC:
AAA_BBB_(?:AT)?CCCC_(.*)_[0-9]{4}[0-1][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9]
^^^^^^^
See the regex demo
I also contracted the four [0-9] to [0-9]{4} to make the pattern shorter.
The (?:AT)? is a non-capturing group to which a ? quantifier is applied. The ? quantifier makes the whole sequence of letters match 1 or 0 times, making it optional in the end.
Please give the following regex a try.
AAA_BBB_(ATCCCC|CCCC)_(.*)_[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9].
It would only match ATCCCC or CCCC. It won't be able to support dynamic characters preceding CCCC. You would need to use wildcards for that.
Also, you would need to change your matcher.group(1) statement to matcher.group(2)

How to find optional group with some prefix using Regex

This is my pattern regex:
"subcategory.html?.*id=(.*?)&.*title=(.+)?"
for below input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale
I want to capturebelow group
group one (id) : 3000080292
group two (title) : BabySale
For which it is working fine. The problem is I want to make second group i.e. value of title to be optional, so that even if title is not present, regex should match and get me value of group 1(id). But for input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&
Regex match is failing even if group one is present. So my question is how to make second group optional here?
Maybe make the entire substring optional?
Try subcategory.html?.*id=(.*?)&.*(?:title=(.+)?)?
Also note that your (and my) regex might be matching too much. For example, the dot here should probably be escaped: subcategory\.html instead of subcategory.html or you will match subcategory€html, too. Your question mark says the l of html is optional; you are probably saved by the .* ("match anything"), that follows.
Last but not least, the final .* means that even this will match (which you probably don't want to match):
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale&Lorem Ipsum Sit Atem http://&%$
It's usually a bad idea to match .* as it will nearly always match too much. Consider using character classes instead of the dot, and to anchor he beginning (^) and end ($) of the string... :)
One of the possible ways is to use something like:
subcategory\.html\?.*id=(.*?)&(.*title=(.+)?)?
(.*title=(.+)?)? is optional now.
please see an example here.
As suggested by #Christian it is better to make .*title non capturing group and it won't be part of the result.
subcategory\.html\?.*id=(.*?)&(?:.*title=(.+)?)?
If you know that parameter id comes before optional title then you can use this regex to capture id and optional title parameters:
subcategory\.html\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?
RegEx Demo
In Java use this regex:
final String regex = "subcategory\\.html\\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?";

Any suggestions to match and extract the pattern?

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?
Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)
If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.
From here, "group zero denotes the entire pattern". Use group(1).
(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet
I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo
The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

Only capture digits instead of the other text in input for a regex like below

Currently this regular expression:
^(?:\\S+\\s+)*?(\\S+)\\s+(?:No\\.\\s+)?(\\S+)(?:\\s+\\(.*?\\))?$
captures 418—FINAL in group number 2 for an input like:
String text="H.B. 418—FINAL VERSION";
How do I change this regular expression to only capture the number (digits) of "418" in group2 ?
EDIT:
I'd still like to capture "H.B." in a preceding group.
Just change the boundaries of the second group to only include the digits. To also save the "H.B." part, add paranthesis around that part too:
^(?:(\\S+)\\s+)*?(\\d+)\\S+\\s+(?:No\\.\\s+)?(\\S+)(?:\\s+\\(.*?\\))?$
I'm not entirely sure what your exact requirements are (your regex is looking for an optional "No." but you haven't given any examples). But this will work on the example you give:
^(?:\\S+\\s+)*?(\\S+)\\s+(?:No\\.\\s+)?(\\d+).*(?:\\s+\\(.*?\\))?$
assuming you don't need the text following the digits. That is, just change the second \S to \d. I also added .* after this to match any remaining characters following the digits up to an optional parenthesized part (without capturing them, but you can capture them if you want to).

Categories