Any suggestions to match and extract the pattern?

Any suggestions to match and extract the pattern? - java

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?

Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)

If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.

From here, "group zero denotes the entire pattern". Use group(1).

(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

Related

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?

The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.

Tell it to find only non-semicolons.
[^;]+

What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

How to find optional group with some prefix using Regex

This is my pattern regex:
"subcategory.html?.*id=(.*?)&.*title=(.+)?"
for below input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale
I want to capturebelow group
group one (id) : 3000080292
group two (title) : BabySale
For which it is working fine. The problem is I want to make second group i.e. value of title to be optional, so that even if title is not present, regex should match and get me value of group 1(id). But for input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&
Regex match is failing even if group one is present. So my question is how to make second group optional here?

Maybe make the entire substring optional?
Try subcategory.html?.*id=(.*?)&.*(?:title=(.+)?)?
Also note that your (and my) regex might be matching too much. For example, the dot here should probably be escaped: subcategory\.html instead of subcategory.html or you will match subcategory€html, too. Your question mark says the l of html is optional; you are probably saved by the .* ("match anything"), that follows.
Last but not least, the final .* means that even this will match (which you probably don't want to match):
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale&Lorem Ipsum Sit Atem http://&%$
It's usually a bad idea to match .* as it will nearly always match too much. Consider using character classes instead of the dot, and to anchor he beginning (^) and end ($) of the string... :)

One of the possible ways is to use something like:
subcategory\.html\?.*id=(.*?)&(.*title=(.+)?)?
(.*title=(.+)?)? is optional now.
please see an example here.
As suggested by #Christian it is better to make .*title non capturing group and it won't be part of the result.
subcategory\.html\?.*id=(.*?)&(?:.*title=(.+)?)?

If you know that parameter id comes before optional title then you can use this regex to capture id and optional title parameters:
subcategory\.html\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?
RegEx Demo
In Java use this regex:
final String regex = "subcategory\\.html\\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?";

Matches A but not B in Java Regex?

I have a big document. Lets scale it to
location=State-City-House
location=City-House
So What I want to do is replace all those not starting with State, with some other string. Say "NY". But those starting with State must remain untouched.
So my end result would be
location=State-City-House
location=NY-City-House
1.Obviously I cant use String.replaceAll().
2.Using Pattern.matcher() is tricky since we are using two different patterns where one must be found and one must not be found.
3.Tried a dirty way of replacing "location=State" first with "bocation=State" then replacing the others and then re-replacing.
So, A neat and simple way to do it?

You can definitely use replaceAll with a negative lookahead:
String repl = input.replaceAll( "(?m)^(location=)(?!State)", "$1NY-" );
(?m) sets MULTILINE modifier so that we match anchors ^ and $ in each line
(location=) matches location= and captures the value in group #1
(?!State) is the negative lookahead to fail the match when State appears after the captured group #1 i.e. location=
In replacement we use $1NY- to make it location=NY- at start.
RegEx Demo

If I understand your intention correctly, you don't actually have the string "State" in your input, but varying strings that represent states.
But some of your text lines are missing the state altogether and only have the name of the City and House. Is that correct? In that case, the defining characteristic between the 2 kinds of lines is the number of dashes.
^location=([^-]+)-([^-]+)$
The above regex matches only full lines with only 1 dash.
I might have misunderstood the task. It would be easier if you would post some of the actual input.

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet

I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo

The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

How to make negative lookahead work with end of line text

I have a regex like the following:
.{0,1000}(?!(xa7|para(graf))$)
using Java.
I was expecting that it would cause the following text to fail:
blaparagraf
because paragraf is found at the end

That's because .{0,1000} will match the entire subject, hence it's not followed by xa7 or paragraf (it's followed by $ only).
You want negative lookbehind:
.{0,1000}(?<!xa7|paragraf)$

It is a common a mistake to misplace assertions. If you want to use lookahead, the pattern is something like this:
^(?!.*paragraph$).*$
This matches (as seen on rubular.com):
something something para
paragraph something something
But doesn't match:
something paragraph
So the key difference here is that we start looking ahead at the beginning of the string, before we match .* (or .{0,1000} in your case). Of course, what we're looking for isn't simply paragraph$, but rather .*paragraph$.
That said, to check that a string doesn't end with something of finite length, lookbehind when supported is the most natural solution.
^.*$(?<!paragraph)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Any suggestions to match and extract the pattern? - java

Try this one (escaped for java): (\\$\\(string\\)(?:(?:\\.not\(.*?\\))+)) It should capture just the part that you are after. You can test it out (unescaped for java though)

From here, "group zero denotes the entire pattern". Use group(1).

(\$\([\w ]+\))(\.not\([\w ]+\))* This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings. Please note: You might have to add escape characters for java.

Related

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

How to find optional group with some prefix using Regex

Matches A but not B in Java Regex?

Regex why does negative lookahead not work when there are two groups here

How to make negative lookahead work with end of line text

Categories

Resources