Regex Pattern in Java - java

I have a regular expression as defined
AAA_BBB_CCCC_(.*)_[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9].
There is a string defined as --> **AAA_BBB_CCCC_DDD_EEEE_19710101T123456** and in the code, we have matcher.group(1) which can filter out what is desired as (DDD_EEEE). Now, I've a new string coming in as --> **AAA_BBB_ATCCCC_DDD_EEEE_19710101T123456**. Is there a way that I can change the regex to satisfy both old and new string? I tried few solutions that came up from Stackoverflow questions like this and others but that didn't work quite right for me.

You just need to add an optional group, (?:AT)?, before CCCC:
AAA_BBB_(?:AT)?CCCC_(.*)_[0-9]{4}[0-1][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9]
^^^^^^^
See the regex demo
I also contracted the four [0-9] to [0-9]{4} to make the pattern shorter.
The (?:AT)? is a non-capturing group to which a ? quantifier is applied. The ? quantifier makes the whole sequence of letters match 1 or 0 times, making it optional in the end.

Please give the following regex a try.
AAA_BBB_(ATCCCC|CCCC)_(.*)_[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9].
It would only match ATCCCC or CCCC. It won't be able to support dynamic characters preceding CCCC. You would need to use wildcards for that.
Also, you would need to change your matcher.group(1) statement to matcher.group(2)

Related

Java Regex Match Pattern Groups unexpectedly matched [duplicate]

I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?
The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.
Tell it to find only non-semicolons.
[^;]+
What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation

Regex to Split String on Even Number of Preceding Characters

I'm trying to develop a regex that will split a string on a single quote only if the single quote is preceded by zero question marks, or an even number of question marks. For example, the following string:
ABC??'DEF?'GHI'JKL????'MNO'
would result in:
ABC??
DEF?'GHI
JKL????
MNO
I've tried using this negative lookbehind:
(?<!\?\?)*\'
But that results in:
ABC??
DEF?
GHI
JKL????
MNO
I've also tried the following
(?<!(\?\?)*)\' results in runtime error
(?:\?\?)*\'
(?!\?\?)+\'
Any ideas would be greatly appreciated.
This isn't handy to use the split method in this kind situations. A workaround consists to describe all that isn't the delimiter and to use the find method:
[^?']+(?:\?.[^?']*)*|(?:\?.[^?']*)+
demo
pattern details:
[^?']* # zero or more characters that aren't a `?` or a `'`
(?: # open a non-capturing group
\? . # a question mark followed by a character (that can be a `?` or a `'`)
[^?']* #
)* # close the non-capturing group and repeat it zero or more times
[^?']*(?:\?.[^?']*)* describes all that isn't the delimiter including the empty string. To avoid empty matches, I use 2 branches of an alternation: [^?']+(?:\?.[^?']*)* and (?:\?.[^?']*)+ to ensure there's at least one character.
(If you want to allow the empty string at the start of the string, add |^ at the end of the pattern)
You can also use the split method but the pattern to do it isn't efficient since it needs to look backward for each position (and is limited since the lookbehind in java allows only limited quantifiers):
(?<=(?<!\?)(?:\?\?){0,100})'
or perhaps more efficient like this:
'(?<=(?<!\?)(?:\?\?){0,100}')
Have you tried positive lookbehind
(?<=.')
Regex101
This regex will do it:
[A-Z]+(\?\?)*'
If you only need to handle a single question mark, not three, five, etc., you could use this:
(?<![^\?]\?)'
You could expand on this concept to match other specific odd numbers of question marks. For example, this will properly not split on a quote preceded by one, three, or five question marks:
(?<![^\?]\?|[^\?]\?{3}|[^\?]\?{5})'
Working example. Lookbehinds must be fixed-width, but some engines allow an OR of the entire lookbehind. Others do not, and would require it be written as three separate lookbehinds:
(?<![^\?]\?)(?<![^\?]\?{3})(?<![^\?]\?{5})'
Obviously this is getting a bit messy, though. And it can't handle an arbitrary odd number of ?.

How to find optional group with some prefix using Regex

This is my pattern regex:
"subcategory.html?.*id=(.*?)&.*title=(.+)?"
for below input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale
I want to capturebelow group
group one (id) : 3000080292
group two (title) : BabySale
For which it is working fine. The problem is I want to make second group i.e. value of title to be optional, so that even if title is not present, regex should match and get me value of group 1(id). But for input
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&
Regex match is failing even if group one is present. So my question is how to make second group optional here?
Maybe make the entire substring optional?
Try subcategory.html?.*id=(.*?)&.*(?:title=(.+)?)?
Also note that your (and my) regex might be matching too much. For example, the dot here should probably be escaped: subcategory\.html instead of subcategory.html or you will match subcategory€html, too. Your question mark says the l of html is optional; you are probably saved by the .* ("match anything"), that follows.
Last but not least, the final .* means that even this will match (which you probably don't want to match):
http://example.com/xyz/subcategory.html?id=3000080292&backTitle=Back&title=BabySale&Lorem Ipsum Sit Atem http://&%$
It's usually a bad idea to match .* as it will nearly always match too much. Consider using character classes instead of the dot, and to anchor he beginning (^) and end ($) of the string... :)
One of the possible ways is to use something like:
subcategory\.html\?.*id=(.*?)&(.*title=(.+)?)?
(.*title=(.+)?)? is optional now.
please see an example here.
As suggested by #Christian it is better to make .*title non capturing group and it won't be part of the result.
subcategory\.html\?.*id=(.*?)&(?:.*title=(.+)?)?
If you know that parameter id comes before optional title then you can use this regex to capture id and optional title parameters:
subcategory\.html\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?
RegEx Demo
In Java use this regex:
final String regex = "subcategory\\.html\\?id=([^&]*)(?:.*&)?(?:title=([^&]*))?";

Any suggestions to match and extract the pattern?

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?
Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)
If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.
From here, "group zero denotes the entire pattern". Use group(1).
(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

Regex why does negative lookahead not work when there are two groups here

when I tried this regex
\"(\S\S+)\"(?!;c)
on this string "MM:";d it comes as matched as I wanted
and on this string "MM:";c it comes as not matched as desired.
But when I add a second group, by moving the semicolon inside that group and making it optional using |
\"(\S\S+)\"(;|)(?!c)
for this string "MM:";c it comes as matched when I expected it to not like before.
I tried this on Java and then on Javascript using Regex tool debuggex:
This link contains a snippet of the above
What am I doing wrong?
note the | is so it is not necessary to have a semicolon.Also in the examples I put c, it is just a substitute in the example for a word, that's why I am using negative lookahead.
After following Holgers response of using Possessive Quantifiers,
\"(\S\S+)\";?+(?!c)
it worked, here is a link to it on RegexPlanet
I believe that the regex will do what it can to find a match; since your expression said the semicolon could be optional, it found that it could match the entire expression (since if the semicolon is not consumed by the first group, it becomes a "no-match" for the negative lookahead. This has to do with the recursive way that regex works: it keeps trying to find a match...
In other words, the process goes like this:
MM:" - matched
(;|) - try semicolon? matched
(?!c) - oops - negative lookahead fails. No match. Go back
(;|) - try nothing. We still have ';c' left to match
(?!c) - negative lookahead not matched. We have a match
An update (based on your comment). The following code may work better:
\"(\S\S+)\"(;|)((?!c)|(?!;c))
Debuggex Demo
The problem is that you don’t want to make the semicolon optional in the sense of regular expression. An optional semicolon implies that the matcher is allowed to try both, matching with or without it. So even if the semicolon is there the matcher can ignore it creating an empty match for the group letting the lookahead succeed.
But you want to consume the semicolon if it’s there, so it is not allowed to be used to satisfy the negative look-ahead. With Java’s regex engine that’s pretty easy: use ;?+
This is called a “possessive quantifier”. Like with the ? the semicolon doesn’t need to be there but if it’s there it must match and cannot be ignored. So the regex engine has no alternatives any more.
So the entire pattern looks like \"(\S\S+)\";?+(?!c) or \"(\S\S+)\"(;?+)(?!c) if you need the semicolon in a group.

Categories