How to add wildcards in lookahead? - java

I'd like to identify certain values in the following string, specially the values inside CVC and Number:
CreditCard Number="123" CVC="213" Date="2015-12"
(?<=CVC=\").*(?=") matches 213" Date="2015-12. How can I modify the regex to look for the first doublequote match after something was found, and not to look for the last doublequote as it does now?
Further: how can I define wildcards in lookaheads? Ideally I'd like to have an expression:
(?<=CreditCard.*CVC=\").*(?=") which means that a CVC statement must be preceded with "CreditCard" String, but between them there could by any values.

You can simply make the .* not greedy .*?
(?<=CVC=\").*?(?=")
RegExr
In answer to your 2nd question, java regex (and most other engines) don't allow variable length lookbehinds. Usually though, you can solve a problem that would require a variable length lookbehind by using capture groups:
(?<=CreditCard.*CVC=\").*?(?=")
becomes:
CreditCard.*?CVC=\"(.*?)"
And then you can take the relevant information from capture group 1.
RegExr (.* added on RegExr so that output replaces the entire input, its not required for your case though.)

You could skip using lookbehinds, and instead use clustering to pull out just the portions of the string you want:
CreditCard Number="(/d*)".*\sCVC="(/d*)"
And then the "match groups" numbered 1 and 2 will correspond to your credit card number and CVC, respectively. (You can use Matcher.group(int) to retrieve the values of the various groups) Notice that by using \d to specifically match digits, you don't have to make the * non-greedy. In this case it works because you only want to match on digits. In the general case (let's say a credit card number could consist of any non-quote character), you can use a custom character class to match anything but your delimiter (quote in this case):
CreditCard Number="([^"]*)".*\sCVC="([^"]*)"

Related

Regex to Split String on Even Number of Preceding Characters

I'm trying to develop a regex that will split a string on a single quote only if the single quote is preceded by zero question marks, or an even number of question marks. For example, the following string:
ABC??'DEF?'GHI'JKL????'MNO'
would result in:
ABC??
DEF?'GHI
JKL????
MNO
I've tried using this negative lookbehind:
(?<!\?\?)*\'
But that results in:
ABC??
DEF?
GHI
JKL????
MNO
I've also tried the following
(?<!(\?\?)*)\' results in runtime error
(?:\?\?)*\'
(?!\?\?)+\'
Any ideas would be greatly appreciated.
This isn't handy to use the split method in this kind situations. A workaround consists to describe all that isn't the delimiter and to use the find method:
[^?']+(?:\?.[^?']*)*|(?:\?.[^?']*)+
demo
pattern details:
[^?']* # zero or more characters that aren't a `?` or a `'`
(?: # open a non-capturing group
\? . # a question mark followed by a character (that can be a `?` or a `'`)
[^?']* #
)* # close the non-capturing group and repeat it zero or more times
[^?']*(?:\?.[^?']*)* describes all that isn't the delimiter including the empty string. To avoid empty matches, I use 2 branches of an alternation: [^?']+(?:\?.[^?']*)* and (?:\?.[^?']*)+ to ensure there's at least one character.
(If you want to allow the empty string at the start of the string, add |^ at the end of the pattern)
You can also use the split method but the pattern to do it isn't efficient since it needs to look backward for each position (and is limited since the lookbehind in java allows only limited quantifiers):
(?<=(?<!\?)(?:\?\?){0,100})'
or perhaps more efficient like this:
'(?<=(?<!\?)(?:\?\?){0,100}')
Have you tried positive lookbehind
(?<=.')
Regex101
This regex will do it:
[A-Z]+(\?\?)*'
If you only need to handle a single question mark, not three, five, etc., you could use this:
(?<![^\?]\?)'
You could expand on this concept to match other specific odd numbers of question marks. For example, this will properly not split on a quote preceded by one, three, or five question marks:
(?<![^\?]\?|[^\?]\?{3}|[^\?]\?{5})'
Working example. Lookbehinds must be fixed-width, but some engines allow an OR of the entire lookbehind. Others do not, and would require it be written as three separate lookbehinds:
(?<![^\?]\?)(?<![^\?]\?{3})(?<![^\?]\?{5})'
Obviously this is getting a bit messy, though. And it can't handle an arbitrary odd number of ?.

Java regex lookbehind issue with quantifiers

I'm using a Java regex pattern in an application that only allows access to the whole match value (that is, I cannot use capturing groups).
I am trying to extract values from my sample text:
C02 SURVEY : 2010 F10446P BONAPARTE 2D
In the above example I need to check for the keyword SURVEY and have to extract value after that :. And I wanted my output to be:
2010 F10446P BONAPARTE 2D
I used the pattern (?<=(?i)survey\s{2}[:])(?:(?![\n]).)*
In this pattern, I have hardcoded the spaces to be 2 (\s{2}) which may vary and not constant value.
I need to use quantifiers with lookbehind operation.
If any other option is there please let me know.
You may leverage a feature in a Java regex engine that is called "constrained width lookbehind":
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
That means, you may replace the {2} limiting quantifier with a limiting quantifier with both minimum and maximum values, e.g. {0,100} to allow zero to a hundred whitespace symbols. Adjust them as you see fit.
Besides, you needn't use a tempered greedy token (?:(?![\n]).)* as the dot in Java regex does not match a newline. Just replace it with .* to match any zero or more chars other than newline. So, your pattern might look as simple as (?i)(?<=survey\s{0,100}:).*.

Only capture digits instead of the other text in input for a regex like below

Currently this regular expression:
^(?:\\S+\\s+)*?(\\S+)\\s+(?:No\\.\\s+)?(\\S+)(?:\\s+\\(.*?\\))?$
captures 418—FINAL in group number 2 for an input like:
String text="H.B. 418—FINAL VERSION";
How do I change this regular expression to only capture the number (digits) of "418" in group2 ?
EDIT:
I'd still like to capture "H.B." in a preceding group.
Just change the boundaries of the second group to only include the digits. To also save the "H.B." part, add paranthesis around that part too:
^(?:(\\S+)\\s+)*?(\\d+)\\S+\\s+(?:No\\.\\s+)?(\\S+)(?:\\s+\\(.*?\\))?$
I'm not entirely sure what your exact requirements are (your regex is looking for an optional "No." but you haven't given any examples). But this will work on the example you give:
^(?:\\S+\\s+)*?(\\S+)\\s+(?:No\\.\\s+)?(\\d+).*(?:\\s+\\(.*?\\))?$
assuming you don't need the text following the digits. That is, just change the second \S to \d. I also added .* after this to match any remaining characters following the digits up to an optional parenthesized part (without capturing them, but you can capture them if you want to).

How to combine these regex for javascript

Hi I am trying to use regEx in JS for identifying 3 identical consecutive characters (could be alphabets,numbers and also all non alpha numeric characters)
This identifies 3 identical consecutive alphabets and numbers : '(([0-9a-zA-Z])\1\1)'
This identifies 3 identical consecutive non alphanumerics : '(([^0-9a-zA-Z])\1\1)'
I am trying to combine both, like this : '(([0-9a-zA-Z])\1\1)|(([^0-9a-zA-Z])\1\1)'
But I am doing something wrong and its not working..(returns true for '88aa3BBdd99##')
Edit : And to find NO 3 identical characters, this seems to be wrong /(^([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1)/ --> RegEx in JS to find No 3 Identical consecutive characters
thanks
Nohsib
The problem is that backreferences are counted from left to right throughout the whole regex. So if you combine them your numbers change:
(([0-9a-zA-Z])\2\2)|(([^0-9a-zA-Z])\4\4)
You could also remove the outer parens:
([0-9a-zA-Z])\1\1|([^0-9a-zA-Z])\2\2
Or you could just capture the alternatives in one set of parens together and append one back-reference to the end:
([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1
But since your character classes match all characters anyway you can have that like this as well:
([\s\S])\1\1
And if you activate the DOTALL or SINGLELINE option, you can use a . instead:
(.)\1\1
It's actually much simpler:
(.)\1\1
The (.) matches any character, and each \1 is a back reference that matches the exact string that was matched by the first capturing group. You should be aware of what the . actually matches and then modify the group (in the parentheses) to fit your exact needs.

regex for specific digit prefix

I am trying to have the following regx rule, but couldn't find solution.
I am sorry if I didn't make it clear. I want for each rule different regx. I am using Java.
rule should fail for all digit inputs start with prefix '1900' or '1901'.
(190011 - fail, 190111 - fail, 41900 - success...)
rule should success for all digit inputs with the prefix '*'
different regex for each rule (I am not looking for the combination of both of them together)
Is this RE fitting the purpose ? :
'\A(\*|(?!190[01])).*'
\A means 'the beginning of string' . I think it's the same in Java's regexes
.
EDIT
\A : "from the very beginning of the string ....". In Python (which is what I know, in fact) this can be omitted if we use the function match() that always analyzes from the very beginning, instead of search() that search everywhere in a string. If you want the regex able to analyze lines from the very beginning of each line, this must be replaced by ^
(...|...) : ".... there must be one of the two following options : ....."
\* : "...the first option is one character only, a star; ..." . As a star is special character meaning 'zero, one or more times what is before' in regex's strings, it must be escaped to strictly mean 'a star' only.
(?!190[01]) : "... the second option isn't a pattern that must be found and possibly catched but a pattern that must be absent (still after the very beginning). ...". The two characters ?! are what says 'there must not be the following characters'. The pattern not to be found is 4 integer characters long, '1900' or '1901' .
(?!.......) is a negative lookahead assertion. All kinds of assertion begins with (? : the parenthese invalidates the habitual meaning of ? , that's why all assertions are always written with parentheses.
If \* have matched, one character have been consumed. On the contrary, if the assertion is verified, the corresponding 4 first characters of the string haven't been consumed: the regex motor has gone through the analysed string until the 4th character to verify them, and then it has come back to its initial position, that is to say, presently, at the very beginning of the string.
If you want the bi-optional part (...|...) not to be a capturing group, you will write ?: just after the first paren, then '\A(?:\*|(?!190[01])).*'
.* : After the beginning pattern (one star catched/matched, or an assertion verified) the regex motor goes and catch all the characters until the end of the line. If the string has newlines and you want the regex to catch all the characters until the end of the string, and not only of a line, you will specify that . must match the newlines too (in Python it is with re.MULTILINE), or you will replace .* with (.|\r|\n)*
I finally understand that you apparently want to catch strings composed of digits characters. If so the RE must be changed to '\A(?:\*|(?!190[01]))\d*' . This RE matches with empty strings. If you want no-match with empty strings, put \d+ in place of \d* . If you want that only strings with at least one digit, even after the star when it begins with a star, match, then do '\A(?:\*|(?!190[01]))(?=\d)\d*'
For the first rule, you should use a combo regex with two captures, one to capture the 1900/1901-prefixed case, and one the capture the rest. Then you can decide whether the string should succeed or fail by examining the two captures:
(190[01]\d+)|(\d+)
Or just a simple 190[01]\d+ and negate your logic.
Regex's are not really very good at excluding something.
You may exclude a prefix using negative look-behind, but it won't work in this case because the prefix is itself a stream of digits.
You seem to be trying to exclude 1-900/901 phone numbers in the US. If the number of digits is definite, you can use a negative look-behind to exclude this prefix while matching the remaining exact number digits.
For the second rule, simply:
\*\d+

Categories