Java regex lookbehind issue with quantifiers

Java regex lookbehind issue with quantifiers - java

I'm using a Java regex pattern in an application that only allows access to the whole match value (that is, I cannot use capturing groups).
I am trying to extract values from my sample text:
C02 SURVEY : 2010 F10446P BONAPARTE 2D
In the above example I need to check for the keyword SURVEY and have to extract value after that :. And I wanted my output to be:
2010 F10446P BONAPARTE 2D
I used the pattern (?<=(?i)survey\s{2}[:])(?:(?![\n]).)*
In this pattern, I have hardcoded the spaces to be 2 (\s{2}) which may vary and not constant value.
I need to use quantifiers with lookbehind operation.
If any other option is there please let me know.

You may leverage a feature in a Java regex engine that is called "constrained width lookbehind":
Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.
That means, you may replace the {2} limiting quantifier with a limiting quantifier with both minimum and maximum values, e.g. {0,100} to allow zero to a hundred whitespace symbols. Adjust them as you see fit.
Besides, you needn't use a tempered greedy token (?:(?![\n]).)* as the dot in Java regex does not match a newline. Just replace it with .* to match any zero or more chars other than newline. So, your pattern might look as simple as (?i)(?<=survey\s{0,100}:).*.

Related

How to match a string in this way?

I need to check if a String matches this specific pattern.
The pattern is:
(Numbers)(all characters allowed)(numbers)
and the numbers may have a comma ("." or ",")!
For instance the input could be 500+400 or 400,021+213.443.
I tried Pattern.matches("[0-9],?.?+[0-9],?.?+", theequation2), but it didn't work!
I know that I have to use the method Pattern.match(regex, String), but I am not being able to find the correct regex.

Dealing with numbers can be difficult. This approach will deal with your examples, but check carefully. I also didn't do "all characters" in the middle grouping, as "all" would include numbers, so instead I assumed that finding the next non-number would be appropriate.
This Java regex handles the requirements:
"((-?)[\\d,.]+)([^\\d-]+)((-?)[\\d,.]+)"
However, there is a potential issue in the above. Consider the following:
300 - -200. The foregoing won't match that case.
Now, based upon the examples, I think the point is that one should have a valid operator. The number of math operations is likely limited, so I would whitelist the operators in the middle. Thus, something like:
"((-?)[\\d,.]+)([\\s]*[*/+-]+[\\s]*)((-?)[\\d,.]+)"
Would, I think, be more appropriate. The [*/+-] can be expanded for the power operator ^ or whatever. Now, if one is going to start adding words (such as mod) in the equation, then the expression will need to be modified.
You can see this regular expression here

In your regex you have to escape the dot \. to match it literally and escape the \+ or else it would make the ? a possessive quantifier. To match 1+ digits you have to use a quantifier [0-9]+
For your example data, you could match 1+ digits followed by an optional part which matches either a dot or a comma at the start and at the end. If you want to match 1 time any character you could use a dot.
Instead of using a dot, you could also use for example a character class [-+*] to list some operators or list what you would allow to match. If this should be the only match, you could use anchors to assert the start ^ and the end $ of the string.
\d+(?:[.,]\d+)?.\d+(?:[.,]\d+)?
In Java:
String regex = "\\d+(?:[.,]\\d+)?.\\d+(?:[.,]\\d+)?";
Regex demo
That would match:
\d+(?:[.,]\d+)? 1+ digits followed by an optional part that matches . or , followed by 1+ digits
. Match any character (Use .+) to repeat 1+ times
Same as the first pattern

Regex to Split String on Even Number of Preceding Characters

I'm trying to develop a regex that will split a string on a single quote only if the single quote is preceded by zero question marks, or an even number of question marks. For example, the following string:
ABC??'DEF?'GHI'JKL????'MNO'
would result in:
ABC??
DEF?'GHI
JKL????
MNO
I've tried using this negative lookbehind:
(?<!\?\?)*\'
But that results in:
ABC??
DEF?
GHI
JKL????
MNO
I've also tried the following
(?<!(\?\?)*)\' results in runtime error
(?:\?\?)*\'
(?!\?\?)+\'
Any ideas would be greatly appreciated.

This isn't handy to use the split method in this kind situations. A workaround consists to describe all that isn't the delimiter and to use the find method:
[^?']+(?:\?.[^?']*)*|(?:\?.[^?']*)+
demo
pattern details:
[^?']* # zero or more characters that aren't a `?` or a `'`
(?: # open a non-capturing group
\? . # a question mark followed by a character (that can be a `?` or a `'`)
[^?']* #
)* # close the non-capturing group and repeat it zero or more times
[^?']*(?:\?.[^?']*)* describes all that isn't the delimiter including the empty string. To avoid empty matches, I use 2 branches of an alternation: [^?']+(?:\?.[^?']*)* and (?:\?.[^?']*)+ to ensure there's at least one character.
(If you want to allow the empty string at the start of the string, add |^ at the end of the pattern)
You can also use the split method but the pattern to do it isn't efficient since it needs to look backward for each position (and is limited since the lookbehind in java allows only limited quantifiers):
(?<=(?<!\?)(?:\?\?){0,100})'
or perhaps more efficient like this:
'(?<=(?<!\?)(?:\?\?){0,100}')

Have you tried positive lookbehind
(?<=.')
Regex101

This regex will do it:
[A-Z]+(\?\?)*'

If you only need to handle a single question mark, not three, five, etc., you could use this:
(?<![^\?]\?)'
You could expand on this concept to match other specific odd numbers of question marks. For example, this will properly not split on a quote preceded by one, three, or five question marks:
(?<![^\?]\?|[^\?]\?{3}|[^\?]\?{5})'
Working example. Lookbehinds must be fixed-width, but some engines allow an OR of the entire lookbehind. Others do not, and would require it be written as three separate lookbehinds:
(?<![^\?]\?)(?<![^\?]\?{3})(?<![^\?]\?{5})'
Obviously this is getting a bit messy, though. And it can't handle an arbitrary odd number of ?.

word range or \w in negative lookbehind

I was trying to made regex for extracting word at the place of Delhi in text
sending to: GK Delhi, where the sending to: is fixed and i don't want to capture whatever at the place of GK. Actually GK will be one word in my case, what i made which should work is: (?<=sending to: \w )Delhi, means if word starts with sending to: and ends with Delhi then return Delhi.
Please help me to fix this.

Three points,
\w matches a single word character. Use \w+ to match one or more or \w* to match zero or more word characters.
Don't forget about space between DK and Delhi: \s+.
Just a note: The (?<= construct is the positive lookbehind, not negative one.
So the regex could look like this:
(?<=sending to:\s*\w+\s+)Delhi
Please also note that arbitrary-length lookbehind is only supported by very few regex engines, but you didn't say anything about the tool you are using.
Update:
Java doesn't support arbitrary-length lookbehind expressions.
The possibilities you have are:
The matched text will always be Delhi (on successful match). So if you are only checking for a match, then you could just use the regex: sending to:\s*\w+\s+Delhi.
If you want to extend the regex to other towns in future, then you could use a capturing group. The regex would be, for example, sending to:\s*\w+\s+(Delhi|Mumbai) and in Java code you would get the city name via matcher.group(1).
Please post your actual Java code of how you are using the regex if you want a more detailed advice.

Java Regular Expression for number of exactly 5 digits anywhere in the string

I'm trying to create a regular expression to parse a 5 digit number out of a string no matter where it is but I can't seem to figure out how to get the beginning and end cases.
I've used the pattern as follows \\d{5} but this will grab a subset of a larger number...however when I try to do something like \\D\\d{5}\\D it doesn't work for the end cases. I would appreciate any help here! Thanks!
For a few examples (55555 is what should be extracted):
At the beginning of the string
"55555blahblahblah123456677788"
In the middle of the string
"2345blahblah:55555blahblah"
At the end of the string
"1234567890blahblahblah55555"

Since you are using a language that supports them use negative lookarounds:
"(?<!\\d)\\d{5}(?!\\d)"
These will assert that your \\d{5} is neither preceded nor followed by a digit. Whether that is due to the edge of the string or a non-digit character does not matter.
Note that these assertions themselves are zero-width matches. So those characters will not actually be included in the match. That is why they are called lookbehind and lookahead. They just check what is there, without actually making it part of the match. This is another disadvantage of using \\D, which would include the non-digit character in your match (or require you to use capturing groups).

Regex matching capital characters, numbers and period

I'm trying to see if a input only contains capital letters, numbers and a period in regex. What would the regex pattern be for this in Java?
Is there any guides on how I can build this regex, even some online tools?
Also is it possible to check length of string is no more than 50 using regex?

This is the Unicode answer:
^[\p{Lu}\p{Nd}.]{0,50}$
From regular-expressions.info
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
^ and $ is the start and the end of the string

Regex pattern:
Pattern.compile("^[A-Z\\d.]*$")
To check the length of a string:
Pattern.compile("^.{0,50}$")
Both combined:
Pattern.compile("^[A-Z\\d.]{0,50}$")
Although I wouldn't use regular expressions to check for length if I were you, just call .length() on the string.

This website is really handy for building and testing and regular expressions

Regular expressions in Java have a lot in common with other languages when it comes to the simple syntax, with some predefined character classes that add more than you'd find in Perl for example. The Java API docs on Pattern show the various patterns that are supported. A friendlier introduction to regexes in Java is http://www.regular-expressions.info/java.html.
Some very quick Googling shows there are many tools online for testing Java regular expressions against input strings. Here is one.
To check for the type of input you are interested in, the following regex should work:
^[A-Z0-9.]{,50}$
Broken down, this is saying:
^: start matching from the start of the input; do not allow the first character(s) to be skipped
[]: match one of the characters in this range
A-Z: within a range, - means to accept all values between the first and last character inclusive, so in this case all characters from A to Z.
0-9: add to the previous range all digits
.: periods are special in regexes, but all special characters become simple again within a character class ([])
{,50}: require (or 0) matches up to 50 of the character class just defined.
$: the match must reach the end of the input; do not allow the last character(s) to be skipped

This returns true for strings, containing only 50 characters that can be numbers, capital letters or a dot.
string.matches("[0-9A-Z\\.]{0,50}")

In response to what tools you can use, I prefer Regex Coach

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex lookbehind issue with quantifiers - java

Related

How to match a string in this way?

Regex to Split String on Even Number of Preceding Characters

word range or \w in negative lookbehind

Java Regular Expression for number of exactly 5 digits anywhere in the string

Regex matching capital characters, numbers and period

Categories

Resources