Java Regex validation - java

I maybe miss something, but i'd like to know why this pattern is valid :
Pattern.compile("([0-9]{1,})");
The compile does not throw an exception despite that the occurence is not valid.
Thx a lot

despite that the occurence is not valid
quantifiers can be represented using {n,m} syntax where:
{n} - exactly n times
{n,} - at least n times
{n,m} - at least n but not more than m times
Source: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
(notice that there is no {,m} quantifier representing "not more than m times") because we don't really need one, we can create it via {0,m})
So {1,} is valid and simply means "at least once".
To avoid confusion we can simplify it by replacing {1,} quantifier with more known form + like
Pattern.compile("([0-9]+)");
Most probably you also don't need to create capturing group 1 by surrounding [0-9]+ with parenthesis. We can access what regex matched with help of group 0, so such group 1 is redundant in most cases.

Related

Regex to validate exact 8 or exact 14 character [duplicate]

Consider the following regular expression, where X is any regex.
X{n}|X{m}
This regex would test for X occurring exactly n or m times.
Is there a regex quantifier that can test for an occurrence X exactly n or m times?
There is no single quantifier that means "exactly m or n times". The way you are doing it is fine.
An alternative is:
X{m}(X{k})?
where m < n and k is the value of n-m.
Here is the complete list of quantifiers (ref. http://www.regular-expressions.info/reference.html):
?, ?? - 0 or 1 occurences (?? is lazy, ? is greedy)
*, *? - any number of occurences
+, +? - at least one occurence
{n} - exactly n occurences
{n,m} - n to m occurences, inclusive
{n,m}? - n to m occurences, lazy
{n,}, {n,}? - at least n occurence
To get "exactly N or M", you need to write the quantified regex twice, unless m,n are special:
X{n,m} if m = n+1
(?:X{n}){1,2} if m = 2n
...
No, there is no such quantifier. But I'd restructure it to /X{m}(X{m-n})?/ to prevent problems in backtracking.
Very old post, but I'd like to contribute sth that might be of help.
I've tried it exactly the way stated in the question and it does work but there's a catch:
The order of the quantities matters. Consider this:
#[a-f0-9]{6}|#[a-f0-9]{3}
This will find all occurences of hex colour codes (they're either 3 or 6 digits long). But when I flip it around like this
#[a-f0-9]{3}|#[a-f0-9]{6}
it will only find the 3 digit ones or the first 3 digits of the 6 digit ones. This does make sense and a Regex pro might spot this right away, but for many this might be a peculiar behaviour. There are some advanced Regex features that might avoid this trap regardless of the order, but not everyone is knee-deep into Regex patterns.
TLDR; (?<=[^x]|^)(x{n}|x{m})(?:[^x]|$)
Looks like you want "x n times" or "x m times", I think a literal translation to regex would be (x{n}|x{m}).
Like this https://regex101.com/r/vH7yL5/1
or, in a case where you can have a sequence of more than m "x"s (assuming m > n), you can add 'following no "x"' and 'followed by no "x", translating to [^x](x{n}|x{m})[^x] but that would assume that there are always a character behind and after you "x"s. As you can see here: https://regex101.com/r/bB2vH2/1
you can change it to (?:[^x]|^)(x{n}|x{m})(?:[^x]|$), translating to "following no 'x' or following line start" and "followed by no 'x' or followed by line end". But still, it won't match two sequences with only one character between them (because the first match would require a character after, and the second a character before) as you can see here: https://regex101.com/r/oC5oJ4/1
Finally, to match the one character distant match, you can add a positive look ahead (?=) on the "no 'x' after" or a positive look behind (?<=) on the "no 'x' before", like this: https://regex101.com/r/mC4uX3/1
(?<=[^x]|^)(x{n}|x{m})(?:[^x]|$)
This way you will match only the exact number of 'x's you want.
Taking a look at Enhardened's answer, they state that their penultimate expression won't match sequences with only one character between them. There is an easy way to fix this without using look ahead/look behind, and that's to replace the start/end character with the boundary character. This lets you match against word boundaries which includes start/end. As such, the appropriate expression should be:
(?:[^x]|\b)(x{n}|x{m})(?:[^x]|\b)
As you can see here: https://regex101.com/r/oC5oJ4/2.

Java Regex how to get all the matching occurences of a pattern [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Is there a wildcard for intergers to use in regex?

I want to know if there is a way to check if a string contains a certain pattern for a regex.
For example:
string.matches("something[0-9]x") would check if the string contains a substring of "something" with any single digit integer following it followed by "x". But lets say if I want to check the same thing, but there is no limit for that int, ie it could be 1000000. Is there like a wildcard for an int that I can use?
Just use modifier + after your character class which match the preceding token one or more time :
string.matches("something[0-9]+x")
Regular expressions work on characters; they have no semantic understanding of those characters. So it doesn't make sense to talk about "integers" here; the best that you can do is to talk about "digits". The number "1" is one digit; "1234" is four.
In a regular expression, you can match one or more of the preceding pattern using "+", so the regex "something[0-9]+x" should do what you want. If you want an upper bound on the number of digits, than you can try something like "something[0-9]{1,5}x"
Yes, simply use *, so in your example string.matches("something[0-9]+x")
It would match a string something followed by any digit from 0 to 9, which have to occur at least one time, so * means zero or more times, while + means it have to occur at least one time but can occur more times if it wants.
If you do [0-9]{n,m} you can specify with m and n in which range it can occur for example:
[0-9]{2,3} will match any digit and it have to occur 2 or 3 times, if you only use one digit in this bracs [0-9]{2} it has to occur at least 2 times.
But at last: simply learn to use google ... there are so many regexp sites with tutorials and stuff.

This regex line exceeds my understanding "(?=(?:\d{3})++(?!\d))"

i am pretty ok with basic reg-ex. but this line of code used to make the thousand separation in large numbers exceeds my knowledge and googling it quite a bit did also not satisfy my curiosity. can one of u please take a minute to explain to me the following line of code?
someString.replaceAll("(\\G-?\\d{1,3})(?=(?:\\d{3})++(?!\\d))", "$1,");
i especially don't understand the regex structure "(?=(?:\d{3})++(?!\d))".
thanks a lot in advance.
"(?=(?:\d{3})++(?!\d))" is a lookahead assertion.
It means "only match if followed by ((three digits that we don't need to capture) repeated one or more times (and again repeated one or more times) (not followed by a digit))". See this explanation about (?:...) notation. It's called non-capturing group and means you don't need to reference this group after the match.
"(\\G-?\\d{1,3})" is the part that should actually match (but only if the above-described conditions are met).
Edit: I think + must be a special character, otherwise it's just a plus. If it's a special character (and quick search suggests that it is in Java, too), the second one is redundant.
Edit 2: Thanks to Alan Moore, it's now clear. The second + means possessive matching, so it means that if after checking as many 3-digit groups as possible it won't find that they're not followed by a non-digit, the engine will immediately give up instead of stepping one 3-digit group back.
this expression has some advanced stuff in it.
first , the easiest: \d{3} means exactly three digits. These are your thousands.
then: the ++ is a variant of + (which means one or more), but possessive, which means it will eat all of the thousands. Im not completely sure why this is necessary.
?:means it is a non-capturing group - i think this is just there for performance reasons and could be omitted.
?=is a positive lookahead - i think this means it is only checked whether that group exists but will not count towards the matched string - meaning it wont be replaced.
?! is a negative lookahead - i dont quite understand that but i think it means it must NOT match, which in turn means there cannot be another digit at the end of the matched sequence. This makes sure the first group gets the right digits. E.g. 10000 can only be matched as 10(000) but not 1(000)0 if you see what i mean.
Through the lookaheads, if i understand it correctly (i havent tested it), only the first group would actually be replaced, as it is the one that matches.
To me, the most interesting part of that regex is the \G. It took me a while to remember what it's for: to prevent adding commas to the fraction part if there is one. If the regex were simply:
(-?\d{1,3})(?=(?:\d{3})++(?!\d))
...this number:
12345.67890
...would end up as:
12,345.67,890
But adding \G to the beginning means a match can only start at the beginning of the string or at the position where the previous match ended. So it doesn't match 345 because of the . following it, and it doesn't match 67 because it would have to skip over some of the string to do so. And so it correctly returns:
12,345.67890
I know this isn't an answer to the question, but I thought it was worth a mention.

What is the difference between `Greedy` and `Reluctant` regular expression quantifiers?

From the Pattern javadocs:
Greedy quantifiers:
X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times
Reluctant quantifiers:
X?? X, once or not at all
X*? X, zero or more times
X+? X, one or more times
X{n}? X, exactly n times
X{n,}? X, at least n times
X{n,m}? X, at least n but not more than m times
The description of what they do is the same...so, what is the difference?
I would really appreciate some examples.
I am coding in Java, but I hear this concept is the same for most modern regex implementations.
A greedy operator always try to "grab" as much of the input as possible, while a reluctant quantifier will match as little of the input as possible and still create a match.
Example:
"The red fox jumped over the red fence"
/(.*)red/ => \1 = "The red fox jumped over the "
/(.*?)red/ => \1 = "The "
"aaa"
/a?a*/ => \1 = "a", \2 = "aa"
/a??a*/ => \1 = "", \2 = "aaa"
"Mr. Doe, John"
/^(?:Mrs?.)?.*\b(.*)$/ => \1 = "John"
/^(?:Mrs?.)?.*?\b(.*)$/ => \1 = "Doe, John"
From this link, where the tutorial author acknowledges the spirit of your question:
At first glance it may appear that
the quantifiers X?, X?? and X?+ do
exactly the same thing, since they all
promise to match "X, once or not at
all". There are subtle implementation
differences which will be explained
near the end of this section.
They go on to put together examples and offer the explanation:
Greedy quantifiers are considered
"greedy" because they force the
matcher to read in, or eat, the entire
input string prior to attempting the
first match. If the first match
attempt (the entire input string)
fails, the matcher backs off the input
string by one character and tries
again, repeating the process until a
match is found or there are no more
characters left to back off from.
Depending on the quantifier used in
the expression, the last thing it will
try matching against is 1 or 0
characters.
The reluctant quantifiers, however,
take the opposite approach: They start
at the beginning of the input string,
then reluctantly eat one character at
a time looking for a match. The last
thing they try is the entire input
string.
And for extra credit, the possessive explanation:
Finally, the possessive quantifiers
always eat the entire input string,
trying once (and only once) for a
match. Unlike the greedy quantifiers,
possessive quantifiers never back off,
even if doing so would allow the
overall match to succeed.
A greedy quantifier will match as much as possible and still get a match
A reluctant quantifier will match the smallest amount possible.
for example given the string
abcdef
the greedy qualifier
ab[a-z]*[a-z] would match abcdef
the reluctant qualifier
ab[a-z]*?[a-z] would match abc
say you have a regex "a\w*b", and use it on "abab"
Greedy matching will match "abab" (it looks for an a, as much occurrences of \w as possible, and a b) and reluctant matching will match just "ab" (as little \w as possible)
There is documentation on how Perl handles these quantifiers perldoc perlre.
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
{n}? Match exactly n times, not greedily
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
*+ Match 0 or more times and give nothing back
++ Match 1 or more times and give nothing back
?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a 's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent subexpression (?>...) for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/

Categories