Understanding `+` in regular expression - java

I have a regular expression to remove all usernames from tweets. It looks like this:
regexFinder = "(?:\\s|\\A)[#]+([A-Za-z0-9-_]+):";
I'm trying to understand what each component does. So far, I've got:
( Used to begin a “group” element
?: Starts non-capturing group (this means one that will be removed from the final result)
\\s Matches against shorthand characters
| or
\\A Matches at the start of the string and matches a position as opposed to a character
[#] Matches against this symbol (which is used for Twitter usernames)
+ Match the previous followed by
([A-Za-z0-9- ] Match against any capital or small characters and numbers or hyphens
I'm a bit lost with the last bit though. Could somebody tell me what the +): means? I'm assuming the bracket is ending the group, but I don't get the colon or the plus sign.
If I've made any mistakes in my understanding of the regex please feel free to point it out!

The + actually means "one or more" of whatever it follows.
In this case [#]+ means "one or more # symbols" and [A-Za-z0-9-_]+ means "one or more of a letter, number, dash, or underscore". The + is one of several quantifiers, learn more here.
The colon at the end is just making sure the match has a colon at the end of the match.
Sometimes it helps to see a visualization, here is one generated by debuggex:

The + sign means "the previous character can be repeated 1 or more times". This is in contrast to the * symbol, which means "the previous character can be repeated 0 or more times". The colon, as far as I can tell, is literal—it matches a literal : in the string.

The plus sign in regular expressions means "one or more occurrences of the previous character or group of characters." Since the second plus sign is within the second set of parentheses, it basically means that the second set of parentheses matches any string comprised of at least one lowercase or uppercase letter, number, or hyphen.
As for the colon, it doesn't have any meaning in Java's regex class. If you're not sure, someone else already found out.

Well, we shall see..
[#]+ any character of: '#' (1 or more times)
( group and capture to \1:
[A-Za-z0-9-_]+ any character of: (a-z A-Z), (0-9), '-', '_' (1 or more times)
) end of capture group \1
: look for and match ':'
The following quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times

Related

Having difficulty understanding Java regex interpretation [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Can someone help me with the following Java regex expression? I've done some research but I'm having a hard time putting everything together.
The regex:
"^-?\\d+$"
My understandning of what each symbol does:
" = matches the beginning of the line
- = indicates a range
? = does not occur or occurs once
\\d = matches the digits
+ = matches one or more of the previous thing.
$ = matches end of the line
Is the regex saying it only want matches that start or end with digits? But where do - and ? come in?
- only indicates a range if it's within a character class (i.e. square brackets []). Otherwise, it's a normal character like any other. With that in mind, this regex matches the following examples:
"-2"
"3"
"-700"
"436"
That is, a positive or negative integer: at least one digit, optionally preceded by a minus sign.
Some regex is composed, as you have now, the correct way to read your regex is :
^ start of word
-? optional minus character
\\d+ one or more digits
$ end of word
This regex match any positive or negative numbers, like 0, -15, 558, -19663, ...
Fore details check this good post Reference - What does this regex mean?
"^-?\\d+$" is not a regex, it's a Java string literal.
Once the compiler has parsed the string literal, the string value is ^-?\d+$, which is a regex matching like this:
^ Matches beginning of input
- Matches a minus sign
? Makes previous match (minus sign) optional
\d Matches a digit (0-9)
+ Makes previous match (digit) match repeatedly (1 or more times)
$ Matches end of input
All-in-all, the regex matches a positive or negative integer number of unlimited length.
Note: A - only denotes a range when inside a [] character class, e.g. [4-7] is the range of characters between '4' and '7', while [3-] and [-3] are not ranges since the start/end value is missing, so they both just match a 3 or - character.

Java Regex Quantifiers in String Split

The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?
The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here
The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.

Can you fix this Java Regex to match currency such as -10 USD, 12.35 AUD ... (Java)?

I have a need to validate the Currency String as followings:
1. The Currency Unit must be in Uppercase and must contain 3 characters from A to Z
2. The number can contain negative (-) or positive (+) sign.
3. The number can contain the decimal fraction, but if the number contain
the decimal fraction then the fraction must be 2 Decimal only.
4. There is no space in the number part
So see this example:
10 USD ------> match
+10 USD ------> match
-10 USD ------> match
10.23 AUD ------> match
-12.11 FRC ------> match
- 11.11 USD ------> NOT match because there is space between negative sign and the number
10 AUD ------> NOT match because there is 2 spaces between the number and currency unit
135.1 AUD ------> NOT match because there is only 1 Decimal in the fraction
126.33 YE ------> NOT match because the currency unit must contain 3 Uppercase characters
So here is what I tried but failed
if(text != null && text.matches("^[+-]\\d+[\\.\\d{2}] [A-Z]{3}$")){
return true;
}
The "^\\d+ [A-Z]{3}$" only match number without any sign and decimal part.
So Can you fix this Java Regex to match currency that meets the above requirements?
Some other questions in the internet do not match my requirements.
It seems you don't know about ? quantifier which means that element which this quantifier describes can appear zero times or once, making it optional.
So to say that string can contain optional - or + at start just add [-+]?.
To say that it can contain optional decimal part in form .XX where X would be digit just add (\\.\\d{2})?
So try with "^[-+]?\\d+(\\.\\d{2})? [A-Z]{3}$"
BTW If you are using yourString.matches(regex) then you don't have to add ^ or $ to regex. This method will match only if entire string will match regex so these metacharacters are not necessary.
BTW2 Normally you should escape - in character class [...] because it represents range of characters like [A-Z] but in this case - can't be used this way because it is at start of character class so there is no "first" range character, so you don't have to escape - here. Same goes if - is last character in [..-]. Here it also can't represent range so it is simple literal.
Try with:
text.matches("[+-]?\\d+(\\.\\d\\d)? [A-Z]{3}")
Note that since you use .matches(), the regex is automatically anchored (blame the Java API desingers for that: .matches() is woefully misnamed)
you could start your regex with
^(\\+|\\-)?
Which means that it will accept either one + sign, one - sign or nothing at all before the digit. But that's only one of your problems.
Now the decimal point:
"3. The number can contain the decimal fraction, but if the number contain
the decimal fraction then the fraction must be 2 Decimal only."
so after the digit \\d+ the next part should be in ( )? to indicate that it is optional (meaning 1 time or never). So either there are exactly one dot and two digits or nothing
(\\.\\d{2})?
Here you can find a reference for regex and test them. Just have a look at what else you could use to identify the 3 Letters for the currency. E.g. the \s could help you to identify a whitespace
This will match all your cases:
^[-+]?\d+(\.\d{2})?\s[A-Z]{3}$
(Demo # regex101)
To use it in Java you have to escape the \:
text.matches("^[-+]?\\d+(\\.\\d{2})?\\s[A-Z]{3}$")
Your regex wasn't far from the goal, but it contains several mistakes.
The most important one is: [] denotes a character class while () is a capturing group. So when you specify a character group like [\\.\\d{2}] it will match on the characters \,.,d,{,2, and}, while you want to match on the pattern .\d{2}.
The other answers already taught you the ? quantifier, so I won't repeat this.
On a sidenote: regular-expressions.info is a great source to learn these things!
Explanation of the regex used above:
^ #start of the string/line
[-+]? #optionally a - or a + (but not both; only one character)
\d+ #one or more numbers
( #start of optional capturing group
\.\d{2} #the character . followed by exactly two numbers (everything optional)
)? #end of optional capturing group
\s #a whitespace
[A-Z]{3} #three characters in the range from A-Z (no lowercase)
$ #end of the string/line

How to make a regular expression that matches tokens with delimiters and separators?

I want to be able to write a regular expression in java that will ensure the following pattern is matched.
<D-05-hello-87->
For the letter D, this can either my 'D' or 'E' in capital letters and only either of these letters once.
The two numbers you see must always be a 2 digit decimal number, not 1 or 3 numbers.
The string must start and end with '<' and '>' and contain '-' to seperate parts within.
The message in the middle 'hello' can be any character but must not be more than 99 characters in length. It can contain white spaces.
Also this pattern will be repeated, so the expression needs to recognise the different individual patterns within a logn string of these pattersn and ensure they follow this pattern structure. E.g
So far I have tried this:
([<](D|E)[-]([0-9]{2})[-](.*)[-]([0-9]{2})[>]\z)+
But the problem is (.*) which sees anything after it as part of any character match and ignores the rest of the pattern.
How might this be done? (Using Java reg ex syntax)
Try making it non-greedy or negation:
(<([DE])-([0-9]{2})-(.*?)-([0-9]{2})>)
Live Demo: http://ideone.com/nOi9V3
Update: tested and working
<([DE])-(\d{2})-(.{1,99}?)-(\d{2})>
See it working: http://rubular.com/r/6Ozf0SR8Cd
You should not wrap -, < and > in [ ]
Assuming that you want to stop at the first dash, you could use [^-]* instead of .*. This will match all non-dash characters.

Analysing a more complex regex

In a previous question that i asked,
String split in java using advanced regex
someone gave me a fantastic answer to my problem (as described on the above link)
but i never managed to fully understand it. Can somebody help me? The regex i was given
is this"
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+"
I can understand some basic things, but there are parts of this regex that even after
thoroughly searching google i could not find, like the question mark preceding the s in the
start, or how exactly the second parenthesis works with the question mark and the equation in the start. Is it possible also to expand it and make it able to work with other types of quotes, like “ ” for example?
Any help is really appreciated.
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+" Explained;
(?s) # This equals a DOTALL flag in regex, which allows the `.` to match newline characters. As far as I can tell from your regex, it's superfluous.
(?= # Start of a lookahead, it checks ahead in the regex, but matches "an empty string"(1) read more about that [here][1]
(([^\"]+\"){2})* # This group is repeated any amount of times, including none. I will explain the content in more detail.
([^\"]+\") # This is looking for one or more occurrences of a character that is not `"`, followed by a `"`.
{2} # Repeat 2 times. When combined with the previous group, it it looking for 2 occurrences of text followed by a quote. In effect, this means it is looking for an even amount of `"`.
[^\"]* # Matches any character which is not a double quote sign. This means literally _any_ character, including newline characters without enabling the DOTALL flag
$ # The lookahead actually inspects until end of string.
) # End of lookahead
\\s+ # Matches one or more whitespace characters, including spaces, tabs and so on
That complicated group up there that is repeated twice will match in whitespaces in this string which is not in between two ";
text that has a "string in it".
When used with String.split, splitting the string into; [text, that, has, a, "string in it".]
It will only match if there are an even number of ", so the following will match on all spaces;
text that nearly has a "string in it.
Splitting the string into [text, that, nearly, has, a, "string, in, it.]
(1) When I say that a capture group matches "an empty string", I mean that it actually captures nothing, it only looks ahead from the point in the regex you are, and check a condition, nothing is actually captured. The actual capture is done by \\s+ which follows the lookahead.
The (?s) part is an embedded flag expression, enabling the DOTALL mode, which means the following:
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
The (?=expr) is a look-ahead expression. This means that the regex looks to match expr, but then moves back to the same point before continuing with the rest of the evaluation.
In this case, it means that the regex matches any \\s+ occurence, that is followed by any even number of ", then followed by non-" until the end ($). In other words, it checks that there are an even number of " ahead.
It can definitely be expanded to other quotes too. The only problem is the ([^\"]+\"){2} part, that will probably have to be made to use a back-reference (\n) instead of the {2}.
This is fairly simple..
Concept
It split's at \s+ whenever there are even number of " ahead.
For example:
Hello hi "Hi World"
^ ^ ^
| | |->will not split here since there are odd number of "
----
|
|->split here because there are even number of " ahead
Grammar
\s matches a \n or \r or space or \t
+ is a quantifier which matches previous character or group 1 to many times
[^\"] would match anything except "
(x){2} would match x 2 times
a(?=bc) would match if a is followed by bc
(?=ab)a would first check for ab from current position and then return back to its position.It then matches a.(?=ab)c would not match c
With (?s)(singleline mode) . would match newlines.So,In this case no need of (?s) since there are no .
I would use
\s+(?=([^"]*"[^"]*")*[^"]*$)

Categories