Match anything before a certain pattern - java

I have string that can contain almost any character including (_- % and so forth. The string ends with (\d{1,2}). Eg. parenthesis with 1 or 2 digits. I now want 2 capturing groups, the 2 digits and everything before the parenthesis.
currently I have:
final Pattern pattern = Pattern.compile("^([-%\\(\\)_/= a-zA-Z\\d]+)\\((\\d{1,2})\\)$");
But this does not match everything. I want to replace the char group with .* but not have it match the (\d{1,2}) at end of string. How can I achieve this?

If I understand your question, you can use a reluctant quantifier .*? instead of your greedy quantifier .* to match everything reluctantly before your second catpure group.

Related

Replace URL String with Integer characters located in the end of that String

I have some URL link and tried to replace all non-integer values with integers in the end of the link using regex
The URL is something like
https://some.storage.com/test123456.bucket.com/folder/80.png
Regex i tried to use:
Integer.parseInt(string.replaceAll(".*[^\\d](\\d+)", "$1"))
Output for that regex is "80.png", and i need only "80". Also i tried this tool - https://regex101.com. And as i see the main problem is that ".png" not matching with my regex and then, after substitution, this part adding to matching group.
I'm totally noob in regex, so i kindly ask you for help.
You may use
String result = string.replaceAll("(?:.*\\D)?(\\d+).*", "$1");
See the regex demo.
NOTE: If there is no match, the result will be equal to the string value. If you do not want this behavior, instead of "(?:.*\\D)?(\\d+).*", use "(?:.*\\D)?(\\d+).*|.+".
Details
(?:.*\D)? - an optional (it must be optional because the Group 1 pattern might also be matched at the start of the string) sequence of
.* - any 0+ chars other than line break chars, as many as possible
\D - a non-digit
(\d+) - Group 1: any one or more digits
.* - any 0+ chars other than line break chars, as many as possible
The replacement is $1, the backreference to Group 1 value, actually, the last 1+ digit chunk in the string that has no line breaks.
Line breaks can be supported if you prepend the pattern with the (?s) inline DOTALL modifier, i.e. "(?s)(?:.*\\D)?(\\d+).*|.+".

How to add a quantity condition to a regex without including the quantity into the matching group?

I want to create a match only if an additional quantity condition is true.
Example (which is fine):
Regex: -(START.*?)_\d+(?=-END)
Input: test-START_one_two_three_4-END
Match Group1: START_one_two_three
Anyways I want to add an additional check that inside the group match, there should be _{3,4} characters. But not followed by each other directly.
So I'd have to create an additional non-capturing group with (?:...).
What I tried: looking 4 times for _* until the -END:
(?:(?:_[^_]*){4}-END)
But adding this into the regex won't create a match anymore. Why?
https://regex101.com/r/MHzWBr/2
You may use a lookahead here:
-(START(?=(?:_[^_]*){3,4}-END).*?)_+\d+(?=-END)
^
See the regex demo
Now, (?=(?:_[^_]*){3,4}-END) is a positive lookahead that makes sure that, immediately to the right of the current location, there is
(?:_[^_]*){3,4} - three or four repetitions of _ followed with any 0+ chars other than _
-END - a literal -END string.
.*?
Note that if you want to match the closest window between -START and -END you need to exclude the . and [^_] from matching the start of the -START and -END patterns:
-(START(?=(?:_(?:(?!-(?:END|START))[^_])*){3,4}-END)(?:(?!-(?:END|START)).)*)_+\d+(?=-END)
See this regex demo
The (?:(?!-(?:END|START)).)* pattern is a tempered greedy token.
Another option might be to do this without a positive lookahead and repeat 2 - 3 times an underscore followed by 1+ times not an underscore.
You could also turn the positive lookahead at the end into a match.
-(START(?:_[^_]+){2,3})_\d+(?=-END)
Regex demo
That will match:
- Match -
( Capturing group
START(?:_[^_]+){2,3} Match START and repeat 2-3 times an underscore followed by not an underscore
)_\d+ Close group, match _ and 1+ digits
(?=-END) Assert what is on the right is -END (Or match -END without the lookahead)

Java Pattern regex search between strings

Given the following strings (stringToTest):
G2:7JAPjGdnGy8jxR8[RQ:1,2]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
G2:7JAPjGdnGy8jxR8[RQ:3,4]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
And the Pattern:
Pattern p = Pattern.compile("G2:\\S+RQ:3,4");
if (p.matcher(stringToTest).find())
{
// Match
}
For string 1 I DON'T want to match, because RQ:3,4 is associated with the G3 section, not G2, and I want string 2 to match, as RQ:3,4 is associated with G2 section.
The problem with the current regex is that it's searching too far and reaching the RQ:3,4 eventually in case 1 even though I don't want to consider past the G2 section.
It's also possible that the stringToTest might be (just one section):
G2:7JAPjGdnGy8jxR8[RQ:3,4]
The strings 7JAPjGdnGy8jxR8 and jRo6pN8ZW9aglYz are variable length hashes.
Can anyone help me with the correct regex to use, to start looking at G2 for RQ:3,4 but stopping if it reaches the end of the string or -G (the start of the next section).
You may use this regex with a negative lookahead in between:
G2:(?:(?!G\d+:)\S)*RQ:3,4
RegEx Demo
RegEx Details:
G2:: Match literal text G2:
(?: Start a non-capture group
(?!G\d+:): Assert that we don't have a G<digit>: ahead of us
\S: Match a non-whitespace character
)*: End non-capture group. Match 0 or more of this
RQ:3,4: Match literal text RQ:3,4
In Java use this regex:
String re = "G2:(?:(?!G\\d+:)\\S)*RQ:3,4";
The problem is that \S matches any whitespace char and the regex engine parses the text from left to right. Once it finds G2: it grabs all non-whitespaces to the right (since \S* is a ghreedy subpattern) and then backtracks to find the rightmost occurrence of RQ:3,4.
In a general case, you may use
String regex = "G2:(?:(?!-G)\\S)*RQ:3,4";
See the regex demo. (?:(?!-G)\S)* is a tempered greedy token that will match 0+ occurrences of a non-whitespace char that does not start a -G substring.
If the hyphen is only possible in front of the next section, you may subtract - from \S:
String regex = "G2:[^\\s-]*RQ:3,4"; // using a negated character class
String regex = "G2:[\\S&&[^-]]*RQ:3,4"; // using character class subtraction
See this regex demo. [^\\s-]* will match 0 or more chars other than whitespace and -.
Try to use [^[] instead of \S in this regex: G2:[^[]*\[RQ:3,4
[^[] means any character but [
Demo
(considering that strings like this: G2:7JAP[jGd]nGy8[]R8[RQ:3,4] are not possible)

What exactly does .*? do in regex? ".*?([a-m/]*).*" [duplicate]

This question already has answers here:
Greedy vs. Reluctant vs. Possessive Qualifiers
(7 answers)
Closed 9 years ago.
For ".*?([a-m/]*).*" matching the string "fall/2005", I thought the ".*" will match any character 0 or more times. However, since there is a ? following .*, it only matches for 0 or 1 repetitions. So I thought .*? will match 'f' but I'm wrong.
What is wrong in my logic?
The ? here acts as a 'modifier' if I can call it like that and makes .* match the least possible match (termed 'lazy') until the next match in the pattern.
In fall/2005, the first .*? will match up to the first match in ([a-m/]*), which is just before f. Hence, .*? matches 0 characters so that ([a-m/]*) will match fall/ and since ([a-m/]*) cannot match anymore, the next part of the pattern .* matches what's left in the string, meaning 2005.
In contrast to .*([a-m/]*).*, you would have .* match as much as possible first (meaning the whole string) and try to go back to make the other terms match. Except that the problem is with the other quantifiers being able to match 0 characters as well, so that .* alone will match the whole string (termed 'greedy').
Maybe a different example will help.
.*ab
In:
aaababaaabab
Here, .* will match as much characters as possible and then try to match ab. Thus, .* will match aaababaaab and the remainder will be matched by ab.
.*?ab
In:
aaababaaabab
Here, .*? will match as little as possible until it can match ab in that regex. The first occurrence of ab is here:
aaababaaabab
^^
And so, .*? matches aa while ab will match ab.
In regex:
? : Occurs no or one times, ? is short for {0,1}
*? : ? after a quantifier makes it a reluctant quantifier, it tries to find the smallest match.
Suppose if you have a string input like this
this is stackoverflow
and you use regex
.*
so output will be
this is stackoverflow
but if you use regex
.*?
your out put will be
this
So from the above example it is clear that if you use .* it will give you whole string.
to prevent this if you want only first cherector before space you should use .*?
For more practical knowledge you can check http://regexpal.com/
The ? (question mark) is considered lazy here or so called not greedy.
Read Greedy vs. reluctant vs. possessive quantifiers
Your regular expression:
.*? any character except newline \n (0 or more times)
(matching the least amount possible)
( group and capture to \1:
[a-m/]* any character of: 'a' to 'm', '/' (0 or more times)
(matching the most amount possible)
) end of \1
.* any character except newline \n (0 or more times)
(matching the most amount possible)

Why does my regex containing \d{1,} together with a negative lookahead still match, where it shouldn't?

I'm trying to match a coordinate pair in a String using a Regex in Java. I explicitly want to exclude strings using negative lookahead.
to be matched:
558,228
558,228,
558,228,589
558,228,A,B,C
NOT to be matched:
558,228,<Text>
The Regex ^558,228(?!,<).* does the job, while ^\d{1,},\d{1,}(?!,<).* doesn't. It's the same regex with the metacharacter \d instead of values. Any ideas why?
The reason is the .* part at the end. It matches everything that wasn't matched earlier.
In combination with \d{1,}, which allows to match less than 3 digits, it will go like this:
^\d{1,},\d{1,}(?!,<) will match 558,22 and .* will match the remaining part 8,<Text>.
The problem is the \d{1,} part in combination with the .* at the end.
In your case
558,228,<Text>
The ^\d{1,},\d{1,}(?!,<) matches ">558,22" and the .* matches the rest "8,<Text>"
You can solve this using the possessive quanitifier ++
^\d+,\d++(?!,<)(.*)
See it here online on Regexr
\d++ is a seldom used possessive quantifier, which is here useful. ++ means match at least once as many as you can and do not backtrack. That means it will not give back the digits once it has found them.
Java Quantifier tutorial

Categories