Optional substring in capturing group

Optional substring in capturing group - java

I have a regex that correctly captures a slash followed by a number in a string. The capturing group portion of the regex looks like this:
\(\d)+\\??
(some digits after a slash up to, but not including, a question mark) and there is more to the regex before and after this capturing group. Now I want to also include in my capturing group a optional specific prefix (call it "abc_"):
The entire prefix (all four characters) must be there to be included in the captured group
If no prefix is present then the digit portion of the capturing group is still captured
if the prefix is partially there or some other prefix is there then the string does not match the regex.
Some examples:
abc_12345 is captured
12345 is captured
ab_12345 fails to match the regex
abc_ fails to match the regex
abcd_ fails to match the regex
How do I construct this?

If I understand you correctly, you want this:
((?:abc_)?\\d+)[?]?
The ?: operator transforms the group into a non-catching group. I don't understand the part with the partial prefix. If you allow any content in front of the regex, you cannot deny a certain optional prefix. You need to have a clear separator in front of the pattern, like a white space in order to deny a prefix.

Your regex does not even seem to work for the case you already described. It captures only one digit and not the full number. Also your escaping is inconsistent.
However, this should do what you intend to do:
((?:abc_)?\\d+)\\??
Your last requirement, that a different prefix should not be matched, can only be answered if you give use the preceding part of the regex. (If this capturing group is preceded by \w+ for instance, any prefix would match, but only the full and correct prefix would be captured)

Related

How to match a string in this way?

I need to check if a String matches this specific pattern.
The pattern is:
(Numbers)(all characters allowed)(numbers)
and the numbers may have a comma ("." or ",")!
For instance the input could be 500+400 or 400,021+213.443.
I tried Pattern.matches("[0-9],?.?+[0-9],?.?+", theequation2), but it didn't work!
I know that I have to use the method Pattern.match(regex, String), but I am not being able to find the correct regex.

Dealing with numbers can be difficult. This approach will deal with your examples, but check carefully. I also didn't do "all characters" in the middle grouping, as "all" would include numbers, so instead I assumed that finding the next non-number would be appropriate.
This Java regex handles the requirements:
"((-?)[\\d,.]+)([^\\d-]+)((-?)[\\d,.]+)"
However, there is a potential issue in the above. Consider the following:
300 - -200. The foregoing won't match that case.
Now, based upon the examples, I think the point is that one should have a valid operator. The number of math operations is likely limited, so I would whitelist the operators in the middle. Thus, something like:
"((-?)[\\d,.]+)([\\s]*[*/+-]+[\\s]*)((-?)[\\d,.]+)"
Would, I think, be more appropriate. The [*/+-] can be expanded for the power operator ^ or whatever. Now, if one is going to start adding words (such as mod) in the equation, then the expression will need to be modified.
You can see this regular expression here

In your regex you have to escape the dot \. to match it literally and escape the \+ or else it would make the ? a possessive quantifier. To match 1+ digits you have to use a quantifier [0-9]+
For your example data, you could match 1+ digits followed by an optional part which matches either a dot or a comma at the start and at the end. If you want to match 1 time any character you could use a dot.
Instead of using a dot, you could also use for example a character class [-+*] to list some operators or list what you would allow to match. If this should be the only match, you could use anchors to assert the start ^ and the end $ of the string.
\d+(?:[.,]\d+)?.\d+(?:[.,]\d+)?
In Java:
String regex = "\\d+(?:[.,]\\d+)?.\\d+(?:[.,]\\d+)?";
Regex demo
That would match:
\d+(?:[.,]\d+)? 1+ digits followed by an optional part that matches . or , followed by 1+ digits
. Match any character (Use .+) to repeat 1+ times
Same as the first pattern

Regex to match first/last names with optional titles

I created the following regex (Java):
(Lord |Lady |Ser )?(Agatha|John)?([ ]??Cain)?
It's working fine except in one situation (and maybe others I didn't take into account during my tests):
As you can see, when you only have the family name, the regex is also taking the whitespace behind the word. I totally understand why, but I don't know how to fix it.
This regex is used to find persons into a big text file which represents the content of a book. And, of course, it must be compatible with my current working environment (Java).

You can use regex lookback to accomplish your goal.
\b(?<!\S)(?:(Lord|Lady|Ser)\s+)?(Agatha|John)?(?:\s*(?<=\b)(Cain))?(?<=\S)\b # regex101
It has these qualities which seem to match (possibly exceed) your criteria:
The regex match is forced to start with a non-whitespace character.
The first capture will be the title (or empty).
The second capture will be the first name (or empty).
The third capture will be the last name (or empty).
All matches have no leading or trailing whitespace.
Additionally, it will even match through line wraps (shown in additional text in the linked regex test sample).
Title, first, and last names are in singleton groups making additions to the match sets as simple as adding an additional alternation to their respective groups.
A trailing lookbehind insisting on the match ending with a non-whitespace was also added to avoid matching just "Lord " of an otherwise non-matching "Lord X".
A regex101 fiddle with example data is linked to the regex.

Replace multiple capture groups using regexp with java

I have this requirement - for an input string such as the one shown below
8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs
I would like to strip the matched word boundaries (where the matching pair is 8 or & or % etc) and will result in the following
This is really a test of repl%acing %mul%tiple matched 9pairs
This list of characters that is used for the pairs can vary e.g. 8,9,%,# etc and only the words matching the start and end with each type will be stripped of those characters, with the same character embedded in the word remaining where it is.
Using Java I can do a pattern as \\b8([^\\s]*)8\\b and replacement as $1, to capture and replace all occurrences of 8...8, but how do I do this for all the types of pairs?
I can provide a pattern such as \\b8([^\\s]*)8\\b|\\b9([^\\s]*)9\\b .. and so on that will match all types of matching pairs *8,9,..), but how do I specify a 'variable' replacement group -
e.g. if the match is 9...9, the the replacement should be $2.
I can of course run it through multiple of these, each replacing a specific type of pair, but I am wondering if there is a more elegant way.
Or is there a completely different way of approaching this problem?
Thanks.

You could use the below regex and then replace the matched characters by the characters present inside the group index 2.
(?<!\S)(\S)(\S+)\1(?=\s|$)
OR
(?<!\S)(\S)(\S*)\1(?=\s|$)
Java regex would be,
(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)
DEMO
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)", "$2"));
Output:
This is reallly a test of repl%acing %mul%tiple matched 9pairs
Explanation:
(?<!\\S) Negative lookbehind, asserts that the match wouldn't be preceded by a non-space character.
(\\S) Captures the first non-space character and stores it into group index 1.
(\\S+) Captures one or more non-space characters.
\\1 Refers to the character inside first captured group.
(?=\\s|$) And the match must be followed by a space or end of the line anchor.
This makes sure that the first character and last character of the string must be the same. If so, then it replaces the whole match by the characters which are present inside the group index 2.
For this specific case, you could modify the above regex as,
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)([89&#%])(\\S+)\\1(?=\\s|$)", "$2"));
DEMO

(?<![a-zA-Z])[8&#%9](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[8&#%9](?![a-zA-Z])
Try this.Replace with $1 or \1.See demo.
https://regex101.com/r/qB0jV1/15
(?<![a-zA-Z])[^a-zA-Z](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[^a-zA-Z](?![a-zA-Z])
Use this if you have many delimiters.

Only capture digits instead of the other text in input for a regex like below

Currently this regular expression:
^(?:\\S+\\s+)*?(\\S+)\\s+(?:No\\.\\s+)?(\\S+)(?:\\s+\\(.*?\\))?$
captures 418—FINAL in group number 2 for an input like:
String text="H.B. 418—FINAL VERSION";
How do I change this regular expression to only capture the number (digits) of "418" in group2 ?
EDIT:
I'd still like to capture "H.B." in a preceding group.

Just change the boundaries of the second group to only include the digits. To also save the "H.B." part, add paranthesis around that part too:
^(?:(\\S+)\\s+)*?(\\d+)\\S+\\s+(?:No\\.\\s+)?(\\S+)(?:\\s+\\(.*?\\))?$

I'm not entirely sure what your exact requirements are (your regex is looking for an optional "No." but you haven't given any examples). But this will work on the example you give:
^(?:\\S+\\s+)*?(\\S+)\\s+(?:No\\.\\s+)?(\\d+).*(?:\\s+\\(.*?\\))?$
assuming you don't need the text following the digits. That is, just change the second \S to \d. I also added .* after this to match any remaining characters following the digits up to an optional parenthesized part (without capturing them, but you can capture them if you want to).

regex for specific digit prefix

I am trying to have the following regx rule, but couldn't find solution.
I am sorry if I didn't make it clear. I want for each rule different regx. I am using Java.
rule should fail for all digit inputs start with prefix '1900' or '1901'.
(190011 - fail, 190111 - fail, 41900 - success...)
rule should success for all digit inputs with the prefix '*'
different regex for each rule (I am not looking for the combination of both of them together)

Is this RE fitting the purpose ? :
'\A(\*|(?!190[01])).*'
\A means 'the beginning of string' . I think it's the same in Java's regexes
.
EDIT
\A : "from the very beginning of the string ....". In Python (which is what I know, in fact) this can be omitted if we use the function match() that always analyzes from the very beginning, instead of search() that search everywhere in a string. If you want the regex able to analyze lines from the very beginning of each line, this must be replaced by ^
(...|...) : ".... there must be one of the two following options : ....."
\* : "...the first option is one character only, a star; ..." . As a star is special character meaning 'zero, one or more times what is before' in regex's strings, it must be escaped to strictly mean 'a star' only.
(?!190[01]) : "... the second option isn't a pattern that must be found and possibly catched but a pattern that must be absent (still after the very beginning). ...". The two characters ?! are what says 'there must not be the following characters'. The pattern not to be found is 4 integer characters long, '1900' or '1901' .
(?!.......) is a negative lookahead assertion. All kinds of assertion begins with (? : the parenthese invalidates the habitual meaning of ? , that's why all assertions are always written with parentheses.
If \* have matched, one character have been consumed. On the contrary, if the assertion is verified, the corresponding 4 first characters of the string haven't been consumed: the regex motor has gone through the analysed string until the 4th character to verify them, and then it has come back to its initial position, that is to say, presently, at the very beginning of the string.
If you want the bi-optional part (...|...) not to be a capturing group, you will write ?: just after the first paren, then '\A(?:\*|(?!190[01])).*'
.* : After the beginning pattern (one star catched/matched, or an assertion verified) the regex motor goes and catch all the characters until the end of the line. If the string has newlines and you want the regex to catch all the characters until the end of the string, and not only of a line, you will specify that . must match the newlines too (in Python it is with re.MULTILINE), or you will replace .* with (.|\r|\n)*
I finally understand that you apparently want to catch strings composed of digits characters. If so the RE must be changed to '\A(?:\*|(?!190[01]))\d*' . This RE matches with empty strings. If you want no-match with empty strings, put \d+ in place of \d* . If you want that only strings with at least one digit, even after the star when it begins with a star, match, then do '\A(?:\*|(?!190[01]))(?=\d)\d*'

For the first rule, you should use a combo regex with two captures, one to capture the 1900/1901-prefixed case, and one the capture the rest. Then you can decide whether the string should succeed or fail by examining the two captures:
(190[01]\d+)|(\d+)
Or just a simple 190[01]\d+ and negate your logic.
Regex's are not really very good at excluding something.
You may exclude a prefix using negative look-behind, but it won't work in this case because the prefix is itself a stream of digits.
You seem to be trying to exclude 1-900/901 phone numbers in the US. If the number of digits is definite, you can use a negative look-behind to exclude this prefix while matching the remaining exact number digits.
For the second rule, simply:
\*\d+

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optional substring in capturing group - java

Related

How to match a string in this way?

Regex to match first/last names with optional titles

Replace multiple capture groups using regexp with java

Only capture digits instead of the other text in input for a regex like below

regex for specific digit prefix

Categories

Resources