Programming error leads to inexplanable regex

Programming error leads to inexplanable regex - java

for a test I created following regex by mistake:
|(\\w+)|
I was puzzled that this regex really works and I can't explain the result:
public static void main(String[] args) {
String toReplace="Hey I'm a lovely String an I'm giving my |value| worth!";
// String replacement1="2 cent"; // I planned to replace |value| with 2 cent
String replacement1="#"; // to produce a better Output
String regex="|(\\w+)|"; // I forgot to escape the |
replacement1="#";
result=toReplace.replaceAll(regex,replacement1);
System.out.println(result);
}
the result is:
#H#e#y# #I#'#m# #a# #l#o#v#e#l#y# #S#t#r#i#n#g# #a#n# #I#'#m# #g#i#v#i#n#g# #m#y# #|#v#a#l#u#e#|# #w#o#r#t#h#!#
My ideas so far are that java tries to replace "nothing" between the characters but why not the characters itself?
\\w+ should match the 'H'
I would expect that every char is replaced by 3 # signs or only by one but that the characters are not replaced puzzles me.

You're right, this regex matches the empty string between each character.
Since the first alternative (the empty string left of |) matches, the rest of the pattern isn't even tried, so the \w+ isn't even reached by the matching engine. You could have written any (valid) pattern to the right of that first |, it wouldn't ever be reached.
The engine works the following way: It has a current position cursor in the subject string. It tries to match starting at that current position. Since your regex is a match, it will perform the replacement at this point, and then move the current position cursor after the found match.
But since the match is zero-width, it simply advances to the next character, because not doing so would result in an infinite loop.

Related

Java Regex with "Joker" characters

I try to have a regex validating an input field.
What i call "joker" chars are '?' and '*'.
Here is my java regex :
"^$|[^\\*\\s]{2,}|[^\\*\\s]{2,}[\\*\\?]|[^\\*\\s]{2,}[\\?]{1,}[^\\s\\*]*[\\*]{0,1}"
What I'm tying to match is :
Minimum 2 alpha-numeric characters (other than '?' and '*')
The '*' can only appears one time and at the end of the string
The '?' can appears multiple time
No WhiteSpace at all
So for example :
abcd = OK
?bcd = OK
ab?? = OK
ab*= OK
ab?* = OK
??cd = OK
*ab = NOT OK
??? = NOT OK
ab cd = NOT OK
abcd = Not OK (space at the begining)
I've made the regex a bit complicated and I'm lost can you help me?

^(?:\?*[a-zA-Z\d]\?*){2,}\*?$
Explanation:
The regex asserts that this pattern must appear twice or more:
\?*[a-zA-Z\d]\?*
which asserts that there must be one character in the class [a-zA-Z\d] with 0 to infinity questions marks on the left or right of it.
Then, the regex matches \*?, which means an 0 or 1 asterisk character, at the end of the string.
Demo
Here is an alternative regex that is faster, as revo suggested in the comments:
^(?:\?*[a-zA-Z\d]){2}[a-zA-Z\d?]*\*?$
Demo

Here you go:
^\?*\w{2,}\?*\*?(?<!\s)$
Both described at demonstrated at Regex101.
^ is a start of the String
\?* indicates any number of initial ? characters (must be escaped)
\w{2,} at least 2 alphanumeric characters
\?* continues with any number of and ? characters
\*? and optionally one last * character
(?<!\s) and the whole String must have not \s white character (using negative look-behind)
$ is an end of the String

Other way to solve this problem could be with look-ahead mechanism (?=subregex). It is zero-length (it resets regex cursor to position it was before executing subregex) so it lets regex engine do multiple tests on same text via construct
(?=condition1)
(?=condition2)
(?=...)
conditionN
Note: last condition (conditionN) is not placed in (?=...) to let regex engine move cursor after tested part (to "consume" it) and move on to testing other things after it. But to make it possible conditionN must match precisely that section which we want to "consume" (earlier conditions didn't have that limitation, they could match substrings of any length, like lets say few first characters).
So now we need to think about what are our conditions.
We want to match only alphanumeric characters, ?, * but * can appear (optionally) only at end. We can write it as ^[a-zA-Z0-9?]*[*]?$. This also handles non-whitespace characters because we didn't include them as potentially accepted characters.
Second requirement is to have "Minimum 2 alpha-numeric characters". It can be written as .*?[a-zA-Z0-9].*?[a-zA-Z0-9] or (?:.*?[a-zA-Z0-9]){2,} (if we like shorter regexes). Since that condition doesn't actually test whole text but only some part of it, we can place it in look-ahead mechanism.
Above conditions seem to cover all we wanted so we can combine them into regex which can look like:
^(?=(?:.*?[a-zA-Z0-9]){2,})[a-zA-Z0-9?]*[*]?$

Regular expression to select strings with a character appearing on odd occurences

I'm trying to create a regex that will select words where the quote character appears on odd occurences. And I'm stuck...
Let's say I have these 4 strings :
hello'
pl'pl'op
'heger
qwe'rty
I should get this list in return :
hello'
'heger
qwe'rty
I'm running around in circles and I don't even know if it is possible to do that in a regex. I'm not so good in regex.
Should I just loop on each characters of all the strings, count the amount of quotes and do a modulo operation to check if the number is odd?

Code
See regex in use here
^(?!(?:\w*'\w*'\w*)+$)[\w']+$
As per the comments below my question, an improvement can be made by changing the non-capture group to an atomic group as the following pattern demonstrates. This optimization is thanks to #Thefourthbird:
^(?!(?>\w*'\w*'\w*)+$)[\w']+$
Results
Input
hello'
pl'pl'op
'heger
qwe'rty
q'
q'q'
q'q'q'
q'q'q'q'
q'q'q'q'q'
q'q'q'q'q'q'
q'q'q'q'q'q'q'
q'q'q'q'q'q'q'q'
q'q'q'q'q'q'q'q'q'
Output
Only matches are shown below
hello'
'heger
qwe'rty
q'
q'q'q'
q'q'q'q'q'
q'q'q'q'q'q'q'
q'q'q'q'q'q'q'q'q'
Explanation
^ Assert position at the start of the line
(?!(?:\w*'\w*'\w*)+$) Negative lookahead ensuring what follows doesn't match
(?:\w*'\w*'\w*)+ Match any combination of apostrophes and word characters where the apostrophe character appears exactly twice, one or more times (this means 2,4,6,8,10,... times)
$ Assert position at the end of the line
[\w']+ Match one or more word characters or apostrophes '
$ Assert position at the end of the line

You don't need a regular expression. Just check if countMatches return an odd or not
public class Main {
public static void main(String[] args) {
String check = "pl'pl'op";
System.out.println("Ocurrences: " + StringUtils.countMatches(check, "'"));
}
}
Output: Ocurrences: 2

Try this:
([^']*'[^']*'[^'])*[^']*'[^']*
The idea is to capture in the group an even (possibly 0) number of quotes, and the text between them, and then one more quote.

Regex matcher to handle a character or end of line

I would like to create a matching pattern for a situation like this
DOMAIN+("Y|A")?
I would like the matching options to be only
DOMAIN
DOMAINY
DOMAINA
but seems like DOMAINX, DOMAINY etc. are matching as well.

Yes, they are matching because you did not specify that the String needed to end with this. DOMAIN(Y|A)? is matching DOMAINX because it rightfully contains DOMAIN followed by nothing (which is accepted since ? validates 0 or 1 occurence).
You can add this restriction by specifying $ at the end of the regular expression.
Sample code that shows the result of matches. In your full code, you probably want to compile a Pattern instead of doing it each time.
public static void main(String[] args) {
String regex = "DOMAIN(Y|A)?$";
System.out.println("DOMAIN".matches(regex)); // prints true
System.out.println("DOMAINX".matches(regex)); // prints false
System.out.println("DOMAINY".matches(regex)); // prints true
System.out.println("DOMAINA".matches(regex)); // prints true
}

You could use word boundaries, \b, in order to prevent strings such as "DOMAINX" from being matched.
If you just want to handle cases where there are characters after the word, add \b to the end:
DOMAIN(?:Y|A)?\b
Otherwise, you could place \b around the expression to handle cases where there may be characters at the start/end:
\bDOMAIN(?:Y|A)?\b
I also made (?:Y|A) a non-capturing group and I removed the quotes.
See the matches here.
However, as your title implies, if you only want to handle characters at the end of a line, use the $ anchor at the end of your expression:
DOMAIN(?:Y|A)?$
You may have to add the m (multi-line) flag so that the anchor matches at the start/end of a line rather than at the start/end of the string:
(?m)DOMAIN(?:Y|A)?$

You need this
DOMAIN(Y|A)?
If you need it to be a word in text you should anchor it with \b as Josh shows.
Your regex does the following
DOMAIN+("Y|A")?
DOMAIN+("Y|A")?
Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Regex syntax only
[Match the character string “DOMAI” literally (case sensitive)][1] DOMAI
[Match the character “N” literally (case sensitive)][1] N+
[Between one and unlimited times, as many times as possible, giving back as needed (greedy)][2] +
[Match the regex below and capture its match into backreference number 1][3] ("Y|A")?
[Between zero and one times, as many times as possible, giving back as needed (greedy)][4] ?
[Match this alternative (attempting the next alternative only if this one fails)][5] "Y
[Match the character string “"Y” literally (case sensitive)][1] "Y
[Or match this alternative (the entire group fails if this one fails to match)][5] A"
[Match the character string “A"” literally (case sensitive)][1] A"

Removing every other character in a string using Java regex

I have this homework problem where I need to use regex to remove every other character in a string.
In one part, I have to delete characters at index 1,3,5,... I have done this as follows:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).", "$1"));
This prints 12345 which is what I want. Essentially I match two characters at a time, and replacing with the first character. I used group capturing to do this.
The problem is, I'm having trouble with the second part of the homework, where I need to delete characters at index 0,2,4,...
I have done the following:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll(".(.)", "$1"));
This prints abcd5, but the correct answer must be abcd. My regex is only incorrect if the input string length is odd. If it's even, then my regex works fine.
I think I'm really close to the answer, but I'm not sure how to fix it.

You are indeed very close to the answer: just make matching the second char optional.
String s = "1a2b3c4d5";
System.out.println(s.replaceAll(".(.)?", "$1"));
// prints "abcd"
This works because:
Regex is greedy by default, it will take the second character if it's there
When the input is of odd length, the second char won't be there at the last replacement, but you'd still match one char (i.e. last char in input)
You can still use backreferences in substitution even if the group fails to match
It will substitute in the empty string, not "null"
This is different from Matcher.group(int), which returns null for failed groups
References
regular-expressions.info/Optional
A closer look at the first part
Let's take a closer look at the first part of the homework:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).", "$1"));
// prints "12345"
Here you didn't have to use ? for the second char, but it "works" because even though you didn't match the last char, you didn't have to! The last char can remain unmatched, unreplaced, due to the problem specification.
Now suppose that we want to delete chars at index 1,3,5..., and put the chars at index 0,2,4... in brackets.
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).", "($1)"));
// prints "(1)(2)(3)(4)5"
A-ha!! Now you're experiencing the exact same problem with odd-length input! You couldn't match the last char with your regex, because your regex needs two chars, but there's only one char at the end for odd-length input!
The solution, again, is to make matching the second char optional:
String s = "1a2b3c4d5";
System.out.println(s.replaceAll("(.).?", "($1)"));
// prints "(1)(2)(3)(4)(5)"

my regex is only incorrect if the input string length is odd. if it's even, then my regex works fine.
Change your expresion to .(.)? - the question mark makes the second character optional, which means it doesn't matter if input is odd or even

Your regex needs 2 chars to match, so fails on the final char.
This regex:
".(.{0,1})"
Will make the second char optional, so it will match with your final '5' as well

What Java regular expression do I need to match this text?

I'm trying to match the following using a regular expression in Java - I have some data separated by the two characters 'ZZ'. Each record starts with 'ZZ' and finishes with 'ZZ' - I want to match a record with no ending 'ZZ' for example, I want to match the trailing 'ZZanychars' below (Note: the *'s are not included in the string - they're just marking the bit I want to match).
ZZanycharsZZZZanycharsZZZZanychars
But I don't want the following to match because the record has ended:
ZZanycharsZZZZanycharsZZZZanycharsZZ
EDIT: To clarify things - here are the 2 testcases I am using:
// This should match and in one of the groups should be 'ZZthree'
String testString1 = "ZZoneZZZZtwoZZZZthree";
// This should not match
String testString2 = "ZZoneZZZZtwoZZZZthreeZZ";
EDIT: Adding a third test:
// This should match and in one of the groups should be 'threeZee'
String testString3 = "ZZoneZZZZtwoZZZZthreeZee";

(Edited after the post of the 3rd example)
Try:
(?!ZZZ)ZZ((?!ZZ).)++$
Demo:
import java.util.regex.*;
public class Main {
public static void main(String[] args) {
String[] tests = {
"ZZoneZZZZtwoZZZZthree",
"ZZoneZZZZtwoZZZZthreeZZ",
"ZZoneZZZZtwoZZZZthreeZee"
};
Pattern p = Pattern.compile("(?!ZZZ)ZZ((?!ZZ).)++$");
for(String tst : tests) {
Matcher m = p.matcher(tst);
System.out.println(tst+" -> "+(m.find() ? m.group() : "no!"));
}
}
}

To match only the final, unterminated record:
(?<=[^Z]ZZ|^)ZZ(?:(?!ZZ).)++$
The starting delimiter is two Z's, but there can be a third Z that's considered part of the data. The lookbehind ensures that you don't match a Z that's part of the previous record's ending delimiter (since an ending delimiter can not be preceded by a non-delimiter Z). However, this assumes there will never be empty records (or records containing only a single Z), which could lead to eight or more Z's in a row:
ZZabcZZZZdefZZZZZZZZxyz
If that were possible, I would forget about trying to match the final record by itself, and instead match all of them from the beginning:
(?:ZZ(?:(?!ZZ).)*+ZZ)*+(ZZ(?:(?!ZZ).)++$)
The final, unterminated record is now captured in group #1.

I'd suggest something like...
/ZZ(.*?)(ZZ|$)/
This will match:
ZZ — the literal string
(.*?) — anychars
(ZZ|$) — either another ZZ literal, or the end of the string

^ZZ.*(?<!ZZ)$
Assert position at the beginning of the string «^»
Match the characters “ZZ” literally «ZZ»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!ZZ)»
Match the characters “ZZ” literally «ZZ»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
Created with RegexBuddy

There's one tricky part to this: The ZZ being both the start token and the end token.
There's one start case (ZZ, not followed by another ZZ which would signify that the first ZZ was actually an end token), and two end cases (ZZ end of string, ZZ followed by ZZ). The goal is to match the start case and NOT either of the end cases.
To that end, here's what I suggest:
/ZZ(?!ZZ)(.*?)(ZZ(?!(ZZ|$))|$)/
For string ZZfooZZZZbarZZbazZZ:
This will NOT match ZZfooZZ, a legitimate record: ZZ, not followed by ZZ, followed by any combination of characters (here "foo"), followed by ZZ, but that ZZ is followed by ZZ, which opens the next record.
The next part examined is the ZZ after foo. This fails because the ZZ cannot be followed by another ZZ, yet in this case it is. This is as we want because the ZZ right after foo does not start a new record anyway.
The ZZ right before bar is not followed by another ZZ, so it's a legitimate start of record. "bar" is consumed by the .*?. Then there is a ZZ, but it is NOT followed by another ZZ or the end of string, which means that the ZZbar token is no good.
(It COULD be interpreted by a human as ZZbarZZ with bazZZ not being valid, but in either case there's something wrong, so I just wrote the regex to consider the wrongly-formatted record to occur here)
So ZZbar will be caught/matched by the regex, as illegitimate.
The ZZ after the bar isn't followed by ZZ, is followed by baz, followed by a ZZ that fails the lookahead assertion stating it can't be followed by the end of the string. So ZZbazZZ is a legitimate record and is not captured in the regex.
One more case: For ZZfoo, the beginning ZZ is okay, the foo is captured, then the regex notes that it's the end of the string, and no ZZ has occurred. Thus, ZZfoo is captured as an illegitimate match.
Let me know if this doesn't make sense, so I can make it more clear.

How about trying to remove all matches for ZZallcharsZZ and what you have left is what you want.
ZZ.*?ZZ

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Programming error leads to inexplanable regex - java

Related

Java Regex with "Joker" characters

Regular expression to select strings with a character appearing on odd occurences

Regex matcher to handle a character or end of line

Removing every other character in a string using Java regex

What Java regular expression do I need to match this text?

Categories

Resources