Why doesn't [[a-z]*&&[^a]] catch "bc", but "b"? - java

Ok, so I have tried to become more familiar with the intersection in regex (&&).
On the java.util.Pattern page all the regex are explained and && is only ever used next to a range (like [a-z&&[^e]]). But I tried to use it like this: [[a-z]*&&[^a]]. To me it seemed logical that this would match all lower case strings, expect the string "a", but instead it seems to be equivalent with [a-z&&[^a]].
So the actual question is: Where did the * operator go? How does this only catch single character strings?

I think your approach is wrong to use an intersection: To match all lowercase strings except "a":
^(?!a$)[a-z]+$
And you can drop the wrapping ^ and $ when calling matches()"
if (input.matches("(?!a$)[a-z]+")) {
// it's an all-lowercase string, but not "a"
}
Of course you don't need regex. although it's a little long winded:
if (input.equals(input.toLowerCase()) && !input.equals("a"))
but you can read it more easily.

Inside a character class (marked by []) the * character has no special meaning. It simply represents the character itself.
So the regular expression
[[a-z]*&&[^a]]
allows exactly one character being one of the following:
b, c, d, ..., z, *
The [a-z] and the following * are unioned, and the resulting character class is intersected with [^a] which simply removes the a character.
Valid strings are (for example):
b
*
c
But
a
is not, as well as each string that contains more than one character.
Now to the solution for what you want. You want to have strings (allowing more than one character, I assume) that could also contain the letter 'a' but not the string "a" alone. The easiest is a group that does this distinction:
(?!a$)[a-z]*
The group (?!a$) is called a zero-width negative lookahead. It means that the looked at character is not consumed (zero-width), and it is not allowed (negative). The '$' character looks till the end. Otherwise, words beginning with 'a' would also be rejected.

Character Class Intersection is supported in Java. The problem is that inside a character class, * looses its special meaning and the literal star "*" will be matched instead. Your regex should be:
[a-z&&[^a]]*
Now it'll match all characters in the range "a-z" except the "a" character.
Example:
Pattern p = Pattern.compile("[a-z&&[^a]]");
Matcher m = p.matcher("a");
System.out.println(m.matches()); // false

Try to use * outside of class:
[[a-z]&&[^a]]*
Interception of two character classes gives you another character class.
And as said in other answers, * doesn't mean quantity inside class. So, use it outside.

Related

Can we replace a specific part of a string literal using any predefined function and regex

I want to replace "&" with a random word "$d" in a given sentence.
Can we replace only those words which start with & and are followed by a single character and a space?
Example:-
Input:-
Two literals are &a and &b and also check &abc and &bac here.
Output:-
Two literals are $da and $db and also check &abc and &bac here.
In the above example in input, the only words that should be replaced are &a and &b(not the complete word should be replaced, only just the '&' in both the words) because these two random words start with & and are followed by a single character and a space.
In the case of the replaceAll() function, it replaces the entire word when I used regex:-
String str="Two literals are &a and &b and also check &abc and &bac here.";
str = str.replaceAll("\\&[a-zA-Z]{1}\\s", "\\$d");
System.out.println(str);
//output for this:-Two literals are $d and $d and also check &abc and &bac here.
//expected output:-Two literals are $da and $db and also check &abc and &bac here.
The correct code for this would be
str.replaceAll("&([a-zA-Z]\\s)", "\\$d$1")
This is an example of backreferencing captured groups in regex, and a here is a nice reference for it. Additionally, here's a relevant StackOverflow question about it.
Essentially, the match inside the parentheses ([a-zA-Z]\\s) matches a single letter and a space. The value of this match can be referenced with $1 since it is of capturing group 1.
So we replace &(a ) with $d(a ) (brackets here to demonstrate what is captured). Credit to u/rzwitserloot for reminding me that OP wants $ not &.
You presumably want a concept called look-ahead: You can match on things being there without 'consuming' it. You can even match on things NOT being there. That's what you want here: Match &[a-z], but only if looking ahead past that, we do NOT see another letter:
for (String test : List.of("Two literals are &a and &bcd", "A literal is &a", "How about &a?")) {
System.out.println(str.replaceAll("&(?=[a-zA-Z](?![a-zA-Z]))", "\\$d"));
}
Perhaps instead you want the single letter thing to just be on any word break (i.e. &z00 should NOT turn into $dz00, even though there is no letter after the z. Then I suggest:
"&(?=[a-zA-Z]\\b)"
That's a lot simpler to read!
A few notes:
(?=x) is 'positive lookahead'. It doesn't itself match anything but makes the match fail if x is not immediately following the match.
(?!x) is 'negative lookahead'. It doesn't itself match anything but makes the match fail if x is immediately following the match.
$ has special meaning in the replacement part so we need to escape it.
\\b is regexpese for 'word break': Doesn't match any characters, but fails if we aren't on a 'word break'. Spaces, dots, end-of-input, end-of-line, a dash, an ampersand - many things are word breaks.
We don't want to match those letters because if we do, they would be replaced.

Understanding regex in java

My program works how I want it to but I stumbled upon something that I don't understand.
String problem = "4 - 2";
problem = problem.replaceAll("[^-?+?0-9]+", " ");
System.out.println(Arrays.asList(problem.trim().split(" ")));
prints [4, -, 2]
but
String problem = "4 - 2";
problem = problem.replaceAll("[^+?-?0-9]+", " ");
System.out.println(Arrays.asList(problem.trim().split(" ")));
doesn't even do anything with the minus sign and prints [4, 2]
Why does it do that, it seems like both should work.
The hyphen has a special meaning inside a character class: it is used to define a character range (like a-z or 0-9), except when:
it is at the start of the character class or immediately after the negation character ^
it is escaped with a backslash
it is at the end of the character class
with some regex engines when it is after a shorthand character class like \w, \s, \d, \p{thing},... (for these one, the situation isn't ambiguous, it can't be a range)
In the first example, it is seen as a literal hyphen (since it is at the beginning).
In your second example, I assume that ?-? defines a range between ? and ? (that is nothing more than the character ?)
Note: ? doesn't have a special meaning inside a character class (it's no more a quantifier but a simple literal character)
If you are trying to match a literal - inside of a [ and ], it must be escaped, \-. In the first case, ^ marks the beginning of a match, so really you are match -?, so there is nothing to escape. In the second case, it seems like you are matching ?-?, which can cause the regular expression to function in a way you did not expect.
PS: To escape in Java, you need \\ instead of \.
In the second example, +?-? means "a plus sign, or any chars between ? and ?, inclusive. Of course, that means just ?, so the whole regex is equivalent to [^+?0-9]+.
The only time within a character class (between the square brackets) that - doesn't mean "between, inclusive" is at the start of the character class, or immediately following a ^ that starts it, or at the end of the character class, or when it's escaped (\-).

why string.matches["+-*/"] will report the pattern exception?

I have this code:
public static void main(String[] args) {
String et1 = "test";
String et2 = "test";
et1.matches("[-+*/]"); //works fine
et2.matches("[+-*/]"); //java.util.regex.PatternSyntaxException, why?
}
Because '-' is escape character? But why it will works fine, if '-' switchs with '+' ?
it is because - is used to define a range of characters in a character class. Since + is after * in the ascii table, the range has no sense, and you obtain an error.
To have a literal - in the middle of a character class, you must escape it. There is no problem if the - is at the begining or at the end of the class because it's unambigous.
An other situation where you don't need to escape the - is when you have a character class shortcut before, example:
[\\d-abc]
(other regex engines like pcre allows the same when the character class shortcut is placed after [abc-\d], but Java doesn't seem to allow this.)
- inside a character class (the [xxx]) is used to define a range, for example: [a-z] for all lower case characters. If you want to actually mean "dash", it has to be in first or last position. I generally place it first to avoid any confusions.
Alternatively you can escape it: [+\\-*/].
Just FYI, the Java regular expression meta characters are defined here:
The metacharacters supported by this API are: <([{\^-=$!|]})?*+.>
As a general rule, to save myself from regexp debugging headaches, if I want to use any of these characters as a literal then I precede them with a \ (Or \\ inside of a Java String expression).
Either:
et2.matches("[\\+\\-\\*/]");
Or:
et2.matches("[\\-\\+\\*/]");
Will work regardless of order.
I think you should use: [\-\+\*/]
Because: '-' to define range, eg: [a-d] it's mean: a,b,c,d

How to make a regular expression that matches tokens with delimiters and separators?

I want to be able to write a regular expression in java that will ensure the following pattern is matched.
<D-05-hello-87->
For the letter D, this can either my 'D' or 'E' in capital letters and only either of these letters once.
The two numbers you see must always be a 2 digit decimal number, not 1 or 3 numbers.
The string must start and end with '<' and '>' and contain '-' to seperate parts within.
The message in the middle 'hello' can be any character but must not be more than 99 characters in length. It can contain white spaces.
Also this pattern will be repeated, so the expression needs to recognise the different individual patterns within a logn string of these pattersn and ensure they follow this pattern structure. E.g
So far I have tried this:
([<](D|E)[-]([0-9]{2})[-](.*)[-]([0-9]{2})[>]\z)+
But the problem is (.*) which sees anything after it as part of any character match and ignores the rest of the pattern.
How might this be done? (Using Java reg ex syntax)
Try making it non-greedy or negation:
(<([DE])-([0-9]{2})-(.*?)-([0-9]{2})>)
Live Demo: http://ideone.com/nOi9V3
Update: tested and working
<([DE])-(\d{2})-(.{1,99}?)-(\d{2})>
See it working: http://rubular.com/r/6Ozf0SR8Cd
You should not wrap -, < and > in [ ]
Assuming that you want to stop at the first dash, you could use [^-]* instead of .*. This will match all non-dash characters.

Using Scanner.useDelimeter() in Java to isolate tokens in an expression

I am trying to isolate the words, brackets and => and <=> from the following input:
(<=>A B) OR (C AND D) AND(A AND C)
So far I've come to isolating just the words (see Scanner#useDelimeter()):
sc.useDelimeter("[^a-zA-Z]");
Upon using :
sc.useDelimeter("[\\s+a-zA-Z]");
I get the output just the brackets.
which I don't want but want AND ).
How do I do that? Doing \\s+ gives the same result.
Also, how is a delimiter different from regex? I'm familiar with regex in PHP. Is the notation used the same?
Output I want:
(
<=>
A
(and so on)
You need a delimitimg regex that can be zero width (because you have adjacent terms), so look-arounds are the only option. Try this:
sc.useDelimeter("((?<=[()>])\\s*)|(\\s*\\b\\s*)");
This regex says "after a bracket or greater-than or at a word boundary, discarding spaces"
Also note that the character class [\\s+a-zA-Z] includes the + character - most characters lose any special regex meaning when inside a character class. It seems you were trying to say "one or more spaces", but that's not how you do that.
Inside [] the ^ means 'not', so the first regex, [^a-zA-Z], says 'give me everything that's not a-z or A-Z'
The second regex, [\\s+a-zA-Z], says 'give me everything that is space, +, a-z or A-Z'. Note that "+" is a literal plus sign when in a character class.

Categories