Understanding regex in java - java

My program works how I want it to but I stumbled upon something that I don't understand.
String problem = "4 - 2";
problem = problem.replaceAll("[^-?+?0-9]+", " ");
System.out.println(Arrays.asList(problem.trim().split(" ")));
prints [4, -, 2]
but
String problem = "4 - 2";
problem = problem.replaceAll("[^+?-?0-9]+", " ");
System.out.println(Arrays.asList(problem.trim().split(" ")));
doesn't even do anything with the minus sign and prints [4, 2]
Why does it do that, it seems like both should work.

The hyphen has a special meaning inside a character class: it is used to define a character range (like a-z or 0-9), except when:
it is at the start of the character class or immediately after the negation character ^
it is escaped with a backslash
it is at the end of the character class
with some regex engines when it is after a shorthand character class like \w, \s, \d, \p{thing},... (for these one, the situation isn't ambiguous, it can't be a range)
In the first example, it is seen as a literal hyphen (since it is at the beginning).
In your second example, I assume that ?-? defines a range between ? and ? (that is nothing more than the character ?)
Note: ? doesn't have a special meaning inside a character class (it's no more a quantifier but a simple literal character)

If you are trying to match a literal - inside of a [ and ], it must be escaped, \-. In the first case, ^ marks the beginning of a match, so really you are match -?, so there is nothing to escape. In the second case, it seems like you are matching ?-?, which can cause the regular expression to function in a way you did not expect.
PS: To escape in Java, you need \\ instead of \.

In the second example, +?-? means "a plus sign, or any chars between ? and ?, inclusive. Of course, that means just ?, so the whole regex is equivalent to [^+?0-9]+.
The only time within a character class (between the square brackets) that - doesn't mean "between, inclusive" is at the start of the character class, or immediately following a ^ that starts it, or at the end of the character class, or when it's escaped (\-).

Related

Why doesn't [[a-z]*&&[^a]] catch "bc", but "b"?

Ok, so I have tried to become more familiar with the intersection in regex (&&).
On the java.util.Pattern page all the regex are explained and && is only ever used next to a range (like [a-z&&[^e]]). But I tried to use it like this: [[a-z]*&&[^a]]. To me it seemed logical that this would match all lower case strings, expect the string "a", but instead it seems to be equivalent with [a-z&&[^a]].
So the actual question is: Where did the * operator go? How does this only catch single character strings?
I think your approach is wrong to use an intersection: To match all lowercase strings except "a":
^(?!a$)[a-z]+$
And you can drop the wrapping ^ and $ when calling matches()"
if (input.matches("(?!a$)[a-z]+")) {
// it's an all-lowercase string, but not "a"
}
Of course you don't need regex. although it's a little long winded:
if (input.equals(input.toLowerCase()) && !input.equals("a"))
but you can read it more easily.
Inside a character class (marked by []) the * character has no special meaning. It simply represents the character itself.
So the regular expression
[[a-z]*&&[^a]]
allows exactly one character being one of the following:
b, c, d, ..., z, *
The [a-z] and the following * are unioned, and the resulting character class is intersected with [^a] which simply removes the a character.
Valid strings are (for example):
b
*
c
But
a
is not, as well as each string that contains more than one character.
Now to the solution for what you want. You want to have strings (allowing more than one character, I assume) that could also contain the letter 'a' but not the string "a" alone. The easiest is a group that does this distinction:
(?!a$)[a-z]*
The group (?!a$) is called a zero-width negative lookahead. It means that the looked at character is not consumed (zero-width), and it is not allowed (negative). The '$' character looks till the end. Otherwise, words beginning with 'a' would also be rejected.
Character Class Intersection is supported in Java. The problem is that inside a character class, * looses its special meaning and the literal star "*" will be matched instead. Your regex should be:
[a-z&&[^a]]*
Now it'll match all characters in the range "a-z" except the "a" character.
Example:
Pattern p = Pattern.compile("[a-z&&[^a]]");
Matcher m = p.matcher("a");
System.out.println(m.matches()); // false
Try to use * outside of class:
[[a-z]&&[^a]]*
Interception of two character classes gives you another character class.
And as said in other answers, * doesn't mean quantity inside class. So, use it outside.

Replacing multiple occurences of special characters by a single special character

I want to remove multiple occurrences of special characters like " ", "-", "!", "_" from my java string by a single underscore "_".
I tried
replaceAll("([\\s\\-\\!])\\1+","_")
and it seems to replace consecutive same type of special character by a underscore but doesn't work otherwise.
for eg:
Hello!!! World
becomes
Hello__World
(2 underscores.)But It should be Hello_World.
Also for cases like Hello - World it fails.
I also tried working with regex and made a regular expression like
replaceAll("([^a-zA-Z0-9])\\1+","_")
but it still doesn't help. How can I achieve it?
Note that \1 is a backreference to the contents matched with the first capturing group. To actually match one or more any characters from the character class, just use a + quantifier:
[\\s!-]+
So, use
str = str.replaceAll("[\\s!-]+","_");
See IDEONE demo

What is the meaning of [...] regex?

I am new to regex going through the tutorial I found the regex [...] says Matches any single character in brackets.. So I tried
System.out.println(Pattern.matches("[...]","[l]"));
I also tried escaping brackets
System.out.println(Pattern.matches("[...]","\\[l\\]"));
But it gives me false I expected true because l is inside brackets.
It would be helpful if anybody clear my doubts.
Characters that are inside [ and ] (called a character class) are treated as a set of characters to choose from, except leading ^ which negates the result and - which means range (if it's between two characters). Examples:
[-123] matches -, 1, 2 or 3
[1-3] matches a single digit in the range 1 to 3
[^1-3] matches any character except any of the digits in the range 1 to 3
. matches any character
[.] matches the dot .
If you want to match the string [l] you should change your regex to:
System.out.println(Pattern.matches("...", "[l]"));
Now it prints true.
The regex [...] is equivalent to the regexes \. and [.].
The tutorial is a little misleading, it says:
[...] Matches any single character in brackets.
However what it means is that the regex will match a single character against any of the characters inside the brackets. The ... means "insert characters you want to match here". So you need replace the ... with the characters that you want to match against.
For example, [AP]M will match against "AM" and "PM".
If your regex is literally [...] then it will match against a literal dot. Note there is no point repeating characters inside the brackets.
The tutorial is saying:
Matches any single character in brackets.
It means you replace ... with a single character, for example [l]
These will print true:
System.out.println(Pattern.matches("[l]","l"));
System.out.println(Pattern.matches("[.]","."));
System.out.println(Pattern.matches("[.]*","."));
System.out.println(Pattern.matches("[.]*","......"));
System.out.println(Pattern.matches("[.]+","......"));

Using Scanner.useDelimeter() in Java to isolate tokens in an expression

I am trying to isolate the words, brackets and => and <=> from the following input:
(<=>A B) OR (C AND D) AND(A AND C)
So far I've come to isolating just the words (see Scanner#useDelimeter()):
sc.useDelimeter("[^a-zA-Z]");
Upon using :
sc.useDelimeter("[\\s+a-zA-Z]");
I get the output just the brackets.
which I don't want but want AND ).
How do I do that? Doing \\s+ gives the same result.
Also, how is a delimiter different from regex? I'm familiar with regex in PHP. Is the notation used the same?
Output I want:
(
<=>
A
(and so on)
You need a delimitimg regex that can be zero width (because you have adjacent terms), so look-arounds are the only option. Try this:
sc.useDelimeter("((?<=[()>])\\s*)|(\\s*\\b\\s*)");
This regex says "after a bracket or greater-than or at a word boundary, discarding spaces"
Also note that the character class [\\s+a-zA-Z] includes the + character - most characters lose any special regex meaning when inside a character class. It seems you were trying to say "one or more spaces", but that's not how you do that.
Inside [] the ^ means 'not', so the first regex, [^a-zA-Z], says 'give me everything that's not a-z or A-Z'
The second regex, [\\s+a-zA-Z], says 'give me everything that is space, +, a-z or A-Z'. Note that "+" is a literal plus sign when in a character class.

regex help in java

I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.
You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+
Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.
Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr
there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.

Categories