Java regexp error: \( is not a valid character - java

I was using java regexp today and found that you are not allowed to use the following regexp sequence
String pattern = "[a-zA-Z\\s\\.-\\)\\(]*";
if I do use it it will fail and tell me that \( is not a valid character.
But if I change the regexp to
String pattern = "[[a-zA-Z\\s\\.-]|[\\(\\)]]*";
Then it will work. Is this a bug in the regxp engine or am I not understanding how to work with the engine?
EDIT: I've had an error in my string: there shouldnt be 2 starting [[, it should be only one. This is now corrected

Your regex has two problems.
You've not closed the character class.
The - is acting as a range operator with . on LHS and ( on RHS. But ( comes before . in unicode, so this results in an invalid range.
To fix problem 1, close the char class or if you meant to not include [ in the allowed characters delete one of the [.
To fix problem 2, either escape the - as \\- or move the - to the beginning or to the end of the char class.
So you can use:
String pattern = "[a-zA-Z\\s\\.\\-\\)\\(]*";
or
String pattern = "[a-zA-Z\\s\\.\\)\\(-]*";
or
String pattern = "[-a-zA-Z\\s\\.\\)\\(]*";

You should only use the dash - at the end of the character class, since it is normally used to show a range (as in a-z). Rearrange it:
String pattern = "[[a-zA-Z\\s\\.\\)\\(-]*";
Also, I don't think you have to escape (.) characters inside brackets.
Update: As others pointed out, you must also escape the [ in a java regex character class.

The problem here is that \.-\) ("\\.-\\)" in a Java string literal) tries to define a range from . to ). Since the Unicode codepoint of . (U+002E) is higher than that of ) (U+0029) this is an error.
Try using this pattern and you'll see: [z-a].
The correct solution is to either put the dash - at the end of the character group (at which point it will lose its special meaning) or to escape it.
You also need to close the unclosed open square bracket or escape it, if it was not intended for grouping.
Also, escaping the fullstop . is not necessary inside a character group.

You have to escape the dash and close the unmatched square bracket. So you are going to get two errors with this regex:
java.util.regex.PatternSyntaxException: Illegal character range near index 14
because the dash is used to specify a range, and \) is obviously a not valid range character. If you escape the dash, making it [[a-zA-Z\s\.\-\)\(]* you'll get
java.util.regex.PatternSyntaxException: Unclosed character class near index 19
which means that you have an extra opening square bracket that is used to specify character class. I don't know what you meant by putting an extra bracket here, but either escaping or removing it will make it a valid regex.

Related

Regarding regex whitespace

I have a small query regarding representing the space in java regular Expression.
I want to restrict the name and for that i have defined an pattern as
Pattern DISPLAY_NAME_PATTERN = compile("^[a-zA-Z0-9_\\.!~*()=+$,-\s]{3,20}$");
but eclipse indicating it as error "Invalid escape sequence".It is saying it for "\s" which according to
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
is a valid predefined class.
What am i missing.Could anyone help me withit.
Thanks in advance.
You need to escape the \ in \s one more time. And also, you don't need to escape the . inside a character class. . and \\. inside a character class matches a literal dot.
Pattern DISPLAY_NAME_PATTERN = Pattern.compile("^[a-zA-Z0-9_.!~*()=+$,\\s-]{3,20}$");
And also put the - at the first or at the last inside the character class. Because - at the center of character class may act as a range operator. regex.PatternSyntaxException: Illegal character range exception is mainly because of this issue, that there isn't a range exists between the , and \\s
If you want to do a backslash match, then you need to escape it exactly three times.
Pattern DISPLAY_NAME_PATTERN = Pattern.compile("^[a-zA-Z0-9_.\\\\!~*()=+$,\\s-]{3,20}$");
Example:
System.out.println("foo-bar bar8998~*foo".matches("[a-zA-Z0-9_.\\\\!~*()=+$,\\s-]{3,20}")); // true
System.out.println("fo".matches("[a-zA-Z0-9_.\\\\!~*()=+$,\\s-]{3,20}")); // false

How to match ^(d+) in a particular text using regex

For example I have text like below :
case1:
(1) Hello, how are you?
case2:
Hi. (1) How're you doing?
Now I want to match the text which starts with (\d+).
I have tried the following regex but nothing is working.
^[\(\d+\)], ^\(\d+\).
[] are used to match any of the things you specify inside the brackets, and are to be followed by a quantifier.
The second regexp will work: ^\(\d+\), so check your code.
Check also so there's no space in front of the first parenthesis, or add \s* in front.
EDIT: Also, java can be tricky with escapes depending on if the regexp you type is directly translated to a regexp or is first a string literal. You may need to double escape your escapes.
In Java you have to escape parenthesis, so "\\(\\d+\\)" should match (1) in case one and two. Adding ^ as you did "^\\(\\d+\\)" will match only case1.
You have to use double back slashes within java string. Consider this
"\n" give you [line break]
"\\n" give you [backslash][n]
If you are going to downvote my post, at least comment to tell me WHY it's not useful.
I believe Java's Regex Engine supports Positive Lookbehind, in which case you can use the following regex:
(?<=[(][0-9]{1,9999}[)]\s?)\b.*$
Which matches:
The literal text (
Any digit [0-9], between 1 and 9999 times {1,9999}
The literal text )
A space, between 0 and 1 times \s?
A word boundary \b
Any character, between 0 and unlimited times .*
The end of a string $

regex help in java

I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.
You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+
Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.
Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr
there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.

Regex Dark Corners in Java... order of chars alters regex meaning?

I recently came across some odd behavior that involves Java's regex engine.
When writing some validation, I needed to add square brackets to my regex, like so:
"[^a-zA-Z0-9_/.# ]" // original expression
"[^a-zA-Z0-9_/.# /]/[]" // first modificiation
However... this implementation failed. After experimentation, I discovered that it would then work if I moved the space char to the end.
"[^a-zA-Z0-9_/.#/]/[ ]" // final working modification
Now the calling code that used this expression used the String.replaceAll(String, String) method, as listed here.
My question is... does anyone have any good technical idea on why the placing of the space alters the meaning of this regex? It really shouldn't matter.
[EDITED]
From the comments and answers--this is an example where using the built-in String method leads to incorrect behavior that is NOT caught. My Runtime environment does NOT complain at all even though if you read the documentation on String.replaceAll(String, String) it clearly states that it is the same functionality as Pattern.compile(regex).matcher(str).replaceAll(repl) I think I will file a bug.
You use the wrong escaping character, it's \ and not /.
Also, I'm not sure if you wanted your character group to include / and . or if you thought that . needs to be escaped in character groups (it doesn't need to be escaped: it always represents the literal . in character groups).
When trying to compile [^a-zA-Z0-9_/.# /]/[] it gives this exception:
java.util.regex.PatternSyntaxException: Unclosed character class near index 20
[^a-zA-Z0-9_/.# /]/[]
^
at java.util.regex.Pattern.error(Pattern.java:1713)
at java.util.regex.Pattern.clazz(Pattern.java:2254)
at java.util.regex.Pattern.sequence(Pattern.java:1818)
at java.util.regex.Pattern.expr(Pattern.java:1752)
at java.util.regex.Pattern.compile(Pattern.java:1460)
at java.util.regex.Pattern.(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
This indicates that there is a problem with the character class at that point. And in fact: you've got an empty character class [] which is not valid!
[^a-zA-Z0-9_/.# /]/[] means "a character not matching (a-z, A-Z, 0-9, _, /, ., #, or /), followed by a slash / followed by <fails to compile because it is malformed>".
What you want is probably [^a-zA-Z0-9_.# \]\[] which is "a character not matching a-z, A-Z, 0-9, _, ., #, , ] or [".
If you write it in a String literal remember to double the \ (because they have special meanings in String literals as well!):
Pattern regex = Pattern.compile("[^a-zA-Z0-9_.# \\]\\[]");

What does this Regular expression do?

I'm in the processing of converting a program from Perl to Java. I have come across the line
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
I'm not very good with regular expressions but from what I can tell this is matching something in the string $info{$host} to the regular expression ^\s*(([^)]+))\s*$ and assigning the match to $title.
My problem is that I have no clue what the regular expression is doing and what it will match. Any help would be appreciated.
Thanks
The regular expression matches a string that contains exactly one pair of matching parentheses (actually, one opening and one matching closing parenthesis, but inside any number of further opening parentheses may occur).
The string may begin and end with whitespace characters, but no others. Inside the parantheses, however, arbitrary characters may occur (at least one).
The following strings should match it:
(abc)
(()
(ab)
By the way, you may simply use the regular expression as-is in Java (after escaping the backslashes), using the Pattern class.
It will match a bunch of leading whitespace, followed by a left paren, followed by some text not including a right paren, followed by a right paren, followed by some more whitespace.
Matches:
(some stuff)
Fails:
(some stuff
some stuff)
(some stuff) asadsad
Ok step by step
/ - quote the regex
^ - the begining of the string
\s* - zero or more of any spacelike character
( - an actual ( character
( - begin a capture group
[^)]+ any of the characters ^ or ) the + indicating at least one
) -end the capture group
) and actual ) character
\s* zero or more space like characters
$ - the end of the string
/ - close the regex quote
So as far as I can work out we are looking for strings like " (^) " or "())"
methinks I am missing something here.
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
First, m// in list context returns the captured matches. my ($title) puts the right hand side in list context. Second, $info{$host} is matched against the following pattern:
/^ \s* \( ( [^\)]+) \) \s* $/x
Yes, used the x flag so I could insert some spaces. ^\s* skips any leading whitespace. Then we have an escaped paranthesis (therefore no capture group is created. Then we have a capture group containing [^\)]. That character class can be better written as [^)] because the right parenthesis is not special in a character class and means anything but a left parenthesis.
If there are one or more characters other than a closing parenthesis following the opening parenthesis followed by a closing parenthesis optionally surrounded on either side by whitespace, that sequence of characters is captured and put in to $x.

Categories