Regex first character not matching - java

I am having some Java Pattern problems. This is my pattern:
"^[\\p{L}\\p{Digit}~._-]+$"
It matches any letter of the US-ASCII, numerals, some special characters, basically anything that wouldn't scramble an URL.
What I would like is to find the first letter in a word that does not match this pattern. Basically the user sends a text as an input and I have to validate it and to throw an exception if I find an illegal character.
I tried negating this pattern, but it wouldn't compile properly. Also find() didn't help out much.
A legal input would be hello while ?hello should not be, and my exception should point out that ? is not proper.
I would prefer a suggestion using Java's Matcher, Pattern or something using util.regex. Its not a necessity, but checking each character in the string individually is not a solution.
Edit: I came up with a better regex to match unreserved URI characters

Try this :
^[\\p{L}\\p{Digit}.'-.'_]*([^\\p{L}\\p{Digit}.'-.'_]).*$
The first character non matching is the group n°1
I made a few try here : http://fiddle.re/gkkzm61
Explanation :
I negate your pattern, so i built this :
[^\\p{L}\\p{Digit}.'-.'_] [^...] means every character except for
^ ^ the following ones.
| your pattern inside |
The pattern has 3 parts :
^[\\p{L}\\p{Digit}.'-.'_]*
Checks the regex from the first character until he meets a non matching character
([^\\p{L}\\p{Digit}.'-.'_])
The non-matching character (negation) inside a capturing group
.*$
Any character until the end of the string.
Hope it helps you
EDIT :
The correct regex shoud be :
^[\\p{L}\\p{Digit}~._-]*([^\\p{L}\\p{Digit}~._-]).*$
It is the same method, i only change the contents of the first and second part.
I tried and it seems to work.

The "^[\\p{L}\\p{Digit}.'-.'_]+$" pattern matches any string containing 1+ characters defined inside the character class. Note that double ' and . are suspicious and you might be unaware of the fact that '-. creates a range and matches '()*+,-.. If it is not on purpose, I think you meant to use .'_-.
To check if a string starts with a character other than the one defined in the character class, you can negated the character class, and check the first character in the string only:
if (str.matches("[^\\p{L}\\p{Digit}.'_-].*")) {
/* String starts with the disallowed character */
}
I also think you can shorten the regex to "(?U)[^\\w.'-].*". At any rate, \\p{Digit} can be replaced with \\d.

Try out this one to find the first non valid char:
Pattern negPattern = Pattern.compile(".*?([^\\p{L}^\\p{Digit}^.^'-.'^_]+).*");
Matcher matcher = negPattern.matcher("hel?lo");
if (matcher.matches())
{
System.out.println("'" + matcher.group(1).charAt(0) + "'");
}

Related

RegExp pattern for a String which contain 0 and 4-9(4 ,5,6,7,8,9)

I am dealing with a string. Use-Case is I don't want a String which has number any digit of 4 to 9 and 0.
Example:-
ABC0123-> Not Valid.
XYZ002456789->Not Valid.
ABC123->Valid
ABC1->Valid
I have tried below pattern but not got success in it.
String pattern = "^[0,4-9]+$";
if(str.matches(pattern)){
//do something.
}
First, remove the comma from the character class. You're not looking for commas.
Since you're disallowing, don't anchor the expression, allow the match anywhere in the string. In fact, matches anchors the expression for you, so we have to intentionally allow characters before and after the disallowed character class:
String pattern = ".*[04-9].*";
if(str.matches(pattern)){
// disallow
}
Live Example
Alternately, you can avoid having those .* in there by using Pattern.compile and then using the resulting Pattern instead of matches, since it won't automatically anchor the pattern like matches does.
It is much more easier to match those that contains 4-9 and 0 than to match those that don't. So you should just write a regex like this:
[4-90]
And call find, then invert the result:
if (!Pattern.compile("[4-90]").matcher(someString).find()) {
// ...
}
Another option could be to use a negated character class and add what you don't want to match. In this case you could add 0 and a range from 4-9 and if you don't want to match a carriage return or a newline you could add those as well.
^[^04-9\\r\\n]+$
Note that if you add the comma to the character class that it would mean a comma literally.
Regex demo | Java demo
String pattern = "^[^04-9\\r\\n]+$";
if(str.matches(pattern)){
//do something.
}

Java Regular Expression: matching a customized Hashtag pattern with a lookahead/lookbehind condition

I am currently learning how to write regular expressions in Java by trying to match simple Hashtag pattern. The Hashtags obey the following conditions:
It starts with a hashtag: #
It has to contain at least 1 letter: [a-zA-Z]
It can contain any of the characters from the class [a-zA-Z0-9_]
It cannot be preceded by a character of the class [a-zA-Z0-9_]
Based on this, I thought that the correct regular expression is:
PATTERN = "(?<![a-zA-Z0-9_])#(?=.*[a-zA-Z])[a-zA-Z0-9_]+"
Here I'm using a lookahead (?=.*[a-zA-Z]) to make sure Condition 2 holds and using a lookbehind (?<![a-zA-Z0-9_]) to make sure Condition 4 holds. I'm less certain about ending with a +.
This works on simple test cases but fails on complicated ones such as:
String text = "####THIS_IS_A_HASHTAG; ;#This_1_2...#12_and_this but not #123 or #this# #or#that";
where does not match #THIS_IS_A_HASHTAG, #This_1_2 and 12_and_this
Could someone explain what I'm doing wrong?
This lookahead:
(?=.*[a-zA-Z])
may produce wrong results for the cases when input is like this:
####12345...#12_and_this
by giving you 2 matches #12345 and #12_and_this. Whereas as per your rules only 2nd should be valid match.
To fix this you can use this regex:
(?<![a-zA-Z0-9_])#(?=[0-9_]*[a-zA-Z])[a-zA-Z0-9_]+
Where lookahead (?=[0-9_]*[a-zA-Z]) means assert presence of a letter after # with optional presence of a digit or underscore in between.
Here is a regex demo for you
How about this?
(example here)
String text = "####THIS_IS_A_HASHTAG;;;#This_1_2...#12_and_this ";
String regex = "#[A-Za-z0-9_]+";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
System.out.println(m.group());
}
It looks like it meets your criteria as stated:
#THIS_IS_A_HASHTAG
#This_1_2
#12_and_this

Match word in String in Java

I'm trying to match Strings that contain the word "#SP" (sans quotes, case insensitive) in Java. However, I'm finding using Regexes very difficult!
Strings I need to match:
"This is a sample #sp string",
"#SP string text...",
"String text #Sp"
Strings I do not want to match:
"Anything with #Spider",
"#Spin #Spoon #SPORK"
Here's what I have so far: http://ideone.com/B7hHkR .Could someone guide me through building my regexp?
I've also tried: "\\w*\\s*#sp\\w*\\s*" to no avail.
Edit: Here's the code from IDEone:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("\\b#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
(edit: positive lookbehind not needed, only matching is done, not replacement)
You are yet another victim of Java's misnamed regex matching methods.
.matches() quite unfortunately so tries to match the whole input, which is a clear violation of the definition of "regex matching" (a regex can match anywhere in the input). The method you need to use is .find().
This is a braindead API, and unfortunately Java is not the only language having such misguided method names. Python also pleads guilty.
Also, you have the problem that \\b will detect on word boundaries and # is not part of a word. You need to use an alternation detecting either the beginning of input or a space.
Your code would need to look like this (non fully qualified classes):
Pattern p = Pattern.compile("(^|\\s)#SP\\b", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("s #SP s");
if (m.find()) {
System.out.println("Match!");
}
You're doing fine, but the \b in front of the # is misleading. \b is a word boundary, but # is already not a word character (i.e. it isn't in the set [0-9A-Za-z_]). Therefore, the space before the # isn't considered a word boundary. Change to:
java.util.regex.Pattern p =
java.util.regex.Pattern.compile("(^|\\s)#SP\\b",
java.util.regex.Pattern.CASE_INSENSITIVE);
The (^|\s) means: match either ^ OR \s, where ^ means the beginning of your string (e.g. "#SP String"), and \s means a whitespace character.
The regular expression "\\w*\\s*#sp\\w*\s*" will match 0 or more words, followed by 0 or more spaces, followed by #sp, followed by 0 or more words, followed by 0 or more spaces. My suggestion is to not use \s* to break words up in your expression, instead, use \b.
"(^|\b)#sp(\b|$)"

Find string with special char using regex

I need to scroll a List and removing all strings that contains some special char. Using RegEx I'm able to remove all string that start with these special chars but, how can I find if this special char is in the middle of the string?
For instance:
Pattern.matches("[()<>/;\\*%$].*", "(123)")
returns true and I can remove this string
but it doesn't works with this kind of string: 12(3).
Is it correct to use \* to find the occurrence of "*" char into the string?
Thanks for the help!
Andrea
You are yet another victim of Java's ill-named .matches() which tries and match the whole input and contradicts the very definition of regex matching.
What you want is matching one character among ()<>/;\\*%$. With Java, you need to create a Pattern, a Matcher from this Pattern and use .find() on this matcher:
final Pattern p = pattern.compile("[()<>/;\\*%$]");
final Matcher m = p.matcher(yourinput);
if (m.find()) // match, proceed
Try the following:
!Pattern.matches("^[^()<>/;\\*%$]*$", "(123)")
This uses a negated character class to ensure that all the characters in the string are not any of the characters in the class.
You then obviously negate the expression since you are testing for a string that does not match.
Is it correct to use \* to find the occurrence of "*" char into the string?
Yes.
Pattern.matches() tries to match the whole input. So since your regex says that the input has to start with a "special" char, 12(3) doesn't match.

regex help in java

I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.
You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+
Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.
Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr
there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.

Categories