Regex Pattern to negate a character

Regex Pattern to negate a character - java

I'm having a statement as below
#sv_q = " INSERT INTO alertuser.REALTIME
I'm trying to create a regex to match this line from a set of lines from the file, whenever insert keyword is used. But since a '#' is used at the beginning I wouldn't want to consider this line. How to achieve this. I tried the below regex, but still this line is getting considered, can anyone suggest how to achieve this?
static Pattern tracePattern = Pattern.compile("(?!\\#).*insert\\s*into",Pattern.CASE_INSENSITIVE);
Matcher localMatcher = tracePattern.matcher(line);
if (localMatcher.find()) {
// doing some checks
}

If you want to use .find(), you need to anchor the pattern at the beginning with ^ or \A. Besides, you'd better use word boundaries to only match whole words insert and into, and use \s+ instead of \s* to enforce at least 1 occurrence of a whitespace between insert and into:
"^(?!#).*\\binsert\\s+into\\b"
You could shorten your solution to
if (s.matches("(?i)(?!#).*\\binsert\\s+into\\b.*")) { doing some checks}
The matches() method requires a full string match, and thus, you need to add the .* at the end. Also, the Pattern.CASE_INSENSITIVE option can be used inline with the help of the embedded flag option (?i). If your input can contain line breaks, use (?si) instead of (?i).

Related

RegExp pattern for a String which contain 0 and 4-9(4 ,5,6,7,8,9)

I am dealing with a string. Use-Case is I don't want a String which has number any digit of 4 to 9 and 0.
Example:-
ABC0123-> Not Valid.
XYZ002456789->Not Valid.
ABC123->Valid
ABC1->Valid
I have tried below pattern but not got success in it.
String pattern = "^[0,4-9]+$";
if(str.matches(pattern)){
//do something.
}

First, remove the comma from the character class. You're not looking for commas.
Since you're disallowing, don't anchor the expression, allow the match anywhere in the string. In fact, matches anchors the expression for you, so we have to intentionally allow characters before and after the disallowed character class:
String pattern = ".*[04-9].*";
if(str.matches(pattern)){
// disallow
}
Live Example
Alternately, you can avoid having those .* in there by using Pattern.compile and then using the resulting Pattern instead of matches, since it won't automatically anchor the pattern like matches does.

It is much more easier to match those that contains 4-9 and 0 than to match those that don't. So you should just write a regex like this:
[4-90]
And call find, then invert the result:
if (!Pattern.compile("[4-90]").matcher(someString).find()) {
// ...
}

Another option could be to use a negated character class and add what you don't want to match. In this case you could add 0 and a range from 4-9 and if you don't want to match a carriage return or a newline you could add those as well.
^[^04-9\\r\\n]+$
Note that if you add the comma to the character class that it would mean a comma literally.
Regex demo | Java demo
String pattern = "^[^04-9\\r\\n]+$";
if(str.matches(pattern)){
//do something.
}

Regex matcher to handle a character or end of line

I would like to create a matching pattern for a situation like this
DOMAIN+("Y|A")?
I would like the matching options to be only
DOMAIN
DOMAINY
DOMAINA
but seems like DOMAINX, DOMAINY etc. are matching as well.

Yes, they are matching because you did not specify that the String needed to end with this. DOMAIN(Y|A)? is matching DOMAINX because it rightfully contains DOMAIN followed by nothing (which is accepted since ? validates 0 or 1 occurence).
You can add this restriction by specifying $ at the end of the regular expression.
Sample code that shows the result of matches. In your full code, you probably want to compile a Pattern instead of doing it each time.
public static void main(String[] args) {
String regex = "DOMAIN(Y|A)?$";
System.out.println("DOMAIN".matches(regex)); // prints true
System.out.println("DOMAINX".matches(regex)); // prints false
System.out.println("DOMAINY".matches(regex)); // prints true
System.out.println("DOMAINA".matches(regex)); // prints true
}

You could use word boundaries, \b, in order to prevent strings such as "DOMAINX" from being matched.
If you just want to handle cases where there are characters after the word, add \b to the end:
DOMAIN(?:Y|A)?\b
Otherwise, you could place \b around the expression to handle cases where there may be characters at the start/end:
\bDOMAIN(?:Y|A)?\b
I also made (?:Y|A) a non-capturing group and I removed the quotes.
See the matches here.
However, as your title implies, if you only want to handle characters at the end of a line, use the $ anchor at the end of your expression:
DOMAIN(?:Y|A)?$
You may have to add the m (multi-line) flag so that the anchor matches at the start/end of a line rather than at the start/end of the string:
(?m)DOMAIN(?:Y|A)?$

You need this
DOMAIN(Y|A)?
If you need it to be a word in text you should anchor it with \b as Josh shows.
Your regex does the following
DOMAIN+("Y|A")?
DOMAIN+("Y|A")?
Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Regex syntax only
[Match the character string “DOMAI” literally (case sensitive)][1] DOMAI
[Match the character “N” literally (case sensitive)][1] N+
[Between one and unlimited times, as many times as possible, giving back as needed (greedy)][2] +
[Match the regex below and capture its match into backreference number 1][3] ("Y|A")?
[Between zero and one times, as many times as possible, giving back as needed (greedy)][4] ?
[Match this alternative (attempting the next alternative only if this one fails)][5] "Y
[Match the character string “"Y” literally (case sensitive)][1] "Y
[Or match this alternative (the entire group fails if this one fails to match)][5] A"
[Match the character string “A"” literally (case sensitive)][1] A"

Java RegEx pattern is invalid when trying to exclude commas

I'm building a function to validate usernames, and in this case I want to accept alphabetic characters only. I'm matching the provided user input against this regex:
[1-9!##$%&*()_+=|<>?{}\\[\\]~-,]
This is the method that makes use of the regex:
public static String purgeInvalidLogin(String failedLogin, String pattern) {
Pattern special = Pattern.compile (pattern);
String purgedLogin = failedLogin.replaceAll(special.pattern(), ""); // remove any special characters before moving on
purgedLogin = StringUtils.deleteWhitespace(purgedLogin);
return purgedLogin;
}
However when trying to run this I get this message:
Illegal character range near index 25 [!##$%&*()_+=|<>?{}[]~-,] ^
which only happened once I added the comma. I've also tried the expression [!##$%&*()_+=|<>?{}[]~-\,] (escaping the comma) to no avail. I'm wondering how I can use the regex properly to exclude commas making use of my method above.
Thanks in advance.

Escape the hyphen just before it. It is interpreted as defining a range of characters, as soon as you add another character (the comma) after it.
[1-9!##$%&*()_+=|<>?{}\\[\\]~\\-,]

You want to accept only alpha chars and you are doing this by listing every possible illegal character. I think you have got this backwards and it would better to look for what you do want (which would be a much shorter regex) and flag non matches.

How to match a string's end using a regex pattern in Java?

I want a regular expression pattern that will match with the end of a string.
I'm implementing a stemming algorithm that will remove suffixes of a word.
E.g. for a word 'Developers' it should match 's'.
I can do it using following code :
Pattern p = Pattern.compile("s");
Matcher m = p.matcher("Developers");
m.replaceAll(" "); // it will replace all 's' with ' '
I want a regular expression that will match only a string's end something like replaceLast().

You need to match "s", but only if it is the last character in a word. This is achieved with the boundary assertion $:
input.replaceAll("s$", " ");
If you enhance the regular expression, you can replace multiple suffixes with one call to replaceAll:
input.replaceAll("(ed|s)$", " ");

Use $:
Pattern p = Pattern.compile("s$");

public static void main(String[] args)
{
String message = "hi this message is a test message";
message = message.replaceAll("message$", "email");
System.out.println(message);
}
Check this,
http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html

When matching a character at the end of string, mind that the $ anchor matches either the very end of string or the position before the final line break char if it is present even when the Pattern.MULTILINE option is not used.
That is why it is safer to use \z as the very end of string anchor in a Java regex.
For example:
Pattern p = Pattern.compile("s\\z");
will match s at the end of string.
See a related Whats the difference between \z and \Z in a regular expression and when and how do I use it? post.
NOTE: Do not use zero-length patterns with \z or $ after them because String.replaceAll(regex) makes the same replacement twice in that case. That is, do not use input.replaceAll("s*\\z", " ");, since you will get two spaces at the end, not one. Either use "s\\z" to replace one s, or use "s+\\z" to replace one or more.
If you still want to use replaceAll with a zero-length pattern anchored at the end of string to replace with a single occurrence of the replacement, you can use a workaround similar to the one in the How to make a regular expression for this seemingly simple case? post (writing "a regular expression that works with String replaceAll() to remove zero or more spaces from the end of a line and replace them with a single period (.)").

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)

Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.

Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time

Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));

It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.

I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.

If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.