Java regex predefined character class nested inside character class - java

I need to use a regular expression that contains all the \b characters except the dot, .
Something like [\b&&[^.]]
For example, in the following test string:
"somewhere deep down in some org.argouml.swingext classes and"
I want org.argouml.swingext string to match but org.argouml string not too match. (Using the Matcher.find() method)
If I use: \b(package_name)>\b they both match, which is not what I want.
If I use: \b(package_name)[\b&&[^\.]] I get a PatternSyntaxException
If I use: \b(package_name)(\b&&[^\.]) nothing matches.
I use this link to test my regexes.
Context: I have a list of package names from a project and I have to search them inside some texts. Obviously if a nested package is found, I don't want the outer package to match as well, as seen from the above example.
I am not using the \s character class at the end because the package may be at the end of line, or it may followed by other nonword characters such as : , ) etc, characters that are contained in the \b class. I just want to subtract the . from the \b class.
If anybody knows how to do this, I would be very grateful :)
Thanks

A negative lookahead would work here:
\borg.argouml(?!\.)\b
Remember that in Java string literals the backslashes in regular expressions must be escaped:
"\\borg.argouml(?!\\.)\\b"

Why not simply use:
\b\w+(\.\w+)+\b
FYI, the PatternSyntaxException pops up because \b matches a position, not a character. A character class always matches 1 character so putting \b (a word boundary) inside a character class will cause the exception to be thrown.

Related

How to escape a character in Regex expression in Java

I have a regex expression which removes all non alphanumeric characters. It is working fine for all special characters apart from ^. Below is the regex expression I am using.
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}]", "").toUpperCase();
I tried modifying it to
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}]\\^", "").toUpperCase();
and
String strRefernce = strReference.replaceAll("[^\\p{IsAlphabetic}^\\p{IsDigit}\\^]", "").toUpperCase();
But these are also not able to remove this symbol.
Can someone please help me with this.
The first ^ inside [^...] is a negation mark making the character class a negated one (matching characters other than what is inside).
The second one inside is considered a literal - thus, it should not be matched with the regex. Remove it, and a caret will get matched with it:
"[^\\p{IsAlphabetic}\\p{IsDigit}]"
or even shorter:
"(?U)\\P{Alnum}"
The \P{Alnum} class stands for any character other than an alphanumeric character: [\p{Alpha}\p{Digit}] (see Java regex reference). When you pass (?U), the \P{Alnum} class will not match Unicode letters. See this IDEONE demo.
Add a + at the end if you want to remove whole chunks of symbols other than \\p{IsAlphabetic} and \\p{IsDigit}.
This works as well.
System.out.println("Text 尖酸[刻薄 ^, More _0As text °ÑÑ"".replaceAll("(?U)[^[\\W_]]+", " "));
Output
Text 尖酸 刻薄 More 0As text Ñ Ñ
Not sure but the word might be the more comprehensive list of alphanum characters.
[\\W_] is a class containing non-words and an underscore.
When put into a negative Java class construct it becomes
[^[\\W_]] is a negative class of a union between nothing and
a class containing non-words and an underscore.

Add Dash to Java Regex

I am trying to modify an existing Regex expression being pulled in from a properties file from a Java program that someone else built.
The current Regex expression used to match an email address is -
RR.emailRegex=^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
That matches email addresses such as abc.xyz#example.com, but now some email addresses have dashes in them such as abc-def.xyz#example.com and those are failing the Regex pattern match.
What would my new Regex expression be to add the dash to that regular expression match or is there a better way to represent that?
Basing on the regex you are using, you can add the dash into your character class:
RR.emailRegex=^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
add
RR.emailRegex=^[a-zA-Z0-9_\\.-]+#[a-zA-Z0-9_-]+\\.[a-zA-Z0-9_-]+$
Btw, you can shorten your regex like this:
RR.emailRegex=^[\\w.-]+#[\\w-]+\\.[\\w-]+$
Anyway, I would use Apache EmailValidator instead like this:
if (EmailValidator.getInstance().isValid(email)) ....
Meaning of - inside a character class is different than used elsewhere. Inside character class - denotes range. e.g. 0-9. If you want to include -, write it in beginning or ending of character class like [-0-9] or [0-9-].
You also don't need to escape . inside character class because it is treated as . literally inside character class.
Your regex can be simplified further. \w denotes [A-Za-z0-9_]. So you can use
^[-\w.]+#[\w]+\.[\w]+$
In Java, this can be written as
^[-\\w.]+#[\\w]+\\.[\\w]+$
^[a-zA-Z0-9_\\.\\-]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
Should solve your problem. In regex you need to escape anything that has meaning in the Regex engine (eg. -, ?, *, etc.).
The correct Regex fix is below.
OLD Regex Expression
^[a-zA-Z0-9_\\.]+#[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+$
NEW Regex Expression
^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
Actually I read this post it covers all special cases, so the best one that's work correctly with java is
String pattern ="(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?|\\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-zA-Z0-9-]*[a-zA-Z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])";

Regarding regex whitespace

I have a small query regarding representing the space in java regular Expression.
I want to restrict the name and for that i have defined an pattern as
Pattern DISPLAY_NAME_PATTERN = compile("^[a-zA-Z0-9_\\.!~*()=+$,-\s]{3,20}$");
but eclipse indicating it as error "Invalid escape sequence".It is saying it for "\s" which according to
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
is a valid predefined class.
What am i missing.Could anyone help me withit.
Thanks in advance.
You need to escape the \ in \s one more time. And also, you don't need to escape the . inside a character class. . and \\. inside a character class matches a literal dot.
Pattern DISPLAY_NAME_PATTERN = Pattern.compile("^[a-zA-Z0-9_.!~*()=+$,\\s-]{3,20}$");
And also put the - at the first or at the last inside the character class. Because - at the center of character class may act as a range operator. regex.PatternSyntaxException: Illegal character range exception is mainly because of this issue, that there isn't a range exists between the , and \\s
If you want to do a backslash match, then you need to escape it exactly three times.
Pattern DISPLAY_NAME_PATTERN = Pattern.compile("^[a-zA-Z0-9_.\\\\!~*()=+$,\\s-]{3,20}$");
Example:
System.out.println("foo-bar bar8998~*foo".matches("[a-zA-Z0-9_.\\\\!~*()=+$,\\s-]{3,20}")); // true
System.out.println("fo".matches("[a-zA-Z0-9_.\\\\!~*()=+$,\\s-]{3,20}")); // false

Regular Expression Pattern to Match Words in All Caps That Are Followed By a colon

I need a pattern to match words like APPLE: or PEAR:
[A-Z][:] will match the R: but not the whole word and thus gives me a false when I try to match.
Can anybody help?
You want to match one or more capital letter which means you need to use a +. Also your : doesn't need to be in a character class:
[A-Z]+:
Just add a "quantifier":
/[A-Z]+:/
Note you don't need a character class for a single character.
How about \b[A-Z]+:? The \b is for checking a word boundary btw.
\b can be used to capture characters only in a word-boundary ie between the start and end of a word.
[A-Z] indicates a range of characters and specifying A-Z specifically matches the range of characters from capital A to capital Z. In other words, only upper case letters.
End the query by trying to match a semicolon and you'll find matches of a capital letter word immediately followed by a single semi-colon.
You can use the regular expression in Java like below.
import java.util.regex.*;
public class RegexExample {
System.out.println(
Pattern.matches("\b[A-Z]+:", "data: stuff, MIX!: of, APPLE: or, PEAR: or, PineAPPLes: yay!")
);
}
I recommend finding an online playground for regular expressions. Iterating and experimenting with regexes for a project can be a fast way to learn the limitations and find ways to simplify or improve an expression.
you need to use the + operator to get a match to all characters in the group
try with regex:
[A-Z]+\:

Regexp match in Java

Regexp in Java
I want to make a regexp who do this
verify if a word is like [0-9A-Za-z][._-'][0-9A-Za-z]
example for valid words
A21a_c32
daA.da2
das'2
dsada
ASDA
12SA89
non valid words
dsa#da2
34$
Thanks
^[0-9A-Za-z]+[._'-]?[0-9A-Za-z]+$ (see matches on rubular.com)
Key points:
^ is the start of the string anchor
$ is the end of string anchor
+ is "one-or-more repetition of"
? is "zero-or-one repetition of" (i.e. "optional")
- in a character class definition is special (range definition)...
unless it's escaped, or first, or last
. unescaped outside of a character class definition is special...
but in a character class definition it's just a period
References
regular-expressions.info/Anchors, Repetition, Dot, Character Class
If [._'-] are optional, put the ? with the next characters, like this:
[0-9A-Za-z]+([._'-][0-9A-Za-z]+)?
"(\\p{Alnum})*([.'_-])?(\\p{Alnum})*"
In this solution I assume that the delimiter is optional, the empty string is also legal, and that the string may start/end with the delimiter, or be composed only of the delimiter.

Categories