Determine if a string has inner word boundaries - java

I use following g to determine if word appears in a text, enforcing word boundaries:
if ( Pattern.matches(".*\\b" + key + "\\b.*", text) ) {
//matched
}
This would match book on text-book but not on facebook.
Now, I would like to to do the reverse: determine if the input text has a word boundary inside.
E.g. mutually-collaborative (CORRECT, there is a word boundary inside) and mutuallycollaborative (WRONG, as there is no word boundary inside).
If the boundary was a punctuation this will work:
if( Pattern.matches("\\p{Punct}", text) ) { //check punctuations
//has punctuation
}
I would like to check for word boundaries in general , e.g. '-', etc.
Any idea?

You want to check if a given string contains a word boundary inside the string. Note that \b matches at the beginning and end of a non-empty string. Thus, you need to exclude those alternatives. Just use
"(?U)(?:\\W\\w|\\w\\W)"
This way, you will make sure a string contains a combination of a word and a non-word characters.
See IDEONE demo:
String s = "mutuallyexclusive";
Pattern pattern = Pattern.compile("(?U)(?:\\W\\w|\\w\\W)");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group() + " word boundary found!");
} else {
System.out.println("Word boundary NOT found in " + s);
}
Just some reference on what a word boundary can match:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
So, with \w\W|\W\w, we exclude the first 2 situations.

Related

How to write valid regex groups?

I try to write regex in order to validate that :
not lowercase characters
not only whitespace
I have write a regex (^[^a-z]+$)(^[^ ]+$) but when i test it, SFSFSDis incorrect.
How can i do that ?
Thanks ! :)
You can also use the regex (^(?!\\s+$)[^a-z]+)
(?!\\s+$) part checks if the regex does not contain only whitespaces till the end
[^a-z]+ then check for the lowercase characters
Tested with a few samples:
List<String> wordList = Arrays.asList(" ", "SFSFSD", "as DDdkj", "AB CD", " k", "l l");
String regex = "(^(?!\\s+$)[^a-z]+)";
for(String word : wordList) {
if(word.matches(regex)) {
System.out.println(word + " :valid");
} else {
System.out.println(word + " :not valid");
}
}
Output:
:not valid
SFSFSD :valid
as DDdkj :not valid
AB CD :valid
k :not valid
l l :not valid
You could use:
^[^a-z]*[^a-z\s][^a-z]*$
This would assert that there be at least one non whitespace, non lowercase letter, character, with the remaining being either whitespace or non lowercase.
I use the regex to validate that:
/^(?!^ +$)([^a-z]+)$/
Explanation
^ matches the position before the first character in the string.
(?!^ +$) negative lookahead, that matches if the string not only whitespace.
([^a-z]+) matches all characters except lowercase characters.
$ matches the right after character position of the string.

How to find and skip special characters at the start and end of the word

New to regex and using following code to find if a word contains special characters at the end/start.
String s = "K-factor:";
String regExp = "^[^<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]*$";
Matcher matcher = Pattern.compile(regExp).matcher(s);
while (matcher.find()) {
System.out.println("Start: "+ matcher.start());
System.out.println("End: "+ matcher.end());
System.out.println("Group: "+ matcher.group());
s = s.substring(0, matcher.start());
}
Would like to find if there's any special character(: in this sample code) at the start or end of the string. Trying to skip the character.
Neither compile time error nor output.
Note that your regex matches a whole string that does not contain the chars you defined in the character class. The string in question does not match that pattern since it contains :.
You might consider splitting the pattern into two parts to check for the unwanted chars at the start or end using an alternation group:
String regExp = "^[<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]|[<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]$";
Here, the pattern has a ^<special_char_class>|<special_char_class>$ structure, ^ anchors the match at start, $ anchors the match at the string end, and | is the alternation operator. Note I removed the ^ from the start of the character class to make them positive rather than negated, so that they could match those chars/ranges defined in the class.
Alternatively, since you seem to just match a string if it contains a non-letter at the start/end, you may use a
String regExp = "^\\P{L}|\\P{L}$";
that is Unicode letter aware or - ASCII only:
String regExp = "^\\P{Alpha}|\\P{Alpha}$";

Matching a word with pound (#) symbol in a regex

I have regexp for check if some text containing word (with ignoring boundary)
String regexp = ".*\\bSOME_WORD_HERE\\b.*";
but this regexp return false when "SOME_WORD" starts with # (hashtag).
Example, without #
String text = "some text and test word";
String matchingWord = "test";
boolean contains = text.matches(".*\\b" + matchingWord + "\\b.*");
// now contains == true;
But with hashtag `contains` was false. Example:
text = "some text and #test word";
matchingWord = "#test";
contains = text.matches(".*\\b" + matchingWord + "\\b.*");
//contains == fasle; but I expect true
The \b# pattern matches a # that is preceded with a word character: a letter, digit or underscore.
If you need to match # that is not preceded with a word char, use a negative lookbehind (?<!\w). Similarly, to make sure the trailing \b matches if a non-word char is there, use (?!\w) negative lookahead:
text.matches("(?s).*(?<!\\w)" + matchingWord + "(?!\\w).*");
Using Pattern.quote(matchingWord) is a good idea if your matchingWord can contain special regex metacharacters.
Alternatively, if you plan to match your search words in between whitespace or start/end of string, you can use (?<!\S) as the initial boundary and (?!\S) as the trailing one
text.matches("(?s).*(?<!\\S)" + matchingWord + "(?!\\S).*");
And one more thing: the .* in the .matches is not the best regex solution. A regex like "(?<!\\S)" + matchingWord + "(?!\\S)" with Matcher#find() will be processed in a much more optimized way, but you will need to initialize the Matcher object for that.
If you are looking for words with leading '#', just simple remove the leading '#' from the searchword and use following regex.
text.matches("#\\b" + matchingWordWithoutLeadingHash + "\\b");

Java match whole word in String

I have an ArrayList<String> which I iterate through to find the correct index given a String. Basically, given a String, the program should search through the list and find the index where the whole word matches. For example:
ArrayList<String> foo = new ArrayList<String>();
foo.add("AAAB_11232016.txt");
foo.add("BBB_12252016.txt");
foo.add("AAA_09212017.txt");
So if I give the String AAA, I should get back index 2 (the last one). So I can't use the contains() method as that would give me back index 0.
I tried with this code:
String str = "AAA";
String pattern = "\\b" + str + "\\b";
Pattern p = Pattern.compile(pattern);
for(int i = 0; i < foo.size(); i++) {
// Check each entry of list to find the correct value
Matcher match = p.matcher(foo.get(i));
if(match.find() == true) {
return i;
}
}
Unfortunately, this code never reaches the if statement inside the loop. I'm not sure what I'm doing wrong.
Note: This should also work if I searched for AAA_0921, the full name AAA_09212017.txt, or any part of the String that is unique to it.
Since word boundary does not match between a word char and underscore you need
String pattern = "(?<=_|\\b)" + str + "(?=_|\\b)";
Here, (?<=_|\b) positive lookbehind requires a word boundary or an underscore to appear before the str, and the (?=_|\b) positive lookahead requires an underscore or a word boundary to appear right after the str.
See this regex demo.
If your word may have special chars inside, you might want to use a more straight-forward word boundary:
"(?<![^\\W_])" + Pattern.quote(str) + "(?![^\\W_])"
Here, the negative lookbehind (?<![^\\W_]) fails the match if there is a word character except an underscore ([^...] is a negated character class that matches any character other than the characters, ranges, etc. defined inside this class, thus, it matches all characters other than a non-word char \W and a _), and the (?![^\W_]) negative lookahead fails the match if there is a word char except the underscore after the str.
Note that the second example has a quoted search string, so that even AA.A_str.txt could be matched well with AA.A.
See another regex demo

Non-greedy Regular Expression in Java

I have next code:
public static void createTokens(){
String test = "test is a word word word word big small";
Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+?\\s*)").matcher(test);
while (mtch.find()){
for (int i = 1; i <= mtch.groupCount(); i++){
System.out.println(mtch.group(i));
}
}
}
And have next output:
word
w
But in my opinion it must be:
word
word
Somebody please explain me why so?
Because your patterns are non-greedy, so they matched as little text as possible while still consisting of a match.
Remove the ? in the second group, and you'll get
word
word word big small
Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+\\s*)").matcher(test);
By using \\s* it will match any number of spaces including 0 spaces. w matches (\\s*.+?\\s*). To make sure it matches a word separated by spaces try (\\s+.+?\\s+)

Categories