I use following g to determine if word appears in a text, enforcing word boundaries:
if ( Pattern.matches(".*\\b" + key + "\\b.*", text) ) {
//matched
}
This would match book on text-book but not on facebook.
Now, I would like to to do the reverse: determine if the input text has a word boundary inside.
E.g. mutually-collaborative (CORRECT, there is a word boundary inside) and mutuallycollaborative (WRONG, as there is no word boundary inside).
If the boundary was a punctuation this will work:
if( Pattern.matches("\\p{Punct}", text) ) { //check punctuations
//has punctuation
}
I would like to check for word boundaries in general , e.g. '-', etc.
Any idea?
You want to check if a given string contains a word boundary inside the string. Note that \b matches at the beginning and end of a non-empty string. Thus, you need to exclude those alternatives. Just use
"(?U)(?:\\W\\w|\\w\\W)"
This way, you will make sure a string contains a combination of a word and a non-word characters.
See IDEONE demo:
String s = "mutuallyexclusive";
Pattern pattern = Pattern.compile("(?U)(?:\\W\\w|\\w\\W)");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group() + " word boundary found!");
} else {
System.out.println("Word boundary NOT found in " + s);
}
Just some reference on what a word boundary can match:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
So, with \w\W|\W\w, we exclude the first 2 situations.
Related
I try to write regex in order to validate that :
not lowercase characters
not only whitespace
I have write a regex (^[^a-z]+$)(^[^ ]+$) but when i test it, SFSFSDis incorrect.
How can i do that ?
Thanks ! :)
You can also use the regex (^(?!\\s+$)[^a-z]+)
(?!\\s+$) part checks if the regex does not contain only whitespaces till the end
[^a-z]+ then check for the lowercase characters
Tested with a few samples:
List<String> wordList = Arrays.asList(" ", "SFSFSD", "as DDdkj", "AB CD", " k", "l l");
String regex = "(^(?!\\s+$)[^a-z]+)";
for(String word : wordList) {
if(word.matches(regex)) {
System.out.println(word + " :valid");
} else {
System.out.println(word + " :not valid");
}
}
Output:
:not valid
SFSFSD :valid
as DDdkj :not valid
AB CD :valid
k :not valid
l l :not valid
You could use:
^[^a-z]*[^a-z\s][^a-z]*$
This would assert that there be at least one non whitespace, non lowercase letter, character, with the remaining being either whitespace or non lowercase.
I use the regex to validate that:
/^(?!^ +$)([^a-z]+)$/
Explanation
^ matches the position before the first character in the string.
(?!^ +$) negative lookahead, that matches if the string not only whitespace.
([^a-z]+) matches all characters except lowercase characters.
$ matches the right after character position of the string.
New to regex and using following code to find if a word contains special characters at the end/start.
String s = "K-factor:";
String regExp = "^[^<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]*$";
Matcher matcher = Pattern.compile(regExp).matcher(s);
while (matcher.find()) {
System.out.println("Start: "+ matcher.start());
System.out.println("End: "+ matcher.end());
System.out.println("Group: "+ matcher.group());
s = s.substring(0, matcher.start());
}
Would like to find if there's any special character(: in this sample code) at the start or end of the string. Trying to skip the character.
Neither compile time error nor output.
Note that your regex matches a whole string that does not contain the chars you defined in the character class. The string in question does not match that pattern since it contains :.
You might consider splitting the pattern into two parts to check for the unwanted chars at the start or end using an alternation group:
String regExp = "^[<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]|[<>{}\"/|;:.,~!?##$%^=&*\\]\\\\()\\[0-9_+]$";
Here, the pattern has a ^<special_char_class>|<special_char_class>$ structure, ^ anchors the match at start, $ anchors the match at the string end, and | is the alternation operator. Note I removed the ^ from the start of the character class to make them positive rather than negated, so that they could match those chars/ranges defined in the class.
Alternatively, since you seem to just match a string if it contains a non-letter at the start/end, you may use a
String regExp = "^\\P{L}|\\P{L}$";
that is Unicode letter aware or - ASCII only:
String regExp = "^\\P{Alpha}|\\P{Alpha}$";
I have regexp for check if some text containing word (with ignoring boundary)
String regexp = ".*\\bSOME_WORD_HERE\\b.*";
but this regexp return false when "SOME_WORD" starts with # (hashtag).
Example, without #
String text = "some text and test word";
String matchingWord = "test";
boolean contains = text.matches(".*\\b" + matchingWord + "\\b.*");
// now contains == true;
But with hashtag `contains` was false. Example:
text = "some text and #test word";
matchingWord = "#test";
contains = text.matches(".*\\b" + matchingWord + "\\b.*");
//contains == fasle; but I expect true
The \b# pattern matches a # that is preceded with a word character: a letter, digit or underscore.
If you need to match # that is not preceded with a word char, use a negative lookbehind (?<!\w). Similarly, to make sure the trailing \b matches if a non-word char is there, use (?!\w) negative lookahead:
text.matches("(?s).*(?<!\\w)" + matchingWord + "(?!\\w).*");
Using Pattern.quote(matchingWord) is a good idea if your matchingWord can contain special regex metacharacters.
Alternatively, if you plan to match your search words in between whitespace or start/end of string, you can use (?<!\S) as the initial boundary and (?!\S) as the trailing one
text.matches("(?s).*(?<!\\S)" + matchingWord + "(?!\\S).*");
And one more thing: the .* in the .matches is not the best regex solution. A regex like "(?<!\\S)" + matchingWord + "(?!\\S)" with Matcher#find() will be processed in a much more optimized way, but you will need to initialize the Matcher object for that.
If you are looking for words with leading '#', just simple remove the leading '#' from the searchword and use following regex.
text.matches("#\\b" + matchingWordWithoutLeadingHash + "\\b");
I have an ArrayList<String> which I iterate through to find the correct index given a String. Basically, given a String, the program should search through the list and find the index where the whole word matches. For example:
ArrayList<String> foo = new ArrayList<String>();
foo.add("AAAB_11232016.txt");
foo.add("BBB_12252016.txt");
foo.add("AAA_09212017.txt");
So if I give the String AAA, I should get back index 2 (the last one). So I can't use the contains() method as that would give me back index 0.
I tried with this code:
String str = "AAA";
String pattern = "\\b" + str + "\\b";
Pattern p = Pattern.compile(pattern);
for(int i = 0; i < foo.size(); i++) {
// Check each entry of list to find the correct value
Matcher match = p.matcher(foo.get(i));
if(match.find() == true) {
return i;
}
}
Unfortunately, this code never reaches the if statement inside the loop. I'm not sure what I'm doing wrong.
Note: This should also work if I searched for AAA_0921, the full name AAA_09212017.txt, or any part of the String that is unique to it.
Since word boundary does not match between a word char and underscore you need
String pattern = "(?<=_|\\b)" + str + "(?=_|\\b)";
Here, (?<=_|\b) positive lookbehind requires a word boundary or an underscore to appear before the str, and the (?=_|\b) positive lookahead requires an underscore or a word boundary to appear right after the str.
See this regex demo.
If your word may have special chars inside, you might want to use a more straight-forward word boundary:
"(?<![^\\W_])" + Pattern.quote(str) + "(?![^\\W_])"
Here, the negative lookbehind (?<![^\\W_]) fails the match if there is a word character except an underscore ([^...] is a negated character class that matches any character other than the characters, ranges, etc. defined inside this class, thus, it matches all characters other than a non-word char \W and a _), and the (?![^\W_]) negative lookahead fails the match if there is a word char except the underscore after the str.
Note that the second example has a quoted search string, so that even AA.A_str.txt could be matched well with AA.A.
See another regex demo
I have next code:
public static void createTokens(){
String test = "test is a word word word word big small";
Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+?\\s*)").matcher(test);
while (mtch.find()){
for (int i = 1; i <= mtch.groupCount(); i++){
System.out.println(mtch.group(i));
}
}
}
And have next output:
word
w
But in my opinion it must be:
word
word
Somebody please explain me why so?
Because your patterns are non-greedy, so they matched as little text as possible while still consisting of a match.
Remove the ? in the second group, and you'll get
word
word word big small
Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+\\s*)").matcher(test);
By using \\s* it will match any number of spaces including 0 spaces. w matches (\\s*.+?\\s*). To make sure it matches a word separated by spaces try (\\s+.+?\\s+)