Non-greedy Regular Expression in Java - java

I have next code:
public static void createTokens(){
String test = "test is a word word word word big small";
Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+?\\s*)").matcher(test);
while (mtch.find()){
for (int i = 1; i <= mtch.groupCount(); i++){
System.out.println(mtch.group(i));
}
}
}
And have next output:
word
w
But in my opinion it must be:
word
word
Somebody please explain me why so?

Because your patterns are non-greedy, so they matched as little text as possible while still consisting of a match.
Remove the ? in the second group, and you'll get
word
word word big small
Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+\\s*)").matcher(test);

By using \\s* it will match any number of spaces including 0 spaces. w matches (\\s*.+?\\s*). To make sure it matches a word separated by spaces try (\\s+.+?\\s+)

Related

Is there a way to find a word by searching for only part of it?

I need to take a phrase that contains a specific word, then if it does have that word even if it's part of another word, to print the entire word out.
I think how to find the word "apple", but I can't figure how to find the word "appletree".
So far, I have some code that finds the word apple and prints that out.
String phrase = "She's sitting under an appletree!";
if (phrase.contains("apple")) {
System.out.println("apple");
} else {
System.out.println("none");
}
How do I print "appletree"?
Use regex for a 1-liner:
String target = phrase.replaceAll(".*?(\\w*apple\\w*).*", "$1");
This works by matching (and thus replacing) the entire input, but capturing the target then using a backreference ($1) to the captured input resulting in just the target being returned.
The word in which apple appears is matched using \\w* (ie any number of word chars) at either end of apple. The minimum number of leading chars are matched outside the target by using a reluctant quantifier .*?, otherwise that expression would match all the way to apple, which would miss words like dapple.
Test code:
String phrase = "She's sitting under an appletree!";
String target = phrase.replaceAll(".*?(\\w*apple\\w*).*", "$1");
System.out.println(target);
Output:
appletree
You could import a scanner to read the phrase. You would use the scanner.next() to capture each token of input into a String variable "s", in this case each word, and then use the if statement if(s.contains("apple")) then System.out.println(s).
Hope this helps!
Robin.
without using regex you could simply split the sentence into words, loop through and check if it contains the requested word - old school style
String [] arr = phrase.split ("\\s+");
for (String word : arr) {
if (word.contains("apple")) return word;
}

Extracting words with - included upper lowercase not working for words it only extracts chars

I'm trying to extract several words from a string with regex matcher &pattern. I did spend some time to make the regular expression I'm using but this doesn't work as expected, any help would be very appreciated.
I made the regular expression I'm using but this doesn't work as expected, some help would be great. I'm able to extract the chars from the words I want but not the entire word.
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main (String[] args){
String mebo = "1323 99BIMCP 1 2 BMWQ-CSPS-D1, 0192, '29229'";
Pattern pattern = Pattern.compile("[((a-zA-Z1-9-0)/W)]");
Matcher matcher = pattern.matcher(mebo);
while (matcher.find()) {
System.out.printf("Word is %s %n",matcher.group(0));
}
}
}
This is current output:
Word is 1 Word is 3 Word is 2 Word is 3 Word is 9 Word is 9 Word
is B Word is I Word is M Word is C Word is P Word is 1 Word is 2
Word is B Word is M Word is W Word is Q Word is - Word is C Word
is S Word is P Word is S Word is - Word is D Word is 1 Word is 0
Word is 1 Word is 9 Word is 2 Word is 2 Word is 9 Word is 2 Word
is 2 Word is 9
============
My expectation is to iterate entire words for example:
String mebo = "1323 99BIMCP 1 2 BMWQ-CSPS-D1, 0192, '29229'"
word is 1323 word is 99BIMCP word is 1 word is 2 word is BMWQ-CSPS-D1
word is 0192 word is 29229
You can use this as it seems from your regex you want to include character digit and - in your match.
`[\w-]+`
[\w-]+ - Matches (a-z 0-9 _ and - ) one or more time.
Demo
The easiest solution here seems to be to ditch regex overall and just split the string instead. You want to allow digits, alphabetic characters and - in your words. Consider the following code:
for (String word : mebo.split("[^\\d\\w-]+")) {
System.out.printf("Word is %s %n", word);
}
This should exhibit the desired behaviour. Note that this will generate some empty strings, unless you have the + in the splitting pattern.
What this does is splitting the input string between everything that does not match your desired characters. This is accomplished through using an inverted character class.
I would suggest a regex split, followed by a regex replacement:
String mebo = "1323 99BIMCP 1 2 BMWQ-CSPS-D1, 0192, '29229'";
String[] parts = mebo.split("\\s*,?\\s+");
for (String part : parts) {
System.out.println(part.replaceAll("[']", ""));
}
1323
99BIMCP
1
2
BMWQ-CSPS-D1
0192
29229
The logic here is to split on whitespace, possibly including a comma separator. Then, we can do a regex replacement cleanup to remove stray characters such as single quotes. Double quotes and any other unwanted characters can easily be added to the character class used for replacement.
In general, regex alone may not suffice here, and you may need a parser to cover every edge case. Case in point, consider the following input line:
One, "Two or more", Three
My answer fails here, because it blindly splits on whitespace, and does not know that escaped whitespace is not a token. A regex would also fail here.

Java match whole word in String

I have an ArrayList<String> which I iterate through to find the correct index given a String. Basically, given a String, the program should search through the list and find the index where the whole word matches. For example:
ArrayList<String> foo = new ArrayList<String>();
foo.add("AAAB_11232016.txt");
foo.add("BBB_12252016.txt");
foo.add("AAA_09212017.txt");
So if I give the String AAA, I should get back index 2 (the last one). So I can't use the contains() method as that would give me back index 0.
I tried with this code:
String str = "AAA";
String pattern = "\\b" + str + "\\b";
Pattern p = Pattern.compile(pattern);
for(int i = 0; i < foo.size(); i++) {
// Check each entry of list to find the correct value
Matcher match = p.matcher(foo.get(i));
if(match.find() == true) {
return i;
}
}
Unfortunately, this code never reaches the if statement inside the loop. I'm not sure what I'm doing wrong.
Note: This should also work if I searched for AAA_0921, the full name AAA_09212017.txt, or any part of the String that is unique to it.
Since word boundary does not match between a word char and underscore you need
String pattern = "(?<=_|\\b)" + str + "(?=_|\\b)";
Here, (?<=_|\b) positive lookbehind requires a word boundary or an underscore to appear before the str, and the (?=_|\b) positive lookahead requires an underscore or a word boundary to appear right after the str.
See this regex demo.
If your word may have special chars inside, you might want to use a more straight-forward word boundary:
"(?<![^\\W_])" + Pattern.quote(str) + "(?![^\\W_])"
Here, the negative lookbehind (?<![^\\W_]) fails the match if there is a word character except an underscore ([^...] is a negated character class that matches any character other than the characters, ranges, etc. defined inside this class, thus, it matches all characters other than a non-word char \W and a _), and the (?![^\W_]) negative lookahead fails the match if there is a word char except the underscore after the str.
Note that the second example has a quoted search string, so that even AA.A_str.txt could be matched well with AA.A.
See another regex demo

Determine if a string has inner word boundaries

I use following g to determine if word appears in a text, enforcing word boundaries:
if ( Pattern.matches(".*\\b" + key + "\\b.*", text) ) {
//matched
}
This would match book on text-book but not on facebook.
Now, I would like to to do the reverse: determine if the input text has a word boundary inside.
E.g. mutually-collaborative (CORRECT, there is a word boundary inside) and mutuallycollaborative (WRONG, as there is no word boundary inside).
If the boundary was a punctuation this will work:
if( Pattern.matches("\\p{Punct}", text) ) { //check punctuations
//has punctuation
}
I would like to check for word boundaries in general , e.g. '-', etc.
Any idea?
You want to check if a given string contains a word boundary inside the string. Note that \b matches at the beginning and end of a non-empty string. Thus, you need to exclude those alternatives. Just use
"(?U)(?:\\W\\w|\\w\\W)"
This way, you will make sure a string contains a combination of a word and a non-word characters.
See IDEONE demo:
String s = "mutuallyexclusive";
Pattern pattern = Pattern.compile("(?U)(?:\\W\\w|\\w\\W)");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group() + " word boundary found!");
} else {
System.out.println("Word boundary NOT found in " + s);
}
Just some reference on what a word boundary can match:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
So, with \w\W|\W\w, we exclude the first 2 situations.

Get number of exact substrings

I want to get the number of substrings out of a string.
The inputs are excel formulas like IF(....IF(...))+IF(...)+SUM(..) as a string. I want to count all IF( substrings. It's important that SUMIF(...) and COUNTIF(...) will not be counted.
I thought to check that there is no capital letter before the "IF", but this is giving (certainly) index out of bound. Can someone give me a suggestion?
My code:
for(int i = input.indexOf("IF(",input.length());
i != -1;
i= input.indexOf("IF(,i- 1)){
if(!isCapitalLetter(tmpFormulaString, i-1)){
ifStatementCounter++;
}
}
Although you can do the parsing by yourself as you were doing (that's possibly better for you to learn debugging so you know what your problem is)
However it can be easily done by regular expression:
String s = "FOO()FOOL()SOMEFOO()FOO";
Pattern p = Pattern.compile("\\bFOO\\b");
Matcher m = p.matcher(s);
int count = 0;
while (m.find()) {
count++;
}
// count= 2
The main trick here is \b in the regex. \b means word boundary. In short, if there is a alphanumeric character at the position of \b, it will not match.
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
I think you can solve your problem by finding String IF(.
Try to do same thing in another way .
For example:
inputStrin = IF(hello)IF(hello)....IF(helloIF(hello))....
inputString.getIndexOf("IF(");
That solves your problem?
Click Here Or You can use regular expression also.

Categories