StringUtils.countMatches words starting with a string? - java

I'm usingStringUtils.countMatches to count word frequencies, is there a way to search text for words starting-with some characters?
Example:
searching for art in "artificial art in my apartment" will return 3! I need it to return 2 for words starting with art only.
My solution was to replace \r and \n in the text with a space and modify the code to be:
text = text.replaceAll("(\r\n|\n)"," ").toLowerCase();
searchWord = " "+searchWord.toLowerCase();
StringUtils.countMatches(text, searchWord);
I also tried the following Regex:
patternString = "\\b(" + searchWord.toLowerCase().trim() + "([a-zA-Z]*))";
pattern = Pattern.compile(patternString);
matcher = pattern.matcher(text.toLowerCase());
Questions:
-Does my first solution make sense or is there a better way to do this?
-Is my second solution faster? as I'm working with large text files and decent number of search-words.
Thanks

text = text.replaceAll("(\r\n|\n)"," ").toLowerCase();
searchWord = " "+searchWord.toLowerCase();
String[] words = text.split(" ");
int count = 0;
for(String word : words)
if(searchWord.length() < word.length())
if(word.substring(word.length).equals(searchWord))
count++;
Loops provide the same effect.

Use a regular expression to count examples of art.... The pattern to use is:
\b<search-word>
Here, \b matches a word boundary. Of course, the \b needs to be escaped when listed in the pattern string. Below is an example:
String input = "artificial art in my apartment";
Matcher matcher = Pattern.compile("\\bart").matcher(input);
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
Output: 2

Related

How do I replace the same word but different case in the same sentence separately?

For example, replace "HOW do I replace different how in the same sentence by using Matcher?" with "LOL do I replace different lol in the same sentence?"
If HOW is all caps, replace it with LOL. Otherwise, replace it with lol.
I only know how to find them:
String source = "HOW do I replace different how in the same " +
"sentence by using Matcher?"
Pattern pattern = Pattern.compile(how, Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(source);
while (m.find()) {
if(m.group.match("^[A-Z]*$"))
System.out.println("I am uppercase");
else
System.out.println("I am lowercase");
}
But I don't know how to replace them by using matcher and pattern.
Here's one way to achieve your goal: (not necessarily the most efficient, but it works and is simply understood)
String source = "HOW do I replace different how in the same sentence by using Matcher?";
String[] split = source.replaceAll("HOW", "LOL").split(" ");
String newSource = "";
for(int i = 0; i < split.length; i++) {
String at = split[i];
if(at.equalsIgnoreCase("how")) at = "lol";
newSource+= " " + at;
}
newSource.substring(1, newSource.length());
//The output string is newSource
Replace all uppercase, then iterate over each word and replace the remaining "how"s with "lol". That substring at the end is simply to remove the extra space.
I came up with a really dumb solution:
String result = source;
result = result.replaceAll(old_Word, new_Word);
result = result.replaceAll(old_Word.toUpperCase(),
newWord.toUpperCase());

Regular expression to remove everything but words. java

This code doesn't seem doing the right job. It removes the spaces between the words!
input = scan.nextLine().replaceAll("[^A-Za-z0-9]", "");
I want to remove all extra spaces and all numbers or abbreviations from a string, except words and this character: '.
For Example:
input: 34 4fF$##D one 233 r # o'clock 329riewio23
returns: one o'clock
public static String filter(String input) {
return input.replaceAll("[^A-Za-z0-9' ]", "").replaceAll(" +", " ");
}
The first replace replaces all characters except alphabetic characters, the single-quote, and spaces. The second replace replaces all instances of one or more spaces, with a single space.
Your solution doesn't work because you don't replace numbers and you also replace the ' character.
Check out this solution:
Pattern pattern = Pattern.compile("[^| ][A-Za-z']{2,} ");
String input = scan.nextLine();
Matcher matcher = pattern.matcher(input);
StringBuilder result = new StringBuilder();
while (matcher.find()) {
result.append(matcher.group());
}
System.out.println(result.toString());
It looks for the beginning of the string or a space ([^| ]) and then takes all the following characters ([A-Za-z']). However, it only takes the word if there are 2 or more charactes ({2,}) and there has to be a trailing space.
If you want to just extract that time information use this regex group match:
input = scan.nextLine();
Pattern p = Pattern.compile("([a-zA-Z]{3,})\\s.*?(o'clock)");
Matcher m = p.matcher(input);
if (m.find()) {
input = m.group(1) + " " + m.group(2);
}
The regex is quite naive though, and will only work if the input is always of a similar format.

How to find if words are close to each and with the same ending letters, and put comma? (JTextArea)

Firstly i don't know if it's even possible. Well, I need a code which would find if in JTextArea are two or more word close to each other with the same ending (with the same two or more last letters) and auto put comma between them. For ex. "I walked played with my dog" it should fix that sentence to: "I walked, played with my dog" it should auto put comma between walked and played because they're close to each other and two last letters are the same. Can anyone help me? Thanks very much.
Regex based solution:
String inputString = "I walked played with bobby robby my dog";
Pattern p = Pattern.compile("([a-z]{2})\\s([a-z]{0,})\\1");
Matcher m = p.matcher(inputString);
while (m.find()) {
inputString = inputString.substring(0, m.start(2) - 1) + ", " + inputString.substring(m.start(2));
m = p.matcher(inputString);
}
The pattern searches for places where there are 2 letters, a space, then some more letters, then the first 2 letters again.
I tweaked the input string to prove it was working, and my output was as expected:
'I walked played with bobby robby my dog'
becomes:
'I walked, played with bobby, robby my dog'
addition: In order to increase the number of characters matched, increase the number in {2} to the desired value. If there is one specific pair you are looking for (e.g. ed) then change [a-z]{2} to be your desired characters. e.g.
Pattern p = Pattern.compile("(ed)\\s([a-z]{0,})\\1");
This should do it:
// First read the text from the text area:
String text = textArea1.getText();
// Split the string around spaces (thats enough based on your specification above)
String[] words = text.split(" ");
String lastLetters = "";
StringBuilder result = new StringBuilder();
// Go through building up the result by looking at one word at a time
for (String str: words) {
if (!lastLetters.isEmpty() && str.endsWith(lastLetters) {
result.append(", ");
} else {
result.append(" ");
}
result.append(str);
int start = str.length()-3;
if (start < 0) {
start = 0;
}
lastLetters = str.substring(start, str.length()-1);
}
// Set the result into the other text area
textArea2.setText(result.toString());
You might need to tweak some of the parameters into the subString to get the exact range of values you need, etc.

Retrieving Regex matched pattern

I need to retrieve a regex pattern matched strings from the given input.
Lets say, the pattern I need to get is like,
"http://mysite.com/<somerandomvalues>/images/<againsomerandomvalues>.jpg"
Now I created the following regex pattern for this,
http:\/\/.*\.mysite\.com\/.*\/images\/.*\.jpg
Can anybody illustrate how to retrieve all the matched pattern with this regx expression using Java?
You don't mask slashes but literal dots:
String regex = "http://(.*)\\.mysite\\.com/(.*)/images/(.*)\\.jpg";
String url = "http://www.mysite.com/work/images/cat.jpg";
Pattern pattern = Pattern.compile (regex);
Matcher matcher = pattern.matcher (url);
if (matcher.matches ())
{
int n = matcher.groupCount ();
for (int i = 0; i <= n; ++i)
System.out.println (matcher.group (i));
}
Result:
www
work
cat
Some simple Java example:
String my_regex = "http://.*.mysite.com/.*/images/.*.jpg";
Pattern pattern = Pattern.compile(my_regex);
Matcher matcher = pattern.matcher(string_to_be_matched);
// Check all occurance
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
In fact, it is not clear if you want the whole matching string or only the groups.
Bogdan Emil Mariesan's answer can be reduced to
if ( matcher.matches () ) System.out.println(string_to_be_matched);
because you know it is mathed and there are no groups.
IMHO, user unknown's answer is correct if you want to get matched groups.
I just want to add additional information (for others) that if you need matched group you can use replaceFirst() method too:
String firstGroup = string.replaceFirst( "http://mysite.com/(.*)/images/", "$1" );
But performance of Pattern.compile approach if better if there are two or more groups or if you need to do that multiple times (on the other hand in programming contests, for example, it is faster to write replaceFirst()).

How can I filter out non letters from a text file using the scanner delimiter including the single quote or apostrophe in Java

Pls I want to keep a count of every word from a file, and this count should not include non letters like the apostrophe, comma, fullstop, question mark, exclamation mark, e.t.c. i.e just letters of the alphabet.
I tried to use a delimiter like this, but it didn't include the apostrophe.
Scanner fileScanner = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
int totalWordCount = 0;
//Firstly to count all the words in the file without the restricted characters
while (fileScanner.hasNext()) {
fileScanner.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
totalWordCount++;
}
System.out.println("There are " + totalWordCount + " word(s)");
//Then later I create an array to store each individual word in the file for counting their lengths.
Scanner fileScanner2 = new Scanner("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt");
String[] words = new String[totalWordCount];
for (int i = 0; i < totalWordCount; ++i) {
words[i] = fileScanner2.useDelimiter(("[.,:;()?!\" \t\n\r]+")).next();
}
This doesn't seem to work !
Please how can I go about this ?
Seems to me that you don't want to filter using anything but spaces and end lines. For example the word "they're" would return as two words if you're using a ' to filter your number of words. Here's how you could change your original code to make it work.
Scanner fileScanner = new Scanner(new File("C:\\MyJavaFolder\\JavaAssignment1\\TestFile.txt"));
int totalWordCount = 0;
ArrayList<String> words = new ArrayList<String>();
//Firstly to count all the words in the file without the restricted characters
while (fileScanner.hasNext()) {
//Add words to an array list so you only have to go through the scanner once
words.add(fileScanner.next());//This defaults to whitespace
totalWordCount++;
}
System.out.println("There are " + totalWordCount + " word(s)");
fileScanner.close();
Using the Pattern.compile() turns your string into a regular expression. The '\s' character is predefined in the Pattern class to match all white space characters.
There is more information at
Pattern Documentation
Also, make sure to close your Scanner classes when you're done. This could prevent your second scanner from opening.
Edit
If you want to count the letters per word you can add the following code to the above code
int totalLetters = 0;
int[] lettersPerWord = new int[words.size()];
for (int wordNum = 0; wordNum < words.size(); wordNum++)
{
String word = words.get(wordNum);
word = word.replaceAll("[.,:;()?!\" \t\n\r\']+", "");
lettersPerWord[wordNum] = word.length();
totalLetters = word.length();
}
I have tested this code and it appears to work for me. The replaceAll, according to the JavaDoc uses a regular expression to match so it should match any of those characters and essentially remove it.
The Delimiter is not a regular expression, so with your example it is looking for things split between "[.,:;()?!\" \t\n\r]+"
You can either use regexp instead of the Delimiter
using the regexp class with the group method may be what your looking for.
String pattern = "(.*)[.,:;()?!\" \t\n\r]+(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
if (m.find( )) {
System.out.println("Found value: " + m.group(1) );
}
Play with those classes and you will see it is much more similar to what you need
You could try this regex in your delimiter:
fileScanner.useDelimiter(("[^a-zA-Z]|[^\']")).next();
This will use any non-letter character OR non apostrophe as a delimiter. That way your words will include the apostrophe but not any other non-letter character.
Then you'll have to loop through each word and check for apostrophe's and account for them if you want the length to be accurate. You could just remove each apostrophe and the length will match the number of letters in the word, or you could create word objects with their own length fields, so that you can print the word as is, and know the number of letter characters in that word.

Categories