Scanning string for keywords of various lengths

Scanning string for keywords of various lengths - java

I want to scan my document split into array of words for certain keywords such as 'Fuel', 'Vehicle', 'Vehicle Leasing', 'Asset Type Maintenance' etc. The problem is that the keywords are of different lengths. One is a single word keyword, the other is 4 words keyword. At the moment I'm scanning word after word but that doesn't like the idea of multiple word keywords such as 'Vehicle Leasing' for example.
What can I do to improve my code and to work with multiple word keywords?
This is how it looks now
public void findKeywords(POITextExtractor te, ArrayList<HashMap<String,Integer>> listOfHashMaps, ArrayList<Integer> KeywordsFound, ArrayList<Integer> existingTags) {
String document = te.getText().toString();
String[] words = document.split("\\s+");
int wordsNo = 0;
int keywordsMatched = 0;
try {
for(String word : words) {
wordsNo++;
for(HashMap<String, Integer> hashmap : listOfHashMaps) {
if(hashmap.containsKey(word) && !KeywordsFound.contains(hashmap.get(word)) && !existingTags.contains(hashmap.get(word))) {
KeywordsFound.add(hashmap.get(word));
keywordsMatched++;
System.out.println(word);
}
}
}
System.out.println("New keywords found: " + KeywordsFound);
System.out.println("Number of words in document = " + wordsNo);
System.out.println("Number of keywords matched: " + keywordsMatched);
} catch (IllegalArgumentException e) {
e.printStackTrace();
}
}
I have included my method. If there's anything else required to understand my code, leave a comment please.
#UPDATE
public void findKeywords(POITextExtractor te, ArrayList<HashMap<String,Integer>> listOfHashMaps, ArrayList<Integer> KeywordsFound, ArrayList<Integer> existingTags) {
String document = te.getText().toString();
String[] words = document.split("\\s+");
int wordsNo = 0;
int keywordsMatched = 0;
for(HashMap<String, Integer> hashmap : listOfHashMaps) {
Iterator it = hashmap.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
//System.out.println(pair.getKey() + " = " + pair.getValue());
it.remove(); // avoids a ConcurrentModificationException
if(document.contains((CharSequence) pair.getKey()) && !KeywordsFound.contains(pair.getValue()) && !existingTags.contains(pair.getValue())) {
System.out.println(pair.getKey());
KeywordsFound.add((Integer) pair.getValue());
keywordsMatched++;
}
}
}
System.out.println("New keywords found: " + KeywordsFound);
System.out.println("Number of keywords matched: " + keywordsMatched);
}

Another way of doing it would be to split the string by the search strings.
eg.
List<String> searchString = new ArrayList<>();
searchString.add("Fuel");
searchString.add("Asset Type Maintenance");
searchString.add("Vehicle Leasing");
String document=""; // Assuming that you complete string is initilaized here.
for (String str : searchString) {
String[] tempDoc=document.split(str);
System.out.println(str + " is repated "+ (tempDoc.length-1) + " times");
Note this might thrash the JVM in garbage collection.
You can compare the performance on you own.

I assume this is a kind of homework. Therefore:
Have a look at string search algorithms that search for a substring (pattern) in a larger string.
Then assume that you use one of this algorithms, but instead of having a sequence of chars (pattern) that you search for in a larger sequence of chars, you have a sequence of string (pattern) that you search for in a larger sequence of string. (so you just have a different, much larger, alphabet)

Related

Find words in String consisting of all distinct characters without using Java Collection Framework

I need your help. I am stuck on one problem, solving it for several hours.
*1. Find word containing only of various characters. Return first word if there are a few of such words.
2. #param words Input array of words
3. #return First word that containing only of various characters*
**public String findWordConsistingOfVariousCharacters(String[] words) {
throw new UnsupportedOperationException("You need to implement this method");
}**
#Test
public void testFindWordConsistingOfVariousCharacters() {
String[] input = new String[] {"aaaaaaawe", "qwer", "128883", "4321"};
String expectedResult = "qwer";
StringProcessor stringProcessor = new StringProcessor();
String result = stringProcessor.findWordConsistingOfVariousCharacters(input);
assertThat(String.format("Wrong result of method findWordConsistingOfVariousCharacters (input is %s)", Arrays.toString(input)), result, is(expectedResult));
}
Thank you in advance

Just go through the data and check whether each string is made up of only distinct characters:
public static boolean repeat(String str) {
char[] chars = str.toCharArray();
Arrays.sort(chars);//The same character will only appear in groups
for(int i = 1;i<chars.length;i++) {
if(chars[i] == chars[i - 1]) {
return false;//Same character appeared twice
}
}
return true;//There is no repeating character
}
The method above is used to check whether a string is made up of distinct characters, now loops through the data:
for(int i = 0;i<input.length;i++){
if(repeat(input[i])){
System.out.println("The answer is " + input[i] + " at index " + i);
break;//you find it! Now break the loop
}
}

Assuming the strings are all ASCII characters, use a boolean[] to mark if you have encountered that character in the word already:
boolean [] encountered = new boolean[256];
for (char c : word.toCharArray()) {
if (encountered[(int)c]) {
// not unique
} else {
encountered[(int)c] = true;
}
}

Find the most common word from user input

I'm very new to Java creating a software application that allows a user to input text into a field and the program runs through all of the text and identifies what the most common word is. At the moment, my code looks like this:
JButton btnMostFrequentWord = new JButton("Most Frequent Word");
btnMostFrequentWord.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
String text = textArea.getText();
String[] words = text.split("\\s+");
HashMap<String, Integer> occurrences = new HashMap<String, Integer>();
for (String word : words) {
int value = 0;
if (occurrences.containsKey(word)) {
value = occurrences.get(word);
}
occurrences.put(word, value + 1);
}
JOptionPane.showMessageDialog(null, "Most Frequent Word: " + occurrences.values());
}
}
This just prints what the values of the words are, but I would like it to tell me what the number one most common word is instead. Any help would be really appreciated.

Just after your for loop, you can sort the map by value then reverse the sorted entries by value and select the first.
for (String word: words) {
int value = 0;
if (occurrences.containsKey(word)) {
value = occurrences.get(word);
}
occurrences.put(word, value + 1);
}
Map.Entry<String,Integer> tempResult = occurrences.entrySet().stream()
.sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
.findFirst().get();
JOptionPane.showMessageDialog(null, "Most Frequent Word: " + tempResult.getKey());

For anyone who is more familiar with Java, here is a very easy way to do it with Java 8:
List<String> words = Arrays.asList(text.split("\\s+"));
Collections.sort(words, Comparator.comparingInt(word -> {
return Collections.frequency(words, word);
}).reversed());
The most common word is stored in words.get(0) after sorting.

I would do something like this
int max = 0;
String a = null;
for (String word : words) {
int value = 0;
if(occurrences.containsKey(word)){
value = occurrences.get(word);
}
occurrences.put(word, value + 1);
if(max < value+1){
max = value+1;
a = word;
}
}
System.out.println(a);
You could sort it, and the solution would be much shorter, but I think this runs faster.

You can either iterate through occurrences map and find the max or
Try like below
String text = textArea.getText();;
String[] words = text.split("\\s+");
HashMap<String, Integer> occurrences = new HashMap<>();
int mostFreq = -1;
String mostFreqWord = null;
for (String word : words) {
int value = 0;
if (occurrences.containsKey(word)) {
value = occurrences.get(word);
}
value = value + 1;
occurrences.put(word, value);
if (value > mostFreq) {
mostFreq = value;
mostFreqWord = word;
}
}
JOptionPane.showMessageDialog(null, "Most Frequent Word: " + mostFreqWord);

Split string after every 2 words and store into list

I have a string of words as follows:
String words = "disaster kill people action scary seriously world murder loose world";
Now, I wish to split every 2 words and store them into a list so that it will produce something like:
[disaster kill, people action, scary seriously,...]
The problem with my code is that it will split whenever it encounters a space. How do I modify it so that it will only be added into the list if it only encounters every 2nd space, preserving the space after each word)
My code:
ArrayList<String> wordArrayList = new ArrayList<String>();
for(String word : joined.split(" ")) {
wordArrayList.add(word);
}
Thanks.

Use this regular expression: (?<!\\G\\S+)\\s.
PROOF:
String words = "disaster kill people action scary seriously world murder loose world";
String[] result = words.split("(?<!\\G\\S+)\\s");
System.out.printf("%s%n", Arrays.toString(result));
And the result:
[disaster kill, people action, scary seriously, world murder, loose world]

Your loop should leave you with an ArrayList<String> that has each word, right? All you need to do now is iterate through that list and combine words together in sets of twos.
ArrayList<String> finalList = new ArrayList<String>();
for (int i = 0; i < wordArrayList.Size(); i+=2) {
if (i + 1 < wordArrayList.Size()
finalList.add(wordArrayList.get(i) + " " + wordArrayList.get(i + 1);
}
This should take your split words and add them to the list with spaces so that they look like your desired output.

I was looking for splitting a string after 'n' words.
So I modify the above solution.
private void spiltParagraph(int splitAfterWords, String someLargeText) {
String[] para = someLargeText.split(" ");
ArrayList<String> data = new ArrayList<>();
for (int i = 0; i < para.length; i += splitAfterWords) {
if (i + (splitAfterWords - 1) < para.length) {
StringBuilder compiledString = new StringBuilder();
for (int f = i; f <= i + (splitAfterWords - 1); f++) {
compiledString.append(para[f] + " ");
}
data.add(compiledString.toString());
}
}
}

I run into this problem today, adding an extra difficulty that is to write this solution in Scala. So, I needed to write a recursive solution that looks like:
val stringToSplit = "THIS IS A STRING THAT WE NEED TO SPLIT EVERY 2 WORDS"
#tailrec
def obtainCombinations(
value: String,
elements: List[String],
res: List[String]
): List[String] = {
if (elements.isEmpty)
res
else
obtainCombinations(elements.head, elements.tail, res :+ value + ' ' + elements.head)
}
obtainCombinations(
stringToSplit.split(' ').head,
stringToSplit.split(' ').toList.tail,
List.empty
)
The output will be:
res0: List[String] = List(THIS IS, IS A, A STRING, STRING THAT, THAT WE, WE NEED, NEED TO, TO SPLIT, SPLIT EVERY, EVERY 2, 2 WORDS)
Porting this to Java would be:
String stringToSplit = "THIS IS A STRING THAT WE NEED TO SPLIT EVERY 2 WORDS";
public ArrayList<String> obtainCombinations(String value, List<String> elements, ArrayList<String> res) {
if (elements.isEmpty()) {
return res;
} else {
res.add(value + " " + elements.get(0));
return obtainCombinations(elements.get(0), elements.subList(1, elements.size()), res);
}
}
ArrayList<String> result =
obtainCombinations(stringToSplit.split(" ")[0],
Arrays.asList(stringToSplit.split(" ")),
new ArrayList<>());

remove repeated words from String Array

Good Morning
I write a function that calculates for me the frequency of a term:
public static int tfCalculator(String[] totalterms, String termToCheck) {
int count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count;
}
and after that I use it on the code below to calculate every word from a String[] words
for(String word:words){
int freq = tfCalculator(words, word);
System.out.println(word + "|" + freq);
mm+=word + "|" + freq+"\n";
}
well the problem that I have is that the words repeat here is for example the result:
cytoskeletal|2
network|1
enable|1
equal|1
spindle|1
cytoskeletal|2
...
...
so can someone help me to remove the repeated word and get as result like that:
cytoskeletal|2
network|1
enable|1
equal|1
spindle|1
...
...
Thank you very much!

Java 8 solution
words = Arrays.stream(words).distinct().toArray(String[]::new);
the distinct method removes duplicates. words is replaced with a new array without duplicates

I think here you want to print the frequency of each string in the array totalterms . I think using Map is a easier solution as in the single traversal of the array it will store the frequency of all the strings Check the following implementation.
public static void printFrequency(String[] totalterms)
{
Map frequencyMap = new HashMap<String, Integer>();
for (String string : totalterms) {
if(frequencyMap.containsKey(string))
{
Integer count = (Integer)frequencyMap.get(string);
frequencyMap.put(string, count+1);
}
else
{
frequencyMap.put(string, 1);
}
}
Set <Entry<String, Integer>> elements= frequencyMap.entrySet();
for (Entry<String, Integer> entry : elements) {
System.out.println(entry.getKey()+"|"+entry.getValue());
}
}

You can just use a HashSet and that should take care of the duplicates issue:
words = new HashSet<String>(Arrays.asList(words)).toArray(new String[0]);
This will take your array, convert it to a List, feed that to the constructor of HashSet<String>, and then convert it back to an array for you.

Sort the array, then you can just count equal adjacent elements:
Arrays.sort(totalterms);
int i = 0;
while (i < totalterms.length) {
int start = i;
while (i < totalterms.length && totalterms[i].equals(totalterms[start])) {
++i;
}
System.out.println(totalterms[start] + "|" + (i - start));
}

in two line :
String s = "cytoskeletal|2 - network|1 - enable|1 - equal|1 - spindle|1 - cytoskeletal|2";
System.out.println(new LinkedHashSet(Arrays.asList(s.split("-"))).toString().replaceAll("(^\[|\]$)", "").replace(", ", "- "));

Your code is fine, you just need keep track of which words were encountered already. For that you can keep a running set:
Set<String> prevWords = new HashSet<>();
for(String word:words){
// proceed if word is new to the set, otherwise skip
if (prevWords.add(word)) {
int freq = tfCalculator(words, word);
System.out.println(word + "|" + freq);
mm+=word + "|" + freq+"\n";
}
}

More efficient way of getting frequency of words

I want to count the frequency of each word in an ArrayList by the start of the word. e.g [cat, cog, mouse] will mean there are 2 words begining with c and one word begining with m. The code I have works fine but there are 26 letters in the alphabet which will require alot more if s. Is there any other way of doing this?
public static void countAlphabeticalWords(ArrayList<String> arrayList) throws IOException
{
int counta =0, countb=0, countc=0, countd=0,counte=0;
String word = "";
for(int i = 0; i<arrayList.size();i++)
{
word = arrayList.get(i);
if (word.charAt(0) == 'a' || word.charAt(0) == 'A'){ counta++;}
if (word.charAt(0) == 'b' || word.charAt(0) == 'B'){ countb++;}
}
System.out.println("The number of words begining with A are: " + counta);
System.out.println("The number of words begining with B are: " + countb);
}

Use a Map
public static void countAlphabeticalWords(List<String> arrayList) throws IOException {
Map<Character,Integer> counts = new HashMap<Character,Integer>();
String word = "";
for(String word : list) {
Character c = Character.toUpperCase(word.charAt(0));
if (counts.containsKey(c)) {
counts.put(c, counts.get(c) + 1);
}
else {
counts.put(c, 1);
}
}
for (Map.Entry<Character, Integer> entry : counts.entrySet()) {
System.out.println("The number of words begining with " + entry.getKey() + " are: " + entry.getValue());
}
Or use a Map and AtomicInteger (as per Jarrod Roberson)
public static void countAlphabeticalWords(List<String> arrayList) throws IOException {
Map<Character,AtomicInteger> counts = new HashMap<Character,AtomicInteger>();
String word = "";
for(String word : list) {
Character c = Character.toUpperCase(word.charAt(0));
if (counts.containsKey(c)) {
counts.get(c).incrementAndGet();
}
else {
counts.put(c, new AtomicInteger(1));
}
}
for (Map.Entry<Character, AtomicInteger> entry : counts.entrySet()) {
System.out.println("The number of words begining with " + entry.getKey() + " are: " + entry.getValue());
}
Best Practices
Never do list.get(i), use for(element : list) instead. And never use ArrayList in a signature use the Interface List instead so you can change the implemenation.

How about this? Considering that the words start only with [a-zA-Z]:
public static int[] getCount(List<String> arrayList) {
int[] data = new int[26];
final int a = (int) 'a';
for(String s : arrayList) {
data[((int) Character.toLowerCase(s.charAt(0))) - a]++;
}
return data;
}
edit:
Just out of curiosity, I made a very simple test comparing my method and Steph's method with map.
List with 236 items, 10000000 iterations (without printing the result): my code took ~10000ms and Steph's took ~65000ms.
Test: http://pastebin.com/HNBgKFRk
Data: http://pastebin.com/UhCtapZZ

Now, every character can be cast to an integer, representing an ASCII decimal. For example, (int)'a' is 97. 'z''s ASCII decimal is 122. http://www.asciitable.com/
You can create a lookup table for the characters:
int characters = new int[128]
Then in your algorithm's loop use the ASCII decimal as index and increment the value:
word = arrayList.get(i);
characters[word.charAt(0)]++;
In the end, you can print the occurence of the characters:
for (int i = 97; i<=122; i++){
System.out.println(String.format("The number of words beginning with %s are: %d", (char)i, characters[i]));
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scanning string for keywords of various lengths - java

Related

Find words in String consisting of all distinct characters without using Java Collection Framework

Find the most common word from user input

Split string after every 2 words and store into list

remove repeated words from String Array

More efficient way of getting frequency of words

Categories

Resources