counting number of occurrences of words in a text java - java

So I'm building a TreeMap from scratch and I'm trying to count the number of occurrences of every word in a text using Java. The text is read from a text file, but I can easily read it from there. I really don't know how to count every word, can someone help?
Imagine the text is something like:
Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.
Output:
Over 1
time 1
computer 1
algotitms 5
...
If possible I want to ignore if it's upper or lower case, I want to count them both together.
EDIT: I don't want to use any sort of Map (hashMap i.e.) or something similiar to do this.

Break down the problem as follows (this is one potential solution - not THE solution):
Split the text into words (create list or array or words).
Remove punctuation marks.
Create your map to collect results.
Iterate over your list of words and add "1" to the value of each encountered key
Display results (Iterate over the map's EntrySet)
Split the text into words
My preference is to split words by using space as a delimiter. The reason being is that, if you split using non-word characters, you may missed on some hyphenated words. I know that the use of hyphenation is being reduced, there are still plenty of words that fall under this rule; for example, middle-aged. If a word such as this is encountered, it MIGHT have to be treated as one word and not two.
Remove punctuation marks
Because of the decision above, you will need to first remove punctuation characters that might attached to your words. Keep in mind that if you use a regular expression to split the words, you might be able to accomplish this step at the same time you are doing the step above. In fact, that would be preferred so that you don't have to iterate over twice. Do both of these in a single pass. While you at it, call toLowerCase() on the input string to eliminate the ambiguity between capitalized words and lowercase words.
Create your map to collect results
This is where you are going to collect your count. Using the TreeMap implementation of the Java Map. One thing to be aware about this particular implementation is that the map is sorted according to the natural ordering of its keys. In this case, since the keys are the words from the inputted text, the keys will be arranged in alphabetical order, not by the magnitude of the count. IF sorting the entries by count is important, there is a technique where you can "reverse" the map and make the values the keys and the keys to values. However, since two or more words could have the same count, you will need to create a new map of <Integer, Set>, so that you can group together words with the same count.
Iterate over your list of words
At this point, you should have a list of words and a map structure to collect the count. Using a lambda expression, you should be able to perform a count() or your words very easily. But, if you are not familiarized or comfortable with Lambda expressions, you can use a regular looping structure to iterate over your list, do a containsKey() check to see if the word was encountered before, get() the value if the map already contains the word, and then add "1" to the previous value. Lastly, put() the new count in the map.
Display results
Again, you can use a Lambda Expression to print out the EntrySet key value pairs or simply iterate over the entry set to display the results.
Based on all of the above points, a potential solution should look like this (not using Lambda for the OPs sake)
public static void main(String[] args) {
String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
text = text.replaceAll("\\p{P}", ""); // replace all punctuations
text = text.toLowerCase(); // turn all words into lowercase
String[] wordArr = text.split(" "); // create list of words
Map<String, Integer> wordCount = new TreeMap<>();
// Collect the word count
for (String word : wordArr) {
if(!wordCount.containsKey(word)){
wordCount.put(word, 1);
} else {
int count = wordCount.get(word);
wordCount.put(word, count + 1);
}
}
Iterator<Entry<String, Integer>> iter = wordCount.entrySet().iterator();
System.out.println("Output: ");
while(iter.hasNext()) {
Entry<String, Integer> entry = iter.next();
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
This produces the following output
Output:
advantage: 1
algorithms: 5
and: 1
combine: 1
computer: 1
each: 1
engineers: 1
even: 1
for: 2
in: 1
invent: 1
more: 1
new: 1
of: 2
other: 2
others: 1
over: 1
producing: 1
results: 2
take: 1
the: 1
things: 1
time: 1
to: 1
turn: 1
utilize: 1
with: 1
work: 1
Why did I break down the problem like this for such mundane task? Simple. I believe each of those discrete steps should be extracted into functions to improve code reusability. Yes, it is cool to use a Lambda expression to do everything at once and make your code look much simplified. But what if you need to some intermediate step over and over? Most of the time, code is duplicated to accomplish this. In reality, often a better solution is to break these tasks into methods. Some of these tasks, like transforming the input text, can be done in a single method since that activity seems to be related in nature. (There is such a thing as a method doing "too little.")
public String[] createWordList(String text) {
return text.replaceAll("\\p{P}", "").toLowerCase().split(" ");
}
public Map<String, Integer> createWordCountMap(String[] wordArr) {
Map<String, Integer> wordCountMap = new TreeMap<>();
for (String word : wordArr) {
if(!wordCountMap.containsKey(word)){
wordCountMap.put(word, 1);
} else {
int count = wordCountMap.get(word);
wordCountMap.put(word, count + 1);
}
}
return wordCountMap;
}
String void displayCount(Map<String, Integer> wordCountMap) {
Iterator<Entry<String, Integer>> iter = wordCountMap.entrySet().iterator();
while(iter.hasNext()) {
Entry<String, Integer> entry = iter.next();
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
Now, after doing that, your main method looks more readable and your code is more reusable.
public static void main(String[] args) {
WordCount wc = new WordCount();
String text = "...";
String[] wordArr = wc.createWordList(text);
Map<String, Integer> wordCountMap = wc.createWordCountMap(wordArr);
wc.displayCount(wordCountMap);
}
UPDATE:
One small detail I forgot to mention is that, if instead of a TreeMap a HashMap is used, the output will come sorted by count value in descending order. This is because the hashing function will use value of the entry as the hash. Therefore, you won't need to "reverse" the map for this purpose. So, after switching to HashMap, the output should be as follows:
Output:
algorithms: 5
other: 2
for: 2
turn: 1
computer: 1
producing: 1
...

my suggestion is to use regexp and split and stream with grouping example 3
EX1 this solution does not use a collection LIST/MAP only array for me it is not optimal
#Test
public void testApp2() {
final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
final String lowerText = text.toLowerCase();
final String[] split = lowerText.split("\\W+");
System.out.println("Output: ");
for (String s : split) {
if (s == null) {
continue;
}
int count = 0;
for (int i = 0; i < split.length; i++) {
final boolean sameWorld = s.equals(split[i]);
if (sameWorld) {
count = count + 1;
split[i] = null;
}
}
System.out.println(s + " " + count);
}
}
EX2 I think that's what you mean, but I'm not sure if I used too much for the list
#Test
public void testApp() {
final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
final String[] split = text.split("\\W+");
final List<String> list = new ArrayList<>();
System.out.println("Output: ");
for (String s : split) {
if(!list.contains(s)){
list.add(s.toUpperCase());
final long count = Arrays.stream(split).filter(s::equalsIgnoreCase).count();
System.out.println(s+" "+count);
}
}
}
EX3 below is a test for your example but use MAP
#Test
public void test() {
final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
Map<String, Long> result = Arrays.stream(text.split("\\W+")).collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
assertEquals(result.get("algorithms"), new Long(5));
System.out.println("Output: ");
result.entrySet().stream().forEach(x -> System.out.println(x.getKey() + " " + x.getValue()));
}

Related

Looping through an ArrayList with another Arraylist in Java

I have a large array list of sentences and another array list of words.
My program loops through the array list and removes an element from that array list if the sentence contains any of the words from the other.
The sentences array list can be very large and I coded a quick and dirty nested for loop. While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
for (int i = 0; i < SENTENCES.size(); i++) {
for (int k = 0; k < WORDS.size(); k++) {
if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {
//Do something
}
}
}
Is there a more efficient way of doing this then a nested for loop?
There's a few inefficiencies in your code, but at the end of the day, if you've got to search for sentences containing words then there's no getting away from loops.
That said, there are couple of things to try.
First, make WORDS a HashSet, the contains method will be far quicker than for an ArrayList because it's doing a hash look-up to get the value.
Second, switch the logic about a bit like this:
Iterator<String> sentenceIterator = SENTENCES.iterator();
sentenceLoop:
while (sentenceIterator.hasNext())
{
String sentence = sentenceIterator.next();
for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
{
if (WORDS.contains(word))
{
sentenceIterator.remove();
continue sentenceLoop;
}
}
}
This code (which assumes you're trying to remove sentences that contain certain words) uses Iterators and avoids the string concatenation and parsing logic you had in your original code (replacing it with a single regex) both of which should be quicker.
But bear in mind, as with all things performance you'll need to test these changes to see they improve the situation.
I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶,̶ ̶b̶u̶t̶ what you must change is the way you handle the removal of the data. This is noted by this part of the explanation of your problem:
The sentences array list can be very large (...). While this works for when there are not many sentences, in cases where their are, the time it takes to finish this operation is ridiculously long.
The cause of this is that removal time in ArrayList takes O(N), and since you're doing this inside a loop, then it will take at least O(N^2).
I recommend using LinkedList rather than ArrayList to store the sentences, and use Iterator rather than your naive List#get since it already offers Iterator#remove in time O(1) for LinkedList.
In case you cannot change the design to LinkedList, I recommend storing the sentences that are valid in a new List, and in the end replace the contents of your original List with this new List, thus saving lot of time.
Apart from this big improvement, you can improve the algorithm even more by using a Set to store the words to lookup rather than using another List since the lookup in a Set is O(1).
What you could do is put all your words into a HashSet. This allows you to check if a word is in the set very quickly. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.
HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
wordSet.add(word);
}
Then it's just a matter of splitting each sentence into the words that make it up, and checking if any of those words are in the set.
for (String sentence : SENTENCES) {
String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
for (String word : sentenceWords) {
if (wordSet.contains(word)) {
// The sentence contains one of the special words.
// DO SOMETHING
break;
}
}
}
I will create a set of words from second ArrayList:
Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");
I will then iterate over the set and the first ArrayList and use Contains:
for (String word : listOfWords) {
for(String sentence : Sentences) {
if (sentence.contains(word)) {
// do something
}
}
}
Also, if you are free to use any open source jar, check this out:
searching string in another string
First, your program has a bug: it would not count words at the beginning and at the end of a sentence.
Your current program has runtime complexity of O(s*w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.
If words is relatively small (a few hundred items or so) you could use regex to speed things up considerably: construct a pattern like this, and use it in a loop:
StringBuilder regex = new StringBuilder();
boolean first = true;
// Let's say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
if (!first) {
regex.append('|');
} else {
first = false;
}
regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b's on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
if (p.matcher(SENTENCES.get(i)).find()) {
// Do something
}
}
Since regex gets pre-compiled into a structure more suitable for fast searches, your program would run in O(s*max(w)), where s is the length, in characters, of all sentences, and w is the length of the longest word. Given that the number of words in your collection is about 200 or 300, this could give you an order of magnitude decrease in running time.
If you have enough memory you can tokenize SENTENCES and put them in a Set. Then it would be better in performance and also more correct than current implementation.
Well, looking at your code I would suggest two things that will improve the performance from each iteration:
Remove " == true". The contains operation already returns a boolean, so it is enough for the if, comparing it with true adds one extra operation for each iteration that is not needed.
Do not concatenate Strings inside a loop (" " + WORDS.get(k) + " ") as it is a quite expensive operation because + operator creates new objects. Better use a string buffer / builder and clear it after each iteration with stringBuffer.setLength(0);.
Besides that, for this case I do not know any other approach, maybe you can use regular expressions if you can abstract a pattern out of those words you want to remove and have then only one loop.
Hope it helps!
If you concern about the efficiency, I think that the most effective way to do this is to use Aho-Corasick's algorithm. While you have 2 nested loops here and a contains() method (that I think takes at the best length of sentence + length of word time), Aho-Corasick gives you one loop over sentences and for checking of containing words it takes length of sentence, which is length of word times faster (+ a preprocessing time for creation of finite state machine, which is relatively small).
I'll approach this in more theoretical view.. If you don't have memory limitation, you can try to mimic the logic in counting sort
say M1 = sentences.size, M2 = number of word per sentences, and N = word.size
Assume all sentences has the same number of words just for simplicity
your current approach's complexity is O(M1.M2.N)
We can create a mapping of words - position in sentences.
Loop through your arraylist of sentences, and change them into two dimensional jagged array of words. Loop through the new array, create a HashMap where key,value = words, arraylist of word position (say with length X). That's O(2M1.M2.X) = O(M1.M2.X)
Then loop through your words arraylist, access your word hashmap, loop through the list of word position. remove each one. That's O(N.X)
Say you're need to give the result in arraylist of string, we need another loop and concat everything. That's O(M1.M2)
Total complexity is O(M1.M2.X) + O(N.X) + O(M1.M2)
assumming X is way smaller than N, you'll probably get better performance

Most efficient way to find unique entries in a large data set

Before anything, I am making it clear that this is an assignment and I do not expect full coded answers. All I seek is advice and maybe snippets of code that helps me.
So, I am reading in about 900,000 words all stored in a arrayList. I need to count unique words using a sorted array (or arraylist) in java.
So far, I am simply looping over the given arrayList and use
Collections.sort(words);
and Collections.binarySearch(words, wordToLook); to achieve it like the following:
OrderedSet set = new OrderedSet();
for(String a : words){
if(!set.contains(a)){
set.add(a);
}
}
and
public boolean contains(String word) {
Collections.sort(uniqueWords);
int result = Collections.binarySearch(uniqueWords, word);
if(result<0){
return false;
}else{
return true;
}
}
This code has a running time of about 60 seconds but I was wondering if there is any better way to do this because running a sort every time an element is added seems very inefficient (but of couse necessary if I were to use binary search).
Any sort of feedback would be greatly appreciated. Thanks.
So, you are required to use a sorted array. That is ok, since you are (not yet) programming in the real world.
I will suggest two alternatives:
The first uses binary search (which you are using in your current code).
I would create a class that contains two fields: the word (a String) and the count for that word (an int). You will build a sorted array of these classes.
Start with an empty array and add to it as you read each word. For each word, do a binary search for the word in the array you are building. The search will either find the entry containing the word (and you will increment the count), or you will determine that the word is not yet in the array.
When your binary search ends without finding the word, you will create a new object to hold the word+count and add it to the array in the location where your search ended (be careful to make sure that your logic really puts it in the right spot to keep your list sorted). Of course, your count is set to 1 for new words.
Another alternative:
Read all of your words into a list and sort it. After sorting, all duplicates will be next to each other in the list.
You will walk down this sorted list once and create a list of word+count as you go. If the next word you see is the same as the last word+count, increment the count. If it is a new word, add a new word+count to your result list with count=1.
I would not use a sorted array. I would create a Map<String, Integer> where the key is your word and the value is the count of the number of occurrences of the word. As you read each word, do something like this:
Integer count = map.get(word);
if (count == null) {
count = 0;
}
map.put(word, count + 1);
Then just iterate over the map's entry set and do whatever you need to do with the counts.
If you know, or can estimate, the number of unique words then you should use this number in the HashMap constructor (so you don't grow the map many times).
If you use a sorted array, your run time cannot be better than proportional to NlogN (where N is the number of words in your list). If you use a HashMap, you can achieve a runtime that grows linearly with N (you save yourself the factor of logN).
Another advantage of using a Map is the memory used is proportional to the number of unique words, rather than the total number of words (assuming that you build the map while reading the words, rather than reading all words into a collection and then adding them to the map).
public static int countUnique(array) {
if(array.length == 0) return 0;
int count = 1;
for i from 1 to array.length - 1 {
if(!array[i].equals(array[i - 1])) count++;
}
return count;
}
This is a O(N) algorithm in pseudocode for counting the number of unique entries in a sorted array. The idea behind it is that we count the number of transitions between groups of equal elements. Then, the number of unique entries is the number of transitions plus one (for the first entry).
Hopefully you see how to apply this algorithm to your array after the elements are sorted.
You could always use comparator to get unique values.
List newList = new ArrayList(new Comparator() {
#Override
public int compare(words o1, words o2) {
if(o1.equalsIgnoreCase(o2)){
return 0;
}
return 1;
}
});
Now count:
words - newList = no. of repeated values.
Hope this helps!!!!

Java - Search performantly for subset of String in String list

I want to search through a list of Strings and return the values, which contains which contain the search string.
The list could look like this (can have up to 1000 entries). Although it is not guranteed that it is always letters and then a digit. It could be digits only, words only or even both mixed up:
entry 1
entry 2
entry 3
entry 4
test 1
test 2
test 3
tst 4
If the user does search for 1, these should be returned:
entry 1
test 1
The situation is that the user has a search bar and can enter a search string. This search string is used to search through the list.
How can this be done performantly?
Currently, I have got:
for (String s : strings) {
if (s.contains(searchedText)) result.add(s);
}
It is O(N) and really slow. Especially if the user types many characters at a time.
Maybe I don't understand your question, but as you know n Java, String objects are immutable, but also can represent collection(array) of chars. So one thing what you can do is to perform search with better algorithms as binary_search, Aho-Corasick, Rabin–Karp, Boyer–Moore string search, StringSearch or one of these. Also you may consider some usage of Abstract_data_types with better performance (hashing, trees etc.).
This is very simple if you use streams:
final List<String> items = Arrays.asList("entry 1", "entry 2", "entry 3", "test 1", "test 2", "test 3");
final String searchString = "1";
final List<String> results = items.parallelStream() // work in parallel
.filter(s -> s.contains(searchString)) // pick out items that match
.collect(Collectors.toList()); // and turn those into a result list
results.forEach(System.out::println);
Notice the parallelStream() which will cause the list to be filtered and traversed using all available CPUs.
In your case you can use the results when the user expands the search term (while typing) to reduce the amount of items that need to be filtered, because if 's' matches all items in result, all those that match 'se' will be a sub-list of result.
If you don't use any additional structures, you cannot perform faster, than look though your data. That takes O(N).
If you can do some preparations, like building text index, you can increase performance of search. General information: http://en.wikipedia.org/wiki/Full_text_search. If you can make some assumptions about your data (like the last symbol is number and you are going to search only by it), it'll be easy to create such index.
Depending on the upper limit of the number in the string and if you have no concerns about space, use an Array of ArrayLists where the array index is the number of the string:
ArrayList<String>[] data = new ArrayList<String>[1000];
for ( int i = 0; i < 1000; i++ )
data[i] = new ArrayList<String>();
//inserting data
int num = Integer.parseInt(datastring.substring(datastring.length-1));
data[i].add(datastring);
//getting all data that has a 1
for ( String s: data[1] )
result.add(s);
Using a Hashmap would overwrite previous mapped values when trying to put new values into it.
i.e. if 1 maps to entry, then you try to add 1 mapping to test, the entry would get replaced with test.
As another idea, you could just keep a count of the number of strings with each number, so when you're searching, you know how many to look for, so as soon as you find all of them, you stop searching:
int[] str_count = new int[1000];
for ( int i = 0; i < 1000; i++ )
str_count[i] = 0;
//when storing data into the list:
int num = Integer.parseInt(datastring.substring(datastring.length-1));
str_count[i]++;
//when searching the list for 1s:
int count = str_count[1];
for (String s : strings) {
if (s.contains(searchedText))
result.add(s);
if (result.size() == count)
break;
}
While the first idea would be much faster, it would take up more space. Yet, the second idea takes up less space, the worst case scenario would search O(N) still.

Faster String Matching/Iteration Method?

In the program I'm currently working on, there's one part that's taking a bit long. Basically, I have a list of Strings and one target phrase. As an example, let's say the target phrase is "inventory of finished goods". Now, after filtering out the stop word (of), I want to extract all Strings from the list that contains one of the three words: "inventory", "finished", and "goods". Right now, I implemented the idea as follows:
String[] targetWords; // contains "inventory", "finished", and "goods"
ArrayList<String> extractedStrings = new ArrayList<String>();
for (int i = 0; i < listOfWords.size(); i++) {
String[] words = listOfWords.get(i).split(" ");
outerloop:
for (int j = 0; j < words.length; j++) {
for (int k = 0; k < targetWords.length; k++) {
if (words[j].equalsIgnoreCase(targetWords[k])) {
extractedStrings.add(listOfWords.get(i));
break outerloop;
}
}
}
}
The list contains over 100k words, and with this it takes rounghly .4 to .8 seconds to complete the task for each target phrase. The things is, I have a lot of these target phrases to process, and the seconds really add up. Thus, I was wondering if anyone knew of a more efficient way to complete this task? Thanks for the help in advance!
Your list of 100k words could be added (once) to a HashSet. Rather than iterating through your list, use wordSet.contains() - a HashSet gives constant-time performance for this, so not affected by the size of the list.
You can take your giant list of words and add them to a hash map and then when your phrase comes in, just loop over the words in your phrase and check against the hash map. Currently you are doing a linear search and what I'm proposing would cut it down to a constant time search.
The key is minimizing lookups. Using this technique you would be effectively indexing your giant list of words for fast lookups.
You are passing trough each of the elements from targetWords, instead of checking for all words from targetWords simultaneously. In addition, you are splitting your list of words in each iteration without really needing it, creating overhead.
I would suggest that you combine your targetWords into one (compiled) regular expression:
(?xi) # turn on comments, use case insensitive matching
\b # word boundary, i.e. start/end of string, whitespace
( # begin of group containing 'inventory' or 'finished' or 'goods'
inventory|finished|goods # bar separates alternatives
) # end of group
\b # word boundary
Don't forget to double-quote the backspaces in your regular expression string.
import java.util.regex.*;
...
Pattern targetPattern = Pattern.compile("(?xi)\\b(inventory|finished|goods)\\b");
for (String singleString : listOfWords) {
if (targetPattern.matcher(singleString).find()) {
extractedStrings.add(singleString);
}
}
If you are not satisfied with the speed of regular expressions - although regular expression engines are usually optimized for performance - you need to roll your own high-speed multi-string search. The Aho–Corasick string matching algorithm is optimized for searching several fixed strings in text, but of course implementing this algorithm is quite some effort compared with simply creating a Pattern.
I'm a little confused to if you want the whole phrase or just single words from listOfWords. If you are trying to get the string from listOfWords if one of your target words is in the string this should work for you.
String[] targetWords= new String[]{"inventory", "finished", "goods"};
List<String> listOfWords = new ArrayList<String>();
// build lookup map
Map<String, ArrayList<String>> lookupMap = new HashMap<String, ArrayList<String>>();
for(String words : listOfWords) {
for(String word : words.split(" ")) {
if(lookupMap.get(word) == null) lookupMap.put(word, new ArrayList<String>());
lookupMap.get(word).add(words);
}
}
// find phrases
Set<String> extractedStrings = new HashSet<String>();
for(String target : targetWords) {
if(lookupMap.containsKey(target)) extractedStrings.addAll(lookupMap.get(target));
}
I would try to implement it with ExecutorService to parallelize search for each word.
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html
For example with fixed thread pool size:
Executors.newFixedThreadPool(20);

Optimizing a simple search algorithm

I have been playing around a bit with a fairly simple, home-made search engine, and I'm now twiddling with some relevancy sorting code.
It's not very pretty, but I'm not very good when it comes to clever algorithms, so I was hoping I could get some advice :)
Basically, I want each search result to get scoring based on how many words match the search criteria. 3 points per exact word and one point for partial matches
For example, if I search for "winter snow", these would be the results:
winter snow => 6 points
winter snowing => 4 points
winterland snow => 4 points
winter sun => 3 points
winterland snowing => 2 points
Here's the code:
String[] resultWords = result.split(" ");
String[] searchWords = searchStr.split(" ");
int score = 0;
for (String resultWord : resultWords) {
for (String searchWord : searchWords) {
if (resultWord.equalsIgnoreCase(searchWord))
score += 3;
else if (resultWord.toLowerCase().contains(searchWord.toLowerCase()))
score++;
}
}
Your code seems ok to me. I suggest little changes:
Since your are going through all possible combinations you might get the toLowerCase() of your back at the start.
Also, if an exact match already occurred, you don't need to perform another equals.
result = result.toLowerCase();
searchStr = searchStr.toLowerCase();
String[] resultWords = result.split(" ");
String[] searchWords = searchStr.split(" ");
int score = 0;
for (String resultWord : resultWords) {
boolean exactMatch = false;
for (String searchWord : searchWords) {
if (!exactMatch && resultWord.equals(searchWord)) {
exactMatch = true;
score += 3;
} else if (resultWord.contains(searchWord))
score++;
}
}
Of course, this is a very basic level. If you are really interested in this area of computer science and want to learn more about implementing search engines start with these terms:
Natural Language Processing
Information retrieval
Text mining
stemming
for acronyms case sensitivity is important, i.e. SUN; any word that matches both content and case must be weighted more than 3 points (5 or 7)?
use the strategy design pattern
For example, consider this naive score model:
interface ScoreModel {
int startingScore();
int partialMatch();
int exactMatch();
}
...
int search(String result, String searchStr, ScoreModel model) {
String[] resultWords = result.split(" ");
String[] searchWords = searchStr.split(" ");
int score = model.startingScore();
for (String resultWord : resultWords) {
for (String searchWord : searchWords) {
if (resultWord.equalsIgnoreCase(searchWord)) {
score += model.exactMatch();
} else if (resultWord.toLowerCase().contains(searchWord.toLowerCase())) {
score += model.partialMatch();
}
}
}
return score;
}
Basic optimization can be done by preprocessing your database: don't split entries into words every time.
Build words list (prefer hash or binary tree to speedup search in the list) for every entry during adding it into DB, remove all too short words, lower case and store this data for further usage.
Do the same actions with the search string on search start (split, lower case, cleanup) and use this words list for comparing with every entry words list.
1) You can sort searchWords first. You could break out of the loop once your result word was alphabetically after your current search word.
2) Even better, sort both, then walk along both lists simultaneously to find where any matches occur.
You can use regular expressions for finding patterns and lengths of matched patterns (for latter classification/scoring).

Categories