Ranking text based on keywords provided - java

I have a set of Strings and a set of keywords.
Example
String 1 : Oracle and Samsung Electronics have reportedly forged a new partnership through which they will work together to deliver mobile cloud services. In a meeting last Thursday, Oracle co-CEO Mark Hurd and Shin Jong-kyun, head of Samsung Electronics’ mobile
String 2 : This is some random string.
Keywords : Oracle,Samsung
The function should return String 1 as the one having highest rank. I can search each strings for each keywords, but it will take too much time as there will be lot of strings and a huge set of keywords.

Create a data structure that maps each term that appears in any of the strings to all strings it appears in.
Map<String,List<Integer>> keyword2stringId;
If a string contains the same keyword multiple times, you could simply add it to the List multiple times, or -- if you prefer -- use a slightly different map which allows you to also keep a count:
Map<String,List<Pair<Integer,Integer>>> keyword2pair; // pair = id + count
Then for each keyword, you can look up the relevant strings and find the ones with the highest overlap, for instance like so:
// count the occurrences of all keywords in the different strings
int[] counts = new int[strings.length];
for (String keyword : keywords) {
for (Integer index : keyword2stringId.get(keyword)) {
if (index != null) {
counts[index]++;
}
}
}
// find the string that has the highest number of keywords
int maxCount = 0;
int maxIndex = -1;
for (int i = 0; i < counts.length; i++) {
if (counts[i] > maxCount) {
maxCount = counts[i];
maxIndex = i;
}
}
// return the highest ranked string or
// 'null' if no matching document was found
if (maxIndex == -1) {
return null;
} else {
return strings[maxIndex];
}
The advantage of this approach is that you can compute your map offline (that is, only once) and then use it again and again for different queries.

It looks like you should try some search engine or search library like Lucene or Solr
Lucene Core, our flagship sub-project, provides Java-based indexing
and search technology, as well as spellchecking, hit highlighting and
advanced analysis/tokenization capabilities.
Solr is the popular, blazing-fast, open source enterprise search
platform built on Apache Lucene™.
Both of this stuff have support to do what you need to do - to search for some keywords and rank them.

This program can't be less than O(n) complexity, that is, you have to check each word of the string with each keyword.
Now, the only think you can do is perform the check over each string all at once:
public int getRank(String string, String[] keyword) {
int rank = 0;
for (String word : string.split(" "))
for (String key : keyword)
if (word.equals(key))
rank++;
return rank;
}
In this easy example, rank is an int increased each time a keyword occurs in the string. Then fill an array of ranks for each string:
String[] strings = new String[]{"...", "...", "...", "...", ...};
String[] keyword = new String[]{"...", "...", "...", "...", ...};
int[] ranks = new int[stringsNumber];
for (int i = 0; i < stringsNumber; i++)
ranks[i] = getRank(strings[i], keyword);

I believe what you're really looking for is TF/IDF - Term Frequency/Inverse Document Frequency. The link provided should give you the information you need, or alternatively as #Mysterion has pointed out, Lucene will do this for you. You don't necessarily need to deploy a complete Lucene/Solr/ElasticSearch installation, you could just make use of the classes you need to roll your own

Related

Find match with regex in arraylist

I'm trying to develop a function that reads an ArrayList of string and is capable to find if there exist at least two tuples that have the same values from a set of indices but differ for a supplementary index. I've developed a version of this function by using a RegEx comparison as follow:
public boolean checkMatching(){
ArrayList<String> rows = new ArrayList<String>();
rows.add("7,2,2,1,1");
rows.add("7,3,2,1,1");
rows.add("7,8,1,1,1");
rows.add("8,2,1,3,1");
rows.add("8,2,1,4,1");
rows.add("8,4,5,1,1");
int[] indices = new int[] {2,3};
int supplementaryIndex = 1;
String regex = "";
for(String r : rows){
String[] rt = r.split(",");
regex = "[a-zA-Z0-9,-.]*[,][a-zA-Z0-9,-.]*[,][" + rt[indices[0]] + "][,][" + rt[indices[1]] + "][,][a-zA-Z0-9,-.]*";
for(String r2 : rows){
if(r.equals(r2) == false){
if(Pattern.matches(regex, r2)){
String[] rt2 = r.split(",");
if(rt[supplementaryIndex].equals(rt2[supplementaryIndex]) == false){
return true;
}
}
}
}
}
return false;
}
However, it is very expensive, especially if there are many rows. I've thought to create a more complex RegEx that considers multiple choices (with '|' condition), as follow:
public boolean checkMatching(){
ArrayList<String> rows = new ArrayList<String>();
rows.add("7,2,2,1,1");
rows.add("7,3,2,1,1");
rows.add("7,8,1,1,1");
rows.add("8,2,1,3,1");
rows.add("8,2,1,4,1");
rows.add("8,4,5,1,1");
int[] indices = new int[] {2,3};
int supplementaryIndex = 1;
String regex = "";
for(String r : rows){
String[] rt = r.split(",");
regex += "[a-zA-Z0-9,-.]*[,][a-zA-Z0-9,-.]*[,][" + rt[indices[0]] + "][,][" + rt[indices[1]] + "][,][a-zA-Z0-9,-.]*";
regex += "|"; //or
}
for(String r2 : rows){
if(Pattern.matches(regex, r2)){
//String rt2 = r.split(",");
//if(rt[supplementaryIndex].equals(rt2[supplementaryIndex]) == false){
return true;
//}
}
}
return false;
}
But the problem is that this way I can't compare the supplementary index values. Do you have any suggestions on how to define a regex that can directly satisfy this condition? Or, is it possible to leverage java streams to do this efficiently?
The main problem of your first approach is that you have two nested loops over the same list, which gets you a quadratic time complexity. To recall, that implies that the inner loop’s body gets executed 10,000 times for a list with 100 elements and 1,000,000 times for a list of 1,000 elements, and so on.
It doesn’t help calling Pattern.matches(regex, r2) in the inner loop’s body. That method exist only to support (as delegation target) the String operation r2.matches(r2), a convenience method, to do Pattern.compile(regex).matcher(input).matches() in one go. If you have to apply the same regex multiple times, you should keep and re-use the result of Pattern.compile(regex).
But here, there is no point in using a regex at all. You have already decomposed the string using split and can access each component via a plain array access. Using this starting point to compose a regex to be applied on the string again, is complicated and expensive at the same time.
Just use something like
// return true when at least one string has the same values for indices
// but different value for supplementaryIndex
Map<List<String>,String> map = new HashMap<>();
for(String r : rows) {
String[] rt = r.split(",");
List<String> key = List.of(rt[indices[0]], rt[indices[1]]);
String old = map.putIfAbsent(key, rt[supplementaryIndex]);
if(old != null && !old.equals(rt[supplementaryIndex])) return true;
}
return false;
This loops over the list a single time, extracts the key elements from the array and composes a key for a HashMap. There are various ways to do this. But while it’s tempting to just concatenate these elements like rt[indices[0]] + "," + rt[indices[1]], which would work, using a List is preferable, as it avoids expensive string concatenation.
The code puts the value to check into the map which will return a previous value if this key has been encountered before. If so, the old and new values can be compared and the method can return immediately if they don’t match.
When you are using Java 8, you have to use Arrays.asList(rt[indices[0]], rt[indices[1]]) instead of List.of(rt[indices[0]], rt[indices[1]]).
This can be easily expanded to support variable lengths for indices, by changing
List<String> key = List.of(rt[indices[0]], rt[indices[1]]);
to
List<String> key = Arrays.stream(indices).mapToObj(i -> rt[i]).toList();
or, if you are using a Java version older than 16:
List<String> key
= Arrays.stream(indices).mapToObj(i -> rt[i]).collect(Collectors.toList());

Java - Search performantly for subset of String in String list

I want to search through a list of Strings and return the values, which contains which contain the search string.
The list could look like this (can have up to 1000 entries). Although it is not guranteed that it is always letters and then a digit. It could be digits only, words only or even both mixed up:
entry 1
entry 2
entry 3
entry 4
test 1
test 2
test 3
tst 4
If the user does search for 1, these should be returned:
entry 1
test 1
The situation is that the user has a search bar and can enter a search string. This search string is used to search through the list.
How can this be done performantly?
Currently, I have got:
for (String s : strings) {
if (s.contains(searchedText)) result.add(s);
}
It is O(N) and really slow. Especially if the user types many characters at a time.
Maybe I don't understand your question, but as you know n Java, String objects are immutable, but also can represent collection(array) of chars. So one thing what you can do is to perform search with better algorithms as binary_search, Aho-Corasick, Rabin–Karp, Boyer–Moore string search, StringSearch or one of these. Also you may consider some usage of Abstract_data_types with better performance (hashing, trees etc.).
This is very simple if you use streams:
final List<String> items = Arrays.asList("entry 1", "entry 2", "entry 3", "test 1", "test 2", "test 3");
final String searchString = "1";
final List<String> results = items.parallelStream() // work in parallel
.filter(s -> s.contains(searchString)) // pick out items that match
.collect(Collectors.toList()); // and turn those into a result list
results.forEach(System.out::println);
Notice the parallelStream() which will cause the list to be filtered and traversed using all available CPUs.
In your case you can use the results when the user expands the search term (while typing) to reduce the amount of items that need to be filtered, because if 's' matches all items in result, all those that match 'se' will be a sub-list of result.
If you don't use any additional structures, you cannot perform faster, than look though your data. That takes O(N).
If you can do some preparations, like building text index, you can increase performance of search. General information: http://en.wikipedia.org/wiki/Full_text_search. If you can make some assumptions about your data (like the last symbol is number and you are going to search only by it), it'll be easy to create such index.
Depending on the upper limit of the number in the string and if you have no concerns about space, use an Array of ArrayLists where the array index is the number of the string:
ArrayList<String>[] data = new ArrayList<String>[1000];
for ( int i = 0; i < 1000; i++ )
data[i] = new ArrayList<String>();
//inserting data
int num = Integer.parseInt(datastring.substring(datastring.length-1));
data[i].add(datastring);
//getting all data that has a 1
for ( String s: data[1] )
result.add(s);
Using a Hashmap would overwrite previous mapped values when trying to put new values into it.
i.e. if 1 maps to entry, then you try to add 1 mapping to test, the entry would get replaced with test.
As another idea, you could just keep a count of the number of strings with each number, so when you're searching, you know how many to look for, so as soon as you find all of them, you stop searching:
int[] str_count = new int[1000];
for ( int i = 0; i < 1000; i++ )
str_count[i] = 0;
//when storing data into the list:
int num = Integer.parseInt(datastring.substring(datastring.length-1));
str_count[i]++;
//when searching the list for 1s:
int count = str_count[1];
for (String s : strings) {
if (s.contains(searchedText))
result.add(s);
if (result.size() == count)
break;
}
While the first idea would be much faster, it would take up more space. Yet, the second idea takes up less space, the worst case scenario would search O(N) still.

Faster String Matching/Iteration Method?

In the program I'm currently working on, there's one part that's taking a bit long. Basically, I have a list of Strings and one target phrase. As an example, let's say the target phrase is "inventory of finished goods". Now, after filtering out the stop word (of), I want to extract all Strings from the list that contains one of the three words: "inventory", "finished", and "goods". Right now, I implemented the idea as follows:
String[] targetWords; // contains "inventory", "finished", and "goods"
ArrayList<String> extractedStrings = new ArrayList<String>();
for (int i = 0; i < listOfWords.size(); i++) {
String[] words = listOfWords.get(i).split(" ");
outerloop:
for (int j = 0; j < words.length; j++) {
for (int k = 0; k < targetWords.length; k++) {
if (words[j].equalsIgnoreCase(targetWords[k])) {
extractedStrings.add(listOfWords.get(i));
break outerloop;
}
}
}
}
The list contains over 100k words, and with this it takes rounghly .4 to .8 seconds to complete the task for each target phrase. The things is, I have a lot of these target phrases to process, and the seconds really add up. Thus, I was wondering if anyone knew of a more efficient way to complete this task? Thanks for the help in advance!
Your list of 100k words could be added (once) to a HashSet. Rather than iterating through your list, use wordSet.contains() - a HashSet gives constant-time performance for this, so not affected by the size of the list.
You can take your giant list of words and add them to a hash map and then when your phrase comes in, just loop over the words in your phrase and check against the hash map. Currently you are doing a linear search and what I'm proposing would cut it down to a constant time search.
The key is minimizing lookups. Using this technique you would be effectively indexing your giant list of words for fast lookups.
You are passing trough each of the elements from targetWords, instead of checking for all words from targetWords simultaneously. In addition, you are splitting your list of words in each iteration without really needing it, creating overhead.
I would suggest that you combine your targetWords into one (compiled) regular expression:
(?xi) # turn on comments, use case insensitive matching
\b # word boundary, i.e. start/end of string, whitespace
( # begin of group containing 'inventory' or 'finished' or 'goods'
inventory|finished|goods # bar separates alternatives
) # end of group
\b # word boundary
Don't forget to double-quote the backspaces in your regular expression string.
import java.util.regex.*;
...
Pattern targetPattern = Pattern.compile("(?xi)\\b(inventory|finished|goods)\\b");
for (String singleString : listOfWords) {
if (targetPattern.matcher(singleString).find()) {
extractedStrings.add(singleString);
}
}
If you are not satisfied with the speed of regular expressions - although regular expression engines are usually optimized for performance - you need to roll your own high-speed multi-string search. The Aho–Corasick string matching algorithm is optimized for searching several fixed strings in text, but of course implementing this algorithm is quite some effort compared with simply creating a Pattern.
I'm a little confused to if you want the whole phrase or just single words from listOfWords. If you are trying to get the string from listOfWords if one of your target words is in the string this should work for you.
String[] targetWords= new String[]{"inventory", "finished", "goods"};
List<String> listOfWords = new ArrayList<String>();
// build lookup map
Map<String, ArrayList<String>> lookupMap = new HashMap<String, ArrayList<String>>();
for(String words : listOfWords) {
for(String word : words.split(" ")) {
if(lookupMap.get(word) == null) lookupMap.put(word, new ArrayList<String>());
lookupMap.get(word).add(words);
}
}
// find phrases
Set<String> extractedStrings = new HashSet<String>();
for(String target : targetWords) {
if(lookupMap.containsKey(target)) extractedStrings.addAll(lookupMap.get(target));
}
I would try to implement it with ExecutorService to parallelize search for each word.
http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html
For example with fixed thread pool size:
Executors.newFixedThreadPool(20);

Best open source in memory java application for implementing Trie

I want to implement a high speed in memory implementation of Trie to create backend to auto suggestion / spell checker.
Is there already some good implementation based on in memory implementations like hazlecast.
Also which java open source tool is best suggested for these kind of usage
I would use a plain NavigableSet like TreeSet. Its built in and supports range searches.
NavigableSet<String> words = new TreeSet<String>();
// add words.
String startsWith = ...
SortedSet<String> matching = words.subSet(startsWith, startsWith + '\uFFFF');
If you want something more memory efficient you can use an array.
List<String> words = new ArrayList<String>();
words.add("aa");
words.add("ab");
words.add("ac");
words.add("ba");
Collections.sort(words);
String startsWith = "a";
int first = Collections.binarySearch(words, startsWith);
int last = Collections.binarySearch(words, startsWith.concat("\uFFFF"));
if (first < 0) first = ~first;
if (last < 0) last = ~last - 1;
for (int i = first; i <= last; i++) {
System.out.println(words.get(i));
}

Optimizing a simple search algorithm

I have been playing around a bit with a fairly simple, home-made search engine, and I'm now twiddling with some relevancy sorting code.
It's not very pretty, but I'm not very good when it comes to clever algorithms, so I was hoping I could get some advice :)
Basically, I want each search result to get scoring based on how many words match the search criteria. 3 points per exact word and one point for partial matches
For example, if I search for "winter snow", these would be the results:
winter snow => 6 points
winter snowing => 4 points
winterland snow => 4 points
winter sun => 3 points
winterland snowing => 2 points
Here's the code:
String[] resultWords = result.split(" ");
String[] searchWords = searchStr.split(" ");
int score = 0;
for (String resultWord : resultWords) {
for (String searchWord : searchWords) {
if (resultWord.equalsIgnoreCase(searchWord))
score += 3;
else if (resultWord.toLowerCase().contains(searchWord.toLowerCase()))
score++;
}
}
Your code seems ok to me. I suggest little changes:
Since your are going through all possible combinations you might get the toLowerCase() of your back at the start.
Also, if an exact match already occurred, you don't need to perform another equals.
result = result.toLowerCase();
searchStr = searchStr.toLowerCase();
String[] resultWords = result.split(" ");
String[] searchWords = searchStr.split(" ");
int score = 0;
for (String resultWord : resultWords) {
boolean exactMatch = false;
for (String searchWord : searchWords) {
if (!exactMatch && resultWord.equals(searchWord)) {
exactMatch = true;
score += 3;
} else if (resultWord.contains(searchWord))
score++;
}
}
Of course, this is a very basic level. If you are really interested in this area of computer science and want to learn more about implementing search engines start with these terms:
Natural Language Processing
Information retrieval
Text mining
stemming
for acronyms case sensitivity is important, i.e. SUN; any word that matches both content and case must be weighted more than 3 points (5 or 7)?
use the strategy design pattern
For example, consider this naive score model:
interface ScoreModel {
int startingScore();
int partialMatch();
int exactMatch();
}
...
int search(String result, String searchStr, ScoreModel model) {
String[] resultWords = result.split(" ");
String[] searchWords = searchStr.split(" ");
int score = model.startingScore();
for (String resultWord : resultWords) {
for (String searchWord : searchWords) {
if (resultWord.equalsIgnoreCase(searchWord)) {
score += model.exactMatch();
} else if (resultWord.toLowerCase().contains(searchWord.toLowerCase())) {
score += model.partialMatch();
}
}
}
return score;
}
Basic optimization can be done by preprocessing your database: don't split entries into words every time.
Build words list (prefer hash or binary tree to speedup search in the list) for every entry during adding it into DB, remove all too short words, lower case and store this data for further usage.
Do the same actions with the search string on search start (split, lower case, cleanup) and use this words list for comparing with every entry words list.
1) You can sort searchWords first. You could break out of the loop once your result word was alphabetically after your current search word.
2) Even better, sort both, then walk along both lists simultaneously to find where any matches occur.
You can use regular expressions for finding patterns and lengths of matched patterns (for latter classification/scoring).

Categories