I want to implement a high speed in memory implementation of Trie to create backend to auto suggestion / spell checker.
Is there already some good implementation based on in memory implementations like hazlecast.
Also which java open source tool is best suggested for these kind of usage
I would use a plain NavigableSet like TreeSet. Its built in and supports range searches.
NavigableSet<String> words = new TreeSet<String>();
// add words.
String startsWith = ...
SortedSet<String> matching = words.subSet(startsWith, startsWith + '\uFFFF');
If you want something more memory efficient you can use an array.
List<String> words = new ArrayList<String>();
words.add("aa");
words.add("ab");
words.add("ac");
words.add("ba");
Collections.sort(words);
String startsWith = "a";
int first = Collections.binarySearch(words, startsWith);
int last = Collections.binarySearch(words, startsWith.concat("\uFFFF"));
if (first < 0) first = ~first;
if (last < 0) last = ~last - 1;
for (int i = first; i <= last; i++) {
System.out.println(words.get(i));
}
Related
I am doing profanity filter. I have 2 for loops nested as shown below. Is there a better way of avoiding nested for loop and improve time complexity.
boolean isProfane = false;
final String phraseInLowerCase = phrase.toLowerCase();
for (int start = 0; start < phraseInLowerCase.length(); start++) {
if (isProfane) {
break;
}
for (int offset = 1; offset < (phraseInLowerCase.length() - start + 1 ); offset++) {
String subGeneratedCode = phraseInLowerCase.substring(start, start + offset);
//BlacklistPhraseSet is a HashSet which contains all profane words
if (blacklistPhraseSet.contains(subGeneratedCode)) {
isProfane=true;
break;
}
}
}
Consider Java 8 version of #Mad Physicist implementation:
boolean isProfane = Stream.of(phrase.split("\\s+"))
.map(String::toLowerCase)
.anyMatch(w -> blacklistPhraseSet.contains(w));
or
boolean isProfane = Stream.of(phrase
.toLowerCase()
.split("\\s+"))
.anyMatch(w -> blacklistPhraseSet.contains(w));
If you want to check every possible combination of consecutive characters, then your algorithm is O(n^2), assuming that you use a Set with O(1) lookup characteristics, like a HashSet. You would probably be able to reduce this by breaking the data and the blacklist into Trie structures and walking along each possibility that way.
A simpler approach might be to use a heuristic like "profanity always starts and ends at a word boundary". Then you can do
isProfane = false;
for(String word: phrase.toLowerCase().split("\\s+")) {
if(blacklistPhraseSet.contains(word)) {
isProfane = true;
break;
}
}
You won't improve a lot on time complexity, because those use iterations under the hood but you could split the phrase on spaces and iterate over the array of words from your phrase.
Something like:
String[] arrayWords = phrase.toLowerCase().split(" ");
for(String word:arrayWords){
if(blacklistPhraseSet.contains(word)){
isProfane = true;
break;
}
}
The problem of this code is that unless your word contains compound words, it won't match those, whereas your code as I understand it will. The word "f**k" in the black list won't match "f**kwit" in my code, it will in yours.
I have a set of Strings and a set of keywords.
Example
String 1 : Oracle and Samsung Electronics have reportedly forged a new partnership through which they will work together to deliver mobile cloud services. In a meeting last Thursday, Oracle co-CEO Mark Hurd and Shin Jong-kyun, head of Samsung Electronics’ mobile
String 2 : This is some random string.
Keywords : Oracle,Samsung
The function should return String 1 as the one having highest rank. I can search each strings for each keywords, but it will take too much time as there will be lot of strings and a huge set of keywords.
Create a data structure that maps each term that appears in any of the strings to all strings it appears in.
Map<String,List<Integer>> keyword2stringId;
If a string contains the same keyword multiple times, you could simply add it to the List multiple times, or -- if you prefer -- use a slightly different map which allows you to also keep a count:
Map<String,List<Pair<Integer,Integer>>> keyword2pair; // pair = id + count
Then for each keyword, you can look up the relevant strings and find the ones with the highest overlap, for instance like so:
// count the occurrences of all keywords in the different strings
int[] counts = new int[strings.length];
for (String keyword : keywords) {
for (Integer index : keyword2stringId.get(keyword)) {
if (index != null) {
counts[index]++;
}
}
}
// find the string that has the highest number of keywords
int maxCount = 0;
int maxIndex = -1;
for (int i = 0; i < counts.length; i++) {
if (counts[i] > maxCount) {
maxCount = counts[i];
maxIndex = i;
}
}
// return the highest ranked string or
// 'null' if no matching document was found
if (maxIndex == -1) {
return null;
} else {
return strings[maxIndex];
}
The advantage of this approach is that you can compute your map offline (that is, only once) and then use it again and again for different queries.
It looks like you should try some search engine or search library like Lucene or Solr
Lucene Core, our flagship sub-project, provides Java-based indexing
and search technology, as well as spellchecking, hit highlighting and
advanced analysis/tokenization capabilities.
Solr is the popular, blazing-fast, open source enterprise search
platform built on Apache Lucene™.
Both of this stuff have support to do what you need to do - to search for some keywords and rank them.
This program can't be less than O(n) complexity, that is, you have to check each word of the string with each keyword.
Now, the only think you can do is perform the check over each string all at once:
public int getRank(String string, String[] keyword) {
int rank = 0;
for (String word : string.split(" "))
for (String key : keyword)
if (word.equals(key))
rank++;
return rank;
}
In this easy example, rank is an int increased each time a keyword occurs in the string. Then fill an array of ranks for each string:
String[] strings = new String[]{"...", "...", "...", "...", ...};
String[] keyword = new String[]{"...", "...", "...", "...", ...};
int[] ranks = new int[stringsNumber];
for (int i = 0; i < stringsNumber; i++)
ranks[i] = getRank(strings[i], keyword);
I believe what you're really looking for is TF/IDF - Term Frequency/Inverse Document Frequency. The link provided should give you the information you need, or alternatively as #Mysterion has pointed out, Lucene will do this for you. You don't necessarily need to deploy a complete Lucene/Solr/ElasticSearch installation, you could just make use of the classes you need to roll your own
I want write to console tree in java in recusive function, one of parameter is depth of tree and I wanna use it for number of tabulators before node names.
public void print(TreeNode node, int depth)
//something ...
String prefix = "";
for(int i = 0; i <depth; i++) {
prefix += "\t";
}
//....
List<TreeNode> subnodes = node.getNodes();
for(int i = 0; i < subnodes.size(); i++) {
System.out.println(prefix+ subnodes.get(i).getTitle()); //title is name of node;
}
}
Is any better solution for string concatenation for prefix then do it by for?
For example I wanna do 2x "\t" which mean "\t\t" if 2 = depth. If depth is variable I wanna do depth x "\t";
My solution is use for but is there any better for this simple thing?
You may want to change your program in order to use StringBuilder class:
StringBuilder prefix = new StringBuilder();
for (int i = 0; i < depth, ++i) {
prefix.append("\t");
}
....
System.out.println(prefix.toString() + subnodes.get(i).getTitle());
String in Java is immutable and that's why when you modify it, acually new copy of String is created. If your tree is really big and deep (or tall :) - StringBuilder should work faster and consume less memory.
If for loop is a problem, you may consider such solutions as storing map of depth level -> pre-calculated prefix like this:
Map<Integer, String> prefixes = new HashMap<Integer, String>();
private void fillPrefixes(int maxTreeDepth) {
StringBuilder prefix = new StringBuilder();
for (int i = 0; i < maxTreeDepth; ++i) {
prefixes.put(i, prefix.toString());
prefix.append("\t");
}
}
This may be useful for the cases of huge trees, when you really recalculate that prefix over 9000 times. Technically, for loop is still there, but you do not recalculate prefix each time you need it. Another side of the coin is increased memory consumption. So, to make a right decision you need to avoid premature optimizations and do it only when you really need it, and also decide what is more critical - memory or execution time.
You may use Apache Commons Lang for that.
Maven dependency for it:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.0</version>
</dependency>
And sample usage:
import org.apache.commons.lang.StringUtils
...
prefix = StringUtils.repeat("\t", depth);
I am sorting 1 million strings (each string 50 chars) in ArrayList with
final Comparator comparator= new Comparator<String>() {
public int compare(String s1, String s2) {
if (s2 == null || s1 == null)
return 0;
return s1.compareTo(s2);
}
};
Collections.Sort(list,comparator);
The average time for this is: 1300 millisec
How can I speed it up?
If you're using Java 6 or below you might get a speedup by switching to Java 7. In Java 7 they changed the sort algorithm to TimSort which performs better in some cases (in particular, it works well with partially sorted input). Java 6 and below used MergeSort.
But let's assume you're using Java 6. I tried three versions:
Collections.sort(): Repeated runs of the comparator you provided take about 3.0 seconds on my machine (including reading the input of 1,000,000 randomly generated lowercase ascii strings).
Radix Sort: Other answers suggested a Radix sort. I tried the following code (which assumes the strings are all the same length, and only lowercase ascii):
String [] A = list.toArray(new String[0]);
for(int i = stringLength - 1; i >=0; i--) {
int[] buckets = new int[26];
int[] starts = new int[26];
for (int k = 0 ; k < A.length;k++) {
buckets[A[k].charAt(i) - 'a']++;
}
for(int k = 1; k < buckets.length;k++) {
starts[k] = buckets[k -1] + starts[k-1];
}
String [] temp = new String[A.length];
for(int k = 0; k < A.length; k++) {
temp[starts[A[k].charAt(i) - 'a']] = A[k];
starts[A[k].charAt(i) - 'a']++;
}
A = temp;
}
It takes about 29.0 seconds to complete on my machine. I don't think this is the best way to implement radix sort for this problem - for example, if you did a most-significant digit sort then you could terminate early on unique prefixes. And there'd also be some benefit in using an in-place sort instead (There's a good quote about this - “The troubles with radix sort are in
implementation, not in conception”). I'd like to write a better radix sort based solution that does this - if I get time I'll update my answer.
Bucket Sort: I also implemented a slightly modified version of Peter Lawrey's bucket sort solution. Here's the code:
Map<Integer, List<String>> buckets = new TreeMap<Integer,List<String>>();
for(String s : l) {
int key = s.charAt(0) * 256 + s.charAt(1);
List<String> list = buckets.get(key);
if(list == null) buckets.put(key, list = new ArrayList<String>());
list.add(s);
}
l.clear();
for(List<String> list: buckets.values()) {
Collections.sort(list);
l.addAll(list);
}
It takes about 2.5 seconds to complete on my machine. I believe this win comes from the partitioning.
So, if switching to Java 7's TimSort doesn't help you, then I'd recommend partitioning the data (using something like bucket sort). If you need even better performance, then you can also multi-thread the processing of the partitions.
You didn't specify the sort algorithm you use some are quicker than others(quick/merge vs. bubble)
Also If you are running on a multi-core/multi-processor machine you can divide the sort between multiple thread (again exactly how depends on the sort algorithm but here's an example)
You can use a radix sort for the first two characters. If you first two characters are distinctive you can use something like.
List<String> strings =
Map<Integer, List<String>> radixSort =
for(String s: strings) {
int key = (s.charAt(0) << 16) + s.charAt(1);
List<String> list = radixSort.get(key);
if(list == null) radixSort.put(key, list = new ArrayList<String>());
list.add(s);
}
strings.clear();
for(List<String> list: new TreeMap<Integer, List<String>>(radixSort).values()) {
Collections.sort(list);
strings.addAll(list);
}
which of the following is an efficient way to reverse words in a string ?
public String Reverse(StringTokenizer st){
String[] words = new String[st.countTokens()];
int i = 0;
while(st.hasMoreTokens()){
words[i] = st.nextToken();i++}
for(int j = words.length-1;j--)
output = words[j]+" ";}
OR
public String Reverse(StringTokenizer st, String output){
if(!st.hasMoreTokens()) return output;
output = st.nextToken()+" "+output;
return Reverse(st, output);}
public String ReverseMain(StringTokenizer st){
return Reverse(st, "");}
while the first way seems more readable and straight forward, there are two loops in it. In the 2nd method, I've tried doing it in tail-recursive way. But I am not sure whether java does optimize tail-recursive code.
you could do this in just one loop
public String Reverse(StringTokenizer st){
int length = st.countTokens();
String[] words = new String[length];
int i = length - 1;
while(i >= 0){
words[i] = st.nextToken();i--}
}
But I am not sure whether java does optimize tail-recursive code.
It doesn't. Or at least the Sun/Oracle Java implementations don't, up to and including Java 7.
References:
"Tail calls in the VM" by John Rose # Oracle.
Bug 4726340 - RFE: Tail Call Optimization
I don't know whether this makes one solution faster than the other. (Test it yourself ... taking care to avoid the standard micro-benchmarking traps.)
However, the fact that Java doesn't implement tail-call optimization means that the 2nd solution is liable to run out of stack space if you give it a string with a large (enough) number of words.
Finally, if you are looking for a more space efficient way to implement this, there is clever way that uses just a StringBuilder.
Create a StringBuilder from your input String
Reverse the characters in the StringBuilder using reverse().
Step through the StringBuilder, identifying the start and end offset of each word. For each start/end offset pair, reverse the characters between the offsets. (You have to do this using a loop.)
Turn the StringBuilder back into a String.
You can test results by timing both of them on a large amount of results
eg. You reverse 100000000 strings and see how many seconds it takes. You could also compare start and end system timestamps to get the exact difference between the two functions.
StringTokenizer is not deprecated but if you read the current JavaDoc...
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
String[] strArray = str.split(" ");
StringBuilder sb = new StringBuilder();
for (int i = strArray.length() - 1; i >= 0; i--)
sb.append(strArray[i]).append(" ");
String reversedWords = sb.substring(0, sb.length -1) // strip trailing space