I have a big amount of data that I have saved in interfaces (55 to be exact). combined they contain almost 170000 strings in string arrays.
What I want to do is match the input element to the strings in given interface and return all the elements that match the pattern of the input element.
I have saved the elements in string arrays eg.
String[] array = {"element 1", "element 2".....};
for every input I have to iterate the string array whole in order to find all the matching elements in the list.
I am currently doing this:
for(int innerIterator = 0; innerIterator < dictionaryArrayLength; innerIterator++){
if (dictionaryArray[innerIterator].matches(""+input[iterator]+"\\D*")) {
matchedWordList.add(dictionaryArray[innerIterator]);
}
}
as the length of the array is in thousands so its taking a bit of time to answer.
I would like this code to perform better. I am currently thinking to change the data structure I have used for the dictionaries. But is there any better way to iterate through the list and find all the matching elements
If you are using Java 8 you could try with the new Stream api:
Pattern inputPatter = Pattern.compile(input[iterator] + "\\D*");
List<String> matchingWords = Arrays.stream(dictionaryArray).parallel()
.filter(word -> inputPatter.matcher(word).matches()).collect(Collectors.toList());
matchedWordList.addAll(matchingWords);
With the first line (Pattern inputPatter = Pattern.compile(input[iterator] + "\\D*");) you avoid
The creation of the regex string at every cycle iteration
The creation of the relative Pattern object at every iteration (see the source code of String.matches)
With Arrays.stream you create a stream from an array.
With .parallel you parallelize the stream in order to distribute the following operations among multiple threads (with huge sets of data it can improve performances considerably).
With .filter(word -> inputPatter.matcher(word).matches()) you select only the words matching your pattern.
With .collect(Collectors.toList()) you collect your results in a list.
Related
I am trying to implement an autocomplete method that receives an input string and searches through a given array of words for the words that have the input string as the prefix of the words in the array.
I have successfully implemented this by looping through the array of words and returning all the elements with the prefix of given input string. However, I believe that there should be an optimal way of doing this as I realized that I am looping through all elements of the array after sorting the elements.
I want to select only elements in the array that begin with the first letter of the given string and then search that sub-array for the words that have the string as their prefix. This will cut the processing time considerably and provide an optimal solution.
What data structure can I use that will not require me to loop through the entire given array, or dictionary key (I tried implementing this using a map), but only chose the sub-array containing elements close to the solution?
PS: I also played with ArrayUtils.subArray() of Apache Commons but could not retrieve the sub-array. Any ideas on how to implement this?
You could use a TreeSet. Once populated, given a String prefix you could get all matching words by
Set<String> findWords(String prefix) {
// copied from https://stackoverflow.com/questions/4002021/increment-last-letter-of-a-string
int len = prefix.length();
String allButLast = prefix.substring(0, len - 1);
String endPrefix = allButLast + new Character(prefix.charAt(len - 1) + 1);
return treeSet.subSet(prefix, endPrefix);
}
Question: Is there an effective and efficient way to return a list of Strings that show up in a message given a list of words using Stream/Parallel Stream?
Let's say I have 'ArrayList banWords' which contains a list of words players cannot say. Now let's assume 'message' represents the message a player types. How would I check to see if 'message' contains any words in 'banWords' and if so, return all the words that appear in 'message' using Stream?
I'm asking this since I'm not very familiar with Stream and haven't found a suitable question that has been asked in the past. Currently, the code loops through every word in 'banWords' and checks if 'message' contains that word. If so, it gets added to a separate ArrayList.
for (String word: banWords)
if (message.contains(word))
// Adds word to a separate arraylist
However, I'm trying to see if there's a way I can use Stream or Parallel Stream to return the words. This is the closest I've found
if (banWords.parallelstream().anyMatch(message::contains) {
// Adds the word to another list using banWords.parallelstream().filter(message::contains).findAny().get()
}
However, that only returns the last word that appears in banWords. For example, if banWords contains 'hello' and 'hey' and the message is 'hello hey,' instead of adding "hello" and "hey" as two separate words, it just adds "hey."
Any ideas on how I can effectively get a list of words in message? At this point, I'm looking for the most effective or quickest way to do this so if you have another way that doesn't use Streams, I would be happy to hear.
Suppose of you have a String ArrayList banWords then create stream using that string, and use filter to filter strings that contains banWords
List<String> list = Stream.of("ArrayList banWords").filter(s->s.contains("banWords"))
.collect(Collectors.toList());
You can create stream with multiple strings also
List<String> list = Stream.of("ArrayList banWords","Set banWords", "map").filter(s->s.contains("banWords"))
.collect(Collectors.toList());
So this is how you need to do
List<String> list = word.stream().
.filter(s->message.contains(s))
.collect(Collectors.toList());
message.stream().filter(s -> bannedWordSet.contains(s)).collect(Collectors.toList());
Something to note, it's important to use a set for your list of banned words instead of a list. It'll be much more efficient.
you can collect to list after filter()
List foundWords = banWords.parallelstream().filter(message::contains).collect(Collectors.toList());
I have a large collection of Strings. I want to be able to find the Strings that begin with "Foo" or the Strings that end with "Bar". What would be the best Collection type to get the fastest results? (I am using Java)
I know that a HashSet is very fast for complete matches, but not for partial matches I would think? So, what could I use instead of just looping through a List? Should I look into LinkedList's or similar types? Are there any Collection Types that are optimized for this kind of queries?
The best collection type for this problem is SortedSet. You would need two of them in fact:
Words in regular order.
Words with their characters inverted.
Once these SortedSets have been created, you can use method subSet to find what you are looking for. For example:
Words starting with "Foo":
forwardSortedSet.subSet("Foo","Fop");
Words ending with "Bar":
backwardSortedSet.subSet("raB","raC");
The reason we are "adding" 1 to the last search character is to obtain the whole range. The "ending" word is excluded from the subSet, so there is no problem.
EDIT: Of the two concrete classes that implement SortedSet in the standard Java library, use TreeSet. The other (ConcurrentSkipListSet) is oriented to concurrent programs and thus not optimized for this situation.
It's been a while but I needed to implement this now and did some testing.
I already have a HashSet<String> as source so generation of all other datastructures is included in search time. 100 different sources are used and each time the data structures need to be regenerated. I only need to match a few single Strings each time. These tests ran on Android.
Methods:
Simple loop through HashSet and call endsWith() on
each string
Simple loop through HashSet and perform precompiled
Pattern match (regex) on each string.
Convert HashSet to single String joined by \n and
single match on whole String.
Generate SortedTree with reversed Strings from
HashSet. Then match with subset() as explained by #Mario Rossi.
Results:
Duration for method 1: 173ms (data setup:0ms search:173ms)
Duration for method 2: 6909ms (data setup:0ms search:6909ms)
Duration for method 3: 3026ms (data setup:2377ms search:649ms)
Duration for method 4: 2111ms (data setup:2101ms search:10ms)
Conclusion:
SortedSet/SortedTree is extremely fast in searching. Much faster than just looping through all Strings. However, creating the structure takes a lot of time. Regexes are much slower, but generating a single large String out of hundreds of Strings is more of a bottleneck on Android/Java.
If only a few matches need to be made, then you better loop through your collection. If you have much more matches to make it may be very useful to use a SortedTree!
If the list of words is stable (not many words are added or deleted), a very good second alternative is to create 2 lists:
One with the words in normal order.
The second with the characters in each word reversed.
For speed purposes, make them ArrayLists. Never LinkedLists or other variants which perform extremely bad on random access (the core of binary search; see below).
After the lists are created, they can be sorted with method Collections.sort (only once each) and then searched with Collections.binarySearch. For example:
Collections.sort(forwardList);
Collections.sort(backwardList);
And then to search for words starting in "Foo":
int i= Collections.binarySearch(forwardList,"Foo") ;
while( i < forwardList.size() && forwardList.get(i).startsWith("Foo") ) {
// Process String forwardList.get(i)
i++;
}
And words ending in "Bar":
int i= Collections.binarySearch(backwardList,"raB") ;
while( i < backwardList.size() && backwardList.get(i).startsWith("raB") ) {
// Process String backwardList.get(i)
i++;
}
Interesting algorithm I would like to get the communities opinion on. I am looking to loop through a Sorted ArrayList<String> for the boolean result if a String exists in the array that begins with certain characters.
Ex. Array {"he", "help", "helpless", hope"}
search character = h
Result: true
search character = he
Result: true
search character = hea
Result: false
Now my first impression was that I should combine binary search with regex but let me know if I am way off. While trie would be the best implementation I need a solution that minimizes heap memory (developing on android) as this array in practicality will contain ~10,000-20,000 entries (words).
I have a db that contains ~200,000 words. I am taking a subset beginning with a set letter (in my example h) which will contain ~20,000 entries and inserting these into an array. I am then performing ~100-1,000 lookups/contains using this subset. The thought in my approach was to increase performance time (instead of db querying) while trying to minimize the hit to memory (array instead of trie tree)
Perhaps a DAWG would optimize lookup however I'm not sure if the size requirements for this structure would be significantly larger than an ArrayList?
If you really want to avoid a trie, this should fit your needs:
NavigableSet<String> tree = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
tree.addAll(Arrays.asList("he", "help", "helpless", "hope"));
String[] queries = {"h", "he", "hea"};
for (String query : queries) {
String higher = tree.ceiling(query);
System.out.println(query + ": " + higher.startsWith(query));
}
prints
h: true
he: true
hea: false
You should consider http://en.wikipedia.org/wiki/Skip_list as an option. Many java implementations are readily available
This question already has answers here:
Trie implementation
(6 answers)
Closed 9 years ago.
How do I construct a tree from a file ? I want to be able to read them from a file and then add to appropriate level
It seems to me that you are trying to implement trie.
Look here for a nice implementation in java: http://www.cs.duke.edu/~ola/courses/cps108/fall96/joggle/trie/Trie.java
If you have only two levels in the tree before leafs (the actual words), you can simply start with arrays with 28 elements being and translate the letters to the index (i.e. a==1, b==2, etc.). Elements of array can be some set/list that contains the full words. You can lazily create arrays and lists (i.e. create the root array but have nulls for the other arrays and list of words, then you create an array/list when/if needed).
Am I reading rules you should follow correctly?
P.S. I think that using arrays with full size each would not be too wasteful on space as well it should be very fast to address
Update: #user1747976, well each array would take around 28*4 or 28*8 bits + 12 bytes overhead. Hopefully you use compressed ops so it is 28*4+12=116bytes per array. Now it depends if you want to be memory efficient or processing efficient. To be memory efficient, you can use some kind of hashmap instead of arrays but I'm not sure the additional overhead will be less than what you use with arrays. Processing will be for sure worse though.
You need to use some clever loop a number of times depending on tree dept requirement. Some ugly pseudo code for inserting into tree:
root=new Object[28];
word="something";
pos = root;
wordInd=1;
for (int i=1; i<=TREE_DEPTH ; i++) {
targetpos = letterInd(letter(wordInd,word));
if (i==TREE_DEPTH) {
if (pos[targetpos] == null) pos[targetpos] = new HashSet<String>();
(Set) pos[targetpos].add(word);
break;
} else {
if (pos[targetpos] == null) pos[targetpos] = new Object[28];
wordInd++;
pos = pos[targetpos];
}
}
Similar loop you can use for retrieving words.
Adding
Starting at the root, search for the first (or current) letter. If that letter is found then move to that node and search for the next letter. If the letter is not found, search for a word that matches the current letter, if there is a similar word then add the current letter as a new node and move both words under that, otherwise add the word.
Note: This will result in a tree that is more optimized for searches then the tree shown in the example. (adamant and adapt will be grouped under another 'a' node)
Update: Take a look at the Wikipedia article for Trie