trie building in java [duplicate]

trie building in java [duplicate] - java

This question already has answers here:
Trie implementation
(6 answers)
Closed 9 years ago.
How do I construct a tree from a file ? I want to be able to read them from a file and then add to appropriate level

It seems to me that you are trying to implement trie.
Look here for a nice implementation in java: http://www.cs.duke.edu/~ola/courses/cps108/fall96/joggle/trie/Trie.java

If you have only two levels in the tree before leafs (the actual words), you can simply start with arrays with 28 elements being and translate the letters to the index (i.e. a==1, b==2, etc.). Elements of array can be some set/list that contains the full words. You can lazily create arrays and lists (i.e. create the root array but have nulls for the other arrays and list of words, then you create an array/list when/if needed).
Am I reading rules you should follow correctly?
P.S. I think that using arrays with full size each would not be too wasteful on space as well it should be very fast to address
Update: #user1747976, well each array would take around 28*4 or 28*8 bits + 12 bytes overhead. Hopefully you use compressed ops so it is 28*4+12=116bytes per array. Now it depends if you want to be memory efficient or processing efficient. To be memory efficient, you can use some kind of hashmap instead of arrays but I'm not sure the additional overhead will be less than what you use with arrays. Processing will be for sure worse though.
You need to use some clever loop a number of times depending on tree dept requirement. Some ugly pseudo code for inserting into tree:
root=new Object[28];
word="something";
pos = root;
wordInd=1;
for (int i=1; i<=TREE_DEPTH ; i++) {
targetpos = letterInd(letter(wordInd,word));
if (i==TREE_DEPTH) {
if (pos[targetpos] == null) pos[targetpos] = new HashSet<String>();
(Set) pos[targetpos].add(word);
break;
} else {
if (pos[targetpos] == null) pos[targetpos] = new Object[28];
wordInd++;
pos = pos[targetpos];
}
}
Similar loop you can use for retrieving words.

Adding
Starting at the root, search for the first (or current) letter. If that letter is found then move to that node and search for the next letter. If the letter is not found, search for a word that matches the current letter, if there is a similar word then add the current letter as a new node and move both words under that, otherwise add the word.
Note: This will result in a tree that is more optimized for searches then the tree shown in the example. (adamant and adapt will be grouped under another 'a' node)
Update: Take a look at the Wikipedia article for Trie

Related

Efficient way for checking if a string is present in an array of strings [duplicate]

This question already has answers here:
How do I determine whether an array contains a particular value in Java?
(30 answers)
Closed 2 years ago.
I'm working on a little project in java, and I want to make my algorithm more efficient.
What I'm trying to do is check if a given string is present in an array of strings.
The thing is, I know a few ways to check if a string is present in an array of strings, but the array I am working with is pretty big (around 90,000 strings) and I am looking for a way to make the search more efficient, and the only ways I know are linear search based, which is not good for an array of this magnitude.
Edit: So I tried implementing the advices that were given to me, but the code i wrote accordingly is not working properly, would love to hear your thoughts.`
public static int binaryStringSearch(String[] strArr, String str) {
int low = 0;
int high = strArr.length -1;
int result = -1;
while (low <= high) {
int mid = (low + high) / 2;
if (strArr[mid].equals(str)) {
result = mid;
return result;
}else if (strArr[mid].compareTo(str) < 0) {
low = mid + 1;
}else {
high = mid - 1;
}
}
return result;
}
Basically what it's supposed to do is return the index at which the string is present in the array, and if it is not in the array then return -1.

So you have a more or less fixed array of strings and then you throw a string at the code and it should tell you if the string you gave it is in the array, do I get that right?
So if your array pretty much never changes, it should be possible to just sort them by alphabet and then use binary search. Tom Scott did a good video on that (if you don't want to read a long, messy text written by someone who isn't a native english speaker, just watch this, that's all you need). You just look right in the middle and then check - is the string you have before or after the string in the middle you just read? If it is already precisely the right one, you can just stop. But in case it isn't, you can eliminate every string after that string in case it's after the string you want to find, otherwise every string that's before the just checked string. Of course, you also eliminate the string itself if it's not equal because - logic. And then you just do it all over again, check the string in the middle of the ones which are left (btw you don't have to actually delete the array items, it's enough just to set a variable for the lower and upper boundary because you don't randomly delete elements in the middle) and eliminate based on the result. And you do that until you don't have a single string in the list left. Then you can be sure that your input isn't in the array. So this basically means that by checking and comparing one string, you can't just eliminate 1 item like you could with checking one after the other, you can remove more then half of the array, so with a list of 256, it should only take 8 compares (or 9, not quite sure but I think it takes one more if you don't want to find the item but know if it exists) and for 65k (which almost matches your number) it takes 16. That's a lot more optimised.
If it's not already sorted and you can't because that would take way too long or for some reason I don't get, then I don't quite know and I think there would be no way to make it faster if it's not ordered, then you have to check them one by one.
Hope that helped!
Edit: If you don't want to really sort all the items and just want to make it a bit (26 times (if language would be random)) faster, just make 26 arrays for all letters (in case you only use normal letters, otherwise make more and the speed boost will increase too) and then loop through all strings and put them into the right array matching their first letter. That way it is much faster then sorting them normally, but it's a trade-off, since it's not so neat then binary search. You pretty much still use linear search (= looping through all of them and checking if they match) but you already kinda ordered the items. You can imagine that like two ways you can sort a buncha cards on a table if you want to find them quicker, the lazy one and the not so lazy one. One way would be to sort all the cards by number, let's just say the cards are from 1-100, but not continuously, there are missing cards. But nicely sorting them so you can find any card really quickly takes some time, so what you can do instead is making 10 rows of cards. In each one you just put your cards in some random order, so when someone wants card 38, you just go to the third row and then linearly search through all of them, that way it is much faster to find items then just having them randomly on your table because you only have to search through a tenth of the cards, but you can't take shortcuts once you're in that row of cards.

Depending on the requirements, there can be so many ways to deal with it. It's better to use a collection class for the rich API available OOTB.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and the insertion order does not matter: Use Set<String> set = new HashSet<>() and then you can use Set#contains to check the presence of a particular string.
Are the strings supposed to be unique i.e. the duplicate strings need to be discarded automatically and also the insertion order needs to be preserved: Use Set<String> set = new LinkedHashSet<>() and then you can use Set#contains to check the presence of a particular string.
Can the list contain duplicate strings. If yes, you can use a List<String> list = new ArrayList<>() to benefit from its rich API as well as get rid of the limitation of fixed size (Note: the maximum number of elements can be Integer.MAX_VALUE) beforehand. However, a List is navigated always in a sequential way. Despite this limitation (or feature), the can gain some efficiency by sorting the list (again, it's subject to your requirement). Check Why is processing a sorted array faster than processing an unsorted array? to learn more about it.

You could use a HashMap which stores all the strings if
Contains query is very frequent and lookup strings do not change frequently.
Memory is not a problem (:D) .

Efficient data structure that checks for existence of String

I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.
If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).
However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?

If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found
Correct, checking a list for an item is linear in the number of entries of the list.
However, I am not keen on filling a HashMap where the value is completely arbitrary
You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.
You can put all your strings there, and then check for presence or absence of other strings in constant time;
Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings
if (knownString.contains(myString)) {
...
}

It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.
A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).
If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).
Otherwise, well, a HashSet...

A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:
String wordList = "$word1$word2$word3$word4$...";
Then create a search argument with your word between the separators:
String searchArg = "$" + searchWord + "$";
Then search with, say, contains:
bool wordFound = wordList.contains(searchArg);
You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.

As others mentioned HashSet is the way to go. But if the size is going to be large and you are fine with false positives (checking if the username exists) you can use BloomFilters (probabilistic data structure) as well.

Split an array of common English words into separate lists/arrays based on word length in Java

I'm trying to search an array of common English words to see if a specific word is contained in it, based on a text file. Since this array has >700,000 words and around 1000 words need to be checked if in the array multiple times, I thought it would be more efficient to separate the words into separate arrays or lists based on length. Is there an easy way to do this without using a switch or lots of if statements? Like so:
for(int i = 0; i < commonWordArray.length; i++) {
if(commonWordArray[i].length == 2) {
twoLetterList.add(commonWordArray[i]);
else if(commonWordArray[i].length == 3) {
threeLetterList.add(commonWordArray[i]);
else if(commonWordArray[i].length == 4) {
fourLetterList.add(commonWordArray[i]);
}
...etc
}
Then doing the same thing when checking the words:
for(int i = 0; i < checkWords.length; i++) {
if(checkWords[i].length == 2) {
if(twoLetterList.contains(checkWords[i])) {
...etc
}

Step 1
Create word buckets.
ArrayList<ArrayList<String>> buckets = new ArrayList<>();
for(int i = 0; i < maxWordLength; i++) {
buckets.add(new ArrayList<String>());
}
Step 2
Add words to your buckets.
buckets.get(word.length()).add(word);
This approach has the downside that some of your buckets may go unused. This is not an issue if you are only filtering common English words, as they do not exceed 30 characters in length. Creating 10-15 extra lists is a trivial overhead for a computer. The largest uncommon but non-technical word is 183 characters. Technical words exceed 180,000 characters, by which point this approach is clearly not practical.
The upside of this approach is that ArrayList.get() and ArrayList.add() both run in constant (O(1)) time.

Use a List<Set<String>> sets. That is, given a String word, find first the proper set (Set<String> set = sets.get(word.length)) - create the set if needed, extend the list if needed. Then just do a set.add(word). Done!
Edit/Hint: a (good) programmer should be lazy - if you need to do/write the same thing twice, you're doing something wrong.

Assuming you've got memory for it (which your current approach relies on), why not just a single Set<String>? Simpler, faster.

If you want to use multiple strings to search, you may want to try something like the Aho Corasick algorithm.
Alternatively, you may want to turn the problem around and check if a string from the 700k array is in the 1k array. To this, you won't have memory issues (imho) and you may do it with a simple dictionary (balanced tree). so you'd have 700k log2(1000).

Use a Trie, which is a memory-efficient storage mechanism which excels at storing words and checking for whether they exist or not.
Implementing one on your own is a fun exercise, or look at existing implementations.

Fastest way to find Strings in String collection that begin with certain chars

I have a large collection of Strings. I want to be able to find the Strings that begin with "Foo" or the Strings that end with "Bar". What would be the best Collection type to get the fastest results? (I am using Java)
I know that a HashSet is very fast for complete matches, but not for partial matches I would think? So, what could I use instead of just looping through a List? Should I look into LinkedList's or similar types? Are there any Collection Types that are optimized for this kind of queries?

The best collection type for this problem is SortedSet. You would need two of them in fact:
Words in regular order.
Words with their characters inverted.
Once these SortedSets have been created, you can use method subSet to find what you are looking for. For example:
Words starting with "Foo":
forwardSortedSet.subSet("Foo","Fop");
Words ending with "Bar":
backwardSortedSet.subSet("raB","raC");
The reason we are "adding" 1 to the last search character is to obtain the whole range. The "ending" word is excluded from the subSet, so there is no problem.
EDIT: Of the two concrete classes that implement SortedSet in the standard Java library, use TreeSet. The other (ConcurrentSkipListSet) is oriented to concurrent programs and thus not optimized for this situation.

It's been a while but I needed to implement this now and did some testing.
I already have a HashSet<String> as source so generation of all other datastructures is included in search time. 100 different sources are used and each time the data structures need to be regenerated. I only need to match a few single Strings each time. These tests ran on Android.
Methods:
Simple loop through HashSet and call endsWith() on
each string
Simple loop through HashSet and perform precompiled
Pattern match (regex) on each string.
Convert HashSet to single String joined by \n and
single match on whole String.
Generate SortedTree with reversed Strings from
HashSet. Then match with subset() as explained by #Mario Rossi.
Results:
Duration for method 1: 173ms (data setup:0ms search:173ms)
Duration for method 2: 6909ms (data setup:0ms search:6909ms)
Duration for method 3: 3026ms (data setup:2377ms search:649ms)
Duration for method 4: 2111ms (data setup:2101ms search:10ms)
Conclusion:
SortedSet/SortedTree is extremely fast in searching. Much faster than just looping through all Strings. However, creating the structure takes a lot of time. Regexes are much slower, but generating a single large String out of hundreds of Strings is more of a bottleneck on Android/Java.
If only a few matches need to be made, then you better loop through your collection. If you have much more matches to make it may be very useful to use a SortedTree!

If the list of words is stable (not many words are added or deleted), a very good second alternative is to create 2 lists:
One with the words in normal order.
The second with the characters in each word reversed.
For speed purposes, make them ArrayLists. Never LinkedLists or other variants which perform extremely bad on random access (the core of binary search; see below).
After the lists are created, they can be sorted with method Collections.sort (only once each) and then searched with Collections.binarySearch. For example:
Collections.sort(forwardList);
Collections.sort(backwardList);
And then to search for words starting in "Foo":
int i= Collections.binarySearch(forwardList,"Foo") ;
while( i < forwardList.size() && forwardList.get(i).startsWith("Foo") ) {
// Process String forwardList.get(i)
i++;
}
And words ending in "Bar":
int i= Collections.binarySearch(backwardList,"raB") ;
while( i < backwardList.size() && backwardList.get(i).startsWith("raB") ) {
// Process String backwardList.get(i)
i++;
}

What is a data structure that has O(1) for append, prepend, and retrieve element at any location?

I'm looking for Java solution but any general answer is also OK.
Vector/ArrayList is O(1) for append and retrieve, but O(n) for prepend.
LinkedList (in Java implemented as doubly-linked-list) is O(1) for append and prepend, but O(n) for retrieval.
Deque (ArrayDeque) is O(1) for everything above but cannot retrieve element at arbitrary index.
In my mind a data structure that satisfy the requirement above has 2 growable list inside (one for prepend and one for append) and also stores an offset to determine where to get the element during retrieval.

You're looking for a double-ended queue. This is implemented the way you want in the C++ STL, which is you can index into it, but not in Java, as you noted. You could conceivably roll your own from standard components by using two arrays and storing where "zero" is. This could be wasteful of memory if you end up moving a long way from zero, but if you get too far you can rebase and allow the deque to crawl into a new array.
A more elegant solution that doesn't really require so much fanciness in managing two arrays is to impose a circular array onto a pre-allocated array. This would require implementing push_front, push_back, and the resizing of the array behind it, but the conditions for resizing and such would be much cleaner.

A deque (double-ended queue) may be implemented to provide all these operations in O(1) time, although not all implementations do. I've never used Java's ArrayDeque, so I thought you were joking about it not supporting random access, but you're absolutely right — as a "pure" deque, it only allows for easy access at the ends. I can see why, but that sure is annoying...
To me, the ideal way to implement an exceedingly fast deque is to use a circular buffer, especially since you are only interested in adding removing at the front and back. I'm not immediately aware of one in Java, but I've written one in Objective-C as part of an open-source framework. You're welcome to use the code, either as-is or as a pattern for implementing your own.
Here is a WebSVN portal to the code and the related documentation. The real meat is in the CHAbstractCircularBufferCollection.m file — look for the appendObject: and prependObject: methods. There is even a custom enumerator ("iterator" in Java) defined as well. The essential circular buffer logic is fairly trivial, and is captured in these 3 centralized #define macros:
#define transformIndex(index) ((headIndex + index) % arrayCapacity)
#define incrementIndex(index) (index = (index + 1) % arrayCapacity)
#define decrementIndex(index) (index = ((index) ? index : arrayCapacity) - 1)
As you can see in the objectAtIndex: method, all you do to access the Nth element in a deque is array[transformIndex(N)]. Note that I make tailIndex always point to one slot beyond the last stored element, so if headIndex == tailIndex, the array is full, or empty if the size is 0.
Hope that helps. My apologies for posting non-Java code, but the question author did say general answers were acceptable.

If you treat append to a Vector/ArrayList as O(1) - which it really isn't, but might be close enough in practice -
(EDIT - to clarify - append may be amortized constant time, that is - on average, the addition would be O(1), but might be quite a bit worse on spikes. Depending on context and the exact constants involved, this behavior can be deadly).
(This isn't Java, but some made-up language...).
One vector that will be called "Forward".
A second vector that will be called "Backwards".
When asked to append -
Forward.Append().
When asked to prepend -
Backwards.Append().
When asked to query -
if ( Index < Backwards.Size() )
{
return Backwards[ Backwards.Size() - Index - 1 ]
}
else
{
return Forward[ Index - Backwards.Size() ]
}
(and also check for the index being out of bounds).

Your idea might work. If those are the only operations you need to support, then two Vectors are all you need (call them Head and Tail). To prepend, you append to head, and to append, you append to tail. To access an element, if the index is less than head.Length, then return head[head.Length-1-index], otherwise return tail[index-head.Length]. All of these operations are clearly O(1).

Here is a data structure that supports O(1) append, prepend, first, last and size. We can easily add other methods from AbstractList<A> such as delete and update
import java.util.ArrayList;
public class FastAppendArrayList<A> {
private ArrayList<A> appends = new ArrayList<A>();
private ArrayList<A> prepends = new ArrayList<A>();
public void append(A element) {
appends.add(element);
}
public void prepend(A element) {
prepends.add(element);
}
public A get(int index) {
int i = prepends.size() - index;
return i >= 0 ? prepends.get(i) : appends.get(index + prepends.size());
}
public int size() {
return prepends.size() + appends.size();
}
public A first() {
return prepends.isEmpty() ? appends.get(0) : prepends.get(prepends.size());
}
public A last() {
return appends.isEmpty() ? prepends.get(0) : appends.get(prepends.size());
}

What you want is a double-ended queue (deque) like the STL has, since Java's ArrayDeque lacks get() for some reason. There were some good suggestions and links to implementations here:
Java equivalent of std::deque?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.