The right datastructure for selecting objects

The right datastructure for selecting objects - java

I'm new to Java and as a learning project, I would like to program a little vocabulary application, so that the user can test himself but also search for entries. However, I struggle to find the right datastructure for this and even after spending the last few days googling for it, I'm still at a loss.
Here is what I have in mind for my vocabulary object:
import java.io.*;
class Vocab implements Serializable {
String lang1;
String lang2;
int rightAnswersInARow; // to influence what to ask during testing
int numberOfTimesSearched; // to influence search suggestions
// ... plus the appropriate setter and getter methods.
}
Now for the testing, at first glance an ArrayList seems to be the most appropriate (choosing a random number and then selecting that object to test). But what if I would also like to factor in the rightAnswersInARow and ask vocabularies with a low number more often? My approach would be count the number of objects for each value, give each value an interval (e.g. the interval for rightAnswersInARow = 0 would be inflated by the factor 3) and then randomly select from there.
But even if I go through the ArrayList each time, get the rightAnswersInARow and determine the intervals...how would I then map the calculated number to the right index since the elements are not sorted? So would a TreeSet be more appropriate?
To search for entries in both languages and maybe even adding a dropdown-list with suggested words (like in Google's search) would require that I find the strings quickly (HashMap?). Or maybe go through 2+ (one for each language) TreeSets to reach the first element that starts with those letters, then selecting the next few elements from there? But that would mean the search would always suggest the same words, ignoring which words were searched for the most.
What would you suggest? Have a HashMap with each value pair and manually implement something like a relational database?
Thank you in advance! :)

Related

Structure like Java's EnumSet that can hold repeated elements

I need some structure where to store N Enums, some of them repeated. And be able to easily extract them. So far I've try to use the EnumSet like this.
cards = EnumSet.of(
BEST_OF_THREE,
BEST_OF_THREE,
SIMPLE_QUESTION,
SIMPLE_QUESTION,
STAR);
But now I see it can only have one of each. Conceptually, which one would be the best structure to use for this problem.
Regards
jose

You can use a Map of type Enumeration -> Integer, where the integer indicates how many of each there are. The google guava "MultiSet" does this for you, and handles the edge cases of adding an enum to the set when there is not already an entry, and removing an enum when it leaves none left.
Another strategy is to use the Enumeration ordinal index. Because this index is unique, you can use this to index into an int array that is sized to the Enumeration size, where the count in each array slot would indicate how many of each enumeration you have. Like this:
// initialize array for counting each enumeration type
// TODO: someone should double check every initial value will be zero
int[] cardCount = new int[CardEnum.values().length];
...
// incrementing the count for an enumeration (when we add)
cardCount[BEST_OF_THREE.ordinal()]++;
...
// decrementing the count for an enumeration (when we remove)
cardCount[BEST_OF_THREE.ordinal()]--;
// DEBUG: assert cardCount[BEST_OF_THREE.ordinal()] >= 0
...
// getting the count for an enumeration
int count = cardCount[BEST_OF_THREE.ordinal()];
... Some time later
Having read the clarifying comments underneath the original post that explained what the OP was asking, it is clear that you're best off with a linear structure with an entry per element. I didn't realize that you didn't need detailed information on how many of each you needed. Storing them in a MultiSet or an equivalent counting structure makes it hard to randomly pick, as you need to attribute an index picked at random from [0, size) to a particular container, which takes log time.

Sets don't allow duplicates, so if you want repeats you'll need either a List or a Map.
If you just need the number of duplicates, an EnumMap with Integer values is probably your best bet.
If the order is important, and you need quick access to the number of each type, you'll probably need to roll your own data structure.
If the order is important (but the count of each is not), then a List is the way to go, which implementation depends on how you will use it.
LinkedList - Best when there will be many inserts/removals from the beginning of the List. Indexing into a LinkedList is very expensive, and should be avoided whenever possible. If a List is built by shifting data onto the front of the list, but any later additions are at the end, conversion to an ArrayList once the initial List is built is a good idea - especially if indexing into the List is anticipated at any point.
ArrayList - When in doubt, this is a good place to start. Inserting or removing items requires shifting, so if this is a common operation look elsewhere.
TreeList - This is a good all-around option, and insertions and removals anywhere in the List are inexpensive. This does require the Apache commons library, and uses a bit more memory than the others.
Benchmarks, and the code used go generate them can be found in this gist.

Storing a dictionary in a hashtable

I have an assignment that I am working on, and I can't get a hold of the professor to get clarity on something. The idea is that we are writing an anagram solver, using a given set of words, that we store in 3 different dictionary classes: Linear, Binary, and Hash.
So we read in the words from a textfile, and for the first 2 dictionary objects(linear and binary), we store the words as an ArrayList...easy enough.
But for the HashDictionary, he want's us to store the words in a HashTable. I'm just not sure what the values are going to be for the HashTable, or why you would do that. The instructions say we store the words in a Hashtable for quick retrieval, but I just don't get what the point of that is. Makes sense to store words in an arraylist, but I'm just not sure of how key/value pairing helps with a dictionary.
Maybe i'm not giving enough details, but I figured maybe someone would have seen something like this and its obvious to them.
Each of our classes has a contains method, that returns a boolean representing whether or not a word passed in is in the dictionary, so the linear does a linear search of the arraylist, the binary does a binary search of the arraylist, and I'm not sure about the hash....

The difference is speed. Both methods work, but the hash table is fast.
When you use an ArrayList, or any sort of List, to find an element, you must inspect each list item, one by one, until you find the desired word. If the word isn't there, you've looped through the entire list.
When you use a HashTable, you perform some "magic" on the word you are looking up known as calculating the word's hash. Using that hash value, instead of looping through a list of values, you can immediately deduce where to find your word - or, if your word doesn't exist in the hash, that your word isn't there.
I've oversimplified here, but that's the general idea. You can find another question here with a variety of explanations on how a hash table works.
Here is a small code snippet utilizing a HashMap.
// We will map our words to their definitions; word is the key, definition is the value
Map<String, String> dictionary = new HashMap<String, String>();
map.put("hello","A common salutation");
map.put("chicken","A delightful vessel for protein");
// Later ...
map.get("chicken"); // Returns "A delightful vessel for protein";
The problem you describe asks that you use a HashMap as the basis for a dictionary that fulfills three requirements:
Adding a word to the dictionary
Removing a word from the dictionary
Checking if a word is in the dictionary
It seems counter-intuitive to use a map, which stores a key and a value, since all you really want to is store just a key (or just a value). However, as I described above, a HashMap makes it extremely quick to find the value associated with a key. Similarly, it makes it extremely quick to see if the HashMap knows about a key at all. We can leverage this quality by storing each of the dictionary words as a key in the HashMap, and associating it with a garbage value (since we don't care about it), such as null.
You can see how to fulfill the three requirements, as follows.
Map<String, Object> map = new HashMap<String, Object>();
// Add a word
map.put('word', null);
// Remove a word
map.remove('word');
// Check for the presence of a word
map.containsKey('word');
I don't want to overload you with information, but the requirements we have here align with a data structure known as a Set. In Java, a commonly used Set is the HashSet, which is almost exactly what you are implementing with this bit of your homework assignment. (In fact, if this weren't a homework assignment explicitly instructing you to use a HashMap, I'd recommend you instead use a HashSet.)

Arrays are hard to find stuff in. If I gave you array[0] = "cat"; array[1] = "dog"; array[2] = "pikachu";, you'd have to check each element just to know if jigglypuff is a word. If I gave you hash["cat"] = 1; hash["dog"] = 1; hash["pikachu"] = 1;", instant to do this in, you just look it up directly. The value 1 doesn't matter in this particular case although you can put useful information there, such as how many times youv'e looked up a word, or maybe 1 will mean real word and 2 will mean name of a Pokemon, or for a real dictionary it could contain a sentence-long definition. Less relevant.

It sounds like you don't really understand hash tables then. Even Wikipedia has a good explanation of this data structure.
Your hash table is just going to be a large array of strings (initially all empty). You compute a hash value using the characters in your word, and then insert the word at that position in the table.
There are issues when the hash value for two words is the same. And there are a few solutions. One is to store a list at each array position and just shove the word onto that list. Another is to step through the table by a known amount until you find a free position. Another is to compute a secondary hash using a different algorithm.
The point of this is that hash lookup is fast. It's very quick to compute a hash value, and then all you have to do is check that the word at that array position exists (and matches the search word). You follow the same rules for hash value collisions (in this case, mismatches) that you used for the insertion.
You want your table size to be a prime number that is larger than the number of elements you intend to store. You also need a hash function that diverges quickly so that your data is more likely to be dispersed widely through your hash table (rather than being clustered heavily in one region).
Hope this is a help and points you in the right direction.

Java find nearest (or equal) value in collection

I have a class along the lines of:
public class Observation {
private String time;
private double x;
private double y;
//Constructors + Setters + Getters
}
I can choose to store these objects in any type of collection (Standard class or 3rd party like Guava). I have stored some example data in an ArrayList below, but like I said I am open to any other type of collection that will do the trick. So, some example data:
ArrayList<Observation> ol = new ArrayList<Observation>();
ol.add(new Observation("08:01:23",2.87,3.23));
ol.add(new Observation("08:01:27",2.96,3.17));
ol.add(new Observation("08:01:27",2.93,3.20));
ol.add(new Observation("08:01:28",2.93,3.21));
ol.add(new Observation("08:01:30",2.91,3.23));
The example assumes a matching constructor in Observation. The timestamps are stored as String objects as I receive them as such from an external source but I am happy to convert them into something else. I receive the observations in chronological order so I can create and rely on a sorted collection of observations. The timestamps are NOT unique (as can be seen in the example data) so I cannot create a unique key based on time.
Now to the problem. I frequently need to find one (1) observation with a time equal or nearest to a certain time, e.g if my time was 08:01:29 I would like to fetch the 4th observation in the example data and if the time is 08:01:27 I want the 3rd observation.
I can obviously iterate through the collection until I find the time that I am looking for, but I need to do this frequently and at the end of the day I may have millions of observations so I need to find a solution where I can locate the relevant observations in an efficient manner.
I have looked at various collection-types including ones where I can filter the collections with Predicates but I have failed to find a solution that would return one value, as opposed to a subset of the collection that fulfills the "<="-condition. I am essentially looking for the SQL equivalent of SELECT * FROM ol WHERE time <= t LIMIT 1.
I am sure there is a smart and easy way to solve my problem so I am hoping to be enlightened. Thank you in advance.

Try TreeSet providing a comparator that compares the time. It mantains an ordered set and you can ask for TreeSet.floor(E) to find the greatest min (you should provide a dummy Observation with the time you are looking for). You also have headSet and tailSet for ordered subsets.
It has O(log n) time for adding and retrieving. I think is very suitable for your needs.
If you prefer a Map you can use a TreeMap with similar methods.

Sort your collection (ArrayList will probably work best here) and use BinarySearch which returns an integer index of either a match of the "closest" possible match, ie it returns an...
index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(),

Have the Observation class implement Comparable and use a TreeSet to store the objects, which will keep the elements sorted. TreeSet implements SortedSet, so you can use headSet or tailSet to get a view of the set before or after the element you're searching for. Use the first or last method on the returned set to get the element you're seeking.
If you are stuck with ArrayList, but can keep the elements sorted yourself, use Collections.binarySearch to search for the element. It returns a positive number if the exact element is found, or a negative number that can be used to determine the closest element. http://download.oracle.com/javase/1.4.2/docs/api/java/util/Collections.html#binarySearch(java.util.List,%20java.lang.Object)

If you are lucky enough to be using Java 6, and the performance overhead of keeping a SortedSet is not a big deal for you. Take a look at TreeSet ceiling, floor, higher and lower methods.

In Lucene, can I search one index but use IDF from another one?

I'm building a system where I want to show only results indexed in the past few days.
Furthermore, I don't want to maintain a giant index with a million documents if I only want to return results from a couple of days (thousands of documents).
On the other hand, my system heavily relies that the occurrences of terms in documents stored in the index have a realistic distribution (consequently: realistic IDF).
That said, I would like to use a small index to return results, but I want to compute documents score using a IDF from a much greater Index (or even an external source).
The Similarity API doesn't seem to allow me to do this. The idf method does not receive as parameter the term being used.
Another possibility is to use TrieRangeQuery to make sure the documents shown are within the last couple of days. Again, I rather not mantain a larger index. Also this kind of query is not cheap.

You should be able to extend IndexReader and override the docFreq() methods to provide whatever values you'd like. One thing this implementation can do is open two IndexReader instances -- one for the small index and one for the large index. All the methods are delegated to the small IndexReader, except for docFreq(), which is delegated to the large index. You'll need to scale the value returned, i.e.
int myNewDocFreq = bigIndexReader.docFreq(t) / bigIndexReader.maxDoc() * smallIndexReader.maxDoc()

Advice on how to improve a current fuzzy search implementation

I'm currently working on implementing a fuzzy search for a terminology web service and I'm looking for suggestions on how I might improve the current implementation. It's too much code to share, but I think an explanation might suffice to prompt thoughtful suggestions. I realize it's a lot to read but I'd appreciate any help.
First, the terminology is basically just a number of names (or terms). For each word, we split it into tokens by space and then iterate through each character to add it to the trie. On a terminal node (such as when the character y in strawberry is reached) we store in a list an index to the master term list. So a terminal node can have multiple indices (since the terminal node for strawberry will match 'strawberry' and 'allergy to strawberry').
As for the actual search, the search query is also broken up into tokens by space. The search algorithm is run for each token. The first character of the search token must be a match (so traw will never match strawberry). After that, we go through children of each successive node. If there is child with a character that matches, we continue the search with the next character of the search token. If a child does not match the given character, we look at the children using the current character of the search token (so not advancing it). This is the fuzziness part, so 'stwb' will match 'strawberry'.
When we reach the end of the search token, we will search through the rest of the trie structure at that node to get all potential matches (since the indexes to the master term list are only on the terminal nodes). We call this the roll up. We store the indices by setting their value on a BitSet. Then, we simply and the BitSets from the results of each search token result. We then take, say, the first 1000 or 5000 indices from the anded BitSets and find the actual terms they correspond to. We use Levenshtein to score each term and then sort by score to get our final results.
This works fairly well and is pretty fast. There are over 390k nodes in the tree and over 1.1 million actual term names. However, there are problems with this as it stands.
For example, searching for 'car cat' will return Catheterization, when we don't want it to (since the search query is two words, the result should be at least two). That would be easy enough to check, but it doesn't take care of a situation like Catheterization Procedure, since it is two words. Ideally, we'd want it to match something like Cardiac Catheterization.
Based on the need to correct this, we came up with some changes. For one, we go through the trie in a mixed depth/breadth search. Essentially we go depth first as long as a character matches. Those child nodes that didn't match get added to a priority queue. The priority queue is ordered by edit distance, which can be calculated while searching the trie (since if there's a character match, distance remains the same and if not, it increases by 1). By doing this, we get the edit distance for each word.
We are no longer using the BitSet. Instead, it's a map of the index to a Terminfo object. This object stores the index of the query phrase and the term phrase and the score. So if the search is "car cat" and a term matched is "Catheterization procedure" the term phrase indices will be 1 as will the query phrase indices. For "Cardiac Catheterization" the term phrase indices will be 1,2 as will the query phrase indices. As you can see, it's very simple afterward to look at the count of term phrase indices and query phrase indices and if they aren't at least equal to the search word count, they can be discarded.
After that, we add up the edit distances of the words, remove the words from the term that match the term phrase index, and count the remaining letters to get the true edit distance. For example, if you matched the term "allergy to strawberries" and your search query was "straw" you would have a score of 7 from strawberries, then you'd use the term phrase index to discard strawberries from the term, and just count "allergy to" (minus the spaces) to get the score of 16.
This gets us the accurate results we expect. However, it is far too slow. Where before we could get 25-40 ms on one word search, now it could be as much as half a second. It's largely from things like instantiating TermInfo objects, using .add() operations, .put() operations and the fact that we have to return a large number of matches. We could limit each search to only return 1000 matches, but there's no guarantee that the first 1000 results for "car" would match any of the first 1000 matches for "cat" (remember, there are over 1.1. million terms).
Even for a single query word, like cat, we still need a large number of matches. This is because if we search for 'cat' the search is going to match car and roll up all the terminal nodes below it (which will be a lot). However, if we limited the number of results, it would place too heavy an emphasis on words that begin with the query and not the edit distance. Thus, words like catheterization would be more likely to be included than something like coat.
So, basically, are there any thoughts on how we could handle the problems that the second implementation fixed, but without as much of the speed slow down that it introduced? I can include some selected code if it might make things clearer but I didn't want to post a giant wall of code.

Wow... tough one.
Well why don't you implement lucene? It is the best and current state of the art when it comes to problems like yours afaik.
However I want to share some thoughts...
Fuziness isnt something like straw* its rather the mis typing of some words. And every missing/wrong character adds 1 to the distance.
Its generally very, very hard to have partial matching (wildcards) and fuzziness at the same time!
Tokenizing is generally a good idea.
Everything also heavily depends on the data you get. Are there spelling mistakes in the source files or only in the search queries?
I have seen some pretty nice implementations using multi dimensional range trees.
But I really think if you want to accomplish all of the above you need a pretty neat combination of a graph set and a nice indexing algorithm.
You could for example use a semantic database like sesame and when importing your documents import every token and document as a node. Then depending on position in the document etc you can add a weighted relation.
Then you need the tokens in some structure where you can do efficient fuzzy matches such as bk-trees.
I think you could index the tokens in a mysql database and do bit-wise comparision functions to get differences. Theres a function that returns all matching bits, if you translit your strings to ascii and group the bits you could achieve something pretty fast.
However if you matched the tokens to the string you can construct a hypothetical perfect match antity and query your semantic database for the nearest neighbours.
You would have to break the words apart into partial words when tokenizing to achieve partial matches.
However you can do also wildcard matches (prefix, suffix or both) but no fuzziness then.
You can also index the whole word or different concatenations of tokens.
However there may be special bk-tree implementations that support this but i have never seen one.

I did a number of iterations of a spelling corrector ages ago, and here's a recent description of the basic method. Basically the dictionary of correct words is in a trie, and the search is a simple branch-and-bound. I used repeated depth-first trie walk, bounded by lev. distance because, since each additional increment of distance results in much more of the trie being walked, the cost, for small distance, is basically exponential in the distance, so going to a combined depth/breadth search doesn't save much but makes it a lot more complicated.
(Aside: You'd be amazed how many ways physicians can try to spell "acetylsalicylic acid".)
I'm surprised at the size of your trie. A basic dictionary of acceptable words is maybe a few thousand. Then there are common prefixes and suffixes. Since the structure is a trie, you can connect together sub-tries and save a lot of space. Like the trie of basic prefixes can connect to the main dictionary, and then the terminal nodes of the main dictionary can connect to the trie of common suffixes (which can in fact contain cycles). In other words, the trie can be generalized into a finite state machine. That gives you a lot of flexibility.
REGARDLESS of all that, you have a performance problem. The nice thing about performance problems is, the worse they are, the easier they are to find. I've been a real pest on StackOverflow pointing this out. This link explains how to do it, links to a detailed example, and tries to dispel some popular myths. In a nutshell, the more time it is spending doing something that you could optimize, the more likely you will catch it doing that if you just pause it and take a look. My suspicion is that a lot of time is going into operations on overblown data structure, rather than just getting to the answer. That's a common situation, but don't fix anything until samples point you directly at the problem.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.