Java find nearest (or equal) value in collection - java

I have a class along the lines of:
public class Observation {
private String time;
private double x;
private double y;
//Constructors + Setters + Getters
}
I can choose to store these objects in any type of collection (Standard class or 3rd party like Guava). I have stored some example data in an ArrayList below, but like I said I am open to any other type of collection that will do the trick. So, some example data:
ArrayList<Observation> ol = new ArrayList<Observation>();
ol.add(new Observation("08:01:23",2.87,3.23));
ol.add(new Observation("08:01:27",2.96,3.17));
ol.add(new Observation("08:01:27",2.93,3.20));
ol.add(new Observation("08:01:28",2.93,3.21));
ol.add(new Observation("08:01:30",2.91,3.23));
The example assumes a matching constructor in Observation. The timestamps are stored as String objects as I receive them as such from an external source but I am happy to convert them into something else. I receive the observations in chronological order so I can create and rely on a sorted collection of observations. The timestamps are NOT unique (as can be seen in the example data) so I cannot create a unique key based on time.
Now to the problem. I frequently need to find one (1) observation with a time equal or nearest to a certain time, e.g if my time was 08:01:29 I would like to fetch the 4th observation in the example data and if the time is 08:01:27 I want the 3rd observation.
I can obviously iterate through the collection until I find the time that I am looking for, but I need to do this frequently and at the end of the day I may have millions of observations so I need to find a solution where I can locate the relevant observations in an efficient manner.
I have looked at various collection-types including ones where I can filter the collections with Predicates but I have failed to find a solution that would return one value, as opposed to a subset of the collection that fulfills the "<="-condition. I am essentially looking for the SQL equivalent of SELECT * FROM ol WHERE time <= t LIMIT 1.
I am sure there is a smart and easy way to solve my problem so I am hoping to be enlightened. Thank you in advance.

Try TreeSet providing a comparator that compares the time. It mantains an ordered set and you can ask for TreeSet.floor(E) to find the greatest min (you should provide a dummy Observation with the time you are looking for). You also have headSet and tailSet for ordered subsets.
It has O(log n) time for adding and retrieving. I think is very suitable for your needs.
If you prefer a Map you can use a TreeMap with similar methods.

Sort your collection (ArrayList will probably work best here) and use BinarySearch which returns an integer index of either a match of the "closest" possible match, ie it returns an...
index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(),

Have the Observation class implement Comparable and use a TreeSet to store the objects, which will keep the elements sorted. TreeSet implements SortedSet, so you can use headSet or tailSet to get a view of the set before or after the element you're searching for. Use the first or last method on the returned set to get the element you're seeking.
If you are stuck with ArrayList, but can keep the elements sorted yourself, use Collections.binarySearch to search for the element. It returns a positive number if the exact element is found, or a negative number that can be used to determine the closest element. http://download.oracle.com/javase/1.4.2/docs/api/java/util/Collections.html#binarySearch(java.util.List,%20java.lang.Object)

If you are lucky enough to be using Java 6, and the performance overhead of keeping a SortedSet is not a big deal for you. Take a look at TreeSet ceiling, floor, higher and lower methods.

Related

Reducing time complexity

I have list of objects which contains a statusEnum. Now, I want to return all those objects which falls under specific list of provided statuses.
A simple solution is to loop on list of objects and then another for loop on provided list of statusEnums ... This would work however, It would make the time complexity of O(n)^2. Is there a way i could reduce it to O(n) ?
I can't change the map. The only other solution i could think of is maintaining another map based on statusEnums as the key but then it would increase the space complexity a lot.
EDIT
I had hashMap of objects (which i said as a list)
Here is the code which i came up with for others ...
public List<MyObjects> getObjectsBasedOnCriteria (List<ObjectStatus> statuses, String secondCriteria){
EnumSet<ObjectStatus> enumSet = EnumSet.copyOf(statuses);
for (Map.Entry<Long, MyObject> objEntry : myObjs.entrySet()){
MyObjects obj = objEntry.getValue();
if (enumSet.contains(obj.getStatus()) && obj.equals(secondCriteria)){
...
}
}
}
Use an Set to hold statusEnums (probably an EnumSet), and check if each instance's status is in that set using set.contains(object.getStatus()), or whatever.
Lookups in EnumSet and HashSet are O(1), so the solution is linear (assuming just one status per object). EnumSet.contains is more efficient than HashSet.contains with enum values; however, the choice is irrelevant to overall time complexity.
Assuming you have a sane number of statuses, esp if you have a enum of statuses you can use an EnumSet to match the status or a HashMap.
Even if you don't do this the time complexity is O(n * m) where n is the number of entries and m if the number of statuses you are looking for. In general it is assumed that you will have much more records than you have statuses you are checking for.
The number of possible enum values is limited to a few thousand due to a limitation in way Java is compiled so this is always an upper bound for enums.

Why using Hashmap.containsKey run faster considerably than Arrays.binarySearch?

I have two lists of phone numbers. 1st list is a subset of 2nd list. I ran two different algorithms below to determine which phone numbers are contained in both of two lists.
Way 1:
Sortting 1st list: Arrays.sort(FirstList);
Looping 2nd list to find matched element: If Arrays.binarySearch(FistList, 'each of 2nd list') then OK
Way 2:
Convert 1st list into HashMap with key/valus is ('each of 1st list', Boolean.TRUE)
Looping 2nd list to find matched element: If FirstList.containsKey('each of 2nd list') then OK
It results in Way 2 ran within 5 seconds is faster considerably than Way 1 with 39 seconds. I can't understand the reason why.
I appreciate your any comments.
Because hashing is O(1) and binary searching is O(log N).
HashMap relies on a very efficient algorithm called 'hashing' which has been in use for many years and is reliable and effective. Essentially the way it works is to split the items in the collection into much smaller groups which can be accessed extremely quickly. Once the group is located a less efficient search mechanism can be used to locate the specific item.
Identifying the group for an item occurs via an algorithm called a 'hashing function'. In Java the hashing method is Object.hashCode() which returns an int representing the group. As long as hashCode is well defined for your class you should expect HashMap to be very efficient which is exactly what you've found.
There's a very good discussion on the various types of Map and which to use at Difference between HashMap, LinkedHashMap and TreeMap
My shorthand rule-of-thumb is to always use HashMap unless you can't define an appropriate hashCode for your keys or the items need to be ordered (either natural or insertion).
Look at the source code for HashMap: it creates and stores a hash for each added (key, value) pair, then the containsKey() method calculates a hash for the given key, and uses a very fast operation to check if it is already in the map. So most retrieval operations are very fast.
Way 1:
Sorting: around O(nlogn)
Search: around O(logn)
Way 2:
Creating HashTable: O(n) for small density (no collisions)
Contains: O(1)

Structure like Java's EnumSet that can hold repeated elements

I need some structure where to store N Enums, some of them repeated. And be able to easily extract them. So far I've try to use the EnumSet like this.
cards = EnumSet.of(
BEST_OF_THREE,
BEST_OF_THREE,
SIMPLE_QUESTION,
SIMPLE_QUESTION,
STAR);
But now I see it can only have one of each. Conceptually, which one would be the best structure to use for this problem.
Regards
jose
You can use a Map of type Enumeration -> Integer, where the integer indicates how many of each there are. The google guava "MultiSet" does this for you, and handles the edge cases of adding an enum to the set when there is not already an entry, and removing an enum when it leaves none left.
Another strategy is to use the Enumeration ordinal index. Because this index is unique, you can use this to index into an int array that is sized to the Enumeration size, where the count in each array slot would indicate how many of each enumeration you have. Like this:
// initialize array for counting each enumeration type
// TODO: someone should double check every initial value will be zero
int[] cardCount = new int[CardEnum.values().length];
...
// incrementing the count for an enumeration (when we add)
cardCount[BEST_OF_THREE.ordinal()]++;
...
// decrementing the count for an enumeration (when we remove)
cardCount[BEST_OF_THREE.ordinal()]--;
// DEBUG: assert cardCount[BEST_OF_THREE.ordinal()] >= 0
...
// getting the count for an enumeration
int count = cardCount[BEST_OF_THREE.ordinal()];
... Some time later
Having read the clarifying comments underneath the original post that explained what the OP was asking, it is clear that you're best off with a linear structure with an entry per element. I didn't realize that you didn't need detailed information on how many of each you needed. Storing them in a MultiSet or an equivalent counting structure makes it hard to randomly pick, as you need to attribute an index picked at random from [0, size) to a particular container, which takes log time.
Sets don't allow duplicates, so if you want repeats you'll need either a List or a Map.
If you just need the number of duplicates, an EnumMap with Integer values is probably your best bet.
If the order is important, and you need quick access to the number of each type, you'll probably need to roll your own data structure.
If the order is important (but the count of each is not), then a List is the way to go, which implementation depends on how you will use it.
LinkedList - Best when there will be many inserts/removals from the beginning of the List. Indexing into a LinkedList is very expensive, and should be avoided whenever possible. If a List is built by shifting data onto the front of the list, but any later additions are at the end, conversion to an ArrayList once the initial List is built is a good idea - especially if indexing into the List is anticipated at any point.
ArrayList - When in doubt, this is a good place to start. Inserting or removing items requires shifting, so if this is a common operation look elsewhere.
TreeList - This is a good all-around option, and insertions and removals anywhere in the List are inexpensive. This does require the Apache commons library, and uses a bit more memory than the others.
Benchmarks, and the code used go generate them can be found in this gist.

Comparator for TreeBag to sort by the number of occurrences

I have a source of strings (let us say, a text file) and many strings repeat multiple times. I need to get the top X most common strings in the order of decreasing number of occurrences.
The idea that came to mind first was to create a sortable Bag (something like org.apache.commons.collections.bag.TreeBag) and supply a comparator that will sort the entries in the order I need. However, I cannot figure out what is the type of objects I need to compare. It should be some kind of an internal map that combines my object (String) and the number of occurrences, generated internally by TreeBag. Is this possible?
Or would I be better off by simply using a hashmap and sort it by value as described in, for example, Java sort HashMap by value
Why don't you put the strings in a map. Map of string to number of times they appear in text.
In step 2, traverse the items in the map and keep on adding them to a minimum heap of size X. Always extract min first if the heap is full before inserting.
Takes nlogx time.
Otherwise after step 1 sort the items by number of occurrences and take first x items. A tree map would come in helpful here :) (I'd add a link to the javadocs, but I'm in a tablet )
Takes nlogn time.
With Guava's TreeMultiset, just use Multisets.copyHighestCountFirst.

Java Collections - Effienct search for DateTime ranges

I have a case where I have a table (t1) which contains items like
| id | timestamp | att1 | att2 |
Now I have to iterate over a collection of elements of type att1 and get all records from t1 which are between two certain timestamps for this att1. I have to do this operation several times for a single att1.
So in order to go easy on the database queries, I intended to load every entry from t1 which has a certain att1 attribute once into a collection and perform the subsequent searches on this collection.
Is there a collection that could handle a search like between '2011-02-06 09:00:00' and '2011-02-06 09:00:30'? It's not guaranteed to contain entries for those two timestamps.
Before writing an implementation for that (most likely a very slow implementation ^^) I wanted to ask you guys if there might be some existing collections already or how I could tackle this problem.
Thanks!
Yes. Use TreeMap which is basically a sorted map of key=>value pairs and its method TreeMap::subMap(fromKey, toKey).
In your case you would use timestamps as keys to the map and for values att1 attribute or id or whatever else would be most convenient for you.
The closest I can think of, and this isn't really what I would consider ideal, is to write a comparator that will sort dates so that those within the range count as less than those outside the range (always return -1 when comparing in to out, 0 when comparing in to in or out to out, and always return +1 when comparing out to in.
Then, use this comparator to sort a collection (I suggest an ArrayList). The values within the range will appear first.
You might just be better off writing your own filter, though. Input a collection (I recommend a LinkedList), iterate over it, and remove anything not in the range. Keep a master copy around for spawning new ones to pass into the filter, if you need to.
You can make the object you want in your collection, which I think is att1, implement the Comparable interface and then have the compareTo method compare the timestamp field. With this in place it will work in any sorted collection, such as a treeSet, making it easy to iterate and pull out everything in a certain range.

Categories