Most efficient way to check if row exists in grid Java - java

All,
I am wondering what's the most efficient way to check if a row already exists in a List<Set<Foo>>. A Foo object has a key/value pair(as well as other fields which aren't applicable to this question). Each Set in the List is unique.
As an example:
List[
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:2][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:3]
]
I want to be able to check if a new Set (Ex: Set[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]) exists in the List.
Each Set could contain anywhere from 1-20 Foo objects. The List can contain anywhere from 1-100,000 Sets. Foo's are not guaranteed to be in the same order in each Set (so they will have to be pre-sorted for the correct order somehow, like a TreeSet)
Idea 1: Would it make more sense to turn this into a matrix? Where each column would be the Foo_Key and each row would contain a Foo_Value?
Ex:
A B C
-----
1 3 4
1 2 4
1 3 3
And then look for a row containing the new values?
Idea 2: Would it make more sense to create a hash of each Set and then compare it to the hash of a new Set?
Is there a more efficient way I'm not thinking of?
Thanks

If you use TreeSets for your Sets can't you just do list.contains(set) since a TreeSet will handle the equals check?
Also, consider using Guava's MultSet class.Multiset

I would recommend you use a less weird data structure. As for finding stuff: Generally Hashes or Sorting + Binary Searching or Trees are the ways to go, depending on how much insertion/deletion you expect. Read a book on basic data structures and algorithms instead of trying to re-invent the wheel.
Lastly: If this is not a purely academical question, Loop through the lists, and do the comparison. Most likely, that is acceptably fast. Even 100'000 entries will take a fraction of a second, and therefore not matter in 99% of all use cases.
I like to quote Knuth: Premature optimisation is the root of all evil.

Related

Structure like Java's EnumSet that can hold repeated elements

I need some structure where to store N Enums, some of them repeated. And be able to easily extract them. So far I've try to use the EnumSet like this.
cards = EnumSet.of(
BEST_OF_THREE,
BEST_OF_THREE,
SIMPLE_QUESTION,
SIMPLE_QUESTION,
STAR);
But now I see it can only have one of each. Conceptually, which one would be the best structure to use for this problem.
Regards
jose
You can use a Map of type Enumeration -> Integer, where the integer indicates how many of each there are. The google guava "MultiSet" does this for you, and handles the edge cases of adding an enum to the set when there is not already an entry, and removing an enum when it leaves none left.
Another strategy is to use the Enumeration ordinal index. Because this index is unique, you can use this to index into an int array that is sized to the Enumeration size, where the count in each array slot would indicate how many of each enumeration you have. Like this:
// initialize array for counting each enumeration type
// TODO: someone should double check every initial value will be zero
int[] cardCount = new int[CardEnum.values().length];
...
// incrementing the count for an enumeration (when we add)
cardCount[BEST_OF_THREE.ordinal()]++;
...
// decrementing the count for an enumeration (when we remove)
cardCount[BEST_OF_THREE.ordinal()]--;
// DEBUG: assert cardCount[BEST_OF_THREE.ordinal()] >= 0
...
// getting the count for an enumeration
int count = cardCount[BEST_OF_THREE.ordinal()];
... Some time later
Having read the clarifying comments underneath the original post that explained what the OP was asking, it is clear that you're best off with a linear structure with an entry per element. I didn't realize that you didn't need detailed information on how many of each you needed. Storing them in a MultiSet or an equivalent counting structure makes it hard to randomly pick, as you need to attribute an index picked at random from [0, size) to a particular container, which takes log time.
Sets don't allow duplicates, so if you want repeats you'll need either a List or a Map.
If you just need the number of duplicates, an EnumMap with Integer values is probably your best bet.
If the order is important, and you need quick access to the number of each type, you'll probably need to roll your own data structure.
If the order is important (but the count of each is not), then a List is the way to go, which implementation depends on how you will use it.
LinkedList - Best when there will be many inserts/removals from the beginning of the List. Indexing into a LinkedList is very expensive, and should be avoided whenever possible. If a List is built by shifting data onto the front of the list, but any later additions are at the end, conversion to an ArrayList once the initial List is built is a good idea - especially if indexing into the List is anticipated at any point.
ArrayList - When in doubt, this is a good place to start. Inserting or removing items requires shifting, so if this is a common operation look elsewhere.
TreeList - This is a good all-around option, and insertions and removals anywhere in the List are inexpensive. This does require the Apache commons library, and uses a bit more memory than the others.
Benchmarks, and the code used go generate them can be found in this gist.

Efficient way to find the difference between two data sets

I have two copies of data, here 1 represents my volumes and 2 represent my issues. I have to compare COPY2 with COPY1 and find all the elements which are missing in COPY2 (COPY1 will always be a superset and COPY2 can be equal or will always be a subset).
Now, I have to get the missing volume and the issue in COPY2.
Such that from the following figure(scenario) I get the result as : -
Missing files – 1-C, 1-D, 2-C, 2-C, 3-A, 3-B, 4,E.
Question-
What data structure should I use to store the above values (volume and issue) in java?
How should I implement this scenario in java in the most efficient manner to find the difference between these 2 copies?
I suggest a flat HashSet<VolumeIssue>. Each VolumeIssue instance corresponds to one categorized issue, such as 1-C.
In that case all you will need to find the difference is a call
copy1.removeAll(copy2);
What is left in copy1 are all the issues present in copy1 and missing from copy2.
Note that your VolumeIssue class must properly implement equals and hashCode for this to work.
Since you've added the Guava tag, I'd go for a variation of Marco Topolnik's answer. Instead of removing one set from the other, use Sets.difference(left, right)
Returns an unmodifiable view of the difference of two sets. The
returned set contains all elements that are contained by set1 and not
contained by set2. set2 may also contain elements not present in set1;
these are simply ignored. The iteration order of the returned set
matches that of set1.
What data structure should I use to store the above values (volume and issue) in java?
You can have a HashMap's with key and value pairs.
key is Volume and Value is a List of Issues.
How should I implement this scenario in java in the most efficient manner to find the difference between these 2 copies?
By getting value from both the HashMap's so you get two List's of value. Then find the difference between those two lists.
consider you got two list of values with same key from two maps.
now
Collection<Issue> diff = list1.removeAll( list2 );

Looking for a table-like data structure

I have 2 sets of data.
Let say one is a people, another is a group.
A people can be in multiple groups while a group can have multiple people.
My operations will basically be CRUD on group and people.
As well as a method that makes sure a list of people are in different groups (which gets called alot).
Right now I'm thinking of making a table of binary 0's and 1's with horizontally representing all the people and vertically all the groups.
I can perform the method in O(n) time by adding each list of binaries and compare with the "and" operation of the list of binaries.
E.g
Group A B C D
ppl1 1 0 0 1
ppl2 0 1 1 0
ppl3 0 0 1 0
ppl4 0 1 0 0
check (ppl1, ppl2) = (1001 + 0110) == (1001 & 0110)
= 1111 == 1111
= true
check (ppl2, ppl3) = (0110 + 0010) == (0110+0010)
= 1000 ==0110
= false
I'm wondering if there is a data structure that does something similar already so I don't have to write my own and maintain O(n) runtime.
I don't know all of the details of your problem, but my gut instinct is that you may be over thinking things here. How many objects are you planning on storing in this data structure? If you have really large amounts of data to store here, I would recommend that you use an actual database instead of a data structure. The type of operations you are describing here are classical examples of things that relational databases are good at. MySQL and PostgreSQL are examples of large scale relational databases that could do this sort of thing in their sleep. If you'd like something lighter-weight SQLite would probably be of interest.
If you do not have large amounts of data that you need to store in this data structure, I'd recommend keeping it simple, and only optimizing it when you are sure that it won't be fast enough for what you need to do. As a first shot, I'd just recommend using java's built in List interface to store your people and a Map to store groups. You could do something like this:
// Use a list to keep track of People
List<Person> myPeople = new ArrayList<Person>();
Person steve = new Person("Steve");
myPeople.add(steve);
myPeople.add(new Person("Bob"));
// Use a Map to track Groups
Map<String, List<Person>> groups = new HashMap<String, List<Person>>();
groups.put("Everybody", myPeople);
groups.put("Developers", Arrays.asList(steve));
// Does a group contain everybody?
groups.get("Everybody").containsAll(myPeople); // returns true
groups.get("Developers").containsAll(myPeople); // returns false
This definitly isn't the fastest option available, but if you do not have a huge number of People to keep track of, you probably won't even notice any performance issues. If you do have some special conditions that would make the speed of using regular Lists and Maps unfeasible, please post them and we can make suggestions based on those.
EDIT:
After reading your comments, it appears that I misread your issue on the first run through. It looks like you're not so much interested in mapping groups to people, but instead mapping people to groups. What you probably want is something more like this:
Map<Person, List<String>> associations = new HashMap<Person, List<String>>();
Person steve = new Person("Steve");
Person ed = new Person("Ed");
associations.put(steve, Arrays.asList("Everybody", "Developers"));
associations.put(ed, Arrays.asList("Everybody"));
// This is the tricky part
boolean sharesGroups = checkForSharedGroups(associations, Arrays.asList(steve, ed));
So how do you implement the checkForSharedGroups method? In your case, since the numbers surrounding this are pretty low, I'd just try out the naive method and go from there.
public boolean checkForSharedGroups(
Map<Person, List<String>> associations,
List<Person> peopleToCheck){
List<String> groupsThatHaveMembers = new ArrayList<String>();
for(Person p : peopleToCheck){
List<String> groups = associations.get(p);
for(String s : groups){
if(groupsThatHaveMembers.contains(s)){
// We've already seen this group, so we can return
return false;
} else {
groupsThatHaveMembers.add(s);
}
}
}
// If we've made it to this point, nobody shares any groups.
return true;
}
This method probably doesn't have great performance on large datasets, but it is very easy to understand. Because it's encapsulated in it's own method, it should also be easy to update if it turns out you need better performance. If you do need to increase performance, I would look at overriding the equals method of Person, which would make lookups in the associations map faster. From there you could also look at a custom type instead of String for groups, also with an overridden equals method. This would considerably speed up the contains method used above.
The reason why I'm not too concerned about performance is that the numbers you've mentioned aren't really that big as far as algorithms are concerned. Because this method returns as soon as it finds two matching groups, in the very worse case you will call ArrayList.contains a number of times equal to the number of groups that exist. In the very best case scenario, it only needs to be called twice. Performance will likely only be an issue if you call the checkForSharedGroups very, very often, in which case you might be better off finding a way to call it less often instead of optimizing the method itself.
Have you considered a HashTable? If you know all of the keys you'll be using, it's possible to use a Perfect Hash Function which will allow you to achieve constant time.
How about having two separate entities for People and Group. Inside People have a set of Group and vice versa.
class People{
Set<Group> groups;
//API for addGroup, getGroup
}
class Group{
Set<People> people;
//API for addPeople,getPeople
}
check(People p1, People p2):
1) call getGroup on both p1,p2
2) check the size of both the set,
3) iterate over the smaller set, and check if that group is present in other set(of group)
Now, you can basically store People object in any data structure. Preferably a linked list if size is not fixed otherwise an array.

List implementation that is a view over multiple sublists?

I'm working on a piece of software that very frequently needs to return a single list that consists of the first (up to) N elements of a number of other lists. The return is not modified by its clients -- it's read-only.
Currently, I am doing something along the lines of (code simplified for readability):
List ret = new ArrayList<String>();
for (List aList : lists) {
// add the first N elements, if they exist
ret.addAll(aList.subList(0, Math.min(aList.size(), MAXMATCHESPERLIST)));
if (ret.size() >= MAXMATCHESTOTAL) {
break;
}
}
return ret;
I'd like to avoid the creation of the new list and the use of addAll() as I don't need to be returning a new list, and I'm dealing with thousands of elements per second. This method is a major bottleneck for my application.
What I'm looking for is an implementation of List that simply consists of the subList() results (those are cheap views, not actual copies) of each of the contained lists.
I've looked through the usual suspects including java.util, Commons Collections, Commons Lang, etc., but can't for the life of me find any such implementation. I'm pretty sure it has to have been implemented at some point though and hopefully I've missed something obvious.
So I'm turning to you, Stack Overflow -- is anyone aware of such an implementation? I can write one myself, but I hate re-inventing the wheel if the wheel is out there.
Suggestions for alternative more efficient approaches are very welcome!
Optional background detail (probably not all that relevant to my question, but just in case it helps you understand what I'm trying to do): this is for a program to fill crossword-style grids with words that revolve around a theme. Each theme may have any number of candidate word lists, ordered in decreasing order of theme relevancy. For instance, the "film" theme may start with a list of movie titles, then a list of actors, then a generic list of places that may or may not be film-relevant, then a generic list of english words. The lists are each stored in a wildcarded trie structure to allow fast lookups that meet the grid constraints (e.g. "CAT" would be stored in trie'd lists against the keys "CAT", "CA?", "C??", "?AT", ... "???" etc.) Lists vary from a few words to several tens of thousands of words.
For any given query, e.g. "C??", I want to return a list that contains up to N (say 50) matching words, ordered in the same order as the source lists. So if list 1 contains 3 matches for "C??", list 2 contains 7, and list 3 contains 100, I need a return list that contains first the 3 matches from list 1, then the 7 matches from list 2, then 40 of the matches from list 3. And I want that returned "conjoined list view" operation to be more efficient than having to continuously call addAll(), in a similar manner to the implementation of subList().
Caching the returned lists is not an option due to memory constraints -- my trie is already consuming the vast majority of my (32 bit) max-sized heap.
PS this isn't homework, it's for a real project. Any help much appreciated!
Do you need random access for the resulting list? Or you client code only iterates over the result?
If you only need to iterate over the result. Create a custom list implementation which will have list of the original lists :) as the instance field. Return custom iterator which will take items from every list one by one and stops when there are no more items in any of the underlying lists or you return MAXMATCHESTOTAL items already.
With some thoughts you can do the same for random access.
Use list.addAll() multiple times. Simple, does not require external jars and ineffective.
Jakarta collections framework has such list. it is effective but requires external jar and does not support generics.
Check Guava from Google. I think it has something that you are looking for.
What's wrong with returning the sublist? That is the fastest way, since the sublist is not a copy but uses a reference to the backing array, and clients are read-only - seems perfect to me.
EDIT:
I understand why you want to group up the contents of several lists to make a larger chunk, but can you change you clients to not need such a large chunk? See my other answer re BlockingQueue and producer/consumer approach.
Have you considered using a BlockingQueue and having consumers pull items from the queue one by one as they need them, rather than getting items in chunks (lists)? It seems you are attempting to reinvent the producer/consumer pattern here.

Java Collections - Effienct search for DateTime ranges

I have a case where I have a table (t1) which contains items like
| id | timestamp | att1 | att2 |
Now I have to iterate over a collection of elements of type att1 and get all records from t1 which are between two certain timestamps for this att1. I have to do this operation several times for a single att1.
So in order to go easy on the database queries, I intended to load every entry from t1 which has a certain att1 attribute once into a collection and perform the subsequent searches on this collection.
Is there a collection that could handle a search like between '2011-02-06 09:00:00' and '2011-02-06 09:00:30'? It's not guaranteed to contain entries for those two timestamps.
Before writing an implementation for that (most likely a very slow implementation ^^) I wanted to ask you guys if there might be some existing collections already or how I could tackle this problem.
Thanks!
Yes. Use TreeMap which is basically a sorted map of key=>value pairs and its method TreeMap::subMap(fromKey, toKey).
In your case you would use timestamps as keys to the map and for values att1 attribute or id or whatever else would be most convenient for you.
The closest I can think of, and this isn't really what I would consider ideal, is to write a comparator that will sort dates so that those within the range count as less than those outside the range (always return -1 when comparing in to out, 0 when comparing in to in or out to out, and always return +1 when comparing out to in.
Then, use this comparator to sort a collection (I suggest an ArrayList). The values within the range will appear first.
You might just be better off writing your own filter, though. Input a collection (I recommend a LinkedList), iterate over it, and remove anything not in the range. Keep a master copy around for spawning new ones to pass into the filter, if you need to.
You can make the object you want in your collection, which I think is att1, implement the Comparable interface and then have the compareTo method compare the timestamp field. With this in place it will work in any sorted collection, such as a treeSet, making it easy to iterate and pull out everything in a certain range.

Categories