Efficient way to find the difference between two data sets

Efficient way to find the difference between two data sets - java

I have two copies of data, here 1 represents my volumes and 2 represent my issues. I have to compare COPY2 with COPY1 and find all the elements which are missing in COPY2 (COPY1 will always be a superset and COPY2 can be equal or will always be a subset).
Now, I have to get the missing volume and the issue in COPY2.
Such that from the following figure(scenario) I get the result as : -
Missing files – 1-C, 1-D, 2-C, 2-C, 3-A, 3-B, 4,E.
Question-
What data structure should I use to store the above values (volume and issue) in java?
How should I implement this scenario in java in the most efficient manner to find the difference between these 2 copies?

I suggest a flat HashSet<VolumeIssue>. Each VolumeIssue instance corresponds to one categorized issue, such as 1-C.
In that case all you will need to find the difference is a call
copy1.removeAll(copy2);
What is left in copy1 are all the issues present in copy1 and missing from copy2.
Note that your VolumeIssue class must properly implement equals and hashCode for this to work.

Since you've added the Guava tag, I'd go for a variation of Marco Topolnik's answer. Instead of removing one set from the other, use Sets.difference(left, right)
Returns an unmodifiable view of the difference of two sets. The
returned set contains all elements that are contained by set1 and not
contained by set2. set2 may also contain elements not present in set1;
these are simply ignored. The iteration order of the returned set
matches that of set1.

What data structure should I use to store the above values (volume and issue) in java?
You can have a HashMap's with key and value pairs.
key is Volume and Value is a List of Issues.
How should I implement this scenario in java in the most efficient manner to find the difference between these 2 copies?
By getting value from both the HashMap's so you get two List's of value. Then find the difference between those two lists.
consider you got two list of values with same key from two maps.
now
Collection<Issue> diff = list1.removeAll( list2 );

Related

Data Structure choices based on requirements

I'm completely new to programming and to java in particular and I am trying to determine which data structure to use for a specific situation. Since I'm not familiar with Data Structures in general, I have no idea what structure does what and what the limitations are with each.
So I have a CSV file with a bunch of items on it, lets say Characters and matching Numbers. So my list looks like this:
A,1,B,2,B,3,C,4,D,5,E,6,E,7,E,8,E,9,F,10......etc.
I need to be able to read this in, and then:
1)display just the letters or just the numbers sorted alphabetically or numerically
2)search to see if an element is contained in either list.
3)search to see if an element pair (for example A - 1 or B-10) is contained in the matching list.
Think of it as an excel spreadsheet with two columns. I need to be able to sort by either column while maintaining the relationship and I need to be able to do an IF column A = some variable AND the corresponding column B contains some other variable, then do such and such.
I need to also be able to insert a pair into the original list at any location. So insert A into list 1 and insert 10 into list 2 but make sure they retain the relationship A-10.
I hope this makes sense and thank you for any help! I am working on purchasing a Data Structures in Java book to work through and trying to sign up for the class at our local college but its only offered every spring...

You could use two sorted Maps such as TreeMap.
One would map Characters to numbers (Map<Character,Number> or something similar). The other would perform the reverse mapping (Map<Number, Character>)
Let's look at your requirements:
1)display just the letters or just the numbers sorted alphabetically
or numerically
Just iterate over one of the maps. The iteration will be ordered.
2)search to see if an element is contained in either list.
Just check the corresponding map. Looking for a number? Check the Map whose keys are numbers.
3)search to see if an element pair (for example A - 1 or B-10) is
contained in the matching list.
Just get() the value for A from the Character map, and check whether that value is 10. If so, then A-10 exists. If there's no value, or the value is not 10, then A-10 doesn't exist.
When adding or removing elements you'd need to take care to modify both maps to keep them in sync.

Comparator for TreeBag to sort by the number of occurrences

I have a source of strings (let us say, a text file) and many strings repeat multiple times. I need to get the top X most common strings in the order of decreasing number of occurrences.
The idea that came to mind first was to create a sortable Bag (something like org.apache.commons.collections.bag.TreeBag) and supply a comparator that will sort the entries in the order I need. However, I cannot figure out what is the type of objects I need to compare. It should be some kind of an internal map that combines my object (String) and the number of occurrences, generated internally by TreeBag. Is this possible?
Or would I be better off by simply using a hashmap and sort it by value as described in, for example, Java sort HashMap by value

Why don't you put the strings in a map. Map of string to number of times they appear in text.
In step 2, traverse the items in the map and keep on adding them to a minimum heap of size X. Always extract min first if the heap is full before inserting.
Takes nlogx time.
Otherwise after step 1 sort the items by number of occurrences and take first x items. A tree map would come in helpful here :) (I'd add a link to the javadocs, but I'm in a tablet )
Takes nlogn time.

With Guava's TreeMultiset, just use Multisets.copyHighestCountFirst.

Most efficient way to check if row exists in grid Java

All,
I am wondering what's the most efficient way to check if a row already exists in a List<Set<Foo>>. A Foo object has a key/value pair(as well as other fields which aren't applicable to this question). Each Set in the List is unique.
As an example:
List[
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:2][Foo_Key:C, Foo_Value:4]
Set<Foo>[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:3]
]
I want to be able to check if a new Set (Ex: Set[Foo_Key:A, Foo_Value:1][Foo_Key:B, Foo_Value:3][Foo_Key:C, Foo_Value:4]) exists in the List.
Each Set could contain anywhere from 1-20 Foo objects. The List can contain anywhere from 1-100,000 Sets. Foo's are not guaranteed to be in the same order in each Set (so they will have to be pre-sorted for the correct order somehow, like a TreeSet)
Idea 1: Would it make more sense to turn this into a matrix? Where each column would be the Foo_Key and each row would contain a Foo_Value?
Ex:
A B C
-----
1 3 4
1 2 4
1 3 3
And then look for a row containing the new values?
Idea 2: Would it make more sense to create a hash of each Set and then compare it to the hash of a new Set?
Is there a more efficient way I'm not thinking of?
Thanks

If you use TreeSets for your Sets can't you just do list.contains(set) since a TreeSet will handle the equals check?
Also, consider using Guava's MultSet class.Multiset

I would recommend you use a less weird data structure. As for finding stuff: Generally Hashes or Sorting + Binary Searching or Trees are the ways to go, depending on how much insertion/deletion you expect. Read a book on basic data structures and algorithms instead of trying to re-invent the wheel.
Lastly: If this is not a purely academical question, Loop through the lists, and do the comparison. Most likely, that is acceptably fast. Even 100'000 entries will take a fraction of a second, and therefore not matter in 99% of all use cases.
I like to quote Knuth: Premature optimisation is the root of all evil.

Java Collections - Effienct search for DateTime ranges

I have a case where I have a table (t1) which contains items like
| id | timestamp | att1 | att2 |
Now I have to iterate over a collection of elements of type att1 and get all records from t1 which are between two certain timestamps for this att1. I have to do this operation several times for a single att1.
So in order to go easy on the database queries, I intended to load every entry from t1 which has a certain att1 attribute once into a collection and perform the subsequent searches on this collection.
Is there a collection that could handle a search like between '2011-02-06 09:00:00' and '2011-02-06 09:00:30'? It's not guaranteed to contain entries for those two timestamps.
Before writing an implementation for that (most likely a very slow implementation ^^) I wanted to ask you guys if there might be some existing collections already or how I could tackle this problem.
Thanks!

Yes. Use TreeMap which is basically a sorted map of key=>value pairs and its method TreeMap::subMap(fromKey, toKey).
In your case you would use timestamps as keys to the map and for values att1 attribute or id or whatever else would be most convenient for you.

The closest I can think of, and this isn't really what I would consider ideal, is to write a comparator that will sort dates so that those within the range count as less than those outside the range (always return -1 when comparing in to out, 0 when comparing in to in or out to out, and always return +1 when comparing out to in.
Then, use this comparator to sort a collection (I suggest an ArrayList). The values within the range will appear first.
You might just be better off writing your own filter, though. Input a collection (I recommend a LinkedList), iterate over it, and remove anything not in the range. Keep a master copy around for spawning new ones to pass into the filter, if you need to.

You can make the object you want in your collection, which I think is att1, implement the Comparable interface and then have the compareTo method compare the timestamp field. With this in place it will work in any sorted collection, such as a treeSet, making it easy to iterate and pull out everything in a certain range.

Java find nearest (or equal) value in collection

I have a class along the lines of:
public class Observation {
private String time;
private double x;
private double y;
//Constructors + Setters + Getters
}
I can choose to store these objects in any type of collection (Standard class or 3rd party like Guava). I have stored some example data in an ArrayList below, but like I said I am open to any other type of collection that will do the trick. So, some example data:
ArrayList<Observation> ol = new ArrayList<Observation>();
ol.add(new Observation("08:01:23",2.87,3.23));
ol.add(new Observation("08:01:27",2.96,3.17));
ol.add(new Observation("08:01:27",2.93,3.20));
ol.add(new Observation("08:01:28",2.93,3.21));
ol.add(new Observation("08:01:30",2.91,3.23));
The example assumes a matching constructor in Observation. The timestamps are stored as String objects as I receive them as such from an external source but I am happy to convert them into something else. I receive the observations in chronological order so I can create and rely on a sorted collection of observations. The timestamps are NOT unique (as can be seen in the example data) so I cannot create a unique key based on time.
Now to the problem. I frequently need to find one (1) observation with a time equal or nearest to a certain time, e.g if my time was 08:01:29 I would like to fetch the 4th observation in the example data and if the time is 08:01:27 I want the 3rd observation.
I can obviously iterate through the collection until I find the time that I am looking for, but I need to do this frequently and at the end of the day I may have millions of observations so I need to find a solution where I can locate the relevant observations in an efficient manner.
I have looked at various collection-types including ones where I can filter the collections with Predicates but I have failed to find a solution that would return one value, as opposed to a subset of the collection that fulfills the "<="-condition. I am essentially looking for the SQL equivalent of SELECT * FROM ol WHERE time <= t LIMIT 1.
I am sure there is a smart and easy way to solve my problem so I am hoping to be enlightened. Thank you in advance.

Try TreeSet providing a comparator that compares the time. It mantains an ordered set and you can ask for TreeSet.floor(E) to find the greatest min (you should provide a dummy Observation with the time you are looking for). You also have headSet and tailSet for ordered subsets.
It has O(log n) time for adding and retrieving. I think is very suitable for your needs.
If you prefer a Map you can use a TreeMap with similar methods.

Sort your collection (ArrayList will probably work best here) and use BinarySearch which returns an integer index of either a match of the "closest" possible match, ie it returns an...
index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size(),

Have the Observation class implement Comparable and use a TreeSet to store the objects, which will keep the elements sorted. TreeSet implements SortedSet, so you can use headSet or tailSet to get a view of the set before or after the element you're searching for. Use the first or last method on the returned set to get the element you're seeking.
If you are stuck with ArrayList, but can keep the elements sorted yourself, use Collections.binarySearch to search for the element. It returns a positive number if the exact element is found, or a negative number that can be used to determine the closest element. http://download.oracle.com/javase/1.4.2/docs/api/java/util/Collections.html#binarySearch(java.util.List,%20java.lang.Object)

If you are lucky enough to be using Java 6, and the performance overhead of keeping a SortedSet is not a big deal for you. Take a look at TreeSet ceiling, floor, higher and lower methods.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficient way to find the difference between two data sets - java

Related

Data Structure choices based on requirements

Comparator for TreeBag to sort by the number of occurrences

Most efficient way to check if row exists in grid Java

Java Collections - Effienct search for DateTime ranges

Java find nearest (or equal) value in collection

Categories

Resources