Java map content comparison - java

Here is a tricky data structure and data organization case.
I have an application that reads data from large files and produces objects of various types (e.g., Boolean, Integer, String) that are categorized in a few (less than a dozen) groups and then stored in a database.
Each object is currently stored in a single HashMap<String, Object> data structure. Each such HashMap corresponds to a single category (group). Each database record is built from the information in all the objects contained in all categories (HashMap data structures).
A requirement has appeared for checking whether subsequent records are "equivalent" in the number and type of columns, where equivalence must be verified across all maps by comparing the name (HashMap key) and the type (actual class) of each stored object.
I am looking for an efficient way of implementing this functionality, while maintaining the original object categorization, because listing objects by category in the fastest possible way is also a requirement.
An idea would be to just sort the keys (e.g., by replacing each HashMap with a TreeMap) and then walk over all maps. An alternative would be to just copy everything in a TreeMap for comparison purposes only.
What would be the most efficient way of implementing this functionality?
Also, if how would you go about finding the difference (i.e., the fields added and those removed), between successive records?

Create a meta SortedSet in which you store all the created maps.
Means SortedSet<Map<String,Object>> e.g. a TreeSet which as a custom Comparator<Map<String,Object>> which does check exactly your requirements of same number and names of keys and same object type per value.
You can then use the contains() method of this meta set structure to find out if a similar record does already exist.
==== EDIT ====
Since I've misundertood the relation between database records and the maps in the first place, I've to change some semantics my answer now of course a little bit.
Still I'would use the mentioned SortedSet<Map<String,Object>> but of course the Map<String,Object> would now point to that Map you and havexy suggested.
On the other hand could it be a step forward to use a Set<Set<KeyAndType>> or SortedSet<Set<KeyAndType>> where your KeyAndType will only contain the key and the type with appropriate Comparable implementation or equals with hashcode.
Why? You asked how to find the differences between two records? If each record relates to one of those inner Set<KeyAndType> you can easily use retainAll() to form the intersection of two successive Sets.
If you would compare this to the idea of a SortedSet<Map<String,Object>>, in both ways you would have the logic which differenciates between the fields within the comparator, one time comparing inner sets, one time comparing inner maps. And since this information gets lost when the surrounding set is constructed, it will be hard to get the differences between two records later on, if you do not have another reduced structure which is easy to use to find such differences. And since such a Set<KeyAndType> could act as key as well as as easy base for comparison between two records, it could be a good candidate to be used for both purposes.
If furthermore you wanna keep the relation between such a Set<KeyAndType> to your record or the group of Map<String,Object> your meta structure could be something like:
Map<Set<KeyAndType>,DatabaseRecord> or Map<Set<KeyAndType>,GroupOfMaps> implemented by a simple LinkedHashMap which allows simple iteration in original order.

One soln is to keep both category based HashMap and combined TreeMap. This will have slight more memory requirement, not much though, as you ll just keep the same reference in both of them.
So whenever you are adding/removing to HashMap you will do the same operation in the TreeMap too. This way both will always be in sync.
You can then use TreeMap for comparison, whether you want comparison of type of object or actual content comparison.

Related

Problems that we use a BiMap to solve

I'm reviewing the capabilities of Googles Guava API and I ran into a data structure that I haven't seen used in my 'real world programming' experience, namely, the BiMap. Is the only benefit of this construct the ability to quickly retrieve a key, for a given value? Are there any problems where the solution is best expressed using a BiMap?
Any time you want to be able to do a reverse lookup without having to populate two maps. For instance a phone directory where you would like to lookup the phone number by name, but would also like to do a reverse lookup to get the name from the number.
Louis mentioned the memory savings possible in a BiMap implementation. That's the only thing that you can't get by wrapping two Map instances. Still, if you let us wrap the Map instances for you, we can take care of a few edges cases. (You could handle all these yourself, but why bother? :))
If you call put(newKey, existingValue), we'll error out immediately to keep the two maps in sync, rather than adding the entry to one map before realizing that it conflicts with an existing mapping in the other. (We provide forcePut if you do want to override the existing value.) We provide similar safeguards for inserting null or other invalid values.
BiMap views keep the two maps in sync: If you remove an element from the entrySet of the original BiMap, its corresponding entry is also removed from the inverse. We do the same kind of thing in Entry.setValue.
We handle serialization: A BiMap and its inverse stay "connected," and the entries are serialized only once.
We provide a smart implementation of inverse() so that foo.inverse().inverse() returns foo, rather than a wrapper of a wrapper.
We override values() to return a Set. This set is identical to what you'd get from inverse().keySet() except that it maintains the same iteration order as the original BiMap.

MapReduce with "customized" key

I have the following problem: I have a lot of data in form of key-value pairs. The key is some id and the value - some piece of text. And my aim is to group that objects in clusters where the text pieces are "similar" in some way. So it would look like a task for the MapReduce, if to take my text piece as a key, and id as a value. But such keys is not traditional way of MapReduce usage, and as I am not really aware of internal implemetation of MapReduces frameworks, I am not sure that this way works. So my idea in detail is:
1. take some MapReduce in Java (Hadoop, GridGain)
2. create special class for my text pieces (say TextKey)
3. Override equals() of the class, packing the text comparison logic here(say levenstein distance comparison, or whatever)
4. Override compareTo() for allowing the MapReduce to sort by key (say lexicographical order)
5. Probably override hashCode()
6. Map my data to key-value pairs: keys -> text pieces, packed in TextKey class, values -> ids
7. Simply reduce by collecting ids of every "equal" (actually similar) key
Can MapReduce work on that way?
In GridGain this can be easily solved by storing your text keys in partitioned data grid. GridGain Data Grid will automatically partition your data set across the cluster based on keys, so as long as you have your similar text pieces properly implement standard java hashCode() and equals(), you should be fine.
You can also send affinity-based MapReduce tasks in GridGain to make sure that your jobs end up on the same node as the data to avoid redundant data movements should you require to run some computations on your data going forward. This can be achieved by executing GridProjection.affinityRun(...) methods.
Right after the map phase, its output is partitioned using a Partitioner (HashPartitioner by default but you can provide your own Parititioner). Your TextKey should implement a LSH hashCode so that similar Text values are likely to go to the same partition.
If the keys are Strings/Text objects the default sorter will work but I think this is not going to affect your result given the scenario you described.
The problem is at the Grouper which passes each group within a partition to a single reduce call. By default this grouper iterates through the partition which is sorted by this moment and it forms groups out of equal values. In your case you should make sure the grouping is done not by equality but by similarity. So, your TextKey should also implement the compareTo() method and take care to return 0 if the LSH hashCodes are the same.
In conclusion you can go with the default data path (i.e. default Partitioner, Sorter, Grouper) but your TextKey (which should implement WritableComparable) should do the magic in the hashCode() and compareTo() methods

Java - how best to perform set-like operations (e.g. retainAll) based on custom comparison

I have two sets both containing the same object types. I would like to be able to access the following:
the intersection of the 2 sets
the objects contained in set 1 and not in set 2
the objects contained in set 2 and not in set 1
My question relates to how best to compare the two sets to acquire the desired views. The class in question has numerous id properties which can be used to uniquely identify that entity. However, there are also numerous properties in the class that describe the current status of the object. The two sets can contain objects that match according to the ids, but which are in a different state (and as such, not all properties are equal between the two objects).
So - how do I best implement my solution. To implement an equals() method for the class which does not take into account the status properties and only looks at the id properties would not seem to be very true to the name 'equals' and could prove to be confusing later on. Is there some way I can provide a method through which the comparisons are done for the set methods?
Also, I would like to be able to access the 3 views described above without modifying the original sets.
All help is much appreciated!
(Edit: My first suggestion has been removed because of an unfortunate implementation detail in TreeSet, as pointed out by Martin Konecny. Some collection classes (e.g. TreeSet) allow you to supply a Comparator that is to be used to compare elements, so you might want to use one of those classes - at least, if there is some natural way of ordering your objects.)
If not (i.e. if it would be difficult to implement CompareTo(), while it would be simpler to implement HashCode() and Equals()), you could create a wrapper class which implements those two functions by looking at the relevant fields from the objects they wrap, and create a regular HashSet of these wrapper objects.
Short version: implement equals based on the entity's key, not state.
Slightly longer version: What the equals method should check depends on the type of object. For something that's considered a "value" object (say, an Integer or String or an Address), equality is typically based on all fields being the same. For an object with a set of fields that uniquely identify it (its primary key), equality is typically based on the fields of the primary key only. Equality doesn't necessarily need to (and often shouldn't) take in to consideration the state of an object. It needs to determine whether two objects are representations of the same thing. Also, for objects that are used in a Set or as keys in a Map, the fields that are used to determine equality should generally not be mutable, since changing them could cause a Set/Map to stop working as expected.
Once you've implemented equals like this, you can use Guava to view the differences between the two sets:
Set<Foo> notInSet2 = Sets.difference(set1, set2);
Set<Foo> notInSet1 = Sets.difference(set2, set1);
Both difference sets will be live views of the original sets, so changes to the original sets will automatically be reflected in them.
This is a requirement for which the Standard C++ Library fares better with its set type, which accepts a comparator for this purpose. In the Java library, your need is modeled better by a Map— one mapping from your candidate key to either the rest of the status-related fields, or to the complete object that happens to also contain the candidate key. (Note that the C++ set type is mandated to be some sort of balanced tree, usually implemented as a red-black tree, which means it's equivalent to Java's TreeSet, which does accept a custom Comparator.) It's ugly to duplicate the data, but it's also ugly to try to work around it, as you've already found.
If you have control over the type in question and can split it up into separate candidate key and status parts, you can eliminate the duplication. If you can't go that far, consider combining the candidate key fields into a single object held within your larger, complete object; that way, the Map key type will be the same as that candidate key type, and the only storage overhead will be the map keys' object references. The candidate key data would not be duplicated.
Note that most set types are implemented as maps under the covers; they map from the would-be set element type to something like a Boolean flag. Apparently there's too much code that would be duplicated in wholly disjoint set and map types. Once you realize that, backing up from using a set in an awkward way to using a map no longer seems to impose the storage overhead you thought it would.
It's a somewhat depressing realization, having chosen the mathematically correct idealized data structure, only to find it's a false choice down a layer or two, but even in your case your problem sounds better suited to a map representation than a set. Think of it as an index.

structure for holding data in this instance (Hashmap/ArrayList etc)?

Best way to describe this is explain the situation.
Imagine I have a factory that produces chairs. Now the factory is split into 5 sections. A chair can be made fully in one area or over a number of areas. The makers of the chairs add attributes of the chair to a chair object. At the end of the day these objects are collected by my imaginary program and added into X datatype(ArrayList etc).
When a chair is added it must check if the chair already exists and if so not replace the existing chair but append this chairs attributes to it(Dont worry about this part, Ive got this covered)
So basically I want a structure than I can easily check if an object exists if not just straight up insert it, else perform the append. So I need to find the chair matching a certain unique ID. Kind of like a set. Except its not matching the same object, if a chair is made in three areas it will be three distinct objects - in real life they all reperesent the same object though - yet I only want one object that will hold the entire attribute contents of all the chairs.
Once its collected and performed the update on all areas of the factory it needs iterate over each object and add its contents to a DB. Again dont worrk about adding to the DB etc thats covered.
I just want to know what the best data structure in Java would be to match this spec.
Thank you in advance.
I'd say a HashMap: it lets you quickly check whether an object exists with a given unique ID, and retrieve that object if it does exist in the collection. Then it's simply a matter of performing your merge function to add attributes to the object that is already in the collection.
Unlike most other collections (ArrayList, e.g.), HashMaps are actually optimized for looking something up by a unique ID, and it will be just as fast at doing this regardless of how many objects you have in your collection.
This answer originally made reference to the Hashtable class, but after further research (and some good comments), I discovered that you're always better off using a HashMap. If you need synchronization, you can call Collections.synchronizedMap() on it. See here for more information.
I'd say use ArrayList. Override the hashcode/equals() method on your Chair object to use the unique ID. That way you can just use list.contains(chair) to check if it exists.
I'd say use an EnumMap. Define an enum of all possible part categories, so you can query the EnumMap for which part is missing
public enum Category {
SEAT,REST,LEGS,CUSHION
}

Best Java data structure to store a 3 column oracle table? 3 column array? or double map?

What is the best data structure to store an oracle table that's about 140 rows by 3 columns. I was thinking about a multi dimensional array.
By best I do not necessarily mean most efficient (but i'd be curious to know your opinions) since the program will run as a job with plenty of time to run but I do have some restrictions:
It is possible for multiple keys to be "null" at first. so the first column might have multiple null values. I also need to be able to access elements from the other columns. Anything better than a linear search to access the data?
So again, something like [][][] would work.. but is there something like a 3 column map where I can access by the key or the second column ? I know maps have only two values.
All data will probably be strings or cast as strings.
Thanks
A custom class with 3 fields, and a java.util.List of that class.
There's no benefit in shoe-horning data into arrays in this case, you get no improvement in performance, and certainly no improvement in code maintainability.
This is another example of people writing FORTRAN in an object-oriented language.
Java's about objects. You'd be much better off if you started using objects to abstract your problem, hide details away from clients, and reduce coupling.
What sensible object, with meaningful behavior, do those three items represent? I'd start with that, and worry about the data structures and persistence later.
All data will probably be strings or cast as strings.
This is fine if they really are strings, but I'd encourage you to look deeper and see if you can do better.
For example, if you write an application that uses credit scores you might be tempted to persist it as a number column in a database. But you can benefit from looking at the problem harder and encapsulating that value into a CreditScore object. When you have that, you realize that you can add something like units ("FICO" versus "TransUnion"), scale (range from 0 to 850), and maybe some rich behavior (e.g., rules governing when to reorder the score). You encapsulate everything into a single object instead of scattering the logic for operating on credit scores all over your code base.
Start thinking less in terms of tables and columns and more about objects. Or switch languages. Python has the notion of tuples built in. Maybe that will work better for you.
If you need to access your data by key and by another key, then I would just use 2 maps for that and define a separate class to hold your record.
class Record {
String field1;
String field2;
String field3;
}
and
Map<String, Record> firstKeyMap = new HashMap<String, Record>();
Map<String, Record> secondKeyMap = new HashMap<String, Record>();
I'd create an object which map your record and then create a collection of this object.

Categories