Efficiently finding duplicates in a constrained many-to-many dataset?

Efficiently finding duplicates in a constrained many-to-many dataset? - java

I have to write a bulk operation version of something our webapp
lets you do on a more limited basis from the UI. The desired
operation is to assign objects to a category. A category can have
multiple objects but a given object can only be in one category.
The workflow for the task is:
1) Using the browser, a file of the following form is uploaded:
# ObjectID, CategoryID
Oid1, Cid1
Oid2, Cid1
Oid3, Cid2
Oid4, Cid2
[etc.]
The file will most likely have tens to hundreds of lines, but
definitely could have thousands of lines.
In an ideal world a given object id would only occur once in the file
(reflecting the fact that an object can only be assigned to one category)
But since the file is created outside of our control, there's no guarantee
that's actually true and the processing has to deal with that possibility.
2) The server will receive the file, parse it, pre-process it
and show a page something like:
723 objects to be assigned to 126 categories
142 objects not found
42 categories not found
Do you want to continue?
[Yes] [No]
3) If the user clicks the Yes button, the server will
actually do the work.
Since I don't want to parse the file in both steps (2) and (3), as
part of (2), I need to build a container that will live across
requests and hold a useful representation of the data that will let me
easily provide the data to populate the "preview" page and will let me
efficiently do the actual work. (While obviously we have sessions, we
normally keep very little in-memory session state.)
There is an existing
assignObjectsToCategory(Set<ObjectId> objectIds, CategoryId categoryId)
function that is used when assignment is done through the UI. It is
highly desireable for the bulk operation to also use this API since it
does a bunch of other business logic in addition to the simple
assignment and we need that same business logic to run when this bulk
assign is done.
Initially it was going to be OK that if the file "illegally" specified
multiple categories for a given object -- it would be OK to assign the
object abitrarily to one of the categories the file associated it
with.
So I was initially thinking that in step (2) as I went through the
file I would build up and put into the cross-request container a
Map<CategoryId, Set<ObjectId>> (specifically a HashMap for quick
lookup and insertion) and then when it was time to do the work I could
just iterate on the map and for each CategoryId pull out the
associated Set<ObjectId> and pass them into assignObjectsToCategory().
However, the requirement on how to handle duplicate ObjectIds changed.
And they are now to be handled as follows:
If an ObjectId appears multiple times in the file and
all times is associated with the same CategoryId, assign
the object to that category.
If an ObjectId appears multiple times in the file and
is associated with different CategoryIds, consider that
an error and make mention of it on the "preview" page.
That seems to mess up my Map<CategoryId, Set<ObjectId>> strategy
since it doesn't provide a good way to detect that the ObjectId I
just read out of the file is already associated with a CategoryId.
So my question is how to most efficiently detect and track these
duplicate ObjectIds?
What came to mind is to use both "forward" and "reverse" maps:
public CrossRequestContainer
{
...
Map<CategoryId, Set<ObjectId>> objectsByCategory; // HashMap
Map<ObjectId, List<CategoryId>> categoriesByObject; // HashMap
Set<ObjectId> illegalDuplicates;
...
}
Then as each (ObjectId, CategoryId) pair was read in, it would
get put into both maps. Once the file was completely read in, I
could do:
for (Map.Entry<ObjectId, List<CategoryId>> entry : categoriesByObject.entrySet()) {
List<CategoryId> categories = entry.getValue();
if (categories.size() > 1) {
ObjectId object = entry.getKey();
if (!all_categories_are_equal(categories)) {
illegalDuplicates.add(object);
// Since this is an "illegal" duplicate I need to remove it
// from every category that it appeared with in the file.
for (CategoryId category : categories) {
objectsByCategory.get(category).remove(object);
}
}
}
}
When this loop finishes, objectsByCategory will no longer contain any "illegal"
duplicates, and illegalDuplicates will contain all the "illegal" duplicates to
be reported back as needed. I can then iterate over objectsByCategory, get the Set<ObjectId> for each category, and call assignObjectsToCategory() to do the assignments.
But while I think this will work, I'm worried about storing the data twice, especially
when the input file is huge. And I'm also worried that I'm missing something re: efficiency
and this will go very slowly.
Are there ways to do this that won't use double memory but can still run quickly?
Am I missing something that even with the double memory use will still run a lot
slower than I'm expecting?

Given the constraints you've given, I don't there's a way to do this using a lot less memory.
One possible optimization though is to only maintain lists of categories for objects which are listed in multiple categories, and otherwise just map object to category, ie:
Map<CategoryId, Set<ObjectId>> objectsByCategory; // HashMap
Map<ObjectId, CategoryId> categoryByObject; // HashMap
Map<ObjectId, Set<CategoryId>> illegalDuplicates; // HashMap
Yes, this adds yet another container, but it will contain (hopefully) only a few entries; also, the memory requirements of the categoryByObject map is reduced (cutting out one list overhead per entry).
The logic is a little more complicated of course. When a duplicate is initially discovered, the object should be removed from the categoryByObject map and added into the illegalDuplicates map. Before adding any object into the categoryByObject map, you will need to first check the illegalDuplicates map.
Finally, it probably won't hurt performance to build the objectsByCategory map in a separate loop after building the other two maps, and it will simplify the code a bit.

Related

is it possible to create a hash map where each value is a method that acts differently on an object

Problem
I want to know if this is possible if I could create a State machine that would contain all the methods and the Values of MethodById would be stated in the machine.
P.S. this is my first question ever on here. If I do it wrong I'm sorry but that is why.
Description (TL;DR)
I'm trying to cross reference data about Sales representatives. Each rep has territories specified by zip-codes.
One dataset has the reps, their territories and their company.
Another data set has their names, phone number and email.
I made a Sales-rep class that takes from the first data-set and needs to be updated with the second data-set.
I also need the Sales-reps to be put in a look-up table (I used a hashmap for this) of <key: zip code, value: Sales-rep object>.
What I want is for each Sales-rep object to having an ID that is standard across all my datasets. I can't use the data I'm provided with because it comes from many different sources and its impossible to standardize any data field.
Names, for example, are listed so many different ways it would be impossible to reconcile them and use that as an ID.
If I can get an ID like this (something like an SSN but less sensitive) then I want to try what my question is about.
I want to iterate through all the elements in my <key: zip code, value: Sales-rep object> hashmap, we will call it RepsByZipCode. When I iterate through each Salesrep object I want to get an ID that I can use in a different hashmap called MethodById <key: ID, value: a method run on the Object with this ID>.
I want it to run a different method for each key on the Object with the matching key (AKA the ID). The point is to run a different method on each different object in linear time so that by the end of the for loop, each object in RepsByZipCode will have some method run on it that can update information (thus completing the cross-referencing).
This also makes the code very extendable because I can change the method for each key if I want to update things differently. Ex:
//SalesRep Object Constructor:
SalesRep(String name, String email, ..., String Id)
Map<String zipcode, Salesrep rep> RepsByZipCode = new HashMap<>{}
//code fills in the above with the first dataset
Map<String ID, ??? method> MethodById = new HashMap<>{}
//code fills in the above with the second dataset
for(String ZipKey:RepsByZipCode){
Salesrep Rep = RepsByZipCode.get(ZipKey);
Rep.getId = ID;
MethodById.get(ID);
//each time this runs, one entry in RepsByZipCode is updated with one
//method from MethodById.
//after this for loop, all of RepsByZipCode has been updated in linear time

You could put these methods into different classes that implement a common interface, and store an instance of each class in your map. If you're using at least Java 8 and your methods are simple enough, you could use lambdas to avoid some boilerplate.

Issue iterating over custom writable component in reducer

I am using a custom writable class as VALUEOUT in the map phase in my MR job where the class has two fields, A org.apache.hadoop.io.Text and org.apache.hadoop.io.MapWritable. In my reduce function I iterate through the values for each key and I perform two operations, 1. filter, 2. aggregate. In the filter, I have some rules to check if certain values in the MapWritable(with key as Text and value as IntWritable or DoubleWritable) satisfy certain conditions and then I simply add them to an ArrayList. At the end of the filter operation, I have a filtered list of my custom writable objects. At the aggregate phase, when I access the objects, it turns out that the last object that was successfully filtered in, has overwritten all other objects in the arraylist. After going through some similar issues with lists on SO where the last object overwrite all the others, I confirmed that I do not have static fields nor am I reusing the same custom writable by setting different values(which was quoted as the possible reasons for such an issue). For each key in the reducer I have made sure that the CustomWritable, Text key and the MapWritable are new objects.
In addition, I also performed a simple test by eliminating the filter & aggregate operations in my reduce and just iterated through the values and added them to an ArrayList using a for loop. In the loop, everytime I added a CustomWritable into the list, I logged the values of all the contents of the List. I logged before and after adding the element to the list. Both logs presented that the previous set of elements have been overwritten. I am confused on how this could even happen. As soon as the next element in the iterable of values was accessed by the loop for ( CustomWritable result : values ), the list content was modified. I am unable to figure out the reason for this behaviour. If anyone can shed some light on this, it would be really helpful. Thanks.

The"values" iterator in the reducer reuses the value as you iterate. It's a technique used for performance and smaller memory footprint. Behind the scenes, Hadoop deserializes the next record into the same Java object. If you need to "remember" an object, you'll need to clone it.
You can take advantage of the Writable interface and use the raw bytes to populate a new object.
IntWritable first = WritableUtils.clone(values.next(), context.getConfiguration());
IntWritable second = WritableUtils.clone(values.next(), context.getConfiguration());

How can I optimize an AppEngine Java/JDO datastore put() to use less writes

I'm tuning an app we run on App Engine and one of the largest costs is data store reads and writes. I have noticed one of the biggest offenders of the writes is when we persist an order.
Basic data is Order has many items - we store both separately and relate them like this:
#PersistenceCapable
public class Order implements Serializable {
#Persistent(mappedBy="order")
#Element(dependent = "true")
private List<Item> orderItems;
// other fields too obviously
}
#PersistenceCapable
public class Item implements Serializable {
#Persistent(dependent = "true")
#JsonIgnore
private Order order;
// more fields...
}
The appstats is showing two data store puts for an order with a single item - but both are using massive numbers of writes. I want to know the best way to optimize this from anyone who's got experience.
AppStats data:
real=34ms api=1695ms cost=6400 billed_ops=[DATASTORE_WRITE:64]
real=42ms api=995ms cost=3600 billed_ops=[DATASTORE_WRITE:36]
Some of the areas I know of that would probably help:
less indexes - there's implict indexes on a number of order and item properties that I could tell appengine not to index, for example item.quantity is not something I need to query by. But is that what all these writes are for?
de-relate item and order, so that I just have a single entity OrderItem, removing the need for a relationship at all (but paying for it with extra storage).
In terms of explicity indexes, I only have 1 on the order table, by order date, and one on the order items, by SKU/date and the implict one for the relationship.
If the items were a collection, not a list, would that remove the need for an index on the children _IDX entirely?
So, my question would be, are any of the above items going to herald big wins, or are there other options I've missed that would be better to focus on initially?
Bonus points: Is there a good 'guide to less datastore writes' article somewhere?

Billing docs clearly state:
New Entity Put (per entity, regardless of entity size): 2 writes + 2 writes per indexed property value + 1 write per composite index value
Existing Entity Put (per entity): 1 write + 4 writes per modified indexed property value + 2 writes per modified composite index value
Also relevant: App Engine predefines a simple index on each property of an entity.
On to questions:
Yes, number of write ops is related to number of indexes properties. Make them unindexed to save write ops.
Combining two entities together would save you 1 write (or 2 in case of new entities).
You don't need to have "explicit" indexes for one property only. These are generated automatically by appengine. You just need to explicitly configure compound indexes, spanning more properties.
No. Collection or List (= Collection with order) is just a Java representation, Datastore API always uses list internally (= items added retain their order).
Update:
Number of indexes affect cost of write but not it's speed. Writes are done in two phases: commit phase where entity data is saved, and apply phase where indexes are built. The put operation returns after commit phase and is not affected by number of indexes.
In your case you are calling two puts, one after another. As you can see from AppStats graph they happen consecutively. You might want to execute them in parallel as async operations (not sure if available in JDO).

How best to get List nodes for a cache implementation

Okay first I will preface this with "I am very very new to Java" (i.e., a few days in), but I am a programmer by trade.
I have come across a situation where I want to load data. However, I would like to cache that data to prevent extraneous calls to the API (or, whatever the data source may be). After thinking about it a bit, I have come up with a cache scheme which seems to be pretty reasonable to me: the idea is that the DataCache class has two collections: a hash table that with key type "string" and value type "CacheData". CacheData has 2 data members - the actual result of the api call in string form, and a ref (ListIterator?) to a node of a linked list. Which brings us to the 2nd collection - a linked list of keys. The idea is that when a request comes in for data, we see if it's in the Hash. If not, we fetch from the API, add the resulting key to the front of the linked list, and store a Data object in the hash containing the result, along with a ref to the first node of the linked list (the one we just added). If the data IS found in the hash, we break the node out of the linked list, put it to the front, and return the data from CacheData. The benefit, every operation is guaranteed to execute in O(1), if I'm understanding correctly.
Can I store the integer hash value of the 'request' in the linked list instead of the string (request) as a whole? If so, how can I access the result in the hashmap given that integer? (none of the methods seem to take an 'int' as param). Also...is my approach to this situation sound? Or is there perhaps something in Java that would make this easier?

Java's Hashtable - How to get any entry

I'm working on a chat server, and I'm putting the Clients into a Hashtable.
This Hashtable is composed by <String name, Connection c>, where Connection has Socket and in-out flows.
I can send messages just looking for a nick in the Hashtable, but how can I send it to all the people?
Can I "Scout" (this was the unknown term) every Hashtable's entry? (like an array, I want to "SCOUT" each entry, so I'll do a loop and I'll send the message to everyone).
Thanks in advance.

You could answer to your own question by reading the javadocs for HashMap. "Read the javadocs" is an important lesson that every beginner in Java should learn and remember.
In this case, the javadocs will show you 3 methods that could be useful:
The keys() method returns a collection consisting of the keys in the table.
The values() method returns a collection consisting of the values in the table.
The entries() method returns a collection representing the key/value pairs in the table.
You can iterate these collections as any other collection. There are examples in the other answers.
However, I get the impression that your application is multi-threaded. If that is the case the there are two other problems that you need to deal with to make your program reliable:
If two or more threads could use the same object or data structure, they need to take the necessary steps to ensure that they are properly synchronized. If they don't then there is a non-zero probability that some sequence of operations will result in the data structure being put into an inconsistent state, or that one or more threads will see an inconsistent state (due to memory caches, values saved in registers, etc).
If one thread is using one of a HashMap's collection iterators and another adds or removes an entry, then the first one is likely to get a ConcurrentModificationException.
If you solve the above two problems by locking out all other operations on the HashMap while your "send to all" operation is going on, you are unintentionally creating a performance bottleneck. Basically, everything else stops until the operation has finished. You get a similar effect (but on a finer scale) if you simply put a synchronization wrapper around the HashMap.
You need to read and learn about these things. (And there's far too much to explain in a single SO Answer). A simple (but not universal) solution to all 3 problems that probably will work in your use-case it to use a ConcurrentHashMap instead of a plain HashMap.

I can send messages just looking for a nick in the Hashtable, but how can I send it to all the people?
Then do the same for all nicknames in the hash table:
for (String name : yourTable.keySet())
yourTable.get(name).send("your message");
or, alternatively:
for (Connection conn : yourTable.values())
conn.send("your message");

You can iterate over all the values in the Hashtable and do what you wish to all of them:
Map<String, Connection> users;
for (Connection connection : users.values()) {
// Send the message to each Socket here.
}

Hashtable has keySet() which returns all key entries in that table. I am posting this from mobile, couldnt get you example link. If you want all connection list you can use entrySet().

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.