Issue iterating over custom writable component in reducer

Issue iterating over custom writable component in reducer - java

I am using a custom writable class as VALUEOUT in the map phase in my MR job where the class has two fields, A org.apache.hadoop.io.Text and org.apache.hadoop.io.MapWritable. In my reduce function I iterate through the values for each key and I perform two operations, 1. filter, 2. aggregate. In the filter, I have some rules to check if certain values in the MapWritable(with key as Text and value as IntWritable or DoubleWritable) satisfy certain conditions and then I simply add them to an ArrayList. At the end of the filter operation, I have a filtered list of my custom writable objects. At the aggregate phase, when I access the objects, it turns out that the last object that was successfully filtered in, has overwritten all other objects in the arraylist. After going through some similar issues with lists on SO where the last object overwrite all the others, I confirmed that I do not have static fields nor am I reusing the same custom writable by setting different values(which was quoted as the possible reasons for such an issue). For each key in the reducer I have made sure that the CustomWritable, Text key and the MapWritable are new objects.
In addition, I also performed a simple test by eliminating the filter & aggregate operations in my reduce and just iterated through the values and added them to an ArrayList using a for loop. In the loop, everytime I added a CustomWritable into the list, I logged the values of all the contents of the List. I logged before and after adding the element to the list. Both logs presented that the previous set of elements have been overwritten. I am confused on how this could even happen. As soon as the next element in the iterable of values was accessed by the loop for ( CustomWritable result : values ), the list content was modified. I am unable to figure out the reason for this behaviour. If anyone can shed some light on this, it would be really helpful. Thanks.

The"values" iterator in the reducer reuses the value as you iterate. It's a technique used for performance and smaller memory footprint. Behind the scenes, Hadoop deserializes the next record into the same Java object. If you need to "remember" an object, you'll need to clone it.
You can take advantage of the Writable interface and use the raw bytes to populate a new object.
IntWritable first = WritableUtils.clone(values.next(), context.getConfiguration());
IntWritable second = WritableUtils.clone(values.next(), context.getConfiguration());

Related

Java: Wrapping objects in some type of collection to store duplicates in a set

I want to make a set of some type of collection (not sure which one yet) as a way of "storing duplicates" in a set. For example if I wanted to add the integer 5 with 39 additional copies I could put it into an arraylist at index 39. Thus if I were to get the size of the arraylist, I would know how many copies of 5 existed within the set.
There are a few other ways I could implement this but I have yet to decide on one. The main issue I'm having with implementing this is that I'm not sure how I can "dynamically" make arraylists (or whatever collection I may end up using) so that whenever someone were to call mySet.add(object), the object is first inserted into a unique arraylist then into the set itself.
Can anyone give me some ideas on how I could approach this?
EDIT:
Sorry I should have been more clear in my question. The point of the code that I'm writing is that we have a set-like collection that allows duplicates. And yes some of the associated methods will be re-written/will have to be re-written. Also my code should be written under the assumption that we do not know what type of object is being inserted(only one data type per set though) nor how many instances of the same object will be added nor how many different unique objects will be added.

I would rather go for using a Map like
HashMap list <Object, Integer>
where Object is the Object that you want to count and Integer is the count

You could try guava's MultiSet, I think it's what you want.
It can store the count of each object. What you need to do is just
multiSet.put(object);
And if it is put for the first time, like you said, a new list will be created, or its count will added by one.

How this particular Collection is suitable for the particular scenario?

I was reading some sample question from Enthuware exam simulator. I came across a question whose
problem statement is like this
You are designing a class that will cache objects. It should be able
to store and retrieve an object when supplied with an object
identifier. Further, this class should work by tracking the "last
accessed times" of the objects. Thus, if its capacity is full, it
should remove only the object that hasn't been accessed the longest.
Which collection class would you use to store the objects?
The possible options given were
HashSet
ArrayList
LinkedHashMap
LinkedList
TreeMap
The correct answer given by simulator is LinkedHashMap. I would quote the explanation
given by simulator.
The LinkedHashMap class maintains the elements in the order of their
insertion time. This property can be used to build the required cache
as follows:
Insert the key-value pairs as you do normally where key will be the object identifier and value will be the object to be cached.
When a key is requested, remove it from the LinkedHashMap and then insert it again. This will make sure that this pair marked as inserted
latest.
If the capacity is full, remove the first element.
Note that you cannot simply insert the key-value again (without first
removing it) because a reinsertion operation does not affect the
position of the pair.
I do understand the first point only. Still here are the following questions.
In point-1 it states, the value will be the object to be cached? How does caching apply like this?
I am not able to understand from point-2 onwards.
Can someone explain this concept to me? Thanks.

I believe you should take the 'caching' from the example with a grain of salt: it's meant to provide some context, but not entirely relevant.
The caching here is likely meant as retrieving a value from the collection instead of accessing a data source and get it from there.
As to your second question:
When a key is requested, remove it from the LinkedHashMap and then
insert it again. This will make sure that this pair marked as inserted
latest.
Consider the following Map:
ID | Value
1 | Jack
5 | John
3 | Jenny
In this situation Jack was entered first, then John and after that Jenny.
Now we want to retrieve the cached value of John. If we want to do so, we first retrieve the value for his unique identifier (5) and we get the object John as result. Right now we have our cached value, but the requirement to track the last access time hasn't been fullfilled yet. Therefore we delete him and add him again, essentially placing him at the end.
ID | Value
1 | Jack
3 | Jenny
5 | John
John stays cached, but now his access time has been updated. Whenever the map is full, you remove the first item in line (which will essentially be the item that's not been accessed for the longest time).
If the map has a maximum size of 3 and we try to add Jeff, we get the following situation:
ID | Value
3 | Jenny
5 | John
7 | Jeff
The first item (Jack) and thus the least-recently accessed object will be removed, making place for the new object (most-recently accessed).

In point-1 it states, the value will be the object to be cached? How does caching apply like this?
Caching an object here means storing the created objects, in some collections, so that they can be retrieved later. Now as the requirement is to store and retrieve objects using it's key, clearly a Map is the option here, which will store the mapping from object's Key to the object itself.
Also, LinkedHashMap is suitable, because it maintains the insertion order. So, the first object you create, will be the first in the that map.
When a key is requested, remove it from the LinkedHashMap and then insert it again. This will make sure that this pair marked as inserted latest.
Again, take a look at the requirement. It says, the elements that haven't be accessed for long, should be removed. Now suppose an object which is at the first position, hasn't been accessed for long. So, when you access it now, you wouldn't want to be still in the first position, because in that case, when you remove the first elements, you will be removing the elements you just accessed.
That is why you should remove the element, and insert it back, so that it is placed at the end.
If the capacity is full, remove the first element.
As it's already clear, the first element is the one, which was inserted first, and has the oldest access time. So, you should remove the first element only, as the requirement says:
if its capacity is full, it should remove only the object that hasn't been accessed the longest.

First step, determine if you need a Set, Map, or List.
Lists preserve order.
Maps allow fast, key based, look up of items.
Sets provide identity based membership, in other words, no duplicates.
You probably want lookup by key, so It's some sort of map. However, you also want to preserve order. At first glance, LinkedHashMap seems a winner, but it is not.
LinkedHashMap preserves insertion order, and you want to preserve access order. To twist one into another, you would have to remove and add back each element as it is accessed. This is very wasteful, and subject to timing issues (between the would-be-atomic add and read).
You could simplify both by maintaining two internal data structures.
A HashMap for fast access.
A linked list to quickly reorder based on access times.
As you insert, the hashmap stores a linked list node, who's key is the key for the stored data object within the linked list node. The node is added on the "newer" end of the list.
As you access, the hashmap pulls up the linked list node, which is then removed and inserted into the head of the linked list. (and the data is returned).
As you delete, the hashmap pulls up the linked list node, and removes it from the linked list, and clears the hashmap entry.
When removing a expired entry, remove from the old end of the linked list, and don't forget to clear out the hashmap entry.
By doing this, you have built your own kind of LinkedHashMap, but one that tracks according to access time instead of insertion order.

They are omitting three very important points:
Together with the LinkedHashMap, a mechanism to determine when to start removing objects is necessary. The most simple one is a counter availableCapacity initialized to the maximum capacity and decremented/incremented accordingly. An alternative is to compare the size() of the LinkedHashMap with a maximumCapacity variable.
The LinkedHashMap (specifically its values()) is assumed to contain the only pointers to the cached objects/structures. If any other pointers are kept, they are assumed to be transient.
The cache is to be administered under a LRU regime.
This said, and to answer your questions:
Yes.
By definition, the first item in a LinkedHashMap is the first inserted ("oldest"). If every time a cache entry is used it is removed and re-inserted into the map, it is placed at the end of the list and thus made the "newest". first will always be the one that has not been used for the longest time. "second" the following, and so on. This is why the elements from the front are removed.

LinkedHashMap stores items in the order they were inserted. They're using one to implement an LRU cache. The keys are the object identifiers. The values are the items to be cached. Maps have a very fast lookup time, which is what makes the map a cache. It's faster to do the lookup than to
Inserting items into the map puts them at the end of the map. So every time you read something, you take it out and put it back on the end. Then, when you need more room in your cache, you chop off the first element. That's the one that hasn't been used in the longest time, because it made its way all the way to the front.

Why does java.util.Map.values() allow you to remove entries from the returned Collection

Why does java.util.Map.values() allow you to delete entries from the returned Collection when it makes no sense to remove a key value pair based on the value? The code which does this would have no idea what key the value(and hence a key) being removed is mapped from. Especially when there are duplicate values, calling remove on that Collection would result in an unexpected key being removed.

it makes no sense to remove a key value pair based on the value
I don't think you're being imaginative enough. I'll admit there probably isn't wide use for it, but there will be valid cases where it would be useful.
As a sample use case, say you had a Map<Person, TelephoneNumber> called contactList. Now you want to filter your contact list by those that are local.
To accomplish this, you could make a copy of the map, localContacts = new HashMap<>(contactList) and remove all mappings where the TelephoneNumber starts with an area code other than your local area code. This would be a valid time where you want to iterate through the values collection and remove some of the values:
Map<Person, TelephoneNumber> contactList = getContactList();
Map<Person, TelephoneNumber> localContacts = new HashMap<Person, TelephoneNumber>(contactList);
for ( Iterator<TelephoneNumber> valuesIt = localContacts.values().iterator(); valuesIt.hasNext(); ){
TelephoneNumber number = valuesIt.next();
if ( !number.getAreaCode().equals(myAreaCode) ) {
valuesIt.remove();
}
}
Especially when there are duplicate values, calling remove on that Collection would result in an unexpected key being removed.
What if you wanted to remove all mappings with that value?

It has to have a remove method because that's part of Collection. Given that, it has the choice of allowing you to remove values or throwing an UnsupportedOperationException. Since there are legitimate reasons that you might want to remove values, why not choose to allow this operation?
Maybe there's a given value where you want to remove every instance
of it from the Map.
Maybe you want to trim out every third
key/value pair for some reason.
Maybe you have a map from hotel
room number to occupancy count and you want to remove everything from
the map where the occupancy count is greater than one in order to
find a room for someone to stay in.
...if you think about it more
closely, there are plenty more examples like this...
In short: there are plenty of situations where this might be useful and implementing it doesn't harm anyone who doesn't use it, so why not?

I think there is quite often a use for removing a value based on a key; other answers show examples. Given that, if you want to remove a certain value, why would you only want one particular key of it removed? Even if you did, you'd have to know which key you wanted to remove (or not, as the case may be), and then you should just remove it by key anyway.

The Collection returned is a special Collection, and its semantics are such that it knows how values in it relate back to the Map it came from. The javadoc indicates what Collection operation the returned collection supports.

Efficiently finding duplicates in a constrained many-to-many dataset?

I have to write a bulk operation version of something our webapp
lets you do on a more limited basis from the UI. The desired
operation is to assign objects to a category. A category can have
multiple objects but a given object can only be in one category.
The workflow for the task is:
1) Using the browser, a file of the following form is uploaded:
# ObjectID, CategoryID
Oid1, Cid1
Oid2, Cid1
Oid3, Cid2
Oid4, Cid2
[etc.]
The file will most likely have tens to hundreds of lines, but
definitely could have thousands of lines.
In an ideal world a given object id would only occur once in the file
(reflecting the fact that an object can only be assigned to one category)
But since the file is created outside of our control, there's no guarantee
that's actually true and the processing has to deal with that possibility.
2) The server will receive the file, parse it, pre-process it
and show a page something like:
723 objects to be assigned to 126 categories
142 objects not found
42 categories not found
Do you want to continue?
[Yes] [No]
3) If the user clicks the Yes button, the server will
actually do the work.
Since I don't want to parse the file in both steps (2) and (3), as
part of (2), I need to build a container that will live across
requests and hold a useful representation of the data that will let me
easily provide the data to populate the "preview" page and will let me
efficiently do the actual work. (While obviously we have sessions, we
normally keep very little in-memory session state.)
There is an existing
assignObjectsToCategory(Set<ObjectId> objectIds, CategoryId categoryId)
function that is used when assignment is done through the UI. It is
highly desireable for the bulk operation to also use this API since it
does a bunch of other business logic in addition to the simple
assignment and we need that same business logic to run when this bulk
assign is done.
Initially it was going to be OK that if the file "illegally" specified
multiple categories for a given object -- it would be OK to assign the
object abitrarily to one of the categories the file associated it
with.
So I was initially thinking that in step (2) as I went through the
file I would build up and put into the cross-request container a
Map<CategoryId, Set<ObjectId>> (specifically a HashMap for quick
lookup and insertion) and then when it was time to do the work I could
just iterate on the map and for each CategoryId pull out the
associated Set<ObjectId> and pass them into assignObjectsToCategory().
However, the requirement on how to handle duplicate ObjectIds changed.
And they are now to be handled as follows:
If an ObjectId appears multiple times in the file and
all times is associated with the same CategoryId, assign
the object to that category.
If an ObjectId appears multiple times in the file and
is associated with different CategoryIds, consider that
an error and make mention of it on the "preview" page.
That seems to mess up my Map<CategoryId, Set<ObjectId>> strategy
since it doesn't provide a good way to detect that the ObjectId I
just read out of the file is already associated with a CategoryId.
So my question is how to most efficiently detect and track these
duplicate ObjectIds?
What came to mind is to use both "forward" and "reverse" maps:
public CrossRequestContainer
{
...
Map<CategoryId, Set<ObjectId>> objectsByCategory; // HashMap
Map<ObjectId, List<CategoryId>> categoriesByObject; // HashMap
Set<ObjectId> illegalDuplicates;
...
}
Then as each (ObjectId, CategoryId) pair was read in, it would
get put into both maps. Once the file was completely read in, I
could do:
for (Map.Entry<ObjectId, List<CategoryId>> entry : categoriesByObject.entrySet()) {
List<CategoryId> categories = entry.getValue();
if (categories.size() > 1) {
ObjectId object = entry.getKey();
if (!all_categories_are_equal(categories)) {
illegalDuplicates.add(object);
// Since this is an "illegal" duplicate I need to remove it
// from every category that it appeared with in the file.
for (CategoryId category : categories) {
objectsByCategory.get(category).remove(object);
}
}
}
}
When this loop finishes, objectsByCategory will no longer contain any "illegal"
duplicates, and illegalDuplicates will contain all the "illegal" duplicates to
be reported back as needed. I can then iterate over objectsByCategory, get the Set<ObjectId> for each category, and call assignObjectsToCategory() to do the assignments.
But while I think this will work, I'm worried about storing the data twice, especially
when the input file is huge. And I'm also worried that I'm missing something re: efficiency
and this will go very slowly.
Are there ways to do this that won't use double memory but can still run quickly?
Am I missing something that even with the double memory use will still run a lot
slower than I'm expecting?

Given the constraints you've given, I don't there's a way to do this using a lot less memory.
One possible optimization though is to only maintain lists of categories for objects which are listed in multiple categories, and otherwise just map object to category, ie:
Map<CategoryId, Set<ObjectId>> objectsByCategory; // HashMap
Map<ObjectId, CategoryId> categoryByObject; // HashMap
Map<ObjectId, Set<CategoryId>> illegalDuplicates; // HashMap
Yes, this adds yet another container, but it will contain (hopefully) only a few entries; also, the memory requirements of the categoryByObject map is reduced (cutting out one list overhead per entry).
The logic is a little more complicated of course. When a duplicate is initially discovered, the object should be removed from the categoryByObject map and added into the illegalDuplicates map. Before adding any object into the categoryByObject map, you will need to first check the illegalDuplicates map.
Finally, it probably won't hurt performance to build the objectsByCategory map in a separate loop after building the other two maps, and it will simplify the code a bit.

Difference between HashMap and ArrayList in Java?

In Java, ArrayList and HashMap are used as collections. But I couldn't understand in which situations we should use ArrayList and which times to use HashMap. What is the major difference between both of them?

You are asking specifically about ArrayList and HashMap, but I think to fully understand what is going on you have to understand the Collections framework. So an ArrayList implements the List interface and a HashMap implements the Map interface. So the real question is when do you want to use a List and when do you want to use a Map. This is where the Java API documentation helps a lot.
List:
An ordered collection (also known as a
sequence). The user of this interface
has precise control over where in the
list each element is inserted. The
user can access elements by their
integer index (position in the list),
and search for elements in the list.
Map:
An object that maps keys to values. A
map cannot contain duplicate keys;
each key can map to at most one value.
So as other answers have discussed, the list interface (ArrayList) is an ordered collection of objects that you access using an index, much like an array (well in the case of ArrayList, as the name suggests, it is just an array in the background, but a lot of the details of dealing with the array are handled for you). You would use an ArrayList when you want to keep things in sorted order (the order they are added, or indeed the position within the list that you specify when you add the object).
A Map on the other hand takes one object and uses that as a key (index) to another object (the value). So lets say you have objects which have unique IDs, and you know you are going to want to access these objects by ID at some point, the Map will make this very easy on you (and quicker/more efficient). The HashMap implementation uses the hash value of the key object to locate where it is stored, so there is no guarentee of the order of the values anymore. There are however other classes in the Java API that can provide this, e.g. LinkedHashMap, which as well as using a hash table to store the key/value pairs, also maintains a List (LinkedList) of the keys in the order they were added, so you can always access the items again in the order they were added (if needed).

If you use an ArrayList, you have to access the elements with an index (int type). With a HashMap, you can access them by an index of another type (for example, a String)
HashMap<String, Book> books = new HashMap<String, Book>();
// String is the type of the index (the key)
// and Book is the type of the elements (the values)
// Like with an arraylist: ArrayList<Book> books = ...;
// Now you have to store the elements with a string key:
books.put("Harry Potter III", new Book("JK Rownling", 456, "Harry Potter"));
// Now you can access the elements by using a String index
Book book = books.get("Harry Potter III");
This is impossible (or much more difficult) with an ArrayList. The only good way to access elements in an ArrayList is by getting the elements by their index-number.
So, this means that with a HashMap you can use every type of key you want.
Another helpful example is in a game: you have a set of images, and you want to flip them. So, you write a image-flip method, and then store the flipped results:
HashMap<BufferedImage, BufferedImage> flipped = new HashMap<BufferedImage, BufferedImage>();
BufferedImage player = ...; // On this image the player walks to the left.
BufferedImage flippedPlayer = flip(player); // On this image the player walks to the right.
flipped.put(player, flippedPlayer);
// Now you can access the flipped instance by doing this:
flipped.get(player);
You flipped player once, and then store it. You can access a BufferedImage with a BufferedImage as key-type for the HashMap.
I hope you understand my second example.

Not really a Java specific question. It seems you need a "primer" on data structures. Try googling "What data structure should you use"
Try this link http://www.devx.com/tips/Tip/14639
From the link :
Following are some tips for matching the most commonly used data structures with particular needs.
When to use a Hashtable?
A hashtable, or similar data structures, are good candidates if the stored data is to be accessed in the form of key-value pairs. For instance, if you were fetching the name of an employee, the result can be returned in the form of a hashtable as a (name, value) pair. However, if you were to return names of multiple employees, returning a hashtable directly would not be a good idea. Remember that the keys have to be unique or your previous value(s) will get overwritten.
When to use a List or Vector?
This is a good option when you desire sequential or even random access. Also, if data size is unknown initially, and/or is going to grow dynamically, it would be appropriate to use a List or Vector. For instance, to store the results of a JDBC ResultSet, you can use the java.util.LinkedList. Whereas, if you are looking for a resizable array, use the java.util.ArrayList class.
When to use Arrays?
Never underestimate arrays. Most of the time, when we have to use a list of objects, we tend to think about using vectors or lists. However, if the size of collection is already known and is not going to change, an array can be considered as the potential data structure. It's faster to access elements of an array than a vector or a list. That's obvious, because all you need is an index. There's no overhead of an additional get method call.
4.Combinations
Sometimes, it may be best to use a combination of the above approaches. For example, you could use a list of hashtables to suit a particular need.
Set Classes
And from JDK 1.2 onwards, you also have set classes like java.util.TreeSet, which is useful for sorted sets that do not have duplicates. One of the best things about these classes is they all abide by certain interface so that you don't really have to worry about the specifics. For e.g., take a look at the following code.
// ...
List list = new ArrayList();
list.add(

Use a list for an ordered collection of just values. For example, you might have a list of files to process.
Use a map for a (usually unordered) mapping from key to value. For example, you might have a map from a user ID to the details of that user, so you can efficiently find the details given just the ID. (You could implement the Map interface by just storing a list of keys and a list of values, but generally there'll be a more efficient implementation - HashMap uses a hash table internally to get amortised O(1) key lookup, for example.)

A Map vs a List.
In a Map, you have key/value pairs. To access a value you need to know the key. There is a relationship that exists between the key and the value that persists and is not arbitrary. They are related somehow. Example: A persons DNA is unique (the key) and a persons name (the value) or a persons SSN (the key) and a persons name (the value) there is a strong relationship.
In a List, all you have are values (a persons name), and to access it you need to know its position in the list (index) to access it. But there is no permanent relationship between the position of the value in the list and its index, it is arbitrary.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.