I am learning Java now and I am learning about different kinds of collections, so far I learned about LinkedList, ArrayList and Array[].
Now I've been introduced to Hash types of collections, HashSet and HashMap, and I didn't quite understand why there are useful, because the list of commands that they support is quietly limited, also, they are sorted in a random order and I need to Override the equal and HashKey methods in order to make it work right with class.
Now, what I don't understand is the benefits over the hassle of using these types instead of ArrayList of a costume class.
I mean, what Map is doing is connecting 2 objects as 1, but wouldn't it just be better to create a class that contains this 2 objects as parameters, and have getters to modify and use them?
If the benefit is that this Hash objects can only contain 1 object of the same name, wouldn't it just be easier to make the ArrayList check that the type is not already there before adding it?
So far I learned to choose when to use LinkedList, ArrayList or Array[] by the rule of "if it's really simple, use Array[], if it's a bit more complex use ArrayList (for example to hold collection of certain class), and if the list is dynamic with a lot of objects inside that need to change order according to removing or adding a new one in the middle or go back and forth within the list then use LinkedList.
But I couldn't understand when to prefer HashMap or HashSet, and I would be really glad if you could explain it to me.
Let me help you out here...
Hashed collections are the most efficient to add, search and remove data, since they hash the key (in HashMap) or the element (in HashSet) to find the place where they belong in a single step.
The concept of hashing is really simple. It is the process of representing an object as a number that can work as it´s id.
For example, if you have a string in Java like String name = "Jeremy";, and you print its hashcode: System.out.println(name.hashCode());, you will see a big number there (-2079637766), that was created using that string object values (in this string object, it's characters), that way, that number can be used as an Id for that object.
So the Hashed collections like the ones mentioned above, use this number to use it as an array index to find the elements in no-time. But obviously is too big to use it as an array index for a possible small array. So they need to reduce that number so it fits in the range of the array size. (HashMap and HashSet use arrays to store their elements).
The operation that they use to reduce that number is called hashing, and is something like this: Math.abs(-2079637766 % arrayLength);.
It's not like that exactly, it's a bit more complex, but this is to simplify.
Let's say that arrayLength = 16;
The % operator will reduce that big number to a number smaller than 16, so that it can be fit in the array.
That is why a Hashed collection will not allow duplicate, because if you try to add the same object or an equivalent one (like 2 strings with the same characters), it will produce the same hashcode and will override whatever value is in the result index.
In your question, you mentioned that if you are worried about duplicates items in an ArrayList, we can just check if the item is there before inserting it, so this way we don't need to use a HashSet. But that is not a good idea, because if you call the method list.contains(elem); in an ArrayList, it needs to go one by one comparing the elements to see if it's there. If you have 1 million elements in the ArrayList, and you check if an element is there, but it is not there, the ArrayList iterated over 1 million elements, that is not good. But with a HashSet, it would only hashed the object and go directly where it is supposed to be in the array and check, doing it in just 1 step, instead of 1 million. So you see how efficient a HashSet is compared to an ArrayList.
The same happens with a HashMap of size 1 million, that it will only take 1 single step to check if a key is there, and not 1 million.
The same thing happens when you need to add, find and remove an element, with the hashed collections it will do all that in a single step (constant time, doesn't depend on the size of the map), but that varies for other structures.
That's why it is really efficient and widely used.
Main Difference between an ArrayList and a LinkedList:
If you want to find the element at place 500 in an ArrayList of size 1000, you do: list.get(500); and it will do that in a single step, because an ArrayList is implemented with an array, so with that 500, it goes directly where the element is in the array.
But a LinkedList is not implemented with an array, but with objects pointing to each other. This way, they need to go linearly and counting from 0, one by one until they get to the 500, which is not really efficient compared to the 1 single step of the ArrayList.
But when you need to add and remove elements in an ArrayList, sometimes the Array will need to be recreated so more elements fit in it, increasing the overhead.
But that doesn't happen with the LinkedList, since no array has to be recreated, only the objects (nodes) have to be re-referenced, which is done in a single step.
So an ArrayList is good when you won't be deleting or adding a lot of elements on the structure, but you are going to read a lot from it.
If you are going to add and remove a lot of elements, then is better a linked list since it has less work to do with those operations.
Why you need to implement the equals(), hashCode() methods for user-defined classes when you want to use those objects in HashMaps, and implement Comparable interface when you want to use those objects with TreeMaps?
Based on what I mentioned earlier for HashMaps, is possible that 2 different objects produce the same hash, if that happens, Java will not override the previous one or remove it, but it will keep them both in the same index. That is why you need to implement hashCode(), so you make sure that your objects will not have a really simple hashCode that can be easily duplicated.
And the reason why is recommended to override the equals() method is that if there is a collision (2 or more objects sharing the same hash in a HashMap), then how do you tell them apart? Well, asking the equals() method of those 2 objects if they are the same. So if you ask the map if it contains a certain key, and in that index, it finds 3 elements, it asks the equals() methods of those elements if its equals() to the key that was passed, if so, it returns that one. If you don't override the equals() method properly and specify what things you want to check for equality (like the properties name, age, etc.), then some unwanted overrides inside the HashMap will happen and you will not like it.
If you create your own classes, say, Person, and has properties like name, age, lastName and email, you can use those properties in the equals() method and if 2 different objects are passed but have the same values in your selected properties for equality, then you return true to indicate that they are the same, or false otherwise. Like the class String, that if you do s1.equals(s2); if s1 = new String("John"); and s2 = new String("John");, even though they are different objects in Java Heap Memory, the implementation of String.equals method uses the characters to determine if the objects are equals, and it returns true for this example.
To use a TreeMap with user-defined classes, you need to implement the Comparable interface, since the TreeMap will compare and sort the objects based on some properties, you need to specify by which properties your objects will be sorted. Will your objects be sorted by age? By name? By id? Or by any other property that you would like. Then, when you implement the Comparable interface and override the compareTo(UserDefinedClass o) method, you do your logic and return a positive number if the current object is greater than the o object passed, 0 if they are the same and a negative number if the current object is smaller. That way, the TreeMap will know how to sort them, based on the number returned.
First HashSet. In HashSet, you can easily get whether it contains given element. Let's have a set of people in your class and you want to ask whether a guy is in your class. You can make an array list of strings. And if you want to ask if a guy is in your class, you have to iterate through whole the list until you find him, which might be too slow for longer lists. If you use HashSet instead, the operation is much faster. You calculate the hash of the searched string and then you go directly to the hash, so you don't need to pass so many elements to answer your question. Well, you can also make a workaround to make the ArrayList faster to access for this purpose but this is already prepared.
And now HashMap. Now imagine that you also want to store a score for each person. So now you can use HashMap. You enter the name and you get his score in a short time, without the need of iterating through whole the data structure.
Does it make sense?
Concerning your question:
"But I couldn't understand when to prefer HashMap or HashSet, and I
would be really glad if you could explain it to me"
The HashMap implement the Map interface, to be used for mapping a Key (K) to a value (V) in constant time, and where order doesn't matter, so you can put and retrieve those data efficiently if you now the key.
And HashSet implement the Set interface, but is internanly using and HashMap, its role is to be used as a Set, meaning you're not supposed to retrieve an element, you just check that is in the set or not (mostly).
In HashMap, you can have identical value, while you can't in a Set (because its a property of a Set).
Concerning this question :
If the benefit is that this Hash objects can only contain 1 object of the same name, >wouldn't it just be easier to make the ArrayList check that the type is not already >there before adding it?
When dealing with collection, you have may base you choice of a particular one on the data representation but also on the way you want to access and store those data, how do you access it ? Do you need to sort them ? Because each implemenation may have different complexity (https://en.wikipedia.org/wiki/Time_complexity), it become important.
Using the doc,
For ArrayList:
The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking).
For HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
So it's about the time complexity.
You may choose even more untypical collection for certain problems :).
This has little to do with Java specifically, and the choice depends mostly on performance requirements, but there's a fundamental difference that must be highlighted. Conceptually, Lists are types of collections that keep the order of insertion and may have duplicates, Sets are more like bags of items that have no specific order and no duplicates. Of course, different implementations may find a way around it (like a TreeSet).
First, let's check the difference between ArrayList and LinkedList. A linked list is a set of nodes, where each node contains a value and a link to the next and previous nodes. This makes inserting an element to a linked list a matter of appending a node to the end of the list, which is a quick operation since the memory does not have to be contiguous, as long as a node keeps a reference to the next node. On the other side, accessing a specific element requires transversing the entire list until finding it.
An array list, as the name implies, wraps an array. Accessing elements in an array by using its index is direct access, but inserting an element implies resizing the array to include the new element, so the memory it occupies is contiguous, making writes a bit heavier in this case.
A HashMap works like a dictionary, where for each key there's a value. The behavior of the insertion will mostly depend on how the hashCode and equals functions of the object used as a key are implemented. If the hashCode of two keys is the same, there's a hash collision, so equals will be used to understand if it's the same key or not. If equals is the same, then it's the same key, so the value is replaced. If not, the new value is added to the collection. Accessing and Writing values depends mostly on calculating the hash of the key followed by direct access to the value, making both operations really quick, O(1).
A set is pretty much like a hash map, without the "values" part, thus, it follows the same rules regarding the implementation of hashCode and equals operations for the added value.
It might be handy to study a bit about the Big-O notation and complexity of algorithms. If you are starting with Java, I'd strongly recommend the book Effective Java, by Joshua Bloch.
Hope it helps you dig further.
I am using a custom writable class as VALUEOUT in the map phase in my MR job where the class has two fields, A org.apache.hadoop.io.Text and org.apache.hadoop.io.MapWritable. In my reduce function I iterate through the values for each key and I perform two operations, 1. filter, 2. aggregate. In the filter, I have some rules to check if certain values in the MapWritable(with key as Text and value as IntWritable or DoubleWritable) satisfy certain conditions and then I simply add them to an ArrayList. At the end of the filter operation, I have a filtered list of my custom writable objects. At the aggregate phase, when I access the objects, it turns out that the last object that was successfully filtered in, has overwritten all other objects in the arraylist. After going through some similar issues with lists on SO where the last object overwrite all the others, I confirmed that I do not have static fields nor am I reusing the same custom writable by setting different values(which was quoted as the possible reasons for such an issue). For each key in the reducer I have made sure that the CustomWritable, Text key and the MapWritable are new objects.
In addition, I also performed a simple test by eliminating the filter & aggregate operations in my reduce and just iterated through the values and added them to an ArrayList using a for loop. In the loop, everytime I added a CustomWritable into the list, I logged the values of all the contents of the List. I logged before and after adding the element to the list. Both logs presented that the previous set of elements have been overwritten. I am confused on how this could even happen. As soon as the next element in the iterable of values was accessed by the loop for ( CustomWritable result : values ), the list content was modified. I am unable to figure out the reason for this behaviour. If anyone can shed some light on this, it would be really helpful. Thanks.
The"values" iterator in the reducer reuses the value as you iterate. It's a technique used for performance and smaller memory footprint. Behind the scenes, Hadoop deserializes the next record into the same Java object. If you need to "remember" an object, you'll need to clone it.
You can take advantage of the Writable interface and use the raw bytes to populate a new object.
IntWritable first = WritableUtils.clone(values.next(), context.getConfiguration());
IntWritable second = WritableUtils.clone(values.next(), context.getConfiguration());
I am new in programming in general and in Java in particular. I want to implement an LRU cache and would like to have O(1) complexity.
I have seen some implementations in the Internet using a manually implemented doubly linked list for the cache (two arrays, a Node class, previous, next etc.) and a HashMap where Key is the item to be cached and Value is the timestamp.
I really don't see the reason to use timestamps: the inserted item goes to the head of the manually-implemented LinkedList, the evicted item is the cached item located at the tail, and in every insertion the previously cached items are shifted one position towards the tail.
The only problems that I see are the following:
For the cache lookup (to find if we have a cache hit or miss for the requested item), we have to "scan" the cache list, which implies a for loop of some type (conventional, for-each etc., I don't really care much at this point). Obviously, we don't want that. I believe this issue can be solved easily by using an array of boolean variables to indicate whether an item is in the cache or not (1: in, 0: out) - let's call it lookupArray - as following: Let's say that the items are distinguished by some numeric ID, i.e. an integer between 1 and N. Then, this lookupArray of booleans will have size N+1 (because array indexing starts from zero) and it will be initialized at all zero values. When the item with numeric ID k, where 1<=k<=N, enters the cache, we set the boolean value at index k of lookupArray to 1. That way, cache lookup does not need any search in the cache: in order to check whether the item with numeric ID k is in the cache or not, we simply check whether the value of lookupArray at index k is 1 or 0, respectively. (We already have the index, i.e. we know where to look, thus there is no need to use a for loop.)
The second problem, though, is not easilly solvable. Let's say that we have a cache hit for an item. Then, if this item is not located at the head (i.e. if it is not the most recently used item), we have to locate it in the cache list and then put it at the head. As far as I understand, this implies searching in the cache list, i.e. a for loop. Then, we can't achieve the O(1) objective.
Am I right about (2)? Is any way to do this without using a HashMap and timestamps?
Due to the fact that I am relatively new in programming as I stated at the beginning of the post, I would really appreciate the use, if possible, of any code snippets demonstrating the implementation with a manually implemented doubly linked list.
Sorry for the long message, I hope it is not only detailed but also clear.
Thank you!
Consider using a queue. It allows you to remove an object and insert it at the beginning. It also has a size and can be used for caching.
http://docs.oracle.com/javase/7/docs/api/java/util/Queue.html
O, maybe you should not implement it yourself. There is an LRUMap available in the Apache Commons Collections library.
https://commons.apache.org/proper/commons-collections/javadocs/api-3.2.1/org/apache/commons/collections/map/LRUMap.html
Summary of this post: I have an set of ordered items whose order may change over time. I need to be able to iterate through this set from multiple threads, each of which may also want to update the order of the items.
For example, multiple threads need to access String keys in some arbitrary sorted order. They strings are not sorted according to their natural ordering, but by some values that may change (hence, a custom Comparator). My original implementation was to use a TreeSet and synchronize on it. If any of the keys needed to be reordered, a thread would remove the key from the map, update the comparison value, and reinsert the key. To implement this, the keys are native Strings, but the comparator has access to the values. This is a weird arrangement where the order of keys may change over time, but since a changed key is always removed and reinserted when it changes, it seems to work. (I suppose it could also work if the Strings were wrapped inside another object.)
I recently became aware of the ConcurrentSkipListSet/ConcurrentSkipListMap implementations which are basically thread-safe sorted sets (resp. maps.) It seems like I can now iterate through the keys without having to lock the entire data structure. However, is there a way I can use them to atomically remove a key and replace it with another, like the operation I was doing above, so that other iterating threads don't miss the item, and without having to use synchronize blocks?
If anyone can suggest a better data structure for this type of operation, I'm all ears, too!
is there a way I can use them to atomically remove a key and replace it with another, like the operation I was doing above, so that other iterating threads don't miss the item, and without having to use synchronize blocks?
The short answer is no. If you need to remove and reinsert, there is no atomic way to do this with any collection that I know of.
That said, one possibility would be for you to reinsert the item before deleting it from the skip list. This would cause a duplicate but may be easier to handle then a missing entry. You would reinsert it after you changed the object so it would sort differently. This assumes that the object would then be non-equal as well. But if the other threads that are processing the lists can't handle the duplicates then I think you are SOL.
Okay first I will preface this with "I am very very new to Java" (i.e., a few days in), but I am a programmer by trade.
I have come across a situation where I want to load data. However, I would like to cache that data to prevent extraneous calls to the API (or, whatever the data source may be). After thinking about it a bit, I have come up with a cache scheme which seems to be pretty reasonable to me: the idea is that the DataCache class has two collections: a hash table that with key type "string" and value type "CacheData". CacheData has 2 data members - the actual result of the api call in string form, and a ref (ListIterator?) to a node of a linked list. Which brings us to the 2nd collection - a linked list of keys. The idea is that when a request comes in for data, we see if it's in the Hash. If not, we fetch from the API, add the resulting key to the front of the linked list, and store a Data object in the hash containing the result, along with a ref to the first node of the linked list (the one we just added). If the data IS found in the hash, we break the node out of the linked list, put it to the front, and return the data from CacheData. The benefit, every operation is guaranteed to execute in O(1), if I'm understanding correctly.
Can I store the integer hash value of the 'request' in the linked list instead of the string (request) as a whole? If so, how can I access the result in the hashmap given that integer? (none of the methods seem to take an 'int' as param). Also...is my approach to this situation sound? Or is there perhaps something in Java that would make this easier?