What is the fastest method to find duplicates from a collection

What is the fastest method to find duplicates from a collection - java

This is what I have tried and somehow I get the feeling that this is not right or this is not the best performing application, so is there a better way to do the searching and fetching the duplicate values from a Map or as a matter of fact any collection. And a better way to traverse through a collection.
public class SearchDuplicates{
public static void main(String[] args) {
Map<Integer, String> directory=new HashMap<Integer, String>();
Map<Integer, String> repeatedEntries=new HashMap<Integer, String>();
// adding data
directory.put(1,"john");
directory.put(2,"michael");
directory.put(3,"mike");
directory.put(4,"anna");
directory.put(5,"julie");
directory.put(6,"simon");
directory.put(7,"tim");
directory.put(8,"ashley");
directory.put(9,"john");
directory.put(10,"michael");
directory.put(11,"mike");
directory.put(12,"anna");
directory.put(13,"julie");
directory.put(14,"simon");
directory.put(15,"tim");
directory.put(16,"ashley");
for(int i=1;i<=directory.size();i++) {
String result=directory.get(i);
for(int j=1;j<=directory.size();j++) {
if(j!=i && result==directory.get(j) &&j<i) {
repeatedEntries.put(j, result);
}
}
System.out.println(result);
}
for(Entry<Integer, String> entry : repeatedEntries.entrySet()) {
System.out.println("repeated "+entry.getValue());
}
}
}
Any help would be appreciated. Thanks in advance

You can use a Set to determine whether entries are duplicate. Also, repeatedEntries might as well be a Set, since the keys are meaningless:
Map<Integer, String> directory=new HashMap<Integer, String>();
Set<String> repeatedEntries=new HashSet<String>();
Set<String> seen = new HashSet<String>();
// ... initialize directory, then:
for(int j=1;j<=directory.size();j++){
String val = directory.get(j);
if (!seen.add(val)) {
// if add failed, then val was already seen
repeatedEntries.add(val);
}
}
At the cost of extra memory, this does the job in linear time (instead of quadratic time of your current algorithm).
EDIT: Here's a version of the loop that doesn't rely on the keys being consecutive integers starting at 1:
for (String val : directory.values()) {
if (!seen.add(val)) {
// if add failed, then val was already seen
repeatedEntries.add(val);
}
}
That will detect duplicate values for any Map, regardless of the keys.

You can use this to found word count
Map<String, Integer> repeatedEntries = new HashMap<String, Integer>();
for (String w : directory.values()) {
Integer n = repeatedEntries.get(w);
n = (n == null) ? 1 : ++n;
repeatedEntries.put(w, n);
}
and this to print the stats
for (Entry<String, Integer> e : repeatedEntries.entrySet()) {
System.out.println(e);
}

List, Vector have a method contains(Object o) which return Boolean value based either this object is exist in collection or not.

You can use Collection.frequency to find all possible duplicates in any collection using
Collections.frequency(list, "a")
Here is a proper example
Most generic method to find
Set<String> uniqueSet = new HashSet<String>(list);
for (String temp : uniqueSet) {
System.out.println(temp + ": " + Collections.frequency(list, temp));
}
References from above link itself

Related

Iterating a Map<TypeA,Set<TypeB>> and convert it to a Map<TypeB,Set<TypeA>> in Java

I have a Map<String>,Set<String>> followingMap, where the keys are usernames and values are Sets of usernames that the key usernames follows.
I have to create a followersMap , where in this case, the followed users in the value Sets are now the keys and the value is a Set of followers according to the previous k.
Not sure if this is clear enough so as an example, an element in the followingMap would be: key="john", value=Set["robert","andrew,"amanda"].
In the followersMap it would be:
key="robert", value=Set["john"]
key="andrew", value=Set["john"]
key="amanda", value=Set["john"]
If a second element in followingMap is key="alex",Set["amanda"] that would add "alex" to the value Set of the "amanda" key.
My code should do the trick, however when testing, I'm getting keys where all value Set are being filled.
Take a look:
Map<String,Set<String>> followerGraph = new HashMap<String,Set<String>>();
for (Map.Entry<String, Set<String>> me : followsGraph.entrySet()) {
String key = me.getKey();
Set<String> tmp = new LinkedHashSet<>();
Set<String> valueSet = me.getValue();
for (String s : valueSet) {
if (followerGraph.containsKey(s)){
followerGraph.get(s).add(key);
} else {
tmp.add(key);
followerGraph.put(s, tmp);
}
}
}
So this is the print of the followsGraph:
{aliana=[#jake, #john, #erick], alyssa=[#john, #erick],
bbitdiddle=[#rock-smith, #john, #erick], casus=[#daniel, #jake, #john, #erick],
david=[#dude, #john]}
And this is the print of the followerGraph:
{#daniel=[casus], #rock-smith=[bbitdiddle], #jake=[aliana, alyssa, bbitdiddle, casus, david], #dude=[david], #john=[aliana, alyssa, bbitdiddle, casus, david], #erick=[aliana, alyssa, bbitdiddle, casus, david]}
As you can see, #erick shouln't have david as follower. Am I missing something?
Sorry if my code looks like a mess. I have just 6 months in Java, 4 hours learning how to iterate a map (tried the Java 8 streams but not sure how to add the if-else in there), and it's 6 am and my wife might kill me for staying up all night :S

You can do something like that:
Map<String, Set<String>> followerMap = new HashMap<>();
followingMap.forEach((name,followingSet)-> followingSet.forEach(
follower-> followerMap.computeIfAbsent(follower, f->new HashSet<>())
.add(name)));
followingMap.forEach process all the entries in the followingMap. Then the Set of each entry is being processed with followingSet.forEach. The elements of this set are the followers, the keys of the new map. computeIfAbsent is being used to put a new entry in the map if it doesn't exists, adding an empty Set in that case. Afterthat, the value is added to the Set, in that case the entry of the followerMap.
And this is the same code using for loops instead of forEach, probably more readable.
Map<String, Set<String>> followerMap = new HashMap<>();
for (Entry<String, Set<String>> followingEntry : followingMap.entrySet()) {
for (String follower : followingEntry.getValue()) {
followerMap.computeIfAbsent(follower, s->new HashSet<>()).add(followingEntry.getKey());
}
}

Try this.
for (Map.Entry<String, Set<String>> me : followsGraph.entrySet()) {
String key = me.getKey();
// Set<String> tmp = new LinkedHashSet<>(); // MOVE THIS TO ...
Set<String> valueSet = me.getValue();
for (String s : valueSet) {
if (followerGraph.containsKey(s)) {
followerGraph.get(s).add(key);
} else {
Set<String> tmp = new LinkedHashSet<>(); // HERE
tmp.add(key);
followerGraph.put(s, tmp);
}
}
}

Try something like this:
Map<String, Set<String>> newFollowsGraph = new HashMap<>();
for (Map.Entry<String, Set<String>> me : followsGraph.entrySet()) {
String key = me.getKey();
Set<String> valueSet = me.getValue();
for (String s : valueSet) {
if (newFollowerGraph.containsKey(s)){
newFollowerGraph.get(s).add(key);
} else {
Set<String> tmp = new LinkedHashSet<>();
tmp.add(key)
newFollowerGraph.put(s, tmp);
}
}
}
The problem is, you are inserting new data in the object you are iterating over.

How to calculate the sum of values of different hashmaps with the same key?

So I have this hashmap named "hm" which produces the following output(NOTE:
this is just a selection) :
{1=35, 2=52, 3=61, 4=68, 5=68, 6=70, 7=70, 8=70, 9=70, 10=72, 11=72}
{1=35, 2=52, 3=61, 4=68, 5=70, 6=70, 7=70, 8=68, 9=72, 10=72, 11=72}
{1=35, 2=52, 3=61, 4=68, 5=68, 6=70, 7=70, 8=70, 9=72, 10=72, 11=72}
This output was created with the following code(NOTE : the rest of the class code is not shown here) :
private int scores;
HashMap<Integer,Integer> hm = new HashMap<>();
for (int i = 0; i < fileLines.length(); i++) {
char character = fileLines.charAt(i);
this.scores = character;
int position = i +1;
hm.put(position,this.scores);
}
System.out.println(hm);
What I am trying to do is put all these hashmaps together into one hashmap with as value the sum of the values per key. I am familiar with Python's defaultdict, but could not find an equivalent working example. I have searched for an answer and hit those answers below but they do not solve my problem.
How to calculate a value for each key of a HashMap?
what java collection that provides multiple values for the same key
is there a Java equivalent of Python's defaultdict?
The desired output would be :
{1=105, 2=156, 3=183 , 4=204 ,5=206 ..... and so on}
Eventually the average per position(key) has to be calculated but that is a problem I think I can fix on my own when I know how to do the above.
EDIT : The real output is much much bigger ! Think about 100+ of the hashmaps with more than 100 keys.

Try with something like that
public Map<Integer, Integer> combine(List<Map<Integer, Integer>> maps) {
Map<Integer, Integer> result = new HashMap<Integer, Integer>();
for (Map<Integer, Integer> map : maps) {
for (Map.Entry<Integer, Integer> entry : map.entrySet()) {
int newValue = entry.getValue();
Integer existingValue = result.get(entry.getKey());
if (existingValue != null) {
newValue = newValue + existingValue;
}
result.put(entry.getKey(), newValue);
}
}
return result;
}
Basically:
Create a new map for the result
Iterate over each map
Take each element and if already present in the result increment the value, if not put it in the map
return the result

newHashMap.put(key1,map1.get(key1)+map2.get(key1)+map3.get(key1));

HashMap, return key and value as a list

I have an homework to do, so I have finished the script but the problem is with the values.
The main code is (I cannot change it due to homework) :
List<String> result = cw.getResult();
for (String wordRes : result) {
System.out.println(wordRes);
}
It have to return:
abc 2
def 2
ghi 1
I have no idea how to handle that.
Now only shows:
abc
def
ghi
I have no idea how to change this method getResult to return with the value of the hashmap as well without changing the first main code.
public List<String> getResult() {
List<String> keyList = new ArrayList<String>(list.keySet());
return keyList;
}
The hashmap is: {abc=2, def=2, ghi=1}
And list: Map<String, Integer> list = new HashMap<String, Integer>();
Please help me if you know any resolution.

I think that now that you have learned about keySet and valueSet, your next task is to learn about entrySet. That's a collection of Map.Entry<K,V> items, which are in turn composed of the key and the value.
That's precisely what you need to complete your task - simply iterate over the entrySet of your Map while adding a concatenation of the value and the key to your result list:
result.add(entry.getKey() + " " + entry.getValue());
Note that if you use a regular HashMap, the items in the result would not be arranged in any particular order.

You need to change this line:
List<String> keyList = new ArrayList<String>(list.keySet());
to:
//first create the new List
List<String> keyList = new List<String>();
//iterate through the map and insert the key + ' ' + value as text
foreach(string item in list.keySet())
{
keyList.add(item+' '+list[item]);
}
return keyList;
I haven't written java in a while so compiler errors might appear, but the idea should work

Well simplest way make an ArrayList and add as #dasblinkenlight said...
Iterator<?> it = list.entrySet().iterator();
while (it.hasNext()) {
#SuppressWarnings("rawtypes")
Map.Entry maps = (Map.Entry) it.next();
lista.add(maps.getKey() + " " + maps.getValue());
}
}
public List<String> getResult() {
List<String> temp = lista;
return temp;
}

If you want to iterate over map entries in order of keys, use an ordered map:
Map<String, Integer> map = new TreeMap<String, Integer>();
Then add your entries, and to print:
for (Map.Entry<String, Ibteger> entry : map.entrySet()) {
System.out.println(entry.getKey() + " " + entry.getValue());
}

Using Java, how can I compare every entry in HashMap to every other entry in the same HashMap without duplicating comparisons?

I am currently using 2 for loops to compare all entries but I am getting duplicate comparisons. Because HashMaps aren't ordered, I can't figure out how to eliminate comparisons that have already been made. For example, I have something like:
for(Entry<String, String> e1: map.entrySet())
{
for(Entry<String, String> e2: map.entrySet())
{
if (e1.getKey() != e2.getKey())
{
//compare e1.getValue() to e2.getValue()
}
}
}
The problem with this is that the first entry will be compared to the second entry and then the third entry and so on. But then the second entry will again be compared to the first entry and so on. And then the third entry will be compared to the first entry, then the second entry, then the 4th entry, etc. Is there a better way to iterate through HashMaps to avoid doing duplicate comparisons?
Additional information:
To be more specific and hopefully answer your questions, the HashMap I have is storing file names (the keys) and file contents (the values) - just text files. The HashMap has been populated by traversing a directory that contains the files I will want to compare. Then what I am doing is running pairs of files through some algorithms to determine the similarity between each pair of files. I do not need to compare file 1 to file 2, and then file 2 to file 1 again, as I only need the 2 files to be compared once. But I do need every file to be compared to every other file once. I am brand new to working with HashMaps. agim’s answer below might just work for my purposes. But I will also try to wrap my brain around both Evgeniy Dorofeev and Peter Lawrey's solutions below. I hope this helps to explain things better.

If you are not careful, the cost of eliminating duplicates could higher than the cost of redundant comparisons for the keys at least.
You can order the keys using System.identityHashCode(x)
for(Map.Entry<Key, Value> entry1: map.entrySet()) {
Key key1 = entry1.getKey();
int hash1 = System.identityHashCode(key1);
Value value1 = entry1.getValue();
for(Map.Entry<Key, Value> entry2: map.entrySet()) {
Key key2 = entry2.getKey();
if (key1 > System.identityHashCode(key2)) continue;
Value value2 = entry1.getValue();
// compare value1 and value2;
}
}

How about this solution:
String[] values = map.values().toArray(new String[map.size()]);
for (int i = 0; i < values.length; i++) {
for (int j = i+1; j<values.length; j++) {
if (values[i].equals(values[j])) {
// ...
}
}
}

Try
HashMap<Object, Object> map = new HashMap<>();
Iterator<Entry<Object, Object>> i = map.entrySet().iterator();
while (i.hasNext()) {
Entry next = i.next();
i.remove();
for (Entry e : map.entrySet()) {
e.equals(next);
}
}
Note that there is no sense comparing keys in a HashMap they are always not equal. That is we could iterate / compare values only

If I understand correctly, you just want to know if there are any duplicates in the map's values? If so:
Set<String> values = new HashSet<String>(map.values());
boolean hasDuplicates = values.size() != map.size();
This could be made more efficient if you kick out once you find the first duplicate:
Set<String> values = new HashSet<String>();
for (String value : map.values()) {
if (!values.add(value)) {
return true;
}
}
return false;

public static boolean compareStringHashMaps(Map<String, String> expectedMap, Map<String, String> actualMap) throws Exception
{
logger.info("## CommonFunctions | compareStringHashMaps() ## ");
Iterator iteratorExpectedMap = expectedMap.entrySet().iterator();
Iterator iteratorActualMap = actualMap.entrySet().iterator();
boolean flag = true;
while (iteratorExpectedMap.hasNext() && iteratorActualMap.hasNext()){
Map.Entry expectedMapEntry = (Map.Entry) iteratorExpectedMap.next();
Map.Entry actualMapEntry = (Map.Entry) iteratorActualMap.next();
if(!expectedMapEntry.getKey().toString().trim().equals(actualMapEntry.getKey().toString().trim()))
{
flag = false;
break;
}
else if (!expectedMapEntry.getValue().toString().trim().equals(actualMapEntry.getValue().toString().trim()))
{
flag = false;
break;
}
}
return flag;
}

Considering the entries of a HashMap is Integer.
This returns the maximum entry within a HashMap.
int maxNum = 0;
for (Object a: hashMap.keySet()) {
if ((int)hashMap.get(a) > maxNum) {
maxNum = (int)hashMap.get(a);
}
}

You could try using a 2D array of results. If the result is already populated, then don't perform the comparison again. This also has the benefit of storing the results for later use.
So for an int result you would be looking at something like this: Integer[][] results = new Integer[map.entrySet().size()][map.entrySet().size()];This initialises the array to nulls and allows you to check for existing results before comparison. One important thing to note here is that each comparison result should be stored in the array twice, with the exception of comparisons to itself. e.g. comparison between index 1 and index 2 should be stored in results[1][2] and result[2][1].
Hope this helps.

Identifying duplicate elements in a list containing 300k+ Strings

I have a list containing 305899 Strings (which is the username for a website). After I remove all the duplicates, the number goes down to 172123 Strings.
I want to find how many times a particular String (the username) is repeated in that ArrayList. I wrote a simple bubble sort type logic but it was too slow.
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
int duplicate = 0;
int size = userNameList.size();
for (int i = 0; i < size - 1; i++) {
duplicate = 0;
for (int j = i + 1; j < size; j++) {
if (userNameList.get(i).equals(userNameList.get(j))) {
duplicate++;
userNameList.remove(j);
j--;
size--;
}
}
numberOfPosts.put(userNameList.get(i), duplicate);
}
return numberOfPosts;
}
Then I changed it to this:
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
Set<String> unique = new HashSet<String>(userNameList);
for (String key : unique) {
numberOfPosts.put(key, Collections.frequency(userNameList, key));
}
return numberOfPosts;
}
This was really slow as well. When I mean slow, it would take like 30+ minutes to through the list.
Is there any other efficient way to handle this problem? Just reduce the time it takes to find and count duplicate elements?

Your findNumberOfPosts method is on the right track, but your implementation is doing loads of unnecessary work.
Try this:
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
for (String userName : userNameList) {
Integer count = numberOfPosts.get(userName);
numberOfPosts.put(userName, count == null ? 1 : ++count);
}
return numberOfPosts;
}
This should execute in a couple of seconds on most machines.

See if this variation of your second method works faster:
private static Map<String, Integer> findNumberOfPosts(
List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
for (String name : userNameList) {
Integer count = numberOfPosts.get(name);
numberOfPosts.put(name, count == null ? 1 : (1 + count));
}
return numberOfPosts;
}
It has some boxing/unboxing overhead, but should operate a lot faster than what you were doing, which required iterating over the entire list of names for each unique name.

You could attempt to build a Trie structure out of the usernames. Then it would be trivial to find the number of distinct elements(username). The code for Trie is little bit complicated, so you better look up resources to see how the implementation can be done.
On other thought, considering the practical scenario, you should not have this duplicate list in the first place. I mean, if the system providing the username was properly designed, then duplicates wouldn't exist in the first place.

This goes even faster than Bohemian's:
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
for (String userName : userNameList) {
if (!numberOfPosts.containsKey(userName)) {
numberOfPosts.put(userName, Collections.frequency(userNameList, userName));
}
}
return numberOfPosts;
}

The best solution is to add all the elements to an Array and then sort that array.
Then you can just iterate over the array and the duplicates will be placed next to each other in the array.

You should try improving the first implementation: for each entry you're iterating through the entire list. How about something like:
Map<String, Integer> map;
for (String username : usernames) {
if (!map.containsKey(username)) {
map.put(username, new Integer(0));
} else {
map.put(username, new Integer(map.get(username).intValue() + 1));
}
}
return map;

Use the data structure that was designed to support this natively. Store the user names in a Multiset and let it automatically maintain the frequency/count for you.
Read this tutorial to understand how multiset works/

The following is the best and convenient method to remove duplicates and count the number of duplicate elements in a List. No need to have extra logic.
List<String> userNameList = new ArrayList<String>();
// add elements to userNameList, including duplicates
userNameList.add("a");
userNameList.add("a");
userNameList.add("a");
userNameList.add("a");
userNameList.add("b");
userNameList.add("b");
userNameList.add("b");
userNameList.add("b");
userNameList.add("c");
userNameList.add("c");
userNameList.add("c");
userNameList.add("c");
int originalSize=userNameList.size();
HashSet hs = new HashSet(); //Set would handle the duplicates automatically.
hs.addAll(userNameList);
userNameList.clear();
userNameList.addAll(hs);
Collections.sort(userNameList); //Sort the List, if needed.
//Displays elements after removing duplicate entries.
for(Object element:userNameList)
{
System.out.println(element);
}
int duplicate=originalSize-userNameList.size();
System.out.println("Duplicate entries in the List:->"+duplicate); //Number of duplicate entries.
/*Map<String, Integer> numberOfPosts = new HashMap<String, Integer>(); //Store duplicate entries in your Map using some key.
numberOfPosts.put(userNameList.get(i), duplicate);
return(numberOfPosts);*/

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

What is the fastest method to find duplicates from a collection - java

List, Vector have a method contains(Object o) which return Boolean value based either this object is exist in collection or not.

Related

Iterating a Map<TypeA,Set<TypeB>> and convert it to a Map<TypeB,Set<TypeA>> in Java

How to calculate the sum of values of different hashmaps with the same key?

HashMap, return key and value as a list

Using Java, how can I compare every entry in HashMap to every other entry in the same HashMap without duplicating comparisons?

Identifying duplicate elements in a list containing 300k+ Strings

Categories

Resources