Identifying duplicate elements in a list containing 300k+ Strings

Identifying duplicate elements in a list containing 300k+ Strings - java

I have a list containing 305899 Strings (which is the username for a website). After I remove all the duplicates, the number goes down to 172123 Strings.
I want to find how many times a particular String (the username) is repeated in that ArrayList. I wrote a simple bubble sort type logic but it was too slow.
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
int duplicate = 0;
int size = userNameList.size();
for (int i = 0; i < size - 1; i++) {
duplicate = 0;
for (int j = i + 1; j < size; j++) {
if (userNameList.get(i).equals(userNameList.get(j))) {
duplicate++;
userNameList.remove(j);
j--;
size--;
}
}
numberOfPosts.put(userNameList.get(i), duplicate);
}
return numberOfPosts;
}
Then I changed it to this:
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
Set<String> unique = new HashSet<String>(userNameList);
for (String key : unique) {
numberOfPosts.put(key, Collections.frequency(userNameList, key));
}
return numberOfPosts;
}
This was really slow as well. When I mean slow, it would take like 30+ minutes to through the list.
Is there any other efficient way to handle this problem? Just reduce the time it takes to find and count duplicate elements?

Your findNumberOfPosts method is on the right track, but your implementation is doing loads of unnecessary work.
Try this:
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
for (String userName : userNameList) {
Integer count = numberOfPosts.get(userName);
numberOfPosts.put(userName, count == null ? 1 : ++count);
}
return numberOfPosts;
}
This should execute in a couple of seconds on most machines.

See if this variation of your second method works faster:
private static Map<String, Integer> findNumberOfPosts(
List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
for (String name : userNameList) {
Integer count = numberOfPosts.get(name);
numberOfPosts.put(name, count == null ? 1 : (1 + count));
}
return numberOfPosts;
}
It has some boxing/unboxing overhead, but should operate a lot faster than what you were doing, which required iterating over the entire list of names for each unique name.

You could attempt to build a Trie structure out of the usernames. Then it would be trivial to find the number of distinct elements(username). The code for Trie is little bit complicated, so you better look up resources to see how the implementation can be done.
On other thought, considering the practical scenario, you should not have this duplicate list in the first place. I mean, if the system providing the username was properly designed, then duplicates wouldn't exist in the first place.

This goes even faster than Bohemian's:
private static Map<String, Integer> findNumberOfPosts(List<String> userNameList) {
Map<String, Integer> numberOfPosts = new HashMap<String, Integer>();
for (String userName : userNameList) {
if (!numberOfPosts.containsKey(userName)) {
numberOfPosts.put(userName, Collections.frequency(userNameList, userName));
}
}
return numberOfPosts;
}

The best solution is to add all the elements to an Array and then sort that array.
Then you can just iterate over the array and the duplicates will be placed next to each other in the array.

You should try improving the first implementation: for each entry you're iterating through the entire list. How about something like:
Map<String, Integer> map;
for (String username : usernames) {
if (!map.containsKey(username)) {
map.put(username, new Integer(0));
} else {
map.put(username, new Integer(map.get(username).intValue() + 1));
}
}
return map;

Use the data structure that was designed to support this natively. Store the user names in a Multiset and let it automatically maintain the frequency/count for you.
Read this tutorial to understand how multiset works/

The following is the best and convenient method to remove duplicates and count the number of duplicate elements in a List. No need to have extra logic.
List<String> userNameList = new ArrayList<String>();
// add elements to userNameList, including duplicates
userNameList.add("a");
userNameList.add("a");
userNameList.add("a");
userNameList.add("a");
userNameList.add("b");
userNameList.add("b");
userNameList.add("b");
userNameList.add("b");
userNameList.add("c");
userNameList.add("c");
userNameList.add("c");
userNameList.add("c");
int originalSize=userNameList.size();
HashSet hs = new HashSet(); //Set would handle the duplicates automatically.
hs.addAll(userNameList);
userNameList.clear();
userNameList.addAll(hs);
Collections.sort(userNameList); //Sort the List, if needed.
//Displays elements after removing duplicate entries.
for(Object element:userNameList)
{
System.out.println(element);
}
int duplicate=originalSize-userNameList.size();
System.out.println("Duplicate entries in the List:->"+duplicate); //Number of duplicate entries.
/*Map<String, Integer> numberOfPosts = new HashMap<String, Integer>(); //Store duplicate entries in your Map using some key.
numberOfPosts.put(userNameList.get(i), duplicate);
return(numberOfPosts);*/

Related

How to calculate the sum of values of different hashmaps with the same key?

So I have this hashmap named "hm" which produces the following output(NOTE:
this is just a selection) :
{1=35, 2=52, 3=61, 4=68, 5=68, 6=70, 7=70, 8=70, 9=70, 10=72, 11=72}
{1=35, 2=52, 3=61, 4=68, 5=70, 6=70, 7=70, 8=68, 9=72, 10=72, 11=72}
{1=35, 2=52, 3=61, 4=68, 5=68, 6=70, 7=70, 8=70, 9=72, 10=72, 11=72}
This output was created with the following code(NOTE : the rest of the class code is not shown here) :
private int scores;
HashMap<Integer,Integer> hm = new HashMap<>();
for (int i = 0; i < fileLines.length(); i++) {
char character = fileLines.charAt(i);
this.scores = character;
int position = i +1;
hm.put(position,this.scores);
}
System.out.println(hm);
What I am trying to do is put all these hashmaps together into one hashmap with as value the sum of the values per key. I am familiar with Python's defaultdict, but could not find an equivalent working example. I have searched for an answer and hit those answers below but they do not solve my problem.
How to calculate a value for each key of a HashMap?
what java collection that provides multiple values for the same key
is there a Java equivalent of Python's defaultdict?
The desired output would be :
{1=105, 2=156, 3=183 , 4=204 ,5=206 ..... and so on}
Eventually the average per position(key) has to be calculated but that is a problem I think I can fix on my own when I know how to do the above.
EDIT : The real output is much much bigger ! Think about 100+ of the hashmaps with more than 100 keys.

Try with something like that
public Map<Integer, Integer> combine(List<Map<Integer, Integer>> maps) {
Map<Integer, Integer> result = new HashMap<Integer, Integer>();
for (Map<Integer, Integer> map : maps) {
for (Map.Entry<Integer, Integer> entry : map.entrySet()) {
int newValue = entry.getValue();
Integer existingValue = result.get(entry.getKey());
if (existingValue != null) {
newValue = newValue + existingValue;
}
result.put(entry.getKey(), newValue);
}
}
return result;
}
Basically:
Create a new map for the result
Iterate over each map
Take each element and if already present in the result increment the value, if not put it in the map
return the result

newHashMap.put(key1,map1.get(key1)+map2.get(key1)+map3.get(key1));

What is the fastest method to find duplicates from a collection

This is what I have tried and somehow I get the feeling that this is not right or this is not the best performing application, so is there a better way to do the searching and fetching the duplicate values from a Map or as a matter of fact any collection. And a better way to traverse through a collection.
public class SearchDuplicates{
public static void main(String[] args) {
Map<Integer, String> directory=new HashMap<Integer, String>();
Map<Integer, String> repeatedEntries=new HashMap<Integer, String>();
// adding data
directory.put(1,"john");
directory.put(2,"michael");
directory.put(3,"mike");
directory.put(4,"anna");
directory.put(5,"julie");
directory.put(6,"simon");
directory.put(7,"tim");
directory.put(8,"ashley");
directory.put(9,"john");
directory.put(10,"michael");
directory.put(11,"mike");
directory.put(12,"anna");
directory.put(13,"julie");
directory.put(14,"simon");
directory.put(15,"tim");
directory.put(16,"ashley");
for(int i=1;i<=directory.size();i++) {
String result=directory.get(i);
for(int j=1;j<=directory.size();j++) {
if(j!=i && result==directory.get(j) &&j<i) {
repeatedEntries.put(j, result);
}
}
System.out.println(result);
}
for(Entry<Integer, String> entry : repeatedEntries.entrySet()) {
System.out.println("repeated "+entry.getValue());
}
}
}
Any help would be appreciated. Thanks in advance

You can use a Set to determine whether entries are duplicate. Also, repeatedEntries might as well be a Set, since the keys are meaningless:
Map<Integer, String> directory=new HashMap<Integer, String>();
Set<String> repeatedEntries=new HashSet<String>();
Set<String> seen = new HashSet<String>();
// ... initialize directory, then:
for(int j=1;j<=directory.size();j++){
String val = directory.get(j);
if (!seen.add(val)) {
// if add failed, then val was already seen
repeatedEntries.add(val);
}
}
At the cost of extra memory, this does the job in linear time (instead of quadratic time of your current algorithm).
EDIT: Here's a version of the loop that doesn't rely on the keys being consecutive integers starting at 1:
for (String val : directory.values()) {
if (!seen.add(val)) {
// if add failed, then val was already seen
repeatedEntries.add(val);
}
}
That will detect duplicate values for any Map, regardless of the keys.

You can use this to found word count
Map<String, Integer> repeatedEntries = new HashMap<String, Integer>();
for (String w : directory.values()) {
Integer n = repeatedEntries.get(w);
n = (n == null) ? 1 : ++n;
repeatedEntries.put(w, n);
}
and this to print the stats
for (Entry<String, Integer> e : repeatedEntries.entrySet()) {
System.out.println(e);
}

List, Vector have a method contains(Object o) which return Boolean value based either this object is exist in collection or not.

You can use Collection.frequency to find all possible duplicates in any collection using
Collections.frequency(list, "a")
Here is a proper example
Most generic method to find
Set<String> uniqueSet = new HashSet<String>(list);
for (String temp : uniqueSet) {
System.out.println(temp + ": " + Collections.frequency(list, temp));
}
References from above link itself

Find most common value from hashmap of set in java?

What would be the fastest way to get the common values from all the sets within an hash map?
I have a
Map<String, Set<String>>
I check for the key and get all the sets that has the given key. But instead of getting all the sets from the hashmap, is there any better way to get the common elements (value) from all the sets?
For example, the hashmap contains,
abc:[ax1,au2,au3]
def:[ax1,aj5]
ijk:[ax1,au2]
I want to extract the ax1 and au2 alone, as they are the most common values from the set.

note: not sure if this is the fastest, but this is one way to do this.
First, write a simple method to extract the frequencies for the Strings occurring across all value sets in the map. Here is a simple implementation:
Map<String, Integer> getFrequencies(Map<String, Set<String>> map) {
Map<String, Integer> frequencies = new HashMap<String, Integer>();
for(String key : map.keySet()) {
for(String element : map.get(key)) {
int count;
if(frequencies.containsKey(element)) {
count = frequencies.get(element);
} else {
count = 1;
}
frequencies.put(element, count + 1);
}
}
return new frequencies;
}
You can simply call this method like this: Map<String, Integer> frequencies = getFrequencies(map)
Second, in order to get the most "common" elements in the frequencies map, you simply sort the entries in the map by using the Comparator interface. It so happens that SO has an excellent community wiki that discusses just that: Sort a Map<Key, Value> by values (Java). The wiki contains multiple interesting solutions to the problem. It might help to go over them.
You can simply implement a class, call it FrequencyMap, as shown below.
Have the class implement the Comparator<String> interface and thus the int compare(String a, String b) method to have the elements of the map sorted in the increasing order of the value Integers.
Third, implement another method, call it getCommon(int threshold) and pass it a threshold value. Any entry in the map that has a frequency value greater than threshold, can be considered "common", and will be returned as a simple List.
class FrequencyMap implements Comparator<String> {
Map<String, Integer> map;
public FrequencyMap(Map<String, Integer> map) {
this.map = map;
}
public int compare(String a, String b) {
if (map.get(a) >= map.get(b)) {
return -1;
} else {
return 1;
} // returning 0 would merge keys
}
public ArrayList<String> getCommon(int threshold) {
ArrayList<String> common = new ArrayList<String>();
for(String key : this.map.keySet()) {
if(this.map.get(key) >= threshold) {
common.add(key);
}
}
return common;
}
#Override public String toString() {
return this.map.toString();
}
}
So using FrequencyMap class and the getCommon method, it boils down to these few lines of code:
FrequencyMap frequencyMap = new FrequencyMap(frequencies);
System.out.println(frequencyMap.getCommon(2));
System.out.println(frequencyMap.getCommon(3));
System.out.println(frequencyMap.getCommon(4));
For the sample input in your question this is the o/p that you get:
// common values
[ax1, au6, au3, au2]
[ax1, au2]
[ax1]
Also, here is a gist containing the code i whipped up for this question: https://gist.github.com/VijayKrishna/5973268

Using Java, how can I compare every entry in HashMap to every other entry in the same HashMap without duplicating comparisons?

I am currently using 2 for loops to compare all entries but I am getting duplicate comparisons. Because HashMaps aren't ordered, I can't figure out how to eliminate comparisons that have already been made. For example, I have something like:
for(Entry<String, String> e1: map.entrySet())
{
for(Entry<String, String> e2: map.entrySet())
{
if (e1.getKey() != e2.getKey())
{
//compare e1.getValue() to e2.getValue()
}
}
}
The problem with this is that the first entry will be compared to the second entry and then the third entry and so on. But then the second entry will again be compared to the first entry and so on. And then the third entry will be compared to the first entry, then the second entry, then the 4th entry, etc. Is there a better way to iterate through HashMaps to avoid doing duplicate comparisons?
Additional information:
To be more specific and hopefully answer your questions, the HashMap I have is storing file names (the keys) and file contents (the values) - just text files. The HashMap has been populated by traversing a directory that contains the files I will want to compare. Then what I am doing is running pairs of files through some algorithms to determine the similarity between each pair of files. I do not need to compare file 1 to file 2, and then file 2 to file 1 again, as I only need the 2 files to be compared once. But I do need every file to be compared to every other file once. I am brand new to working with HashMaps. agim’s answer below might just work for my purposes. But I will also try to wrap my brain around both Evgeniy Dorofeev and Peter Lawrey's solutions below. I hope this helps to explain things better.

If you are not careful, the cost of eliminating duplicates could higher than the cost of redundant comparisons for the keys at least.
You can order the keys using System.identityHashCode(x)
for(Map.Entry<Key, Value> entry1: map.entrySet()) {
Key key1 = entry1.getKey();
int hash1 = System.identityHashCode(key1);
Value value1 = entry1.getValue();
for(Map.Entry<Key, Value> entry2: map.entrySet()) {
Key key2 = entry2.getKey();
if (key1 > System.identityHashCode(key2)) continue;
Value value2 = entry1.getValue();
// compare value1 and value2;
}
}

How about this solution:
String[] values = map.values().toArray(new String[map.size()]);
for (int i = 0; i < values.length; i++) {
for (int j = i+1; j<values.length; j++) {
if (values[i].equals(values[j])) {
// ...
}
}
}

Try
HashMap<Object, Object> map = new HashMap<>();
Iterator<Entry<Object, Object>> i = map.entrySet().iterator();
while (i.hasNext()) {
Entry next = i.next();
i.remove();
for (Entry e : map.entrySet()) {
e.equals(next);
}
}
Note that there is no sense comparing keys in a HashMap they are always not equal. That is we could iterate / compare values only

If I understand correctly, you just want to know if there are any duplicates in the map's values? If so:
Set<String> values = new HashSet<String>(map.values());
boolean hasDuplicates = values.size() != map.size();
This could be made more efficient if you kick out once you find the first duplicate:
Set<String> values = new HashSet<String>();
for (String value : map.values()) {
if (!values.add(value)) {
return true;
}
}
return false;

public static boolean compareStringHashMaps(Map<String, String> expectedMap, Map<String, String> actualMap) throws Exception
{
logger.info("## CommonFunctions | compareStringHashMaps() ## ");
Iterator iteratorExpectedMap = expectedMap.entrySet().iterator();
Iterator iteratorActualMap = actualMap.entrySet().iterator();
boolean flag = true;
while (iteratorExpectedMap.hasNext() && iteratorActualMap.hasNext()){
Map.Entry expectedMapEntry = (Map.Entry) iteratorExpectedMap.next();
Map.Entry actualMapEntry = (Map.Entry) iteratorActualMap.next();
if(!expectedMapEntry.getKey().toString().trim().equals(actualMapEntry.getKey().toString().trim()))
{
flag = false;
break;
}
else if (!expectedMapEntry.getValue().toString().trim().equals(actualMapEntry.getValue().toString().trim()))
{
flag = false;
break;
}
}
return flag;
}

Considering the entries of a HashMap is Integer.
This returns the maximum entry within a HashMap.
int maxNum = 0;
for (Object a: hashMap.keySet()) {
if ((int)hashMap.get(a) > maxNum) {
maxNum = (int)hashMap.get(a);
}
}

You could try using a 2D array of results. If the result is already populated, then don't perform the comparison again. This also has the benefit of storing the results for later use.
So for an int result you would be looking at something like this: Integer[][] results = new Integer[map.entrySet().size()][map.entrySet().size()];This initialises the array to nulls and allows you to check for existing results before comparison. One important thing to note here is that each comparison result should be stored in the array twice, with the exception of comparisons to itself. e.g. comparison between index 1 and index 2 should be stored in results[1][2] and result[2][1].
Hope this helps.

Accessing the last entry in a Map

How to move a particular HashMap entry to Last position?
For Example, I have HashMap values like this:
HashMap<String,Integer> map = new HashMap<String,Integer>();
map= {Not-Specified 1, test 2, testtest 3};
"Not-Specified" may come in any position. it may come first or in the middle of the map. But i want to move the "Not-Specified" to the last position.
How can I do that?

To answer your question in one sentence:
Per default, Maps don't have a last entry, it's not part of their contract.
And a side note: it's good practice to code against interfaces, not the implementation classes (see Effective Java by Joshua Bloch, Chapter 8, Item 52: Refer to objects by their interfaces).
So your declaration should read:
Map<String,Integer> map = new HashMap<String,Integer>();
(All maps share a common contract, so the client need not know what kind of map it is, unless he specifies a sub interface with an extended contract).
Possible Solutions
Sorted Maps:
There is a sub interface SortedMap that extends the map interface with order-based lookup methods and it has a sub interface NavigableMap that extends it even further. The standard implementation of this interface, TreeMap, allows you to sort entries either by natural ordering (if they implement the Comparable interface) or by a supplied Comparator.
You can access the last entry through the lastEntry method:
NavigableMap<String,Integer> map = new TreeMap<String, Integer>();
// add some entries
Entry<String, Integer> lastEntry = map.lastEntry();
Linked maps:
There is also the special case of LinkedHashMap, a HashMap implementation that stores the order in which keys are inserted. There is however no interface to back up this functionality, nor is there a direct way to access the last key. You can only do it through tricks such as using a List in between:
Map<String,String> map = new LinkedHashMap<String, Integer>();
// add some entries
List<Entry<String,Integer>> entryList =
new ArrayList<Map.Entry<String, Integer>>(map.entrySet());
Entry<String, Integer> lastEntry =
entryList.get(entryList.size()-1);
Proper Solution:
Since you don't control the insertion order, you should go with the NavigableMap interface, i.e. you would write a comparator that positions the Not-Specified entry last.
Here is an example:
final NavigableMap<String,Integer> map =
new TreeMap<String, Integer>(new Comparator<String>() {
public int compare(final String o1, final String o2) {
int result;
if("Not-Specified".equals(o1)) {
result=1;
} else if("Not-Specified".equals(o2)) {
result=-1;
} else {
result =o1.compareTo(o2);
}
return result;
}
});
map.put("test", Integer.valueOf(2));
map.put("Not-Specified", Integer.valueOf(1));
map.put("testtest", Integer.valueOf(3));
final Entry<String, Integer> lastEntry = map.lastEntry();
System.out.println("Last key: "+lastEntry.getKey()
+ ", last value: "+lastEntry.getValue());
Output:
Last key: Not-Specified, last value: 1
Solution using HashMap:
If you must rely on HashMaps, there is still a solution, using a) a modified version of the above comparator, b) a List initialized with the Map's entrySet and c) the Collections.sort() helper method:
final Map<String, Integer> map = new HashMap<String, Integer>();
map.put("test", Integer.valueOf(2));
map.put("Not-Specified", Integer.valueOf(1));
map.put("testtest", Integer.valueOf(3));
final List<Entry<String, Integer>> entries =
new ArrayList<Entry<String, Integer>>(map.entrySet());
Collections.sort(entries, new Comparator<Entry<String, Integer>>(){
public int compareKeys(final String o1, final String o2){
int result;
if("Not-Specified".equals(o1)){
result = 1;
} else if("Not-Specified".equals(o2)){
result = -1;
} else{
result = o1.compareTo(o2);
}
return result;
}
#Override
public int compare(final Entry<String, Integer> o1,
final Entry<String, Integer> o2){
return this.compareKeys(o1.getKey(), o2.getKey());
}
});
final Entry<String, Integer> lastEntry =
entries.get(entries.size() - 1);
System.out.println("Last key: " + lastEntry.getKey() + ", last value: "
+ lastEntry.getValue());
}
Output:
Last key: Not-Specified, last value: 1

HashMap doesn't have "the last position", as it is not sorted.
You may use other Map which implements java.util.SortedMap, most popular one is TreeMap.

A SortedMap is the logical/best choice, however another option is to use a LinkedHashMap which maintains two order modes, most-recently-added goes last, and most-recently-accessed goes last. See the Javadocs for more details.

When using numbers as the key, I suppose you could also try this:
Map<Long, String> map = new HashMap<>();
map.put(4L, "The First");
map.put(6L, "The Second");
map.put(11L, "The Last");
long lastKey = 0;
//you entered Map<Long, String> entry
for (Map.Entry<Long, String> entry : map.entrySet()) {
lastKey = entry.getKey();
}
System.out.println(lastKey); // 11

move does not make sense for a hashmap since its a dictionary with a hashcode for bucketing based on key and then a linked list for colliding hashcodes resolved via equals.
Use a TreeMap for sorted maps and then pass in a custom comparator.

In such scenario last used key is usually known so it can be used for accessing last value (inserted with the one):
class PostIndexData {
String _office_name;
Boolean _isGov;
public PostIndexData(String name, Boolean gov) {
_office_name = name;
_isGov = gov;
}
}
//-----------------------
class KgpData {
String _postIndex;
PostIndexData _postIndexData;
public KgpData(String postIndex, PostIndexData postIndexData) {
_postIndex = postIndex;
_postIndexData = postIndexData;;
}
}
public class Office2ASMPro {
private HashMap<String,PostIndexData> _postIndexMap = new HashMap<>();
private HashMap<String,KgpData> _kgpMap = new HashMap<>();
...
private void addOffice(String kgp, String postIndex, String officeName, Boolean gov) {
if (_postIndexMap.get(postIndex) == null) {
_postIndexMap.put(postIndex, new PostIndexData(officeName, gov));
}
_kgpMap.put( kgp, new KgpData(postIndex, _postIndexMap.get(postIndex)) );
}

Find missing all elements from array
int[] array = {3,5,7,8,2,1,32,5,7,9,30,5};
TreeMap<Integer, Integer> map = new TreeMap<>();
for(int i=0;i<array.length;i++) {
map.put(array[i], 1);
}
int maxSize = map.lastKey();
for(int j=0;j<maxSize;j++) {
if(null == map.get(j))
System.out.println("Missing `enter code here`No:"+j);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Identifying duplicate elements in a list containing 300k+ Strings - java

The best solution is to add all the elements to an Array and then sort that array. Then you can just iterate over the array and the duplicates will be placed next to each other in the array.

Use the data structure that was designed to support this natively. Store the user names in a Multiset and let it automatically maintain the frequency/count for you. Read this tutorial to understand how multiset works/

Related

How to calculate the sum of values of different hashmaps with the same key?

What is the fastest method to find duplicates from a collection

Find most common value from hashmap of set in java?

Using Java, how can I compare every entry in HashMap to every other entry in the same HashMap without duplicating comparisons?

Accessing the last entry in a Map

Categories

Resources