I want to iterate over 2 collections each one roughly 600 records. I want to compare each element of collection one with all other elements in collection two. If I choose my collection to be LinkedHashSet then, I have to call iterator on each collection and have two while (inner and outer) loop.
And for the choice of ArrayList, I will have two for loops (inner and outer) to read data from each collection.
Primarily I chose LinkedHashSet because I read that LinkedHashSet has a better performance, I also preferred using set to remove duplicate, but after seeing it running very slow, taking around 2 hours to finish, I thought maybe It would be better to copy set into an ArrayList and then iterate over ArrayList instead of LinkedHashSet.
I was wondering which one would have a better choice to speed up the runtime.
public ArrayList> processDataSourcesV2(LinkedHashMap> ppmsFinalResult,LinkedHashMap> productDBFinalResult ) {
//each parameter is a hashmap that contains key(id) and value (set of unique parameters)
ArrayList> result = new ArrayList>();
Iterator<Entry<RecordId, LinkedHashSet<String>>> ppmsIterator = ppmsFinalResult.entrySet().iterator();
Iterator<Entry<RecordId, LinkedHashSet<String>>> productIdIterator =null;
//pair of id from each list
ArrayList<Pair> listOfIdPair = new ArrayList<Pair>();
while (ppmsIterator.hasNext()) {
//RecordId object is an object containing the id and which list this id belongs to
Entry<RecordId, LinkedHashSet<String>> currentPpmsPair = ppmsIterator.next();
RecordId currentPpmsIDObj = currentPpmsPair.getKey();
//set of unique string
LinkedHashSet<String> currentPpmsCleanedTerms = (LinkedHashSet<String>)currentPpmsPair.getValue();
productIdIterator = productDBFinalResult.entrySet().iterator();
while (productIdIterator.hasNext()) {
Entry<RecordId, LinkedHashSet<String>> currentProductDBPair = productIdIterator.next();
RecordId currentProductIDObj = currentProductDBPair.getKey();
LinkedHashSet<String> currentProductCleanedTerms = (LinkedHashSet<String>)currentProductDBPair.getValue();
ArrayList<Object> listOfRowByRowProcess = new ArrayList <Object>();
Pair currentIDPair = new Pair(currentPpmsIDObj.getIdValue(),currentProductIDObj.getIdValue());
//check for duplicates
if ((currentPpmsIDObj.getIdValue()).equals(currentProductIDObj.getIdValue()) || listOfIdPair.contains(currentIDPair.reverse()) ) {
continue;
}
else {
LinkedHashSet<String> commonTerms = getCommonTerms(currentPpmsCleanedTerms,currentProductCleanedTerms);
listOfIdPair.add(currentIDPair.reverse());
if (commonTerms.size()>0) {
listOfRowByRowProcess.add(currentPpmsIDObj);
listOfRowByRowProcess.add(currentProductIDObj);
listOfRowByRowProcess.add(commonTerms);
result.add(listOfRowByRowProcess);
}
}
}
}
return result;
}
public LinkedHashSet<String> getCommonTerms(LinkedHashSet<String> setOne, LinkedHashSet<String> setTwo){
Iterator<String> setOneIt = setOne.iterator();
LinkedHashSet<String> setOfCommon = new LinkedHashSet<String>();
//making hard copy
while (setOneIt.hasNext()) {
setOfCommon.add(setOneIt.next());
}
setOfCommon.retainAll(setTwo);
return setOfCommon;
}
Arrays are faster than any other structure when it comes to iteration (all elements are stored sequentially in memory ), one the other hand, it slower when deleting and inserting element because it has to ensure the sequential storage. Iterating over linked list is slower because you might get page fault... So it's up to you which one to choose.
If you want to find which elements are in both collections, make one a Set and get its intersection with the other collection:
Collection<T> collection1, collection2; // given these
Set<T> intersection = new HashSet<T>(collection1);
intersection.retainAll(collection2);
This will execute in O(n) time, where n is the size of collection2, because finding elements in a HashSet performs in constant time.
My guess is you are checking every element of collection1 with every element of collection2, which has O(n2) time complexity.
Related
I'm currently trying to create a method that determine if an ArrayList(a2) contains an ArrayList(a1), given that both lists contain duplicate values (containsAll wouldn't work as if an ArrayList contains duplicate values, then it would return true regardless of the quantity of the values)
This is what I have: (I believe it would work however I cannot use .remove within the for loop)
public boolean isSubset(ArrayList<Integer> a1, ArrayList<Integer> a2) {
Integer a1Size= a1.size();
for (Integer integer2:a2){
for (Integer integer1: a1){
if (integer1==integer2){
a1.remove(integer1);
a2.remove(integer2);
if (a1Size==0){
return true;
}
}
}
}
return false;
}
Thanks for the help.
Updated
I think the clearest statement of your question is in one of your comments:
Yes, the example " Example: [dog,cat,cat,bird] is a match for
containing [cat,dog] is false but containing [cat,cat,dog] is true?"
is exactly what I am trying to achieve.
So really, you are not looking for a "subset", because these are not sets. They can contain duplicate elements. What you are really saying is you want to see whether a1 contains all the elements of a2, in the same amounts.
One way to get to that is to count all the elements in both lists. We can get such a count using this method:
private Map<Integer, Integer> getCounter (List<Integer> list) {
Map<Integer, Integer> counter = new HashMap<>();
for (Integer item : list) {
counter.put (item, counter.containsKey(item) ? counter.get(item) + 1 : 1);
}
return counter;
}
We'll rename your method to be called containsAllWithCounts(), and it will use getCounter() as a helper. Your method will also accept List objects as its parameters, rather than ArrayList objects: it's a good practice to specify parameters as interfaces rather than implementations, so you are not tied to using ArrayList types.
With that in mind, we simply scan the counts of the items in a2 and see that they are the same in a1:
public boolean containsAllWithCounts(List<Integer> a1, List<Integer> a2) {
Map<Integer,Integer> counterA1 = getCounter(a1);
Map<Integer,Integer> counterA2 = getCounter(a2);
boolean containsAll = true;
for (Map.Entry<Integer, Integer> entry : counterA2.entrySet ()) {
Integer key = entry.getKey();
Integer count = entry.getValue();
containsAll &= counterA1.containsKey(key) && counterA1.get(key).equals(count);
if (!containsAll) break;
}
return containsAll;
}
If you like, I can rewrite this code to handle arbitrary types, not just Integer objects, using Java generics. Also, all the code can be shortened using Java 8 streams (which I originally used - see comments below). Just let me know in comments.
if you want remove elements from list you have 2 choices:
iterate over copy
use concurrent list implementation
see also:
http://docs.oracle.com/javase/8/docs/api/java/util/Collections.html#synchronizedList-java.util.List-
btw why you don't override contains method ??
here you use simple Object like "Integer" what about when you will be using List< SomeComplexClass > ??
example remove with iterator over copy:
List<Integer> list1 = new ArrayList<Integer>();
List<Integer> list2 = new ArrayList<Integer>();
List<Integer> listCopy = new ArrayList<>(list1);
Iterator<Integer> iterator1 = listCopy.iterator();
while(iterator1.hasNext()) {
Integer next1 = iterator1.next();
Iterator<Integer> iterator2 = list2.iterator();
while (iterator2.hasNext()) {
Integer next2 = iterator2.next();
if(next1.equals(next2)) list1.remove(next1);
}
}
see also this answer about iterator:
Concurrent Modification exception
also don't use == operator to compare objects :) instead use equal method
about use of removeAll() and other similarly methods:
keep in mind that many classes that implements list interface don't override all methods from list interface - so you can end up with unsupported operation exception - thus I prefer "low level" binary/linear/mixed search in this case.
and for comparison of complex classes objects you will need override equal and hashCode methods
f you want to remove the duplicate values, simply put the arraylist(s) into a HashSet. It will remove the duplicates based on equals() of your object.
- Olga
In Java, HashMap works by using hashCode to locate a bucket. Each bucket is a list of items residing in that bucket. The items are scanned, using equals for comparison. When adding items, the HashMap is resized once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally it's much closer to O(1) than O(n).
in short - there is no need to use more resources (memory) and "harness" unnecessary classes - as hash map "get" method gets very expensive as count of item grows.
hashCode -> put to bucket [if many item in bucket] -> get = linear scan
so what counts in removing items ?
complexity of equals and hasCode and used of proper algorithm to iterate
I know this is maybe amature-ish, but...
There is no need to remove the items from both lists, so, just take it from the one list
public boolean isSubset(ArrayList<Integer> a1, ArrayList<Integer> a2) {
for(Integer a1Int : a1){
for (int i = 0; i<a2.size();i++) {
if (a2.get(i).equals(a1Int)) {
a2.remove(i);
break;
}
}
if (a2.size()== 0) {
return true;
}
}
return false;
}
If you want to remove the duplicate values, simply put the arraylist(s) into a HashSet. It will remove the duplicates based on equals() of your object.
I have two big arrays of strings. I want to remove the elements from the first array that do not exist in the second array.
First I create two arrays:
Array to modify:
String[] sarr = fdata.split(System.getProperty("line.separator"));
ArrayList<String> items = new ArrayList(Arrays.asList(sarr));
Filter array:
List<String> filter = new ArrayList<String>();
filter = Arrays.asList(voc.split(System.getProperty("line.separator")))
Then I create Iterator to iterate through the elements of the items array and check if the iterated item exists in filter array, if it does, remove it from items:
Iterator<String> it = items.iterator();
while (it.hasNext()) {
String s = it.next();
if (!filter.contains(s)) {
it.remove();
}
}
items arrays contains 286,568 strings and filter contains 100,000 strings. It appears that the operation takes too much time so I am not doing it efficiently.
Is there a faster way?
Just use different collection types. For the Filter, use HashSet for O(1) (instad of O(n) for ArrayList) search complexity, and for the items, use LinkedList instead of ArrayList - which will be more efficient for the remove operations.
I didn't test this code, but...
String[] sarr = fdata.split(System.getProperty("line.separator"));
LinkedList<String> items = new LinkedList(Arrays.asList(sarr));
Set<String> filter = new HashSet<String>();
filter = new HashSet(Arrays.asList(voc.split(System.getProperty("line.separator"))));
items.retainAll(filter);
When you call collection.contains(element) often for a large collection, you should not use an ArrayList, but rather a HashSet.
Set<String> filter = new HashSet<>();
Collections.addAll(filter, voc.split(System.getProperty("line.separator")));
A HashSet is an optimized data structure for looking up things.
I'm currently working on a Java program that is required to handle large amounts of data. I have two Vectors...
Vector collectionA = new Vector();
Vector collectionB = new Vector();
...and both of them will contain around 900,000 elements during processing.
I need to find all items in collectionB that are not contained in collectionA. Right now, this is how I'm doing it:
for (int i=0;i<collectionA.size();i++) {
if(!collectionB.contains(collectionA.elementAt(i))){
// do stuff if orphan is found
}
}
But this causes the program to run for lots of hours, which is unacceptable.
Is there any way to tune this so that I can cut my running time significantly?
I think I've read once that using ArrayList instead of Vector is faster. Would using ArrayLists instead of Vectors help for this issue?
Use a HashSet for the lookups.
Explanation:
Currently your program has to test every item in collectionB to see if it is equal to the item in collectionA that it is currently handling (the contains() method will need to check each item).
You should do:
Set<String> set = new HashSet<String>(collectionB);
for (Iterator i = collectionA.iterator(); i.hasNext(); ) {
if (!set.contains(i.next())) {
// handle
}
}
Using the HashSet will help, because the set will calculate a hash for each element and store the element in a bucket associated with a range of hash values. When checking whether an item is in the set, the hash value of the item will directly identify the bucket the item should be in. Now only the items in that bucket have to be checked.
Using a SortedSet like TreeSet would also be an improvement over Vector, since to find the item, only the position it would be in has tip be checked, instead of all positions. Which Set implementation would perform best depends on the data.
If ordering of the elements doesn't matter, I would go for HashSets, and do it as follows:
Set<String> a = new HashSet<>();
Set<String> b = new HashSet<>();
// ...
b.removeAll(a):
So in essence, you're removing from set b all the elements that are in set a, leaving the asymmetric set difference. Note that the removeAll method does modify set b, so if that's not what you want, you would need to make a copy first.
To find out whether HashSet or TreeSet is more efficient for this type of operation, I ran the below code with both types, and used Guava's Stopwatch to measure execution time.
#Test
public void perf() {
Set<String> setA = new HashSet<>();
Set<String> setB = new HashSet<>();
for (int i=0; i < 900000; i++) {
String uuidA = UUID.randomUUID().toString();
String uuidB = UUID.randomUUID().toString();
setA.add(uuidA);
setB.add(uuidB);
}
Stopwatch stopwatch = Stopwatch.createStarted();
setB.removeAll(setA);
System.out.println(stopwatch.elapsed(TimeUnit.MILLISECONDS));
}
On my modest development machine, using Oracle JDK 7, the TreeSet variant is about 4 times slower (~450ms) than the HashSet variant (~105ms).
I have several ArrayLists of Integer objects, stored in a HashMap.
I want to get a list (ArrayList) of all the numbers (Integer objects) that appear in each list.
My thinking so far is:
Iterate through each ArrayList and put all the values into a HashSet
This will give us a "listing" of all the values in the lists, but only once
Iterate through the HashSet
2.1 With each iteration perform ArrayList.contains()
2.2 If none of the ArrayLists return false for the operation add the number to a "master list" which contains all the final values.
If you can come up with something faster or more efficient, funny thing is as I wrote this I came up with a reasonably good solution. But I'll still post it just in case it is useful for someone else.
But of course if you have a better way please do let me know.
I am not sure that I understand your goal. But if you wish to find the intersection of a collection of List<Integer> objects, then you can do the following:
public static List<Integer> intersection(Collection<List<Integer>> lists){
if (lists.size()==0)
return Collections.emptyList();
Iterator<List<Integer>> it = lists.iterator();
HashSet<Integer> resSet = new HashSet<Integer>(it.next());
while (it.hasNext())
resSet.retainAll(new HashSet<Integer>(it.next()));
return new ArrayList<Integer>(resSet);
}
This code runs in linear time in the total number of items. Actually this is average linear time, because of the use of HashSet.
Also, note that if you use ArrayList.contains() in a loop, it may result in quadratic complexity, since this method runs in linear time, unlike HashSet.contains() that runs in constant time.
You have to change step 1:
- Use the shortest list instead of your hashSet (if it isn't in the shortest list it isn't in all lists...)
Then call contains on the other lists and remove value as soon as one return false (and skip further tests for this value)
At the end the shortest list will contain the answer...
some code:
public class TestLists {
private static List<List<Integer>> listOfLists = new ArrayList<List<Integer>>();
private static List<Integer> filter(List<List<Integer>> listOfLists) {
// find the shortest list
List<Integer> shortestList = null;
for (List<Integer> list : listOfLists) {
if (shortestList == null || list.size() < shortestList.size()) {
shortestList = list;
}
}
// create result list from the shortest list
final List<Integer> result = new LinkedList<Integer>(shortestList);
// remove elements not present in all list from the result list
for (Integer valueToTest : shortestList) {
for (List<Integer> list : listOfLists) {
// no need to compare to itself
if (shortestList == list) {
continue;
}
// if one list doesn't contain value, remove from result and break loop
if (!list.contains(valueToTest)) {
result.remove(valueToTest);
break;
}
}
}
return result;
}
public static void main(String[] args) {
List<Integer> l1 = new ArrayList<Integer>(){{
add(100);
add(200);
}};
List<Integer> l2 = new ArrayList<Integer>(){{
add(100);
add(200);
add(300);
}};
List<Integer> l3 = new ArrayList<Integer>(){{
add(100);
add(200);
add(300);
}};
List<Integer> l4 = new ArrayList<Integer>(){{
add(100);
add(200);
add(300);
}};
List<Integer> l5 = new ArrayList<Integer>(){{
add(100);
add(200);
add(300);
}};
listOfLists.add(l1);
listOfLists.add(l2);
listOfLists.add(l3);
listOfLists.add(l4);
listOfLists.add(l5);
System.out.println(filter(listOfLists));
}
}
Create a Set (e.g. HashSet) from the first List.
For each remaining list:
call set.retainAll (list) if both list and set are small enough
otherwise call set.retainAll (new HashSet <Integer> (list))
I cannot say after which thresholds second variant of step 2. becomes faster, but I guess maybe > 20 in size or so. If your lists are all small, you can not bother with this check.
As I remember Apache Collections have more efficient integer-only structures if you care not only about O(*) part, but also about the factor.
Using the Google Collections Multiset makes this (representation-wise) a cakewalk (though I also like Eyal's answer). It's probably not as efficient time/memory-wise as some of the other's here, but it's very clear what's going on.
Assuming the lists contain no duplicates within themselves:
Multiset<Integer> counter = HashMultiset.create();
int totalLists = 0;
// for each of your ArrayLists
{
counter.addAll(list);
totalLists++;
}
List<Integer> inAll = Lists.newArrayList();
for (Integer candidate : counter.elementSet())
if (counter.count(candidate) == totalLists) inAll.add(candidate);`
if the lists might contain duplicate elements, they can be passed through a set first:
counter.addAll(list) => counter.addAll(Sets.newHashSet(list))
Finally, this is also ideal if you want might want some additional data later (like, how close some particular value was to making the cut).
Another approach that slightly modifies Eyal's (basically folding together the act of filtering a list through a set and then retaining all the overlapping elements), and is more lightweight than the above:
public List<Integer> intersection(Iterable<List<Integer>> lists) {
Iterator<List<Integer>> listsIter = lists.iterator();
if (!listsIter.hasNext()) return Collections.emptyList();
Set<Integer> bag = new HashSet<Integer>(listsIter.next());
while (listsIter.hasNext() && !bag.isEmpty()) {
Iterator<Integer> itemIter = listsIter.next().iterator();
Set<Integer> holder = new HashSet<Integer>(); //perhaps also pre-size it to the bag size
Integer held;
while (itemIter.hasNext() && !bag.isEmpty())
if ( bag.remove(held = itemIter.next()) )
holder.add(held);
bag = holder;
}
return new ArrayList<Integer>(bag);
}
I'm looking to make a recursive method iterative.
I have a list of Objects I want to iterate over, and then check their subobjects.
Recursive:
doFunction(Object)
while(iterator.hasNext())
{
//doStuff
doFunction(Object.subObjects);
}
I want to change it to something like this
doFunction(Object)
iIterator = hashSet.iterator();
while(Iterator.hasNext()
{
//doStuff
hashSet.addAll(Object.subObjects);
}
Sorry for the poor psuedo code, but basically I want to iterate over subobjects while appending new objects to the end of the list to check.
I could do this using a list, and do something like
while(list.size() > 0)
{
//doStuff
list.addAll(Object.subObjects);
}
But I would really like to not add duplicate subObjects.
Of course I could just check whether list.contains(each subObject) before I added It.
But I would love to use a Set to accomplish that cleaner.
So Basically is there anyway to append to a set while Iterating over it, or is there an easier way to make a List act like a set rather than manually checking .contains()?
Any comments are appreciated.
Thanks
I would use two data structures --- a queue (e.g. ArrayDeque) for storing objects whose subobjects are to be visited, and a set (e.g. HashSet) for storing all visited objects without duplication.
Set visited = new HashSet(); // all visited objects
Queue next = new ArrayDeque(); // objects whose subobjects are to be visited
// NOTE: At all times, the objects in "next" are contained in "visited"
// add the first object
visited.add(obj);
Object nextObject = obj;
while (nextObject != null)
{
// do stuff to nextObject
for (Object o : nextObject.subobjects)
{
boolean fresh = visited.add(o);
if (fresh)
{
next.add(o);
}
}
nextObject = next.poll(); // removes the next object to visit, null if empty
}
// Now, "visited" contains all the visited objects
NOTES:
ArrayDeque is a space-efficient queue. It is implemented as a cyclic array, which means you use less space than a List that keeps growing when you add elements.
"boolean fresh = visited.add(o)" combines "boolean fresh = !visited.contains(o)" and "if (fresh) visited.add(o)".
I think your problem is inherently a problem that needs to be solved via a List. If you think about it, your Set version of the solution is just converting the items into a List then operating on that.
Of course, List.contains() is a slow operation in comparison to Set.contains(), so it may be worth coming up with a hybrid if speed is a concern:
while(list.size() > 0)
{
//doStuff
for each subObject
{
if (!set.contains(subObject))
{
list.add(subObject);
set.add(subObject)
}
}
}
This solution is fast and also conceptually sound - the Set can be thought of as a list of all items seen, whereas the List is a queue of items to work on. It does take up more memory than using a List alone, though.
If you do not use a List, the iterator will throw an exception as soon as you read from it after modifying the set. I would recommend using a List and enforcing insertion limits, then using ListIterator as that will allow you to modify the list while iterating over it.
HashSet nextObjects = new HashSet();
HashSet currentObjects = new HashSet(firstObject.subObjects);
while(currentObjects.size() > 0)
{
Iterator iter = currentObjects.iterator();
while(iter.hasNext())
{
//doStuff
nextObjects.add(subobjects);
}
currentObjects = nextObjects;
nextObjects = new HashSet();
}
I think something like this will do what I want, I'm not concerned that the first Set contains duplicates, only that the subObjects may point to the same objects.
Use more than one set and do it in "rounds":
/* very pseudo-code */
doFunction(Object o) {
Set processed = new HashSet();
Set toProcess = new HashSet();
Set processNext = new HashSet();
toProcess.add(o);
while (toProcess.size() > 0) {
for(it = toProcess.iterator(); it.hasNext();) {
Object o = it.next();
doStuff(o);
processNext.addAll(o.subObjects);
}
processed.addAll(toProcess);
toProcess = processNext;
toProcess.removeAll(processed);
processNext = new HashSet();
}
}
Why not create an additional set that contains the entire set of objects? You can use that for lookups.