Java Collections containsAll Weired Behavior

Java Collections containsAll Weired Behavior - java

I have following code , where I am using superList and subList , I want to check that subList is actually a subList of superList.
My objects do not implement hashCode or equals methods. I have created the similar situation in the test. When I run the test then the result show very big performance difference between results from JDK collection and common collections.After Running the test I am getting following output.
Time Lapsed with Java Collection API 8953 MilliSeconds & Result is true
Time Lapsed with Commons Collection API 78 MilliSeconds & Result is true
My question is why is java collection , so slow in processing the containsAll operation. Am I doing something wrong there? I have no control over collection Types I am getting that from legacy code. I know if I use HashSet for superList then I would get big performance gains using JDK containsAll operation, but unfortunately that is not possible for me.
package com.mycompany.tests;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashSet;
import org.apache.commons.collections.CollectionUtils;
import org.junit.Before;
import org.junit.Test;
public class CollectionComparison_UnitTest {
private Collection<MyClass> superList = new ArrayList<MyClass>();
private Collection<MyClass> subList = new HashSet<MyClass>(50000);
#Before
public void setUp() throws Exception {
for (int i = 0; i < 50000; i++) {
MyClass myClass = new MyClass(i + "A String");
superList.add(myClass);
subList.add(myClass);
}
#Test
public void testIt() {
long startTime = System.currentTimeMillis();
boolean isSubList = superList.containsAll(subList);
System.out.println("Time Lapsed with Java Collection API "
+ (System.currentTimeMillis() - startTime)
+ " MilliSeconds & Result is " + isSubList);
startTime = System.currentTimeMillis();
isSubList = CollectionUtils.isSubCollection(subList, superList);
System.out.println("Time Lapsed with Commons Collection API "
+ (System.currentTimeMillis() - startTime)
+ " MilliSeconds & Result is " + isSubList);
}
}
class MyClass {
String myString;
MyClass(String myString) {
this.myString = myString;
}
String getMyString() {
return myString;
}
}

Different algorithms:
ArrayList.containsAll() offers O(N*N), while CollectionUtils.isSubCollection() offers O(N+N+N).

ArrayList.containsAll is inherited from AbstractCollection.containsAll and is a simple loop checking all elements in row. Each step is a slow linear search. I don't know how CollectionUtils works, but it's not hard to do it much faster then using the simple loop. Converting the second List to a HashSet is a sure win. Sorting both lists and going through them in parallel could be even better.
EDIT:
The CollectionUtils source code makes it clear. They're converting both collections to "cardinality maps", which is a simple and general way for many operations. In some cases it may not be a good idea, e.g., when the first list is empty or very short, you in fact loose time. In you case it's a huge win in comparison to AbstractCollection.containsAll, but you could do even better.
Addendum years later
The OP wrote
I know if I use HashSet for superList then I would get big performance gains using JDK containsAll operation, but unfortunately that is not possible for me.
and that's wrong. Classes without hashCode and equals inherit them from Object and can be used with a HashSet and everything works perfectly. Except for that each object is unique, which may be unintended and surprising, but the OP's test superList.containsAll(subList) does exactly the same thing.
So the quick solutions would be
new HashSet<>(superList).containsAll(subList)

You should at least try the tests in the opposite order. Your results may very well just show that the JIT compiler is doing its job well :-)

Related

HashMap performs better than array? [duplicate]

Is it (performance-wise) better to use Arrays or HashMaps when the indexes of the Array are known? Keep in mind that the 'objects array/map' in the example is just an example, in my real project it is generated by another class so I cant use individual variables.
ArrayExample:
SomeObject[] objects = new SomeObject[2];
objects[0] = new SomeObject("Obj1");
objects[1] = new SomeObject("Obj2");
void doSomethingToObject(String Identifier){
SomeObject object;
if(Identifier.equals("Obj1")){
object=objects[0];
}else if(){
object=objects[1];
}
//do stuff
}
HashMapExample:
HashMap objects = HashMap();
objects.put("Obj1",new SomeObject());
objects.put("Obj2",new SomeObject());
void doSomethingToObject(String Identifier){
SomeObject object = (SomeObject) objects.get(Identifier);
//do stuff
}
The HashMap one looks much much better but I really need performance on this so that has priority.
EDIT: Well Array's it is then, suggestions are still welcome
EDIT: I forgot to mention, the size of the Array/HashMap is always the same (6)
EDIT: It appears that HashMaps are faster
Array: 128ms
Hash: 103ms
When using less cycles the HashMaps was even twice as fast
test code:
import java.util.HashMap;
import java.util.Random;
public class Optimizationsest {
private static Random r = new Random();
private static HashMap<String,SomeObject> hm = new HashMap<String,SomeObject>();
private static SomeObject[] o = new SomeObject[6];
private static String[] Indentifiers = {"Obj1","Obj2","Obj3","Obj4","Obj5","Obj6"};
private static int t = 1000000;
public static void main(String[] args){
CreateHash();
CreateArray();
long loopTime = ProcessArray();
long hashTime = ProcessHash();
System.out.println("Array: " + loopTime + "ms");
System.out.println("Hash: " + hashTime + "ms");
}
public static void CreateHash(){
for(int i=0; i <= 5; i++){
hm.put("Obj"+(i+1), new SomeObject());
}
}
public static void CreateArray(){
for(int i=0; i <= 5; i++){
o[i]=new SomeObject();
}
}
public static long ProcessArray(){
StopWatch sw = new StopWatch();
sw.start();
for(int i = 1;i<=t;i++){
checkArray(Indentifiers[r.nextInt(6)]);
}
sw.stop();
return sw.getElapsedTime();
}
private static void checkArray(String Identifier) {
SomeObject object;
if(Identifier.equals("Obj1")){
object=o[0];
}else if(Identifier.equals("Obj2")){
object=o[1];
}else if(Identifier.equals("Obj3")){
object=o[2];
}else if(Identifier.equals("Obj4")){
object=o[3];
}else if(Identifier.equals("Obj5")){
object=o[4];
}else if(Identifier.equals("Obj6")){
object=o[5];
}else{
object = new SomeObject();
}
object.kill();
}
public static long ProcessHash(){
StopWatch sw = new StopWatch();
sw.start();
for(int i = 1;i<=t;i++){
checkHash(Indentifiers[r.nextInt(6)]);
}
sw.stop();
return sw.getElapsedTime();
}
private static void checkHash(String Identifier) {
SomeObject object = (SomeObject) hm.get(Identifier);
object.kill();
}
}

HashMap uses an array underneath so it can never be faster than using an array correctly.
Random.nextInt() is many times slower than what you are testing, even using array to test an array is going to bias your results.
The reason your array benchmark is so slow is due to the equals comparisons, not the array access itself.
HashTable is usually much slower than HashMap because it does much the same thing but is also synchronized.
A common problem with micro-benchmarks is the JIT which is very good at removing code which doesn't do anything. If you are not careful you will only be testing whether you have confused the JIT enough that it cannot workout your code doesn't do anything.
This is one of the reason you can write micro-benchmarks which out perform C++ systems. This is because Java is a simpler language and easier to reason about and thus detect code which does nothing useful. This can lead to tests which show that Java does "nothing useful" much faster than C++ ;)

arrays when the indexes are know are faster (HashMap uses an array of linked lists behind the scenes which adds a bit of overhead above the array accesses not to mention the hashing operations that need to be done)
and FYI HashMap<String,SomeObject> objects = HashMap<String,SomeObject>(); makes it so you won't have to cast

For the example shown, HashTable wins, I believe. The problem with the array approach is that it doesn't scale. I imagine you want to have more than two entries in the table, and the condition branch tree in doSomethingToObject will quickly get unwieldly and slow.

Logically, HashMap is definitely a fit in your case. From performance standpoint is also wins since in case of arrays you will need to do number of string comparisons (in your algorithm) while in HashMap you just use a hash code if load factor is not too high. Both array and HashMap will need to be resized if you add many elements, but in case of HashMap you will need to also redistribute elements. In this use case HashMap loses.

Arrays will usually be faster than Collections classes.
PS. You mentioned HashTable in your post. HashTable has even worse performance thatn HashMap. I assume your mention of HashTable was a typo
"The HashTable one looks much much
better "

The example is strange. The key problem is whether your data is dynamic. If it is, you could not write you program that way (as in the array case). In order words, comparing between your array and hash implementation is not fair. The hash implementation works for dynamic data, but the array implementation does not.
If you only have static data (6 fixed objects), array or hash just work as data holder. You could even define static objects.

Java 8: get average of more than one attribute [duplicate]

This question already has answers here:
How to compute average of multiple numbers in sequence using Java 8 lambda
(7 answers)
Closed 6 years ago.
In the following class:
I want to get average of foo and bar in List<HelloWorld> helloWorldList
#Data
public class HelloWorld {
private Long foo;
private Long bar;
}
OPTION 1: JAVA
Long fooSum, barSum;
for(HelloWorld hw: helloWorldList){
fooSum += hw.getFoo();
barSum += hw.getBar();
}
Long fooAvg = fooSum/helloWorldList.size();
Long barAvg = barSum/helloWorldList.size();
OPTION 2 : JAVA 8
Double fooAvg = helloWorldList.stream().mapToLong(HelloWorld::foo).average().orElse(null);
Double barAvg = helloWorldList.stream().mapToLong(HelloWorld::bar).average().orElse(null);
Which approach is better ?
Is there any better way to get these values ?
Answer edit: This question has been marked duplicate but after reading comments from bradimus i ended up implementing this:
import java.util.function.Consumer;
public class HelloWorldSummaryStatistics implements Consumer<HelloWorld> {
#Getter
private int fooTotal = 0;
#Getter
private int barTotal = 0;
#Getter
private int count = 0;
public HelloWorldSummaryStatistics() {
}
#Override
public void accept(HelloWorld helloWorld) {
fooTotal += helloWorld.getFoo();
barTotal += helloWorld.getBar();
count++;
}
public void combine(HelloWorldSummaryStatistics other) {
fooTotal += other.fooTotal;
barTotal += other.barTotal;
count += other.count;
}
public final double getFooAverage() {
return getCount() > 0 ? (double) getFooTotal() / getCount() : 0.0d;
}
public final double getBarAverage() {
return getCount() > 0 ? (double) getBarTotal() / getCount() : 0.0d;
}
#Override
public String toString() {
return String.format(
"%s{count=%d, fooAverage=%f, barAverage=%f}",
this.getClass().getSimpleName(),
getCount(),
getFooAverage(),
getBarAverage());
}
}
Main Class:
HelloWorld a = new HelloWorld(5L, 1L);
HelloWorld b = new HelloWorld(5L, 2L);
HelloWorld c = new HelloWorld(5L, 4L);
List<HelloWorld> hwList = Arrays.asList(a, b, c);
HelloWorldSummaryStatistics helloWorldSummaryStatistics = hwList.stream()
.collect(HelloWorldSummaryStatistics::new, HelloWorldSummaryStatistics::accept, HelloWorldSummaryStatistics::combine);
System.out.println(helloWorldSummaryStatistics);
Note: As suggested by others if you need high precision BigInteger etc can be used.

The answers/comments you got so far don't mention one advantage of a streams-based solution: just by changing stream() to parallelStream() you could turn the whole thing into a multi-threaded solution.
Try doing that with "option 1"; and see how much work it would need.
But of course, that would mean even more "overhead" in terms of "things going on behind the covers costing CPU cycles"; but if you are talking about large datasets it might actually benefit you.
At least you could very easily see how turning on parallelStreams() would influence execution time!

If you want to find average value in list of integers it is better to use classic approach with iterating.
Streams have some overhead and JVM has to load classes for stream usage. But also JVM has JIT with lots of optimizations.
Please beware of incorrect banchmarking. Use JMH
Streams are good and effective when your iteration operation is not such a simple thing as two integers sum.
Also streams allow you to parallelize code. There is no direct criteria when parallelize is better than single thread. As for me - if function call takes over 100ms - you can parrallelize it.
So, if your dataset processing takes >100ms try parallelStream
If not - use iterating.
P.S. Doug Lea - "When to use parallel streams"

Which approach is better ?
When you say "better", do you mean "closer to the sample's true average" or "more efficient" or what? If efficiency is your goal, streams entail a fair amount of overhead that is often ignored. However, they provide readability and conciser code. It depends upon what you're trying to maximize, how large your datasets are, etc.
Perhaps rephrase the question?

What is better for the performance CollectionUtils.isEmpty() or collection.isEmpty()

What is better for the performance if you already know that the collection isn’t null.
Using !collection.isEmpty() or CollectionUtils.isNotEmpty(collection) from the Apache Commons lib?
Or isn’t there any performance difference?

The code of CollectionUtils.isNotEmpty (assuming we are talking about Apache Commons here)...
public static boolean isEmpty(Collection coll)
{
return ((coll == null) || (coll.isEmpty()));
}
public static boolean isNotEmpty(Collection coll)
{
return (!(isEmpty(coll)));
}
...so, not really a difference, that one null check will not be your bottleneck ;-)

The other answers are correct, but just to be sure about it:
Why do you care at all? Does your application have a performance problem; and careful profiling pointed to that method; so you are looking for alternatives?
Because ... if not ... then it could be that we are looking at
PrematureOptimization.
And one other aspect: if "java standard libraries" provide a feature; I would always prefer them over something coming from an "external library".
Of course, ApacheCommons is quite commons nowadays; but I would only add the dependency towards it ... if all my other code is already using it.

The difference is negligible (extra null check), all calls can be easily inlined even by C1 compiler. In general you should not worry about performance of such simple methods. Even if one of them is twice slower it's still blazingly fast compared to the rest code of your application.

Collection.isEmpty as CollectionUtils which is defined in apache libs is indirectly going to use the collection.isEmpty() method.
Although no noticable difference present in both of them, It's just that
CollectionUtils.isEmpty is NullSafe and as you say that you know that collection is not empty , so Both are equally Good (Almost)

With the following program you can see the clear results with 1000 Integer in the List.
Note: time is in milliseconds
collection.isEmpty is almost 0 milliseconds
CollectionsUtils.isNotEmpty take 78 milliseconds
public static void main(String[] args){
List<Integer> list = new ArrayList<Integer>();
for(int i = 0; i<1000; i++)
list.add(i);
long startTime = System.currentTimeMillis();
list.isEmpty();
long endTime = System.currentTimeMillis();
long totalTime = endTime - startTime;
System.out.println(totalTime);
long startTime2 = System.currentTimeMillis();
CollectionUtils.isNotEmpty(list);
long endTime2 = System.currentTimeMillis();
long totalTime2 = endTime2 - startTime2;
System.out.println(totalTime2);
}

Easiest way to test if two collections have the same contents?

Very often when writing tests I've to check if two collections have the same contents and sometimes even if they have the same order. So I endlessly end up doing the same thing:
assertEquals(collection1.size(), collection2.size());
for (ItemType item : collection1){
if (!collection2.contains(item)) fail(); //This depends on the collection
}
//some more code is required to test ordering
Do you know of a good way to end this torment using some standard library?

Better use equals() method, because if you use containsAll then for two lists that have same elements can be equal although there elements are in different order. So using containsAll is not good way to compare List
Here is a demo
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
public class TwoButtons {
public static void main(String[] args){
Collection<Integer> c1 = new ArrayList<>(Arrays.asList(1,2,3));
Collection<Integer> c2 = new ArrayList<>(Arrays.asList(3,2,1));
System.out.println("equals " + c1.equals(c2));
System.out.println("containsAll " + c1.containsAll(c2));
}
}
output
equals false
containsAll true

You can do use this as a condition
collection1.size()==collection2.size() && collection1.containsAll(collection2)
Here you are checking that both the collections are of same size as well as the have all the elements.
As per the comments by joachim-isaksson
you can do this but will not be efficient.
collection1.containsAll(collection2) && collection2.containsAll(collection1)
You should use
ReflextionAsserter

Is there a way to test for enum value in a list of candidates? (Java)

This is a simplified example. I have this enum declaration as follows:
public enum ELogLevel {
None,
Debug,
Info,
Error
}
I have this code in another class:
if ((CLog._logLevel == ELogLevel.Info) || (CLog._logLevel == ELogLevel.Debug) || (CLog._logLevel == ELogLevel.Error)) {
System.out.println(formatMessage(message));
}
My question is if there is a way to shorten the test. Ideally i would like somethign to the tune of (this is borrowed from Pascal/Delphi):
if (CLog._logLevel in [ELogLevel.Info, ELogLevel.Debug, ELogLevel.Error])
Instead of the long list of comparisons. Is there such a thing in Java, or maybe a way to achieve it? I am using a trivial example, my intention is to find out if there is a pattern so I can do these types of tests with enum value lists of many more elements.
EDIT: It looks like EnumSet is the closest thing to what I want. The Naïve way of implementing it is via something like:
if (EnumSet.of(ELogLevel.Info, ELogLevel.Debug, ELogLevel.Error).contains(CLog._logLevel))
But under benchmarking, this performs two orders of magnitude slower than the long if/then statement, I guess because the EnumSet is being instantiated every time it runs. This is a problem only for code that runs very often, and even then it's a very minor problem, since over 100M iterations we are talking about 7ms vs 450ms on my box; a very minimal amount of time either way.
What I settled on for code that runs very often is to pre-instantiate the EnumSet in a static variable, and use that instance in the loop, which cuts down the runtime back down to a much more palatable 9ms over 100M iterations.
So it looks like we have a winner! Thanks guys for your quick replies.

what you want is an enum set
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/EnumSet.html
put the elements you want to test for in the set, and then use the Set method contains().
import java.util.EnumSet;
public class EnumSetExample
{
enum Level { NONE, DEBUG, INFO, ERROR };
public static void main(String[] args)
{
EnumSet<Level> subset = EnumSet.of(Level.DEBUG, Level.INFO);
for(Level currentLevel : EnumSet.allOf(Level.class))
{
if (subset.contains(currentLevel))
{
System.out.println("we have " + currentLevel.toString());
}
else
{
System.out.println("we don't have " + currentLevel.toString());
}
}
}
}

There's no way to do it concisely in Java. The closest you can come is to dump the values in a set and call contains(). An EnumSet is probably most efficient in your case. You can shorted the set initialization a little using the double brace idiom, though this has the drawback of creating a new inner class each time you use it, and hence increases the memory usage slightly.

In general, logging levels are implemented as integers:
public static int LEVEL_NONE = 0;
public static int LEVEL_DEBUG = 1;
public static int LEVEL_INFO = 2;
public static int LEVEL_ERROR = 3;
and then you can test for severity using simple comparisons:
if (Clog._loglevel >= LEVEL_DEBUG) {
// log
}

You could use a list of required levels, ie:
List<ELogLevel> levels = Lists.newArrayList(ELogLevel.Info,
ELogLevel.Debug, ELogLevel.Error);
if (levels.contains(CLog._logLevel)) {
//
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Collections containsAll Weired Behavior - java

Different algorithms: ArrayList.containsAll() offers O(N*N), while CollectionUtils.isSubCollection() offers O(N+N+N).

You should at least try the tests in the opposite order. Your results may very well just show that the JIT compiler is doing its job well :-)

Related

HashMap performs better than array? [duplicate]

Java 8: get average of more than one attribute [duplicate]

What is better for the performance CollectionUtils.isEmpty() or collection.isEmpty()

Easiest way to test if two collections have the same contents?

Is there a way to test for enum value in a list of candidates? (Java)

Categories

Resources