List size vs Set size doesn't match - java

When creating large lists I ran into something odd. I created sub lists, because the entire list was too large. But when checking the resulting sizes I found:
new ArrayList<>(rSet).size() != rSet.size();
Where rSet is a HashSet
When I stop eclipse and investigate, I see that rSet has 1000 items, whilst responding to .size() as having less (the number of less items fluctuates; sometimes the rSet.size() is higher than the values it actually contains). I cannot reproduce this in a separate test case; the code has too many layer to provide. But is filled from separate threads, which are ended by the time size is called.
I said I filled it from threads. I provide the Set<> rSet as parameter to all threads, and use the following method to add new items to the set:
public static void addSynchronized(final Set<?> c, final List<?> items) {
c.addAll(items);
}
I must be doing something the code disagrees with... But what?

is filled from separate threads
I think there's your problem. HashSet is not thread-safe. When writing to it from multiple threads at the same time, anything could happen.
To make it synchronized (from the docs):
Set s = Collections.synchronizedSet(new HashSet(...));
Your addSynchronized method has a misleading name, because it's not synchronized. (Having an argument named list that's actually a Set is a bit confusing as well.)

Related

Does introducing an intermediate list might cause performance overhead?

List<UserData> dataList = new ArrayList<>();
List<UserData> dataList1 = dataRepository.findAllByProcessType(ProcessType.OUT);
List<UserData> dataList2 = dataRepository.findAllByProcessType(ProcessType.CORPORATE_OUT);
dataList.addAll(dataList1);
dataList.addAll(dataList2 );
return dataList ;
vs
List<UserData> dataList = new ArrayList<>();
dataList.addAll(dataRepository.findAllByProcessType(ProcessType.OUT));
dataList.addAll(dataRepository.findAllByProcessType(ProcessType.CORPORATE_OUT));
return dataList ;
does the first implementation will cause any performance overhead? (i.e. more garbage / memory allocation than the second one)
P.S. - Yes, it can be optimised using one round trip to db as mentionted by #Tim. But that's not the answer i am looking for.I am in general want to know whether this type of implementation will cause overhead or not. Because this type of implementation helps debugging.
I'm going to say no, on the basis that I would be very surprised if the two code blocks produce different bytecode.
The first code does not "introduce an intermediate list". All it does is create new variables to reference lists that were created by the dataRepository call. I would expect the compiler to simply optimise those variables out.
Those lists are also created in the second code example, so there's no real difference.
Knowing that the compiler performs these sorts of optimisations frees us as programmers to write code that is well laid-out, clear, and maintainable, whilst still remaining confident that it will perform well.
The other consideration is debugging. In the first code block, it is easy to set breakpoints on the variable declaration lines, and inspect the values of the variables. Those simple operations become a pain when code is implemented in the second code block.
As the addAll() method should just be referencing the same data, both of your versions should perform about the same. But, the best thing to do here is to avoid the two unnecessary roundtrips to your database, and just use a single query:
List<ProcessType> types = Arrays.asList(ProcessType.OUT, ProcessType.CORPORATE_OUT);
List<UserData> dataList = findAllByProcessTypeIn(types);

Safely publish array elements

This question is concerned with safe publishing of array contents within multi threaded java programms.
Let's assume I have some arbitrary array of Objects:
Object[] myArray = ...
Now this array is handed over to another thread, maybe like this:
new Thread() {
public void run() {
// ...
Object o = myArray[0];
// ...
}
};
My question is, will the new Thread observe the values within the array as 'expected' if no further synchronization is in place? Does this depend on whether the array itself is a (final/volatile) field or a local variable? Are subsequent modifications of the array from the first thread immediately visible to the new thread?
What would be the most efficient way of safely publishing the array's elements?
The exact answer depends on the mutability of the array and its contents, the number of writers and readers, whether your only option is using an array (and not a thread-safe Collection of the same elements), etc.
If I was in your shoes, I'd probably spend an evening searching in StackOverflow and trying out some fancy combination of patterns (maybe final array reference plus immutable wrappers of the array elements, or a custom update scheme using Unsafe.putOrderedObject if there's a single writer).
Then after a short internal struggle, I'd copy all of it in my "Playground" folder and use an AtomicReferenceArray or a CopyOnWriteArrayList (or another appropriate off-the-shelf solution).
I'd also try to remind myself that I need to worry about performance when I've addressed all of my bigger concerns (like correctness), and when I have a proof that this specific part of my program needs to be optimized. Hopefully a similar approach will work for you too.

ArrayList vs Vector performance in single-threaded application

I was just looking for the answer for the question why ArrayList is faster than Vector and i found ArrayList is faster as it is not synchronized.
so my doubt is:
If ArrayList is not synchronized why would we use it in multithreaded environment and compare it with Vector.
If we are in a single threaded environment then how the performance of the Vector decreases as there is no Synchronization going on as we are dealing with a single thread.
Why should we compare the performance considering the above points ?
Please guide me :)
a) Methods using ArrayList in a multithreaded program may be synchronized.
class X {
List l = new ArrayList();
synchronized void add(Object e) {
l.add(e);
}
...
b) We can use ArrayList without exposing it to other threads, this is when ArrayList is referenced only from local variables
void x() {
List l = new ArrayList(); // no other thread except current can access l
...
Even in a single threaded environment entering a synchronized method takes a lock, this is where we lose performance
public synchronized boolean add(E e) { // current thread will take a lock here
modCount++;
...
You can use ArrayList in a multithread environment if the list is not shared between threads.
If the list is shared between threads you can synchronize the access to that list.
Otherwise you can use Collections.synchronizedList() to get a List that can be used thread safely.
Vector is an old implementation of a synchronized List that is no longer used because the internal implementation basically synchronize every method. Generally you want to synchronize a sequence of operations. Otherwyse you can throw a ConcurrentModificationException when iterating the list another thread modify it. In addition synchronize every method is not good from a performance point of view.
In addition also in a single thread environment accessing a synchronized method needs to perform some operations, so also in a single thread application Vector is not a good solution.
Just because a component is single threaded doesn't mean that it cannot be used in a thread safe context. Your application may have it's own locking in which case additional locking is redundant work.
Conversely, just because a component is thread safe, it doesn't mean that you cannot use it in an unsafe manner. Typically thread safety extends to a single operation. E.g. if you take an Iterator and call next() on a collection this is two operations and they are no longer thread safe when used in combination. You still have to use locking for Vector. Another simple example is
private Vector<Integer> vec =
vec.add(1);
int n = vec.remove(vec.size());
assert n == 1;
This is atleast three operations however the number of things which can go wrong are much more than you might suppose. This is why you end up doing your own locking and why the locking inside Vector might be redundant, even unwanted.
For you own interest;
vec can change at any point t another Vector or null
vec.add(2) can happen between any operation, changing the size and the last element.
vec.remove() can happen between any operation.
vec.add(null) can happen between any operation resulting in a possible NullPointerException
The vec can /* change */ in these places.
private Vector<Integer> vec =
vec.add(1); /* change*/
int n = vec.remove(vec.size() /* change*/);
assert n == 1;
In short, assuming that just because you used a thread safe collection your code is now thread safe is a big assumption.
A common pattern which breaks is
for(int n : vec) {
// do something.
}
Look harmless enough except
for(Iterator iter = vec.iterator(); /* change */ vec.hasNext(); ) {
/* change */ int n = vec.next();
I have marked with /* change */ where another thread could change the collection meaning this loop can get a ConcurrentModificationException (but might not)
there is no Synchronization
The JVM doesn't know there is no need for synchronization and so it still has to do something. It has an optimisation to reduce the cost of uncontended locks, but it still has to do work.
You need to understand the basic concept to know answer for your above questions...
When you say array list is not syncronized and vector is, we mean that the methods in those classes (like add(), get(), remove() etc...) are synchronized in vector class and not in array list class. These methods will act upon tha data being stored .
So, the data saved in vector class cannot be edited / read parallely as add, get, remove metods are synchornized and the same in array list can be done parallely as these methods in array list are not synchronized...
This parallel activity makes array list fast and vector slow... This behavior remains same though you use them in either multithreaded (or) single threaded enviornment...
Hope this answers your question...

Java not giving stable output

I'm trying to make a method which would do transformation from initial frame of reference to Center of Mass frame.
The set used in the method contains a number of objects which among other properties have mass, position and velocity (if that helps).
/**
* Method used to set origin at the Center of Mass (COM) of the system. This
* method calculates position(r) and velocity(v) of the COM in initial
* frame, then uses them to transform v and r of each particle in the system
* to COM frame.
*
* #param astroObjSet
* a set containing massive particles
*/
private static void COM(Set<Particle> astroObjSet) {
// temporary variables used in the method
Vector velSum = new Vector();// Sum of velocities multiplied by mass
Vector posSum = new Vector();// Sum of position multiplied by mass
double totalMass = 0;
// this loop calculates total mass of the given system, sum of v_i*m_i,
// and sum of r_i*m_i for each particle i in the system
for (Particle element : astroObjSet) {
totalMass = totalMass + element.getMass();
velSum = Vector.add(velSum,element.getVelocity().times(element.getMass()));
posSum = Vector.add(posSum,element.getPosition().times(element.getMass()));
}
// calculate COM velocity and position in initial frame of reference
Vector COMpos = posSum.times(1 / totalMass);
Vector COMvel = velSum.times(1 / totalMass);
// transform position and velocity of each particle in the set to COM
// frame of reference.
for (Particle element : astroObjSet) {
Vector finPos = new Vector(Vector.Subtract(element.getPosition(),
COMpos));
Vector finVel = new Vector(Vector.Subtract(element.getVelocity(),
COMvel));
element.setPosition(finPos);
element.setVelocity(finVel);
}
But for some reason unless I put println("String") somewhere in the method, method would work only 1 out of ten times (I even counted). If you put println("String") before or after method is called it would also work, but not without. Essentially everything works only when I examine it. Never thought quantum mechanics would hunt me even in Java.
Does anybody have an idea what is going on?
Update 1
Fixed summation loop, thanks to Boann.
The Particle class does not have explicit implementations of equals and hashCode (i didn't do anything :/), do you meant that I can not modify parameters of Objects inside HashSet?
A Java Set enforces uniqueness of elements. Every object added is compared with existing objects to ensure it is not a duplicate.
HashSet does this using the hashCode and equals methods of objects to ask them whether they are the same as each other. You can modify fields of objects inside HashSets, but you must not modify any fields that you use in computing the hashCode or testing equality, because that will break/confuse the set.
Since you have not overridden the hashCode and equals methods, the default implementations simply test for object "identity": is this the same object? In that case, two created objects with equal values are treated as separate, and both can be added to a single HashSet. Since the fields are not even looked at, you can safely modify any of them without breaking the set.
So it looks like HashSet gives out elements at random. Is it normal?
Yes. A HashSet gives no guarantee as to the order of the elements, and the order will not even remain stable over time. (Internally it orders objects by some of the bits of their hashCodes, giving a difficult-to-predict and effectively random order. When a set grows, it starts looking at more bits of the hashCodes to efficiently determine uniqueness, causing them to rearrange randomly again.) If you really need to enforce order of a Set you can use a LinkedHashSet instead.
However, using a Set at all is probably wasted effort in your application. Instead of Set and HashSet, use List and ArrayList. A List is much simpler and guarantees predictable iteration order: elements are iterated in the order they were added. A List is faster than a Set too, since it doesn't expend effort trying to prevent equal objects from being added twice.
Also no idea what are multiple threads, so unlikely I'm doing that.
As a simple analogy, a thread is an instruction pointer that runs around through the code. (Imagine a finger pointing at a currently executing line of code.) It is what steps through methods, statements, loops, etc, and executes them. You always have at least one thread, otherwise your program wouldn't be running at all. You can create any number of threads and set them running around your program all at once. On a multi-CPU system, these threads can be executed truly simultaneously, one on each CPU. However, you can also have more threads than CPUs, in which case CPUs will take turns executing threads (changing to a different thread every 10 milliseconds or so) to make them seem simultaneous.
And why adding print makes it more systematic?
The only reason I can think of is that you are modifying shared data from multiple threads. Threads are essentially blind to each other, by which I mean they do not naturally try to cooperate. This causes two problems: (1) if two threads try to change data simultaneously, the data can be corrupted and end up with a value that is not correct for either thread; (2) threads might not notice if another thread has changed some variables, because any thread can assume that variables have the same values they were previously set to by that thread.
To make multithreading safe, it is necessary to add some synchronization around any changing data which is shared between threads.
If your set of particles is being used from multiple threads without synchronization, it is possible for the set to be corrupted, and/or for changes in some of the values to not be seen by other threads. The effect of adding a print statement is that it accesses a shared, properly synchronized resource (the output stream to the console) so it causes a happens-before relationship between threads doing the printing, that ensures prior changes are seen, which fixes problem (2), although it is a very sloppy substitute for explicit synchronization.
You could be using multiple threads unintentionally. For example, creating a graphical user interface creates a thread which is used to handle input events. From then on, you should either switch to that thread and do all your program's work on it, or add synchronization around any shared changing data.
If you're sure you're not editing data from multiple threads, I can't see how the print statement could possibly make any difference. (Perhaps it didn't make a difference, and the only confusion was the mis-assumption about the stability of the HashSet.)
You probably have a memory visibility problem between threads. The print happens to synchronize over the output stream which creates a memory barrier and synchronizes with main memory masking the problem.

Is Java foreach loop an overkill for repeated execution

I agree foreach loop reduces typing and good for readability.
A little backup, I work on low latency application development and receive 1Million packets to process per second. Iterating through a million packets and sending this information across to its listeners. I was using foreach loop to iterate through the set of listeners.
Doing profiling i figured there are a lot of Iterator objects created to execute foreach loop. Converting foreach loop to index based foreach I observed a huge drop in the number of objects created there by reducing no. of GC's and increasing application throughput.
Edit: (Sorry for confusion, making this Q more clearer)
For example i have list of listeners(fixed size) and i loop through this forloop a million times a second. Is foreach an overkill in java?
Example:
for(String s:listOfListeners)
{
// logic
}
compared to
for (int i=0;i<listOfListeners.size();i++)
{
// logic
}
Profiled screenshot for the code
for (int cnt = 0; cnt < 1_000_000; cnt++)
{
for (String string : list_of_listeners)
{
//No Code here
}
}
EDIT: Answering the vastly different question of:
For example i have list of listeners(fixed size) and i loop through this forloop a million times a second. Is foreach an overkill in java?
That depends - does your profiling actually show that the extra allocations are significant? The Java allocator and garbage collector can do a lot of work per second.
To put it another way, your steps should be:
Set performance goals alongside your functional requirements
Write the simplest code you can to achieve your functional requirements
Measure whether that code meets the functional requirements
If it doesn't:
Profile to work out where to optimize
Make a change
Run the tests again to see whether they make a significant difference in your meaningful metrics (number of objects allocated probably isn't a meaningful metric; number of listeners you can handle probably is)
Go back to step 3.
Maybe in your case, the enhanced for loop is significant. I wouldn't assume that it is though - nor would I assume that the creation of a million objects per second is significant. I would measure the meaningful metrics before and after... and make sure you have concrete performance goals before you do anything else, as otherwise you won't know when to stop micro-optimizing.
Size of list is around a million objects streaming in.
So you're creating one iterator object, but you're executing your loop body a million times.
Doing profiling i figured there are a lot of Iterator objects created to execute foreach loop.
Nope? Only a single iterator object should be created. As per the JLS:
The enhanced for statement is equivalent to a basic for statement of the form:
for (I #i = Expression.iterator(); #i.hasNext(); ) {
VariableModifiersopt TargetType Identifier =
(TargetType) #i.next();
Statement
}
As you can see, that calls the iterator() method once, and then calls hasNext() and next() on it on each iteration.
Do you think that extra object allocation will actually hurt your performance significantly?
How much do you value readability over performance? I take the approach of using the enhanced for loop wherever it helps readability, until it proves to be a performance problem - and my personal experience is that it's never hurt performance significantly in anything I've written. That's not to say that would be true for all applications, but the default position should be to only use the less readable code after proving it will improve things significantly.
The "foreach" loop creates just one Iterator object, while the second loop creates none. If you are executing many, many separate loops that execute just a few times each, then yes, "foreach" may be unnecessarily expensive. Otherwise, this is micro-optimizing.
EDIT: The question has changed so much since I wrote my answer that I'm not sure what I'm answering at the moment.
Looking up stuff with list.get(i) can actually be a lot slower if it's a linked list, since for each lookup, it has to traverse the list, while the iterator remembers the position.
Example:
list.get(0) will get the first element
list.get(1) will first get the first element to find pointer to the next
list.get(2) will first get the first element, then go to the second and then to the third
etc.
So to do a full loop, you're actually looping over elements in this manner:
0
0->1
0->1->2
0->1->2->3
etc.
I do not think you should worry about the effectiveness here.
Most of time is consumed by your actual application logic (in this case - by what you do inside the loop).
So, I would not worry about the price you pay for convenience here.
Is this comparison even fair ? You are comparing using an Iterator Vs using get(index)
Furthermore, each loop would only create one additional Iterator. Unless the Iterator is itself inefficient for some reason, you should see comparable performance.

Categories