Unique tasks all along a program - java

In my program, I do some tasks, parametrized by a MyParameter object (I call doTask(MyParameter parameter) to run a task).
From the beginning to the end of the program, I can create a lot of tasks (a few million at least) BUT i want to run only once each of them (if a task has already been executed, the method does nothing)
Currently, I'm using a HashSet to store the MyParameter objects for the tasks already executed, but if the MyParameter object is 100bytes, and if I run in my program 10M tasks, it is 1GB at least in memory ...)
How can I optimize that, to use as few memory as possible ?
Thanks a lot guys

If all you need to know is whether a particular MyParameter has been processed or not, ditch the HashSet and use a BitSet instead.
Basically, if all you need to know is whether a particular MyParameter is done or not, then storing the entire MyParameter in the set is overkill - you only need to store a single bit, where 0 means "not done" and 1 means "done". This is exactly what a BitSet is designed for.
The hashes of your MyParameter values are presumably unique, otherwise your current approach of using a HashSet is pointless. If so, then you can use the hashCode() of each MyParameter as an index into the bit set, using the corresponding bit as an indicator of whether the given MyParameter is done or not.
That probably doesn't make much sense as is, so the following is a basic implementation. (Feel free to substitute the for loop, numParameters, getParameter(), etc with whatever it is that you're actually using to generate MyParameters)
BitSet doneSet = new BitSet();
for (int i = 0; < numParameters; ++i) {
MyParameter parameter = getParameter(i);
if (!doneSet.get(parameter.hashCode())) {
doTask(parameter );
doneSet.set(parameter.hashCode());
}
}
The memory usage of this approach is a bit contingent on how BitSet is implemented internally, but I would expect it to be significantly better than simply storing all your MyParameters in a HashSet.
If, in fact, you do need to hang onto your MyParameter objects once you've processed them because they contain the result of processing, then you can possibly save space by storing just the result portion of the MyParameter in the HashSet (if such a thing's possible - your question doesn't make this clear).
If, on the other hand, you really do need each MyParameter in its entirety once you're done processing it, then you're already doing pretty much the best you can do. You might be able to do a little better memory-wise by storing them as a vector (i.e. expandable array) of MyParameters (which avoids some of the memory overheads inherent in using a HashSet), but this will incur a speed penalty due to time needed to expand the vector and an O(n) search time.

A TreeSet will give you somewhat better memory performance than a HashSet, at the cost of log(n) lookups.
You can use a NoSql key-value store such as Cassandra or LevelDB, which are essentially external hash tables.
You may be able to compress the MyParameter representation, but if it's only at 100bytes currently then I don't know how much smaller you'd be able to get it.

Related

Java: efficient way to remember which objects are processed

What is the most efficient way to remember which objects are processed?
Obviously one could use a hash set:
Set<Foo> alreadyProcessed = new HashSet<>();
void process(Foo foo) {
if (!alreadyProcessed.contains(foo)) {
// Do something
alreadyProcessed.add(foo);
}
}
This makes me wonder why I would store the object, while I just want to check if the hash does exist in this set. Assuming that any hash of foo is unique.
Is there any more performant way to do this?
Take in mind that a very large number of objects will be processed and that the actual process code will not always be very heavy. Also it is not possible for me to have a precompiled worklist of objects, it will be build up dynamically during the processing.
Set#contains can be very fast. It depends how are your hashcode() and equals() methods are implemented. try to cache the hashcode value to make it more faster. (like String.java)
The other simple and fastes option is to add a boolean member to your Foo class: foo.done = true;
You can't use the hashcode since equality of the hashcode of two objects does not imply that the objects are equal.
Else depending on the use case you want to remember if you have already processed
a) the same object, tested by reference, or
b) an equal object, tested by a call to Object.equals(Object)
For b) you can use a standard Set implementation.
For a) also you can also use a standard Set implementation if you now that the equals-method is returning equality of references, or you would need something like a IdentityHashSet.
No mention of performance in this answer, you need to address correctness first!
Write good code. Optimize it for performance only if you can show that you need to in your use case.
There is no performance advantage in storing a hash code rather than the object. If you doubt that, remember that what is being stored is a reference to the object, not a copy of it. In reality that's going to be 64 bits, pretty much the same as the hash code. You've already spent a substantial amount of time thinking about a problem that none of your users will ever notice. (If you are doing this calculations millions of times in a tight, mission-critical loop that's another matter).
Using the set is simple to understand. Doing anything else runs a risk that a future maintainer will not understand the code and introduce a bug.
Also don't forget that a hash code it not guaranteed to be unique for every different object. Every so often storing the hash code will give you a false positive, causing you to fail to process an object you wanted to process. (As an aside, you need to make sure that equals() only considers two objects equal if they are the same object. The default Object.equals() does this, so don't override it)
Use the Set. If you are processing a very large number of objects, use a more efficient Set than HashSet. That is much more likely to give you a performance speedup than anything clever with hashing.

Perform arithmetic on number array without iterating

For example, I would like to do something like the following in java:
int[] numbers = {1,2,3,4,5};
int[] result = numbers*2;
//result now equals {2,4,6,8,10};
Is this possible to do without iterating through the array? Would I need to use a different data type, such as ArrayList? The current iterating step is taking up some time, and I'm hoping something like this would help.
No, you can't multiply each item in an array without iterating through the entire array. As pointed out in the comments, even if you could use the * operator in such a way the implementation would still have to touch each item in the array.
Further, a different data type would have to do the same thing.
I think a different answer from the obvious may be beneficial to others who have the same problem and don't mind a layer of complexity (or two).
In Haskell, there is something known as "Lazy Evaluation", where you could do something like multiply an infinitely large array by two, and Haskell would "do" that. When you accessed the array, it would try to evaluate everything as needed. In Java, we have no such luxury, but we can emulate this behavior in a controllable manner.
You will need to create or extend your own List class and add some new functions. You would need functions for each mathematical operation you wanted to support. I have examples below.
LazyList ll = new LazyList();
// Add a couple million objects
ll.multiplyList(2);
The internal implementation of this would be to create a Queue that stores all the primitive operations you need to perform, so that order of operations is preserved. Now, each time an element is read, you perform all operations in the Queue before returning the result. This means that reads are very slow (depending on the number of operations performed), but we at least get the desired result.
If you find yourself iterating through the whole array each time, it may be useful to de-queue at the end instead of preserving the original values.
If you find that you are making random accesses, I would preserve the original values and returned modified results when called.
If you need to update entries, you will need to decide what that means. Are you replacing a value there, or are you replacing a value after the operations were performed? Depending on your answer, you may need to run backwards through the queue to get a "pre-operations" value to replace an older value. The reasoning is that on the next read of that same object, the operations would be applied again and then the value would be restored to what you intended to replace in the list.
There may be other nuances with this solution, and again the way you implement it would be entirely different depending on your needs and how you access this (sequentially or randomly), but it should be a good start.
With the introduction of Java 8 this task can be done using streams.
private long tileSize(int[] sizes) {
return IntStream.of(sizes).reduce(1, (x, y) -> x * y);
}
No it isn't. If your collection is really big and you want to do it faster you can try to operates on elements in 2 or more threads, but you have to take care of synchronization(use synchronized collection) or divide your collection to 2(or more) collections and in each thread operate on one collection. I'm not sure wheather it will be faster than just iterating through the array - it depends on size of your collection and on what you want to do with each element. If you want to use this solution you will have wheather is it faster in your case - it might be slower and definitely it will be much more complicated.
Generally - if it's not critical part of code and time of execution isn't too long i would leave it as it is now.

Reuse hashmaps in array

I am holding an array of hashmaps, I want to gain maximum performance and memory usage so I would like to resue the hashmaps inside an array.
So when there is a hashmap in the array that is not needed any more and I want to add new hashmap to the array I just clear the hashmap and use put() to add new values.
I also need to copy back values when I retireve hashmap from array.
I am not sure if this is better than creating new HashMap() every time.
What is better?
UPDATE
need to cycle about 50 milions of hashmaps, each hash map has about 10 key-value pairs. If size of the array 20,000 I need just 20,000 hashmaps instead of 50 milions new hashmaps()
Be very careful with this approach. Although it may be better performance-wise to recycle objects, you may get into trouble by modifying the same reference several times, as illustrated in the following example:
public class A {
public int counter = 0;
public static void main(String[] args) {
A a = new A();
a.counter = 5;
A b = a; // I want to save a into b and then recycle a for other purposes
a.counter = 10; // now b.counter is also 10
}
}
I'm sure you got the point, however if you are not copying around references to HashMaps from the array, then it should be ok.
Doesn't matter. Premature optimization. Come back when you have profiler results telling you where you're actually spending most memory or CPU cycles
It is entirely unclear why re-using maps in this manner would improve performance and/or memory usage. For all we know, it might make no difference, or might have the opposite effect.
You should do whatever results in the most readable code, then profile, and finally optimize the parts of the code that the profiler highlights as bottlenecks.
In most cases you will not feel any difference.
Typically number of map entries is MUCH higher than number of map objects. When you populate map you create instance of Map.Entry per entry. This is relatively light-weight object but anyway you invoke new. The map itself without data is lightweight too, so you will not get any benefits with these tricks unless your map is supposed to hold 1-2 entries.
Bottom line.
Forget about pre-mature optimization. Implement your application. If you have performance problems profile the application, find bottle necks and fix them. I can 99% guarantee you that the bottleneck will never be in new HashMap() call.
I think what you want is an Object pool kind of thing, where you get an object(in your case, its HashMap) from the object pool, perform your operations, and if that Object is no longer needed you put it back in the pool.
check for Object pool design pattern, for further reference check this link :
http://sourcemaking.com/design_patterns/object_pool
The problem you have is that most of the objects are Map.Entry objects in the HashMap. While you can recycle the HashMap itself (and its array) these are only a small portion of the objects. One way around this is to use FastMap from javolution which recycles everything and has support for managing the lifecycle (its designed to minimise garbage this way)
I suspect the most efficient way is to use an EnumMap is possible (if you have known key attributes) or POJOs even if most fields are not used.
There's a few problems with reusing HashMaps.
Even if the key and value data were to take no memory (shared from other places), the Map.Entry objects would dominate memory usage but not be reused (unless you did something a bit special).
Because of generational GC, generally having old objects point to new is expensive (and relatively difficult to see what's going on). Might not be an issue if you are keeping millions of these.
More complicated code is more difficult to optimise. So keep it simple, and then do the big optimisations, which probably involve changing the data structures.

Java: what is the overhead of using ConcurrentSkipList* when no concurrency is needed?

I need a sorted list in a scenario dominated by iteration (compared to insert/remove, not random get at all). For this reason I thought about using a skip list compared to a tree (the iterator should be faster).
The problem is that java6 only has a concurrent implementation of a skip list, so I was guessing whether it makes sense to use it in a non-concurrent scenario or if the overhead makes it a wrong decision.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, but I wanted to hear somebody else's opinion.
EDIT:
After some micro-benchmarking (running iteration multiple times on different-sized TreeSet, LinkedList, ConcurrentSkipList and ArrayList) shows that there's quite an overhead. ConcurrentSkipList does store the elements in a linked list inside, so the only reason why it would be slower on iteration than a LinkedList would be due to the aforementioned overhead.
If thread-safety's not required I'd say to skip package java.util.concurrent altogether.
What's interesting is that sometimes ConcurrentSkipList is slower than TreeSet on the same input and I haven't sorted out yet why.
I mean, have you seen the source code for ConcurrentSkipListMap? :-) I always have to smile when I look at it. It's 3000 lines of some of the most insane, scary, and at the same time beautiful code I've ever seen in Java. (Kudos to Doug Lea and co. for getting all the concurrency utils integrated so nicely with the collections framework!) Having said that, on modern CPUs the code and algorithmic complexity won't even matter so much. What usually makes more difference is having the data to iterate co-located in memory, so that the CPU cache can do its job better.
So in the end I'll wrap ArrayList with a new addSorted() method that does a sorted insert into the ArrayList.
Sounds good. If you really need to squeeze every drop of performance out of iteration you could also try iterating a raw array directly. Repopulate it upon each change, e.g. by calling TreeSet.toArray() or generating it then sorting it in-place using Arrays.sort(T[], Comparator<? super T>). But the gain could be tiny (or even nothing if the JIT does its job well) so it might not be worth the inconvenience.
As measured using Open JDK 6 on typical production hardware my company uses, you can expect all add and query operations on a skip-list map to take roughly double the time as the same operation on a tree map.
examples:
63 usec vs 31 usec to create and add 200 entries.
145 ns vs 77 ns for get() on that 200-element map.
And the ratio doesn't change all that much for smaller and larger sizes.
(The code for this benchmark will eventually be shared so you can review it and run it yourself; sorry we're not ready to do that yet.)
Well you can use a lot of other structures to do the skip list, it exists in Concurrent package because concurrent data structures are a lot more complicated and because using a concurrent skip list would cost less than using other concurrent data structures to mimic a skip list.
In a single thread world is different: you can use a sorted set, a binary tree or your custom data structure that would perform better than concurrent skip list.
The complexity in iterating a tree list or a skip list will be always O(n), but if you instead use a linked list or an array list, you have the problem with insertion: to insert an item in the right position (sorted linked list) the complexity of insertion will be O(n) instead of O(log n) for a binary tree or for a skip list.
You can iterate in TreeMap.keySet() to obtain all inserted keys in order and it will not be so slow.
There is also the TreeSet class, that probably is what you need, but since it is just a wrapper to TreeMap, the direct use of TreeMap would be faster.
Without concurrency, it is usually more efficient to use a balanced binary search tree. In Java, this would be a TreeMap.
Skip lists are generally reserved for concurrent programming because of their ease in implementation the speed in multithreaded applications.
You seem to have a good grasp of the trade-off here, so I doubt anyone can give you a definitive, principled answer. Fortunately, this is pretty straightforward to test.
I started by creating a simple Iterator<String> that loops indefinitely over a finite list of randomly generated strings. (That is: on initialization, it generates an array _strings of a random strings of length b out of a pool of c distinct characters. The first call to next() returns _strings[0], the next call returns _strings[1] … the (n+1)th call returns _strings[0] again.) The strings returned by this iterator were what I used in all calls to SortedSet<String>.add(...) and SortedSet<String>remove(...).
I then wrote a test method that accepts an empty SortedSet<String> and loops d times. On each iteration, it adds e elements, then removes f elements, then iterates over the entire set. (As a sanity-check, it keeps track of the set's size by using the return values of add() and remove(), and when iterates over the entire set, it makes sure it finds the expected number of elements. Mostly I did that just so there would be something in the body of the loop.)
I don't think I need to explain what my main(...) method does. :-)
I tried various values for the various parameters, and I found that sometimes ConcurrentSkipListSet<String> performed better, and sometimes TreeSet<String> did, but the difference was never much more than twofold. In general, ConcurrentSkipListSet<String> performed better when:
a, b, and/or c were relatively large. (I mean, within the ranges I tested. My a's ranged from 1000 to 10000, my b's from 3 to 10, my c's from 10 to 80. Overall, the resulting set-sizes ranged from around 450 to exactly 10000, with modes of 666 and 6666 because I usually used e=2‎f.) This suggests that ConcurrentSkipListSet<String> copes somewhat better than TreeSet<String> with larger sets, and/or with more-expensive string-comparisons. Trying specific values designed to tease apart these two factors, I got the impression that ConcurrentSkipListSet<String> coped noticeably better than TreeSet<String> with larger sets, and slightly less well with more-expensive string-comparisons. (That's basically what you'd expect; TreeSet<String>'s binary-tree approach aims to do the absolute minimum possible number of comparisons.)
e and f were small; that is, when I called add(...)s and remove(...)s only a small number of times per iteration. (This is exactly what you predicted.) The exact turn-over point depended on a, b, and c, but to a first approximation, ConcurrentSkipListSet<String> performed better when e + f was less than around 10, and TreeSet<String> performed better when e + f was more than around 20.
Of course, this was on a machine that may look nothing like yours, using a JDK that may look nothing like yours, and using very artificial data that might look nothing like yours. I'd recommend that you run your own tests. Since Tree* and ConcurrentSkipList* both implement Sorted*, you should have no difficulty trying your code both ways and seeing what you find.
For what I know ConcurrentSkipList* are basically lock-free implementations based on CAS, so they should not carry (much) overhead, […]
My understanding is that this will depend on the machine. On some systems a lock-free implementation may not be possible, in which case these classes will have to use locks. (But since you're not actually multi-threading, even locks may not be all that expensive. Synchronization has overhead, of course, but its main cost is lock contention and forced single-threading. That isn't an issue for you. Again, I think you'll just have to test and see how the two versions perform.)
As noted SkipList has a lot of overhead compared to TreeMap and the TreeMap iterator isn't well suited to your use case because it just repeatedly calls the method successor() which turns out to be very slow.
So one alternative that will be significantly faster than the previous two is to write your own TreeMap iterator. Actually, I would dump TreeMap altogether since 3000 lines of code is a bit bulkier than you probably need and just write a clean AVL tree implementation with the methods you need. The basic AVL logic is just a few hundred lines of code in any language then add the iterator that works best in your case.

Java Performance - ArrayLists versus Arrays for lots of fast reads

I have a program where I need to make 100,000 to 1,000,000 random-access reads to a List-like object in as little time as possible (as in milliseconds) for a cellular automata-like program. I think the update algorithm I'm using is already optimized (keeps track of active cells efficiently, etc). The Lists do need to change size, but that performance is not as important. So I am wondering if the performance from using Arrays instead of ArrayLists is enough to make a difference when dealing with that many reads in such short spans of time. Currently, I'm using ArrayLists.
Edit:
I forgot to mention: I'm just storing integers, so another factor is using the Integer wrapper class (in the case of ArrayLists) versus ints (in the case of arrays). Does anyone know if using ArrayList will actually require 3 pointer look ups (one for the ArrayList, one for the underlying array, and one for the Integer->int) where as the array would only require 1 (array address+offset to the specific int)? Would HotSpot optimize the extra look ups away? How significant are those extra look ups?
Edit2:
Also, I forgot to mention I need to do random access writes as well (writes, not insertions).
Now that you've mentioned that your arrays are actually arrays of primitive types, consider using the collection-of-primitive-type classes in the Trove library.
#viking reports significant (ten-fold!) speedup using Trove in his application - see comments. The flip-side is that Trove collection types are not type compatible with Java's standard collection APIs. So Trove (or similar libraries) won't be the answer in all cases.
Try both, but measure.
Most likely you could hack something together to make the inner loop use arrays without changing all that much code. My suspicion is that HotSpot will already inline the method calls and you will see no performance gain.
Also, try Java 6 update 14 and use -XX:+DoEscapeAnalysis
ArrayLists are slower than Arrays, but most people consider the difference to be minor. In your case could matter though, since you're dealing with hundreds of thousands of them.
By the way, duplicate: Array or List in Java. Which is faster?
I would go with Kevin's advise.
Stay with the lists first and measure your performance if your programm is to slow compare it to a version with an array. If that gives you a measurable performance boost go with the arrays, if not stay with the lists because they will make your life much much easier.
There will be an overhead from using an ArrayList instead of an array, but it is very likely to be small. In fact, the useful bit of data in the ArrayList can be stored in registers, although you will probably use more (List size for instance).
You mention in your edit that you are using wrapper objects. These do make a huge difference. If you are typically using the same value repeatedly, then a sensible cache policy may be useful (Integer.valueOf gives the same results for -128 to 128). For primitives, primitive arrays usually win comfortably.
As a refinement, you might want to make sure the adjacent cells tend to be adjacent in the array (you can do better than rows of columns with a space filling curve).
One possibility would be to re-implement ArrayList (it's not that hard), but expose the backing array via a lock/release call cycle. This gets you convenience for your writes, but exposes the array for a large series of read/write operations that you know in advance won't impact the array size. If the list is locked, add/delete is not allowed - just get/set.
for example:
SomeObj[] directArray = myArrayList.lockArray();
try{
// myArrayList.add(), delete() would throw an illegal state exception
for (int i = 0; i < 50000; i++){
directArray[i] += 1;
}
} finally {
myArrayList.unlockArray();
}
This approach continues to encapsulate the array growth/etc... behaviors of ArrayList.
Java uses double indirection for its objects so they can be moved about in memory and have its references still be valid, this means every reference lookup is equivalent to two pointer lookups. These extra lookups cannot be optimised away completely.
Perhaps even worse is your cache performance will be terrible. Accessing values in cache is goings to be many times faster than accessing values in main memory. (perhaps 10x) If you have an int[] you know the values will be consecutive in memory and thus load into cache readily. However, for Integer[] the Integers individual objects can appear randomly across your memory and are much more likely to be cache misses. Also Integer use 24 bytes which means they are much less likely to fit into your caches than 4 byte values.
If you update an Integer, this often results in a new object created which is many orders of magnitude than updating an int value.
If you're creating the list once, and doing thousands of reads from it, the overhead from ArrayList may well be slight enough to ignore. If you're creating thousands of lists, go with the standard array. Object creation in a loop quickly goes quadratic, simply because of all the overhead of instantiating the member variables, calling the constructors up the inheritance chain, etc.
Because of this -- and to answer your second question -- stick with standard ints rather than the Integer class. Profile both and you'll quickly (or, rather, slowly) see why.
If you're not going to be doing a lot more than reads from this structure, then go ahead and use an array as that would be faster when read by index.
However, consider how you're going to get the data in there, and if sorting, inserting, deleting, etc, are a concern at all. If so, you may want to consider other collection based structures.
Primitives are much (much much) faster. Always. Even with JIT escape analysis, etc. Skip wrapping things in java.lang.Integer. Also, skip the array bounds check most ArrayList implementations do on get(int). Most JIT's can recognize simple loop patterns and remove the loop, but there isn't much reason to much with it if you're worried about performance.
You don't have to code primitive access yourself - I'd bet you could cut over to using IntArrayList from the COLT library - see http://acs.lbl.gov/~hoschek/colt/ - "Colt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java") - in a few minutes of refactoring.
The options are:
1. To use an array
2. To use the ArrayList which internally uses an array
It is obvious the ArrayList introduces some overhead (look into ArrayList source code). For the 99% of the use cases this overhead can be easily ignored. However if you implement time sensitive algorithms and do tens of millions of reads from a list by index then using bare arrays instead of lists should bring noticeable time savings. USE COMMON SENSE.
Please take a look here: http://robaustin.wikidot.com/how-does-the-performance-of-arraylist-compare-to-array I would personally tweak the test to avoid compiler optimizations, e.g. I would change "j = " into "j += " with the subsequent use of "j" after the loop.
An Array will be faster simply because at a minimum it skips a function call (i.e. get(i)).
If you have a static size, then Arrays are your friend.

Categories