I am working on a game and have several customers stored in an ArrayList, and each customer has their own unique ID which is saved as a variable inside the object.
If I want to retrieve a customer from the list using their ID, which of these would be better practice?
Iterate through all customers in the list until I find a match.
Convert the list to a HashMap of keys (the ID) and values (the customers), and then simply use the .get() method. Maybe this does exactly the same as option one?
HashMap will be more efficient (in most cases O(1), in worst case O(n)), iterating over a list would be O(n).
Of course it depends on the size of your data, if you have lots of it then HashMap is the obvious choice, if you have few (e.g. 5 maybe 10), it might be the case that a List would be more efficient - constant factors have to be taken under consideration here, which Big-Oh notation ignores.
As written by others, collections having "hash" in their name typically provide improved performance properties for certain kind of actions.
But please keep in mind the very old rule to not do premature optimization. Seriously: unless we are talking a "real production setup"; and 10 000+ paying customers ... then you should understand that you should very much more focus on good design, and creating readable + maintainable code than on potential performance issues.
Yes, one should avoid outright stupid designs; but the thing is: if you focus too much and too early on performance, you risk two things:
a) missing the real bottleneck. If you really encounter performance issues, you have profile your application to understand where time is spent. Far too often people assume that their problem is X; and then they spend a lot of time fixing X - to later find out that actually the Just-in-Time compiler addresses X good enough; and that their real problem is some other Y they never thought of.
b) putting yourself into a corner. Thing is: good designs need to be able to evolve and change. If you assume you must do this or that to preserve some holy performance grail ... chances are, that this impacts your overall application in a negative way.
I know your question was specific, but I'm going to propose another solution.
Build you own set, backed by an array of Costumers. Once a Costumer is inserted, the Customer's ID is set to the actual index it gets when placed into the array.
that way you can have direct access to all you costumers, using their ID.
The hashmap and all sorted list can use the index to find the right item faster. It can pick an index in the middle, if the value it's looking for is higher or lower it can use that information and pick a new index in the lower or higher half and repeat the same task as before. This reduce the time to find the item. It's called binary search (thx fabian)
You can try it and compare the result!
EDIT:
I am with Jägermeister with this one, you should wait to up the performance and have it as a final step.
Related
I'm working on an activity in which I have to determine what sort is used in a set of given methods, which I can't actually view as it's run from a .jar file which only contains .class files. All I can see is timing information (so I can determine what sort it could be based on time complexity). However, the elements being sorted are randomly generated integers, and duplicates are allowed. In order to properly identify the sorts, I'll have to somehow identify which methods are not sorting in a stable manner--without being able to view the source code or the array of integers. For example, I could have an array [1,2,3,9,5,64,8,5,1], which when sorted would be [1,1,2,3,5,8,9,64]. However, if something like a non-stable Selection Sort were used to sort this, the two 1 values would be switched in relative order to each other. How could I detect if something like this has occurred?
If you're constrained to using ints, there's no way you can test this as there is no way to distinguish two different 1s.
However, if you have the option of using Integers, and you specifically use new Integer() (rather than the autoboxing that comes with Java), you can compare the references in the original array with the new one.
As mentioned by Joe C, you would have to use some sort of wrapper class as there is no way to determine the difference between two primitive values. For example the Integer class could be used.
However there is a bigger issue here and it is an issue with logic. Just because a sorting algorithm is not guaranteed to be stable does not mean that given a particular input it will be unstable. Thus, measuring instability of an algorithm is ultimately an unreliable way of determining which algorithm is being used.
You could use instability as a way to narrow down your set of potential algorithms however to begin with I would recommend that you focus on the time it takes for different algorithms to process different kinds of inputs. Start with the obvious which is the time complexity of the algorithms: for example a bogosort will on average take much longer than a quicksort. Then delve into the inner workings of the algorithms that you are considering. Some algorithms with the same time complexities (eg: mergesort and heapsort) will both behave differently based on the type of input that you pass into them. For example one may be very slow to process a completely reversed input while another may do this very quickly.
I understand HashSet based on HashMap, since they are pretty similar. It makes the code more flexible and minimizes implementation effort. However, one reference variable in the HashSet's Entry seem to be unnecessary for me if the class forbids null element, therefore the whole Entry makes no point. Despite this fact, an Entry takes 24 byte memory / element, whereas a single array with the set's elements would take only 4 byte / element if my figures are correct. (aside from array's header)
If my argument is correct, does the advantages really overweight this performance hit?
(if i am wrong, i would learn from it aswell)
Though this question is primarily opinion-based, I'll summarize a few points on the topic:
HashSet appeared in Java 1.2 many years ago. It's hard to guess now the exact reasons for design decisions made at that times, but clearly Java wasn't used for high-loaded applications; performance played less role than simplicity.
You are right that HashSet is suboptimal in its memory consumption. The problem is known, the bug JDK-6624565 is registered, and discussions at core-libs-dev are held from time to time. But is this a blocker for many real world applications? Probably, no.
For those uncommon applications where HashSet memory usage is unacceptable, there are already good alternatives, like trove THashSet.
Note that open addressing algorithms has their disadvantages, e.g. significant performance degradation with load factors close to 1; element removal difficulties. See the related answer.
I'm looking for a high performing data structure that behaves like a set and where the elements will always be an array of ints. The data structure only needs to fulfill this interface:
trait SetX {
def size:Int
def add(element:Array[Int])
def toArray:Array[Array[Int]]
}
The set should not contain duplicates and this could be achieved using Arrays.equals(int[] a, int[] a2) - i.e. the values of the arrays can't be the same.
Before creating it I have a rough idea of how many elements there will be but need resizing behaviour in case there are more than initially thought. The elements will always be the same length and I know what that is at the time of creation.
Of course I could use a Java HashSet (wrapping the arrays of course) but this is being used in a tight loop and it is too slow. I've looked at Trove and that works nicely (by using arrays but providing a TObjectHashingStrategy) but I was hoping that since my requirements are so specific there might be a quicker/more efficient way to do this.
Has anyone ever come across this or have an idea how I could accomplish this?
The trait above is Scala but I'm very happy with Java libs or code.
I should really say what I am doing. I am basically generating a large number of int arrays in a tight loop and at the end of it I just want to see the unique ones. I never have to remove elements from the set or anything else. Just add lots of int arrays to the set and at the end get out the unique ones.
Look at prefix trees. You can follow tree structure immediately during array generation. At the end of generation you will have an answer, if the generated array already is present in the set. Prefix tree would consume much less memory than an ordinary hash set.
If you are generating arrays and have a not very small chance of their equivalence, I suspect you are only taking numbers from a very limited range. It would simplify prefix tree implementation, too.
I'm sure that proper implementation would be faster than using any set implementation to keep solid arrays.
Downside of this solution is that you need to implement data structure yourself, because it would be integrated with the logic of code deeply.
If you want high performance then write your own:
Call it ArraySetInt.
Sets are usually either implemented as trees or hashtable.
If you want an array based set, this would slow down adding, maybe deleting, but will speed up iterating, low memory usage. etc.
First look how ArrayList is implemented.
remove the object and replace it with primitive int.
Then rename the add() to put() and change it to a type of sort by insertion. Use System.arraycopy() to insert. use Arrays.binsearch() to find the insert position and whether element already exist in one step.
With out knowing how much data or if you are doing more reads than write:
You should probably try (ie benchmark) the naive case of an array of arrays or array of special wrapped array (ie composite object with cached hashcode of array and the array). Generally on small data sets not much beats looping through an array (e.g. HashMap for an Enum can actually be slower than looping through).
If you have really large amount of data and your willing to make some compromises you might consider a bloom filter but it sounded like you don't have much data.
I'd go for some classic solution wrapping the array by a class providing faster equals and hashCode. The hashCode can be simply cached and equals can make use of it for quickly saying no in case of differing arrays.
I'd avoid Arrays.hashCode as it uses a stupid multiplier (31), which might lead to unneeded collisions. For a really fast equals you could make use of cryptography and say that two arrays are equals if and only if their SHA-1 are equal (you'd be the first to find a collision :D).
The ArrayWrapper rather simple and should be faster than using TObjectHashingStrategy as it never has to look at the data itself (fewer cache misses) and it has the fastest and best possible hashCode and equals.
You could also look for some CompactHashSet implementation as it can be faster due to better memory locality.
What is the best practice for initializing an ArrayList in Java?
If I initialize a ArrayList using the new operator then the ArrayList will by default have memory allocated for 10 buckets. Which is a performance hit.
I don't know, maybe I am wrong, but it seems to me that I should create a ArrayList by mentioning the size, if I am sure about the size!
Which is a performance hit.
I wouldn't worry about the "performance hit". Object creation in Java is very fast. The performance difference is unlikely to be measurable by you.
By all means use a size if you know it. If you don't, there's nothing to be done about it anyway.
The kind of thinking that you're doing here is called "premature optimization". Donald Knuth says it's the root of all evil.
A better approach is to make your code work before you make it fast. Optimize with data in hand that tells you where your code is slow. Don't guess - you're likely to be wrong. You'll find that you rarely know where the bottlenecks are.
If you know how many elements you will add, initialize the ArrayList with correct number of objects. If you don't, don't worry about it. The performance difference is probably insignificant.
This is the best advice I can give you:
Don't worry about it. Yes, you have several options to create an ArrayList, but using the new, the default option provided by the library, isn't a BAD choice, otherwise it'd be stupid to make it the default choice for everyone without clarifying what's better.
If it turns out that this is a problem, you'll quickly discover it when you profile. That's the proper place to find problems, when you profile your application for performance/memory problems. When you first write the code, you don't worry about this stuff -- that's premature optimization -- you just worry about writing good, clean code, with good design.
If your design is good, you should be able to fix this problem in no time, with little impact to the rest of the system. Effective Java 2nd Edition, Item 52: Refer to objects by their interfaces. You may even be able to switch to a LinkedList, or any other kind of List out there, if that turns out to be a better data structure. Design for this kinds of flexibility.
Finally, Effective Java 2nd Edition, Item 1: Consider static factory methods instead of constructors. You may even be able to combine this with Item 5: Avoid creating unnecessary objects, if in fact no new instances are actually needed (e.g. Integer.valueOf doesn't always create a new instance).
Related questions
Java Generics Syntax - in-depth about type inferring static factory methods (also in Guava)
On ArrayList micromanagement
Here are some specific tips if you need to micromanage an ArrayList:
You can use ArrayList(int initialCapacity) to set the initial capacity of a list. The list will automatically grow beyond this capacity if needed.
When you're about to populate/add to an ArrayList, and you know what the total number of elements will be, you can use ensureCapacity(int minCapacity) (or the constructor above directly) to reduce the number of intermediate growth. Each add will run in amortized constant time regardless of whether or not you do this (as guaranteed in the API), so this can only reduce the cost by a constant factor.
You can trimToSize() to minimize the storage usage.
This kind of micromanagement is generally unnecessary, but should you decide (justified by conclusive profiling results) that it's worth the hassle, you may choose to do so.
See also
Collections.singletonList - Returns an immutable list containing only the specified object.
If you already know the size of your ArrayList (approximately) you should use the constructor with capacity. But most of the time developers don't really know what will be in the List, so with a capacity of 10 it should be sufficient for most of the cases.
10 buckets is an approximation and isn't a performance hit unless you already know that your ArrayList contains tons of elements, in this case, the need to resize your array all the time will be the performance hit.
You don't need to tell initial size of ArrayList. You can always add/remove any element from it easily.
If this is a performance matter, please keep in mind following things :
Initialization of ArrayList is very fast. Don't worry about it.
Adding/removing element from ArrayList is also very fast. Don't worry about it.
If you find your code runs too slow. The first to blame is your algorithm, no offense. Machine specs, OS and Language indeed participate too. But their participation is considered insignificant compared to your algorithm participation.
If you don't know the size of theArrayList, then you're probably better off using a LinkedList, since the LinkedList.add() operation is constant speed.
However as most people here have said you should not worry about speed before you do some kind of profiling.
You can use this old, but good (in my opinion) article for reference.
http://chaoticjava.com/posts/linkedlist-vs-arraylist/
Since ArrayList is implemented by array underlying, we have to choose a initial size for the array.
If you really care you can call trimToSize() once you have constructed and populated the object. The javadoc states that the capacity will be at least as large as the list size. As previously stated, its unlikely you will find that the memory allocated to an ArrayList is a performance bottlekneck, and if it were, I would recommend you use an array instead.
Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:
public class Record {
final private int field1;
final private int field2;
final private long field3;
/* constructor & accessors here */
}
List<Record> records = new ArrayList<Record>();
If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:
public class OptimizedRecordStore {
final private int[] field1;
final private int[] field2;
final private long[] field3;
Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
/* constructor and other accessors & methods */
}
edit:
assume the # of records is something that is changed infrequently or never
I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.
The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.
The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.
Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?
I suspect it's the latter in which case the former makes far more sense.
If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.
If you have millions of records, the second approach has several advantages:
Memory usage: the first approach uses more memory because a) every Java object in heap has a header (containing class id, lock state etc.); b) objects are aligned in memory; c) each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOops for more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).
Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.
How are you going to access the data? If the accesses over the fields are always coupled, then use the first option, if you are going to process the fields by its own, then the second option is better.
See this article in wikipedia: Parallel Array
A good example about when it's more convenient to have separate arrays could be simulations where the numerical data is packed together in the same array, and other attributes like name, colour, etc. that are accessed just for presentation of the data in other array.
I was curious so I actually ran a benchmark. If you don't re-create the object like you are[1], then SoA beats AoS by 5-100% depending on workload[2]. See my code here:
https://gist.github.com/twolfe18/8168262c5420c7a62d39
[1] I didn't add that because if you are concerned enough about speed to consider this refactor, it would be silly to do that.
[2] This also doesn't account for re-allocation, but again, this is often something you can either amortize away or know statically. This is a reasonable assumption for a pure-speed benchmark.
Notice that the second approach might have negative impact on caching behaviour. If you want to access a single record at a time, you'd better have that record not scattered all across the place.
Also, the only memory you win in the second approach, is (possibly) due to member alignment. (and having to allocate a separate object).
Otherwise, they have exactly the same memory use, asymptotically. The first option is much better due to locality, IMO
Whenever I have tried doing number crunching in Java, I have always had to revert to C-style coding (i.e. close to your option 2). It minimised the number of objects floating around in your system, as instead of 1,000,000 objects, you only have 3. I was able to do a bit of FFT analysis of real-time sound data using the C-style, and it was far too slow using objects.
I'd choose the first method (array of structures) unless you access the store relatively infrequently and are running into serious memory pressure issues.
First version basically stores the objects in their "natural" form (+1 BTW for using immutable records). This uses a little more memory because of the per-object overhead (probably around 8-16 bytes depending on your JVM) but is very good for accessing and returning objects in a convenient and human-understandable form in one simple step.
Second version uses less memory overall, but the allocation of a new object on every "get" is a pretty ugly solution that will not perform well if accesses are frequent.
Some other possibilities to consider:
An interesting "extreme" variant would be to take the second version but write your algorithms / access methods to interact with the underlying arrays directly. This is clearly going to result in complex inter-dependencies and some ugly code, but would probably give you the absolute best performance if you really needed it. It's quite common to use this approach for intensive graphics applications such as manipulating a large array of 3D coordinates.
A "hybrid" option would be to store the underlying data in a structure of arrays as in the second version, but cache the accessed objects in a HashMap so that you only generate the object the first time a particular index is accessed. Might make sense if only a small fraction of objects are ever likely to accessed, but all data is needed "just in case".
(Not a direct answer, but one that I think should be given)
From your comment,
"cletus -- I greatly respect your thoughts and opinions, but you gave me the high-level programming & software design viewpoint which is not what I'm looking for. I cannot learn to ignore optimization until I can get an intuitive sense for the cost of different implementation styles, and/or the ability to estimate those costs. – Jason S Jul 14 '09 at 14:27"
You should always ignore optimization until it presents itself as a problem. Most important is to have the system be usable by a developer (so they can make it usable by a user). There are very few times that you should concern yourself with optimization, in fact in ~20 years of professional coding I have cared about optimization a total of two times:
Writing a program that had its primary purpose to be faster than another product
Writing a smartphone app with the intention of reducing the amount of data sent between the client and server
In the first case I wrote some code, then ran it through a profiler, when I wanted to do something and I was not sure which approach was best (for speed/memory) I would code one way and see the result in the profiler, then code the other way and see the result. Then I would chose the faster of the two. This works and you learn a lot about low level decisions. I did not, however, allow it to impact the higher level classes.
In the second case, there was no programming involved, but I did the same basic thing of looking at the data being sent and figuring out how to reduce the number of messages being sent as well as the number of bytes being sent.
If your code is clear then it will be easier to speed up once you find out it is slow. As Cletus said in his answer, you are resizing one time -vs- three times... one time will be faster than three. From a higher point of view the one time is simpler to understand than the three times, thus it is more likely to be correct.
Personally I'd rather get the right answer slowly then the wrong answer quickly. Once I know how to get the right answer then I can find out where the system is slow and replace those parts of it with faster implementations.
Because you are making the int[] fields final, you are stuck with just the one initialization of the array and that is it. Thus, if you wanted 10^6 field1's, Java would need to separate that much memory for each of those int[], because you cannot reassign the size of those arrays. With an ArrayList, if you do not know the number of records beforehand and will be removing records potentially, you save a lot of space upfront and then later on as well when you go to remove records.
I would go for the ArrayList version too, so I don't need to worry about growing it. Do you need to have a column like access to values? What is your scenario behind your question?
Edit You could also use a common long[][] matrix.
I don't know how you pass the columns to Matlab, but I guess you don't gain much speed with a column based storage, more likely you loose speed in the java computation.