I'm looking for a Java class with the characteristics of C++ std::map's usual implementation (as I understand it, a self-balancing binary search tree):
O(log n) performance for insertion/removal/search
Each element is composed of a unique key and a mapped value
Keys follow a strict weak ordering
I'm looking for implementations with open source or design documents; I'll probably end up rolling my own support for primitive keys/values.
This question's style is similar to: Java equivalent of std::deque, whose answer was "ArrayDeque from Primitive Collections for Java".
ConcurrentSkipListMap is a sorted map backed by a skip list (a self-balancing tree-like structure with O(log n) performance). Generally the bounds on CSLM are tighter than TreeMap (which is a self-balancing red-black tree impl) so it will probably perform better, with the side benefit of being thread-safe and concurrent, which TreeMap is not. CSLM was added in JDK 1.6.
Trove has a set of collections for primitive types and some other interesting variants of the common Java collection types.
Other collection libraries of interest include the Google Collection library and Apache Commons Collections.
The closest class to a binary tree in the standard Java libraries is java.util.TreeMap but it doesn't support primitive types, except by boxing (i.e. int is wrapped as an Integer, double as a Double, etc).
java.util.HashMap is likely to give better performance for large maps. Theoretically it is O(1) but its precise performance characteristics depend on the hash code generation algorithm(s) for the key class(es).
According to Introduction to Collections: "Arrays ... are the only collection that supports storing primitive data types."
You can take a look at commons-collections FastTreeMap as well.
I doubt you will find many collections that support primitive types without boxing, so just live with it. And that is not necessarily needed, because of autoboxing.
If you really want to use primitive (after making benchmarks that show insufficient performance!), you can see the source of the FastTreeMap and add methods for handling primitives.
Related
I'm trying to understand how Java Collections Framework sorts its collections by default and I got confused, because I read all the collections are being sorted using merge sort. But as I took a look at Array class I saw this: «Implementors should feel free to substitute other algorithms, so long as the specification itself is adhered to. (For example, the algorithm used bysort(Object[]) does not have to be a mergesort, but it does have to be stable.)» Which means it also uses other sorting algorithms. So how exactly are the collections being sorted?
The code to sort collections is delivered with the JRE/JDK.
Anyone who implements the JRE/JDK can choose to implement it in any way he wants, as long as it's conforming (i.e. it actually sorts the collection correctly and the sorting is stable).
Some implementations might choose merge-sort, others might choose something else. No specific implementation is required.
I am doing a project in Scala, but am fairly new to the language and have a Java background. I see that Scala doesn't have ArrayList, so I am wondering what Scala's equivalent of Java's ArrayList is called, and if there are any important differences between the Java and Scala versions.
EDIT: I'm not looking for a specific behavior so much as an internal representation (data stored in an array, but the whole array isn't visible, only the part you use).
I can think of 3 more specific questions to address yours:
What is Scala's default collection?
What Scala collection has characteristics similar to ArrayList?
What's a good replacement for Array in Scala?
So here are the answers for these:
What is Scala's default collection?
Scala's equivalent of Java's List interface is the Seq. A more general interface exists as well, which is the GenSeq -- the main difference being that a GenSeq may have operations processed serially or in parallel, depending on the implementation.
Because Scala allows programmers to use Seq as a factory, they don't often bother with defining a particular implementation unless they care about it. When they do, they'll usually pick either Scala's List or Vector. They are both immutable, and Vector has good indexed access performance. On the other hand, List does very well the operations it does well.
What Scala collection has characteristics similar to ArrayList?
That would be scala.collection.mutable.ArrayBuffer.
What's a good replacement for Array in Scala?
Well, the good news is, you can just use Array in Scala! In Java, Array is often avoided because of its general incompatibility with generics. It is a co-variant collection, whereas generics is invariant, it is mutable -- which makes its co-variance a danger, it accepts primitives where generics don't, and it has a pretty limited set of methods.
In Scala, Array -- which is still the same Array as in Java -- is invariant, which makes most problems go away. Scala accepts AnyVal (the equivalent of primitives) as types for its "generics", even though it will do auto-boxing. And through the "enrich my library" pattern, ALL of Seq methods are available to Array.
So, if you want a more powerful Array, just use an Array.
What about a collection that shrinks and grows?
The default methods available to all collections all produce new collections. For example, if I do this:
val ys = xs filter (x => x % 2 == 0)
Then ys will be a new collection, while xs will still be the same as before this command. This is true no matter what xs was: Array, List, etc.
Naturally, this has a cost -- after all, you are producing a new collection. Scala's immutable collections are much better at handling this cost because they are persistent, but it depends on what operation is executed.
No collection can do much about filter, but a List has excellent performance on generating a new collection by prepending an element or removing the head -- the basic operations of a stack, as a matter of fact. Vector has good performance on a bunch of operations, but it only pays if the collection isn't small. For collections of, say, up to a hundred elements, the overall cost might exceed the gains.
So you can actually add or remove elements to an Array, and Scala will produce a new Array for you, but you'll pay the cost of a full copy when you do that.
Scala mutable collections add a few other methods. In particular, the collections that can increase or decrease size -- without producing a new collection -- implement the Growable and Shrinkable traits. They don't guarantee good performance on these operations, though, but they'll point you to the collections you want to check out.
It's ArrayBuffer from scala.collection.mutable. You can find the scaladocs here.
It's hard to say exactly what you should do because you haven't said what behavior of ArrayList you're interested in using. It's more useful to think in terms of which scala traits you want to take advantage of. Here's a good explanation: http://grahamhackingscala.blogspot.com/2010/02/how-to-convert-java-list-to-scala-list.html.
That said, you probably want some sort of IndexedSeq.
I would like to find and reuse (if possible) a map implementation which has the following attributes:
While the number of entries is small, say < 32, underlying storage should be done in an array like this [key0, val0, key1, val1, ...] This storage scheme avoids many small Entry objects and provides for extremely fast look ups (even tho they are sequential scans!) on modern CPU's due to the CPU's cache not being invalidated and lack of pointer indirection into heap.
The map should maintain insertion order for key/value pairs regardless of the number of entries similar to LinkedHashMap
We are working on an in-memory representations of huge (millions of nodes/edges) graphs in Scala and having such a Map would allow us to store Node/Edge attributes as well as Edges per node in a much more efficient way for 99%+ of Nodes and Edges which have few attributes or neighbors while preserving chronological insertion order for both attributes and edges.
If anyone knows of a Scala or Java map with such characteristics I would be much obliged.
Thanx
While I'm not aware of any implementations that exactly fit your requirements, you may be interested in peeking at Flat3Map (source) in the Jakarta Commons library.
Unfortunately, the Jakarta libraries are rather outdated (e.g., no support for generics in the latest stable release, although it is promising to see that this is changing in trunk) and I usually prefer the Google Collections, but it might be worth your time to see how Apache implemented things.
Flat3Map doesn't preserve the order of the keys, unfortunately, but I do have a suggestion in regard to your original post. Instead of storing the keys and values in a single array like [key0, val0, key1, val1, ...], I recommend using parallel arrays; that is, one array with [key0, key1, ...] and another with [val0, val1, ...]. Normally I am not a proponent of parallel arrays, but at least this way you can have one array of type K, your key type, and another of type V, your value type. At the Java level, this has its own set of warts as you cannot use the syntax K[] keys = new K[32]; instead you'll need to use a bit of typecasting.
Have you measured with profiler if LinkedHashMap is too slow for you? Maybe you don't need that new map - premature optimization is the root of all evil..
Anyway for processing millions or more pieces of data in a second, even best-optimized map can be too slow, because every method call decreases performance as well in that cases. Then all you can do is to rewrite your algorithms from Java collections to arrays (i.e. int -> object maps).
Under java you can maintain a 2d array(spreadsheet). I wrote a program which basically defines a 2 d array with 3 coloumns of data, and 3 coloumns for looking up the data. the three coloumns are testID, SubtestID and Mode. This allows me to basically look up a value by testid and mode or any combination, or I can also reference by static placement. The table is loaded into memory at startup and referenced by the program. It is infinately expandable and new values can be added as needed.
If you are interested, I can post a code source example tonight.
Another idea may be to maintain a database in your program. Databases are designed to organize large amounts of data.
Well, it seems to me ArrayLists make it easier to expand the code later on both because they can grow and because they make using Generics easier. However, for multidimensional arrays, I find the readability of the code is better with standard arrays.
Anyway, are there some guidelines on when to use one or the other? For example, I'm about to return a table from a function (int[][]), but I was wondering if it wouldn't be better to return a List<List<Integer>> or a List<int[]>.
Unless you have a strong reason otherwise, I'd recommend using Lists over arrays.
There are some specific cases where you will want to use an array (e.g. when you are implementing your own data structures, or when you are addressing a very specific performance requirement that you have profiled and identified as a bottleneck) but for general purposes Lists are more convenient and will offer you more flexibility in how you use them.
Where you are able to, I'd also recommend programming to the abstraction (List) rather than the concrete type (ArrayList). Again, this offers you flexibility if you decide to chenge the implementation details in the future.
To address your readability point: if you have a complex structure (e.g. ArrayList of HashMaps of ArrayLists) then consider either encapsulating this complexity within a class and/or creating some very clearly named functions to manipulate the structure.
Choose a data structure implementation and interface based on primary usage:
Random Access: use List for variable type and ArrayList under the hood
Appending: use Collection for variable type and LinkedList under the hood
Loop and process: use Iterable and see the above for use under the hood based on producer code
Use the most abstract interface possible when handing around data. That said don't use Collection when you need random access. List has get(int) which is very useful when random access is needed.
Typed collections like List<String> make up for the syntactical convenience of arrays.
Don't use Arrays unless you have a qualified performance expert analyze and recommend them. Even then you should get a second opinion. Arrays are generally a premature optimization and should be avoided.
Generally speaking you are far better off using an interface rather than a concrete type. The concrete type makes it hard to rework the internals of the function in question. For example if you return int[][] you have to do all of the computation upfront. If you return List> you can lazily do computation during iteration (or even concurrently in the background) if it is beneficial.
The List is more powerful:
You can resize the list after it has been created.
You can create a read-only view onto the data.
It can be easily combined with other collections, like Set or Map.
The array works on a lower level:
Its content can always be changed.
Its length can never be changed.
It uses less memory.
You can have arrays of primitive data types.
I wanted to point out that Lists can hold the wrappers for the primitive data types that would otherwise need to be stored in an array. (ie a class Double that has only one field: a double) The newer versions of Java convert to and from these wrappers implicitly, at least most of the time, so the ability to put primitives in your Lists should not be a consideration for the vast majority of use cases.
For completeness: the only time that I have seen Java fail to implicitly convert from a primitive wrapper was when those wrappers were composed in a higher order structure: It could not convert a Double[] into a double[].
It mostly comes down to flexibility/ease of use versus efficiency. If you don't know how many elements will be needed in advance, or if you need to insert in the middle, ArrayLists are a better choice. They use Arrays under the hood, I believe, so you'll want to consider using the ensureCapacity method for performance. Arrays are preferred if you have a fixed size in advance and won't need inserts, etc.
Java collections only store Objects, not primitive types; however we can store the wrapper classes.
Why this constraint?
It was a Java design decision, and one that some consider a mistake. Containers want Objects and primitives don't derive from Object.
This is one place that .NET designers learned from the JVM and implemented value types and generics such that boxing is eliminated in many cases. In CLR, generic containers can store value types as part of the underlying container structure.
Java opted to add generic support 100% in the compiler without support from the JVM. The JVM being what it is, doesn't support a "non-object" object. Java generics allow you to pretend there is no wrapper, but you still pay the performance price of boxing. This is IMPORTANT for certain classes of programs.
Boxing is a technical compromise, and I feel it is implementation detail leaking into the language. Autoboxing is nice syntactic sugar, but is still a performance penalty. If anything, I'd like the compiler to warn me when it autoboxes. (For all I know, it may now, I wrote this answer in 2010).
A good explanation on SO about boxing: Why do some languages need Boxing and Unboxing?
And criticism of Java generics: Why do some claim that Java's implementation of generics is bad?
In Java's defense, it is easy to look backwards and criticize. The JVM has withstood the test of time, and is a good design in many respects.
Makes the implementation easier. Since Java primitives are not considered Objects, you would need to create a separate collection class for each of these primitives (no template code to share).
You can do that, of course, just see GNU Trove, Apache Commons Primitives or HPPC.
Unless you have really large collections, the overhead for the wrappers does not matter enough for people to care (and when you do have really large primitive collections, you might want to spend the effort to look at using/building a specialized data structure for them).
It's a combination of two facts:
Java primitive types are not reference types (e.g. an int is not an Object)
Java does generics using type-erasure of reference types (e.g. a List<?> is really a List<Object> at run-time)
Since both of these are true, generic Java collections can not store primitive types directly. For convenience, autoboxing is introduced to allow primitive types to be automatically boxed as reference types. Make no mistake about it, though, the collections are still storing object references regardless.
Could this have been avoided? Perhaps.
If an int is an Object, then there's no need for box types at all.
If generics aren't done using type-erasure, then primitives could've been used for type parameters.
There is the concept of auto-boxing and auto-unboxing. If you attempt to store an int in a List<Integer> the Java compiler will automatically convert it to an Integer.
Its not really a constraint is it?
Consider if you wanted to create a collection that stored primitive values. How would you write a collection that can store either int, or float or char? Most likely you will end up with multiple collections, so you will need an intlist and a charlist etc.
Taking advantage of the object oriented nature of Java when you write a collection class it can store any object so you need only one collection class. This idea, polymorphism, is very powerful and greatly simplifies the design of libraries.
The main reason is the java design strategy.
++
1) collections requires objects for manipulation and primitives are not derived from object
so this can be the other reason.
2) Java primitive data types are not reference type for ex. int is not an object.
To Overcome:-
we have concept of auto-boxing and auto-unboxing. so if you are trying to store primitive data types compiler will automatically convert that into object of that primitive data class.
I think we might see progress in this space in the JDK possibly in Java 10 based on this JEP - http://openjdk.java.net/jeps/218.
If you want to avoid boxing primitives in collections today, there are several third party alternatives. In addition to the previously mentioned third party options there is also Eclipse Collections, FastUtil and Koloboke.
A comparison of primitive maps was also published a while ago with the title: Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove. The GS Collections (Goldman Sachs) library was migrated to the Eclipse Foundation and is now Eclipse Collections.