Tree construction from an ordered list - java

In java, im creating a SortedSet from a list which is always going to be ordered (but is only of type ArrayList). I figure adding them one by one is going to have pretty poor performance (in the case for example of an AVL tree), as it will have to reorder the tree a lot.
my question is, how should i be creating this set? in a way that it is as fast as possible to build a balanced tree?
the specific implementation i was planning on using was either IntRBTreeSet or IntAVLTreeSet from http://fastutil.dsi.unimi.it/docs/it/unimi/dsi/fastutil/ints/IntSortedSet.html
after writing this up, I think the poor performance wont affect me too much anyway (too small amount of data), but im still interested in how it would be done in a general case.

A set having a tree implementation would have the middle element from your list in the top. So the algorithm would be as following:
find the middle element of the List
insert it into set
repeat for both sub-lists to the left and to the right of the middle element

Red-Black trees are a good choice for the general case, and they have very fast inserts. See Chris Okasaki's paper for an elegant and fast implementation. The Functional Java library has a generic Set class that is backed by a red-black tree implemented according to this paper.

With all the discussion of using a Set, it occurs to me that maybe the problem could be re-stated. Why use a Set at all? If you just want to check for membership, and your source list is sorted, then do a binary search for the object - this will be as fast (and probably faster) than any n-tree you can envision, and it's not that tough to code.
So, envision a OrderedListSet interface that just wraps the underling List object. As long as the comparator used to order the list is also used for the binary search, this should be pretty straight-forward.
All Set operations will start with a getIndex(Object ob) call, then the appropriate action is taken on the List.

Do you have a performance problem with the simple approach of just inserting the elements as they come?
If not, don't optimize.

The built in TreeSet (http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html) class uses a red-black tree as it's backing tree (and, has been noted, red-black trees are quite fast for inserts). Here's good info on Red-Black trees (they don't have the problem of the typical binary tree implementation when inserting data that is mostly ordered already).
If you are dealing with huge data sets (big enough to require disk based backing, or significant paging file swap), then a B+Tree is a very good option (see JDBM for a Java based version of self-balancing B+Tree - it doesn't implement Set, but could be used that way if desired).
Depending on how your application is actually using this data, you might want to consider the GlazedLists library and make your lists 'live'. If all you are doing is static analysis, then this may be overkill, but it is an absolutely fantastic way of working with list based data. Definitely worth reading about.

Related

Binary Search Tree vs a MultiMap

The problem I have to solve is that I have to input IP address prefixes and that data associated with them in a tree so they can be queried later. I'm reading these addresses from a file and the file may contain as many as 16 million records and the file could have duplicates and i have to store those too.
I wrote my own binary search tree, but learned that a TreeMap in Java is implemented using a Red Black tree, but a TreeMap can't contain duplicates.
I want the query to take O(logn) time.
The data structure needs to be in Ram, so I'm also not sure how I'm going to store 16 million nodes.
I wanted to ask: Would it be too much of a performance hit using a library like guava to insert the Ips in Multi-maps? Or is there a better way to do this?
Using a built in library, which is tested documented and well maintained is usually a good practice.
It will also help you learn more about guava. Once you start using it "for just one thing", you will most likely realize there is much more you can use to make your life a bit easier.
Also, an alternative is using a TreeMap<Key,List<MyClass>> rather then TreeMap<Key,MyClass> as a custom implementation of a Multimap.
Regarding memory - you should try to minimize your data as much as possible (use efficient data structures, no need for "wasty" String, for example for storing IPs, there are cheaper alternatives, exploit them.
Also note - the OS will be able to offer you more memory then the RAM you have, by using virtual memory (practically for 64bits machine - it is most likely to be more then enough). However, it will most likely be less efficient then a DS dedicated for disk (such as B+ trees, for example).
Alternatives:
As alternatives to the TreeMap - you might be interested in other data structures (each with its advantages and disadvantages):
hash table - implemented as HashMap in java. Your type will then beHashMap<Key,List<Value>>. It allows O(1) average case query, but might decay to O(n) worst case. It also does not allow efficient range queries.
trie or its more space efficient version - radix tree. Allows O(1) access to each key, but is usually less space efficient then the alternatives. With this approach, you will implement the Map interface with the DS, and your type will be Map<Key,List<Value>>
B+ tree, which is much more optimized for disk - if your data is too large to fit in RAM after all.

Java Performance: Map vs List

I've building a tree pagination in JSF1.2 and Richfaces 3.3.2, because I have a lot of tree nodes (something like 80k), and it's slow..
So, as first attempt, I create a HashMap with the page and the list of nodes of the page.
But, the performance isn't good enough...
So I was wondering if is something faster than a HashMap, maybe a List of Lists or something.
Someone have some experience with this? What can I do?
Thanks in advance.
EDIT.
The big problem is that I have to validate permissions of users in the childnodes of the tree. I knew that this is the big problem: this validation is slow, because I have to go inside the nodes, I don't have a good way to know if the user have permission in a 10th level node without iterate all of them. Plus to this, the same three has used in more places...
The basic reason for why I was doing this pagination, is that the client side will be much slow, because of the structure generated by richfaces, a lot of tr's and td's, the browser just going crazy with this.
So, unfortunatelly, I have to load all the nodes, and paginate just client side, and I need to know what of them is faster to iterate...
Sorry my bad english.
A hash map is the fastest data structure if you want to get all nodes for a page. The list of nodes can be fetched in constant time (O(1)) while with lists the time is O(n) (n=number of pages, faster on sorted lists but never getting near O(1))
What operations on your datastructure are too slow. That's what you have to analyse before you start optimization.
It's probably more due to the fact that JSF is a performance pig than a data structure choice. The one attempt I've seen to create a JSF app could be timed with a sundial.
You're making a mistake by guessing about solutions without more knowledge about the root cause. I'd recommend that you profile your app to see where the time is being spent.
The data structure to use always depends on how you need to store the data and how you need to access it. HashMap<K, V> is supposed to have constant time complexity in accessing the value, provided the key. When you call get(key), the hashCode() for key is computed and it's used to retrieve the related value. Unless you've got different keys that have the same hashcode (in which case you may have been doing something wrong, as while is not mandatory different objects should have different hash codes, at least in the majority of cases), this is usually fast.
Searching an element in a plain list requires scanning of the list, which will (almost) always be slower than computing an hashcode.
If you need to associate values with keys, a Map is the way. And HashMap should be fast enough.
I don't know too much about JSF, but I think - if the data structure and access pattern is the one that a Map is designed for - the problem is not the HashMap itself.
I would solve this with a javascript/ajax calls method that fetches childnodes.

Which Java data object to use for multidimensional range matching?

Project Background:
I am writing a map tile overlay class for java that can use gdal2tile.py tiles. Basically I will end up with thousands of jpg files that are in a file structure like
"Zoom Level/X coordinate/Y coordinate"
The coordinates are ints but will not necessarily start at 0 or 1.
I will have to search for tiles that are within a certain range to find out which ones I need to render.
My Problem:
I tried iterating using the file structure itself but it is wicked slow (not surprising).
I tried iterating using an ArrayList of strings of the file structure and .contains() but it seems to be even slower (not too surprising).
Optimally I would like to use a data structure that would let me choose a range on multiple dimensions so that I can call something like.
Tiles.getWhere(Zoom Level,min X,max X,min Y,maxY);
I assume that some sort of Collection or TreeMap would be the right choice but I'm not experienced enough with Java to know for sure and I'd prefer not to have to benchmark a lot of different approaches.
I could use SQLite to do it but that seems like overkill.
My Question:
What is the most efficient way to check for the existence of datasets given multiple dimensional constraints?
May be you are looking for a map with multiple keys.
Commons-collections provides a map with multiple lookup keys:
http://commons.apache.org/collections/apidocs/org/apache/commons/collections/map/MultiKeyMap.html
a map guarantees a O(1) insertion and O(1) selection timings.
Thinking of your problem I could find out three directions to which you could aim your search next (this is not a hand-by-hand guide but rather a out-of-the-box brain opener for a stucked situation you have faced):
1) Usage of Java built in structures. Yes, indeed, a list is the worst case of a searching method. A Map, as the name suggests, is far more convenient for maps. It is not only the name, but the indexing to a Map is signifigantly less time consuming compared to a List. You can imagine your map as a cube, where you have to handle about half of the dots inside it, if you use List and probably only a narrow layer of it when you search by indexing a Map. There is a magnitude of difference. So, my answer here: Map is a key word towards the correct direction (assuming you want to do it in this way after reading on my answer).
2) Usage of a Map Server solution. This is probably too far from your approach, but entire frameworks are made for solving your type of question. An example is GeoServer. It has a ready made solution for the entire problem. It is a stable solution for the great big problem possibly in your hand: showing a map to a user from a source.
3) Sticking to the GDAL framework you were using, you could select slightly different py-file, like gdal_proximity.py and - wow! - you have a searching possibility in your hand! This particular one searches by a center point and a distance, but will do the stuff you need =)
There is a starting point, how I would make it. Could this serve for something?
Sounds to me like you are looking for something like an Interval Tree.
http://en.wikipedia.org/wiki/Interval_tree
I have implemented one of these in the past but only in one dimension. The Wikipedia reference mentions extensions to more dimensions.
Paul

What is the most efficent way to create a tree in java?

I am creating a tree in java to model the Extensive-form of a game for an AI. The tree will be an 25-ary tree (a tree in which each branch at most has 25 child branches) because at each turn of the game there are 25 different moves. Because the number of new branches that have to be created in each new layer of the tree is 25^n I'm very concerned with making this efficient. (I intend to remorselessly cut of branches to keep them from growing in order to keep things from getting bogged down). What is the best way to model such a tree when efficiency is such a concern? My first impression is to have a node object where each node has a parent node and an array of child nodes but this means creating a lot of objects. Ultimately these are my questions:
Is this the fastest way create and manage my tree?
What is a good way to figure out how much time any given algorithm or process in a program is going to take? (the only one I've thought of so far is to create a date before the process and then after and compare the # of milliseconds that have passed)
Any other thoughts are also welcome. My question implies and is related to a great number of other questions, i would expect. If i have been ambiguous or unclear please comment to let me know instead of down-voting and storming off as this isn't productive.
Realistically, the way you described is the best approach. It'll perform reasonably well compared to anything else you could do and will be straightforward to implement.
Time and again people are asking questions about how to do something "efficiently". The best answer is nearly always, "don't even bother trying". Unless your improvement is an algorithmic one, it's unlikely to make much difference anyway, and especially in a case like this, the extra effort and complexity isn't worth whatever miniscule gain you might be able to achieve.
Putting it another way, and to borrow a quote (though I can't remember the originator), the first rule of optimization is: don't.
Having said that, if you really feel the need to squeeze every last drop of speed, you could try caching and re-using objects (instead of discarding them completely, keep track of them in a free object store, and then when you need to create a new object, first check the free object store to check if there is an existing one). As always, you'll need to measure performance before and after to see if it really helps (chances are it won't help much, unless physical memory is really constrained, in which case garbage collection can become expensive).
I agree with the previous comment about only optimizing once you have implemented the rest of the application.
On the other hand, I do realize a few things that may be of importance:
Branching factor of 25: Although not ridiculously huge, it is still large with respect to other problems. For a tree, you will definitely have to have a list for each node to indicate the list of SubNodes. You can do this either by making a Node class which has a collection of nodes within it, or have an external Map that maps a given node to a list of children nodes.
Removing and adding of elements will be done: This lends itself to a LinkedList implementation of the stored children since you don't want to perform costly removes and adds. A HashSet may work also, but the problem is that you may need more memory.
Iteration of the elements may or may not be done: If you want to iterate over the entire list at each step, LinkedLists are fine. If you want to prioritize the nodes then you may be saving memory by using a priority queue data structure. Priority queues are especially helpful if you are going to implement a heuristic function and evaluate which child to move to at any given node.
Thats all I have so far, but I'll keep updating if I think of more things, or if you update your content.

In Java, which is the most recommended class for a dictionary data structure?

I need a data structure to store users which should be retrieved by id.
I noticed there are several classes that implement the Map interface. Which one should be my default choice? They all seem quite equivalent to me.
Probably it depends on how many users you plan to have and if you will need them ordered or just getting single items by id.
HashMap uses hash codes to store things so you have constant time for put and get operations but items are always unordered.
TreeMap instead uses a binary tree so you have log(n) time for basic operations but items are kept ordered in the tree.
I would use HashMap because it's the simpler one (remember to give it a suitable initial capacity). Remember that these datastructures are not synchronized by default, if you plan to use it from more than one thread take care of using ConcurrentHashMap.
A middle approach is the LinkedHashMap that uses same structure as HashMap (hashcode and equals method) but it also keeps a doubly linked list of element inserted in the map (mantaining the order of insertion). This hybrid has ordered items (ordered in sense of insertion order, as suggested by comments.. just to be precise but I had already specified that) without performance losses of TreeMap.
No concurrency: use java.util.HashMap
Concurrency: use java.util.concurrent.ConcurrentHashMap
If you want some control on the order used by iterators, use a TreeMap or a LinkedHashMap.
This is covered on the Java Collections Trail, Implementations page.
If they all seem equivalent then you haven't read the documentation. Sun's documentation is pretty much as terse as it gets and provides very important points for making your choices.
Start here.
Your choice could be modified by how you intend to use the data structure, and where you would rather have performance - reads, or writes?
In a user-login system, my guess is that you'll be doing more reads than writes.
(I know I already answered this once, but I feel this needs saying)
Have you considered using a database to store this information? Even if it's SQLite, it'd probably be easier than storing your user database in the program code or loading the entire dataset into memory each time.

Categories