Java Map vs Array (Temp) Sort of Last Index

Java Map vs Array (Temp) Sort of Last Index - java

I have looked at similar questions that detail the sorting of Maps and sorting of arrays of primitive data types, but no question directly details the difference between a one-time sort of a Java Map vs primitive data type array ([]).
Primary note* I know that 'TreeMap' is the sorted version (by key) of Map in Java, but I don't know how much about the 'behind-the-scenes' of how TreeMap sorts the keys (either while data is being added, or after the data is FINISHED being added)?
Primary note 2* Dijkstra's algorithm in this case is not an EXACT implementation. We are just finding the shortest path of weighted edges in a graph G of size M nodes. This means that adjacency matrix (format seen below) is of size M x M. This is not a SMART implementation. Pretty much just as base-line as you can get... sorry for the confusion!
We are given an adjacency matrix, where elements are related to each other ('connected') in the following example:
0,1,5 // 0 is connected to 1, and the weight of the edge is 5
0,2,7 // 0 is connected to 2, and the weight of the edge is 7
0,3,8 // 0 is connected to 3, and the weight of the edge is 8
1,2,10 // 1 is connected to 2, and the weight of the edge is 10
1,3,7 // 1 is connected to 3, and the weight of the edge is 7
2,3,3 // 2 is connected to 3, and the weight of the edge is 3
But never mind the input, just assume that we have a matrix of values to manipulate.
We are looking at storing all the possible paths in a "shortest path" algorithm (I'm sure 75% or more of people on SO know Dijkstra's algorithm). This IS for homework, but an implementation question, not a "solve this for me" question.
ASSUME that the size of the matrix is very large (size M x M), maybe more than 50x50 in size. This would result in [50-1]!/2 = 1.52 × 10^64 results in the result list assuming that our algorithm was smart enough to pick out duplicates and not find the length of a duplicate path (which it is not, because we are noobs at Graph Theory and Java, so please don't suggest any algorithm to avoid duplicates...).
My friend says that a temp sort (using a temporary variable) on an index of int[n] in a List, where int[n] is the last index and value of the shortest path (ALGORITHM_1) may be faster than TreeMap (ALGORITHM_2) where the key of the Map is the value of the shortest path.
We were debating as to what implementation would be faster in trying to find ALL lengths of the shortest path. We can store it as the last index of each path (have an int[] where the last element is the value (sum) of the shortest path (all elements int the array) (ALGORITHM_1), OR we can store that sum as the KEY of the Map (ALGORITHM_2).
Because this is a shortest path algorithm (albeit not a great one...), we NEED to sort the results of each path by length, which is the sum of each edge in the graph, in order to find the shortest path.
So the real question is: what would be faster in sorting the results ONLY ONE TIME? Through a Map.sort() algorithm (built into the Java standard library) or through creating a temporary variable to hold the value of the most recent 'length' in each int[]? For example:
myMap.sort(); // Unless TreeMap in Java does 'behind=the-scenes' sorting on keys...
myMap.get(0); // This would return the first element of the map, which is the shortest path
OR
int temp = myList.get(0)[m]; // Store a temp variable that is the 'shortest path'
for( int[] i in myList<int[]>) {
if (temp > myList.get(i)[m]) { // Check if the current path is shorter than the previous
temp = myList.get(i)[m]; // Replace temp if current path is shorter
}
}
Note that I haven't actually tested the implementations yet, nor have I checked my own Java syntax, so I don't know if these statements are declared correctly. This is just a theoretical question. Which would run faster? This is my 3rd year of Java and I don't know the underlying data structures used in HashMap, nor the Big O notation of either implementation.
Perhaps someone who knows the Java standard could describe what kind of data structures or implementations are used in HashMap vs (Primitive data type)[], and what the differences in run times might be in a ONE-TIME-ONLY sort of the structures.
I hope that this inquiry makes sense, and I thank anyone who takes the time to answer my question; I always appreciate the time and effort generous people such as yourselves put into helping to educate the newbies!
Regards,
Chris

It may not be necessary to sort your data in order to find the shortest path. Instead, you could iterate through the data and keep track of the shortest path that you've encountered.
Assuming the data is stored in an array of Data objects, with data.pathLength giving the path length,
Data[] data; // array of data
Data shortest = data[0]; // initialize shortest variable
for(int i = 1; i < data.length; i++) {
if(data[i].pathLength < shortest.pathLength)
shortest = data[i];
}
That said, TreeMap is a Red-Black Tree, which is a form of balanced binary tree. Unlike a standard binary tree, a balanced binary tree will rotate its branches in order to ensure that it is approximately balanced, which ensures log(n) lookups and insertions. A red-black tree ensures that the longest branch is no more than twice the length of the shortest branch; an AVL Tree is a balanced binary tree with even tighter restrictions. Long story short, a TreeMap will sort its data in n*log(n) time (log(n) for each insertion, times n data points). Your one-time array sort will also sort its data in n*log(n) time, assuming you're using Mergesort or Quicksort or Heapsort etc (as opposed to Bubblesort or another n^2 sort algorithm). You cannot do better than n*log(n) with a comparison sort; incidentally, you can use a transformation sort like Radix Sort that has a big-oh of O(n), but transformation sorts are usually memory hogs and exhibit poor cache behavior, so you're usually better off with one of the n*log(n) comparison sorts.
Since TreeMap and your custom sort are both n*log(n), this means that there's not much theoretical advantage to either one, so just use the one that's easier to implement. TreeMap's complex data structure does not come free, however, so your custom sorting algorithm will probably exhibit slightly better performance, e.g. maybe a factor of 2; this probably isn't worth the complexity of implementing a custom sort as opposed to using a TreeMap, especially for a one-shot sort, but that's your call. If you want to play around with boosting your program's performance, then implement a sorting algorithm that's amenable to parallelization (like Mergesort) and see how much of an improvement that'll get you when you split the sorting task up among multiple threads.

If you want the shortest path, sorting isn't necessary. Just track the shortest path as you finish each path, and update your shortest path when you encounter a shorter path. That'll wind up giving you an O(n), where n is the number of paths. You couldn't realistically store 10^64 paths anyway, so some truncation of the result set will be required.

how much about the 'behind-the-scenes' of how TreeMap sorts the keys (either while data
is being added, or after the data is FINISHED being added)?
TreeMap uses RedBlack tree algorithm (a variant of BST) where operations containsKey, get, put and remove operations take O(log(n)) time which is the very good. The keys gets sorted after each add of element as the TreeMap definition (in the link) explains it. Sorting will take O(nlog(n))
I am not sure why you are comparing a Map type - which uses key, value pair against Array. You have mentioned about using length of shortest paths as keys in TreeMap but what is that you are putting as values? If you just want to store "length of paths", I would suggest put them in array and sort them using Arrays.Sort() which will also sort in O(nlog(n)) using a different algorithm Dual-Pivot Quicksort.
Hope this helps!

Related

Generating an always balanced binary seach tree only true insertion

So i was thinking of a problem i find very interesting and i would like to share the concept of this, the problem starts of with an hypotetical data structure you define (it can be a list, array, tree, binary search tree, red black tree, Btree, etc.), the goal of this is obviously to optimize insertion, search, delete and update (but you can consider this as a search with replacement), the time complexity has to be has low as possible for every single type of operation (possibly O(1) or O(log(n) try to not use a solution of O(n)) the second part of the problem is that this structure during a normal day of work receives new elements with a key of increasing value starting from 1 to N where N can be Long.MAX_LONG, obviously when a new key is given it has to be inserted immediately so it will go as follows:
[1,2,3,4,...,N]
I think i am close to the solution of this problem but i am missing a little bit more of optimization, i was thinking of using either a Tree or a Hashtable but in the case of Hashtable there is a problem when N becomes very high it's needed to rehash the entire structure or the complexity would become O(n), this however is not a problem with a Tree but i think it may become a sequence of elements (keep in mind that we have to put every new element when it comes) like this:
And in this case you can clearly see that this Tree is not just a Tree it's a List, using a BST would give the same result.
I think the correct structure to use is the BST (or something like it for example Red Black Tree) and find a way to always have it balanced, but i am missing something.

If the "key" is an integer and the key are generated by incrementing a counter starting from 1, then the obvious data structure for representing the key -> value mapping is a ValueType[]. Yes, an array.
There are two problems with this:
Arrays do not "grow" in Java.
Solutions:
Preallocate the array to be big enough to start with.
Use an ArrayList instead of a array.
"Borrow" the algorithm that ArrayList uses to grow a list and use it with a bare array.
Arrays cannot have more than Integer.MAX_VALUE elements. (And ArrayList has the same problem.
Solution: use an array of arrays, and do some arithmetic to convert the long keys into a pair of ints for indexing the arrays.

Fastest way to find number of elements in a range

Given an array with n elements, how to find the number of elements greater than or equal to a given value (x) in the given range index i to index j in O(log n) or better complexity?
my implementation is this but it is O(n)
for(a=i;a<=j;a++)
if(p[a]>=x) // p[] is array containing n elements
count++;

If you are allowed to preprocess the array, then with O(n log n) preprocessing time, we can answer any [i,j] query in O(log n) time.
Two ideas:
1) Observe that it is enough to be able to answer [0,i] and [0,j] queries.
2) Use a persistent* balanced order statistics binary tree, which maintains n versions of the tree, version i is formed from version i-1 by adding a[i] to it. To answer query([0,i], x), you query the version i tree for the number of elements > x (basically rank information). An order statistics tree lets you do that.
*: persistent data structures are an elegant functional programming concept for immutable data structures and have efficient algorithms for their construction.

If the array is sorted you can locate the first value less than X with a binary search and the number of elements greater than X is the number of items after that element. That would be O(log(n)).
If the array is not sorted there is no way of doing it in less than O(n) time since you will have to examine every element to check if it's greater than or equal to X.

Impossible in O(log N) because you have to inspect all the elements, so a O(N) method is expected.
The standard algorithm for this is based on quicksort's partition, sometimes called quick-select.
The idea is that you don't sort the array, but rather just partition the section containing x, and stop when x is your pivot element. After the procedure is completed you have all elements x and greater to the right of x. This is the same procedure as when finding the k-th largest element.
Read about a very similar problem at How to find the kth largest element in an unsorted array of length n in O(n)?.
The requirement index i to j is not a restriction that introduces any complexity to the problem.

Given your requirements where the data is not sorted in advance and constantly changing between queries, O(n) is the best complexity you can hope to achieve, since there's no way to count the number of elements greater than or equal to some value without looking at all of them.
It's fairly simple if you think about it: you cannot avoid inspecting every element of a range for any type of search if you have no idea how it's represented/ordered in advance.
You could construct a balanced binary tree, even radix sort on the fly, but you're just pushing the overhead elsewhere to the same linear or worse, linearithmic O(NLogN) complexity since such algorithms once again have you inspecting every element in the range first to sort it.
So there's actually nothing wrong with O(N) here. That is the ideal, and you're looking at either changing the whole nature of the data involved outside to allow it to be sorted efficiently in advance or micro-optimizations (ex: parallel fors to process sub-ranges with multiple threads, provided they're chunky enough) to tune it.
In your case, your requirements seem rigid so the latter seems like the best bet with the aid of a profiler.

Fastest way to add a value in the middle of a sorted array - Java

I have a sorted array, lets say D={1,2,3,4,5,6} and I want to add the number 5 in the middle. I can do that by adding the value 5 in the middle and move the other values one step to the right.
The problem is that I have an array with 1000 length and I need to do that operation 10.000 times, so I need a faster way.
What options do I have? Can I use LinkedLists for better performance?

That depends on how you add said numbers. If only in ascending or descending order - then yes, LinkedList will do the trick, but only if you keep the node reference in between inserts.
If you're adding numbers in arbitrary order, you may want to deconstruct your array, add the new entries and reconstruct it again. This way you can use a data structure that's good at adding and removing entries while maintaining "sortedness". You have to relax one of your assumptions however.
Option 1
Assuming you don't need constant time random access while adding numbers:
Use a binary sorted tree.
The downside - while you're adding, you cannot read or reference an element by their position, not easily at least. Best case scenario - you're using a tree that keeps track of how many elements the left node has and can get the ith element in log(n) time. You can still get pretty good performance if you're just iterating through the elements though.
Total runtime is down to n * log(n) from n^2. Random access is log(n).
Option 2
Assuming you don't need the elements sorted while you're adding them.
Use a normal array, but add elements to the end of it, then sort it all when you're done.
Total runtime: n * log(n). Random access is O(1), however elements are not sorted.
Option 3
(This is kinda cheating, but...)
If you have a limited number of values, then employing the idea of BucketSort will help you achieve great performance. Essentially - you would replace your array with a sorted map.
Runtime is O(n), random access is O(1), but it's only applicable to a very small number of situations.
TL;DR
Getting arbitrary values, quick adding and constant-time positional access, while maintaining sortedness is difficult. I don't know any such structure. You have to relax some assumption to have room for optimizations.

A LinkedList will probably not help you very much, if at all. Basically you are exchanging the cost of shifting every value on insert with the cost of having to traverse each node in order to reach the insertion point.
This traversal cost will also need to be paid whenever accessing each node. A LinkedList shines as a queue, but if you need to access the internal nodes individually it's not a great choice.
In your case, you want a sorted Tree of some sort. A BST (Balanced Search Tree, also referred to as a Sorted Binary Tree) is one of the simplest types and is probably a good place to start.
A good option is a TreeSet, which is likely functionally equivalent to how you were using an array, if you simply need to keep track of a set of sorted numbers.

Sorted data structure with O(logN) insertion that gives insertion point index

My goal is a sorted data structure that can accomplish 2 things:
Fast insertion (at the location according to sort order)
I can quickly segment my data into the sets of everything greater than or less than or equal to an element. I need to know the size of each of these partitions, and I need to be able to "get" these partitions.
Currently, I'm implementing this in java using an ArrayList which provides #2 very easily since I can perform binary search (Collections.binarySearch) and get an insertion index telling me at what point an element would be inserted. Then based on the fact that indices range from 0 to the size of the array, I immediately know how many elements are greater than my element or smaller than my elements, and I can easily get at those elements (as a sublist). However, this doesn't have property #1, and results in too much array copying.
This makes me want to use something like a SkipList or RedBlackTree that could perform the insertions faster, but then I can't figure out how to satisfy property #2 without making it take O(N) time.
Any suggestions would be appreciated. Thanks
EDIT: Thanks for the answers below that reference data structures that perform the insertion in O(logN) time and that can partition quickly as well, but I want to highlight the size() requirement - I need to know the size of these partitions without having to traverse the entire partition (which, according to this is what the TreeSet does. The reasoning behind this is that in my use case I maintain my data using several different copies of data structures each using a different comparator, and then need to ask "according to what comparator is the set of all things larger than a particular element smallest". In the ArrayList case, this is actually easy and takes only O(YlogN) where Y is the number of comparators, because I just binary search each of the Y arrays and return the arraylist with the highest insertion index. It's unclear to me how I could this with a TreeSet without taking O(YN).
I should also add that an approximate answer for the insertion index would still be valuable even if it couldn't be solved exactly.

Use a common Java TreeSet. Insertion takes O(logN), so #1 of your requirements is done. Here's the qouting from documentation:
This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains).
And as it implements the NavigableSet interface, you have #2 or your requirements with the following methods:
tailSet(someElem) returns a Set view starting from someElem till the last element
headSet(someElem) returns a Set view starting from the first element till someElem
subSet(fromElem, toElem) returns a Set view starting from fromElem till toElem
These operations are overloaded with versions that include/exclude the bounds provided.
TreeSet is quite flexible: it allows you to define a Comparator to order the Set in a custom way, or you can also rely on the natural ordering of the elements.
EDIT:
As per the requirement of returned subsets size() operation to not be O(n), I'm afraid there's no adhoc implementation in the Java API.
It is true, the set views returned by TreeSet range operations, implement size() by 'jumping' to the first element of the view in O(log n) time, and then iterating over the subsequent elements, adding 1 in each iteration, until the end of the subset is reached.
I must say this is quite unfortunate, since it's not always needed to traverse the returned subset view, but sometimes, knowing the size of the subset in advance can be quite useful (as it's your use case).
So, in order to fulfil your requirement, you need another structure, or at least, an auxiliary structure. After some research, I suggest you use a Fenwick tree. A Fenwick tree is also known as a Binary Indexed Tree (BIT), and can be either immutable or mutable. The immutable version is implemented with an array, while the mutable version could be implemented with a balanced binary tree, i.e. a Red-Black tree (Java TreeSet is actually implemented as a Red-Black tree). Fenwick trees are mainly used to store frequencies and calculate the sum of all frequencies up to a given element in O(log n) time.
Please refer to this question here on Stack Overflow for a complete introduction to this quite unknown but yet incredibly useful structure. (As the explanation is here in Stack Overflow, I won't copy it here).
Here's another Stack Overflow question asking how to properly initialize a Fenwick tree, and here's actual Java code showing how to implement Fenwick tree's operations. Finally, here's a very good theoretic explanation about the structure and the underlying algorithms being used.
The problem with all the samples in the web is that they use the immutable version of the structure, which is not suitable to you, since you need to interleave queries with adding elements to the structure. However, they are all very useful to fully understand the structure and algorithms being used.
My suggestion is that you study Java TreeMap's implementation and see how to modify/extend it so that you can turn it into a Fenwick tree that keeps 1 as a value for every key. This 1 would be each key's frequency. So Fenwick tree's basic operation getSum(someElement) would actually return the size of the subset from first element up to someElement, in O(log n) time.
So the challenge is to implement a balanced tree (a descendant of Java's Red-Black TreeMap, actually), that implements all Fenwick tree's operations you need. I believe you'd be done with getSum(somElement), but maybe you could also extend the returned subtree range views so that they all refer to getSum(someElelment) when implementing size() operation for range views.
Hope this helps, at least I hope it's a good place to start. Please, let me know if you need clarifications, as well as examples.

If you don't need duplicate elements (or if you can make the elements look distinct), I'd use a java.util.TreeSet. It meets your stated requirements.
O(log n) sorted insertion due to binary tree structure
O(log n) segmentation time using in-place subsets
Unfortunately, the O(log n) segmentation time is effectively slowed to O(n) by your requirement to always know the size of the segment, due to the reason in the answer you linked. The in-place subsets don't know their size until you ask them, and then they count. The counted size is stored, but if the backing set is changed in any way, the subset has to count again.

I think the best data structure for this problem would be a B-Tree with a dense index. Such a B-Tree is built from:
- inner nodes containing only pointers to child nodes
- leafs containing pointers to paged arrays
- a number of equal-sized-arrays (pages)
Unfortunately there are few generic implementations of a B-Tree in Java, probably because so many Variations exist.
The cost of insertion would be
O(log(n)) to find the position
O(p) to insert a new element into a page (where p is the constant page size)
Maybe this data structure also covers your segmentation problem. If not: The cost of extracting would be
O(log(n)) to find the borders
O(e) to copy the extract (where e is the size of the extract)

One easy way to get what you want involves augmenting your favourite binary search tree data structure (red-black trees, AVL trees, etc...) with left and right subtree sizes at each node --- call them L-size and R-size.
Assume that updating these fields in your tree data structures can be done efficiently (say constant time). Then here is what you get:
Insertion, deletion, and all the regular binary search tree operations as efficient as your choice of data structure --- O(log n) for red-back trees.
Given a key x, you can get the number of elements in your tree that have keys less than x in O(log n) time, by descending down the tree to find the appropriate location for x, summing up the L-sizes (plus one for the actual node you're traversing) each time you "go right". The "greater than" case is symmetrical.
Given a key x, you can get the sorted list x_L of elements that are less than x in O(log n + |x_L|) time by, again, descending down the tree to find the appropriate location for x, and each time you go right you tag the node you just traversed, appending it to a list h_L. Then doing in-order traversals of each of the nodes in h_L (in order of addition to h_L) will give you x_L (sorted). The "greater than" case is symmetrical.
Finally, for my answer to work, I need to guarantee you that we can maintain these L- and R-sizes efficiently for your choice of specific tree data structure. I'll consider the case of red-black trees.
Note that maintaining L-sizes and R-sizes is done in constant time for vanilla binary search trees (when you add a node starting from the root, just add one to L-sizes if the node should go in the left subtree, or one to the R-sizes if it goes in the right subtree). Now the additional balancing procedures of red-black trees only alter the tree structure through local rotations of nodes --- see Wikipedia's depiction of rotations in red-black trees. It's easy to see that the post-rotation L-size and R-size of P and Q can be recalculated from the L-sizes and R-sizes of A,B,C. This only adds a constant amount of work to the red-black tree's operations.

Pseudo Range Minimum Query

I have a problem with my assignment which requires me to solve a problem that is similar to range-minimum-query. The problem is roughly described below:
I am supposed to code a java program which reads in large bunch of integers (about 100,000) and store them into some data structure. Then, my program must answer queries for the minimum number in a given range [i,j]. I have successfully devised an algorithm to solve this problem. However, it is just not fast enough.
The pseudo-code for my algorithm is as follows:
// Read all the integers into an ArrayList
// For each query,
// Read in range values [i,j] (note that i and j is "actual index" + 1 in this case)
// Push element at index i-1 into a Stack
// Loop from index i to j-1 in the ArrayList (tracking the current index with variable k)
[Begin loop]
// If element at k is lesser than the one at the top of the stack, push the element at k into the Stack.
[End of loop]
Could someone please advise me on what I could do so that my algorithm would be fast enough to solve this problem?
The assignment files can be found at this link: http://bit.ly/1bTfFKa
I have been stumped by this problem for days. Any help would be much appreciated.
Thanks.

Your problem is a static range minimum query (RMQ). Suppose you have N numbers. The simplest algorithm you could use is an algorithm that would create an array of size N and store the numbers, and another one that will be of size sqrtN, and will hold the RMQ of each interval of size sqrtN in the array. This should work since N is not very large, but if you have many queries you may want to use a different algorithm.
That being said, the fastest algorithm you could use is making a Sparse Table out of the numbers, which will allow you to answer the queries in O(1). Constructing the sparse table is O(NlogN) which, given N = 10^5 should be just fine.
Finally, the ultimate RMQ algorithm is using a Segment Tree, which also supports updates (single-element as well as ranges), and it's O(N) to construct the Segment Tree, and O(logN) per query and update.
All of these algorithms are very well exposed here.
For more information in Segment Trees see these tutorials I wrote myself.
link
Good Luck!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.