So, .Net and Java has spoiled me not being "required" to learn any sorting algorithms, but now I am in a need to sort an array in a different languages the doesn't have this luxury. I was able to pick up on bubble sorting with little issue. However, some sources detest the use of bubble sorting becuase of the horrible performance with average and worst case scenario of n^2 comparrisons. Bubble sorting seems to get the job done, but about tackle a array that has +100,000 elements and has me worried that performance could be an issue at this degree. On the other, some of the other algorithms look pretty intimidating in terms of complexity. My question is, what would be a good follow up to bubble sorting in terms of better performance, but not going off into complexity wasteland in implementation?
As a side note, I am an analyst that the programs as needed, not a CS major. Needless to say, theres some holes that I have yet filled in my programming expertise. Thanks :)
There are many options, each with their own trade-offs. As you've discovered, Bubble Sort's trade-offs are that it's (a) simple, but (b) slow with even remotely large arrays.
Quicksort is a good one, but you may run into memory issues.
I've used Heapsort with much success, but it's not guaranteed to be stable (though I've never had problems).
Bogosort is fun to implement and talk about, but entirely impractical.
and so on...
Having a good understanding of the data to be sorted helps one decide which algorithm is best. For example:
How large will the array be?
Is there a chance it's already sorted or partially sorted?
What kind of data does the array contain?
How difficult/expensive is it to compare two elements in the array?
How difficult/expensive is it to determine if the array is sorted?
and so on...
There is no one sorting algorithm that's better than all others. Choosing what fits your needs is something that you'll pick up over time and with practice.
Take your time to learn Quicksort, it's a great algorithm and not that complicated if you go slow.
If you want some sorting algorithms just to get your feet wet(ter), I would recommend Insertion Sort and Selection Sort, they are generally better than Bubble Sort, and quick to understand and implement. Merge sort is also common in algorithm courses. You will have much better use of Quick Sort though.
You should also understand the difference between stable and non-stable sorting, if you don't already. A stable sort will not change the order of items with the same key, while a non-stable could.
Related
I know that as number of elements is doubled the time to sort for selection sort and insertion sort quadrupled.
How about merge sort and quick sort?
Lets say it makes 2 seconds to sort 100 items using merge sort.
How long would it take to sort 200 items using merge sort and quick sort?
Merge Sort is usually O(nlog(n)). Quick sort can be O(nlog(n)), but worst case scenario it will end up closer to O(n^2). I'll leave the math to you, as it's fairly simple. The nice thing about common sorting algorithms is that they are very well documented online, there are likely plenty of calculators online that could give you specifics. As for how long it would actually take to run an algorithm, I'm no expert but I'm guessing that would depend largely on hardware. You should be more concerned with the Big-O of whatever you're running because that's the only thing you can really control as a programmer.
I always have a habit of creating lots of classes while solving Graph Theory problems like :
class Node{
......
}
class Edge{
......
}
Often this runs me into performance and speed issues. Hence I feel that using arrays for storing graphs is faster than storing it in User Defined Classes and Structures like Lists and Maps though the latter provides more flexibility and readability to the code. Hence do the use of arrays and language structures for representing graphs really cause any significant performance boost. If yes, which should be the general choice while coding in Java?
Measure it.
Build a solution, put it into a profiler and look where most of the computation time is used up. You cannot sensibly argue about this topic in general, you need experiments.
That said: In 98% of the cases, you are better of with writing readable OO code. If it turns out to be too slow, narrow down the method that causes the trouble (with a profiler) and try to make that method faster. Don't start writing ugly code in the hope that it might faster than nice one.
The problem with arrays is that they suppose a huge waste of memory for big graphs, which usually implement few links between nodes.
The performance boost you would get would depend not only on the data structure, but also on the type of graph and operations you perform on it.
E.g. deleting a node could be very expensive on an array implementation.
Well yes , sometimes graphs are faster then array. Basically Sometimes it's depend on your requirement. Sometimes if you use array is best choice and some times graphs. There are many collections are available in java. like linkedlist , arraylist , vector , doublelinkedlist etc. All the collections are fast in java. You just have to choose the best possibilities matched with your requirement...
I have this small presentation about algorithms with a group of nerds and i was randomly tasked to convince them that shell sort is better than merge sort algorithm... I have been reading for almost a weak But No matter how much i read on merge sort and shell sort i find the merge sort better than shell sort..
Are their any advantages of shell sort on merge sort? I mean on what circumstances is shell sort better than merge sort. I might have missed something but i dont know what.
Any tips would be fine or if possible can you link me to something helpful..
You have to remember the context in which shellsort was proposed: shellsort was published in 1959; quicksort, in 1961; mergesort, in 1948 (OK, that was a bit surprising). The computers of the day were slow and had small memories. Thus the asymptotic advantage of mergesort was hardly relevant compared to the increased complexity of implemention and code. In fact, shellsort gets the quadratic fallback of modern practical mergesorts for free, since insertion sorting with a gap of 1 is insertion sort.
It was not known then how to do an efficient in-place merge (and even now, no one implements it, because it's wildly inefficient in practice).
Shellsort has an uncomplicated nonrecursive implementation. Recursion in higher-level languages was confined to LISP (impractical then, not to mention lacking an array type) and the as-yet unimplemented ALGOL 60 standard.
Shellsort's running time improves a lot on mostly sorted data. (It's no Timsort though.)
Merge sort is normally faster than shell sort, but shell sort is in place. Quick sort is faster if sorting data, but merge sort is usually faster if sorting an array of pointers or indices to data, if the compare overhead for the elements is greater than the move overhead for pointers or indices, since merge sort uses fewer compares but more moves than quick sort. If sorting an array of somewhat random integers, then counting / radix sort is fastest.
As mentioned, merge sort was published in 1948. Merge sort on old mainframes was implemented on tape drivers or disk drives. For tape drives, there were/are variations of merge sort:
http://en.wikipedia.org/wiki/Polyphase_merge_sort
http://en.wikipedia.org/wiki/Oscillating_merge_sort
Natural merge sort takes advantages of any existing natural ordering, but has the overhead of keeping track of variable size runs. With tape drives, this can/could be done using single file marks for end of runs, double file marks for end of data. Early disk drives with variable sized blocks could implement something similar (using small blocks to indicate end of run / end of data).
http://en.wikipedia.org/wiki/Merge_sort#Natural_merge_sort
An alternative to natural merge sort is tim sort, where natural and/or forced ordering using insertion sort is used to create runs of fixed size during the initial pass:
http://en.wikipedia.org/wiki/Timsort
The "classic" merge sort is bottom up merge sort, and in the case of an external sort, using tape drives or disk drives, the initial pass sorts data in memory, to skip past the initial merge passes, similar to tim sort, except that the memory sort may not have been insertion sort, and generally an array of pointers or indices were sorted, and the data written according to those pointers or indices, as opposed to sorting data in memory before writing. On some systems, a single I/O with multiple pointers / lengths to data is/was used. SATA / IDE / SCSI PC controllers have a set of descriptors that hold address / length data to deal with paged memory, but I don't know if any high end sort programs for PC's use the descriptors to write a set of records for merge sort with a single I/O.
I'm not sure when top down merge sort was first published. Rather than starting off with some fixed or variable run size and using iteration to advance indices or pointers while merging runs, it recursively generates indices or pointers until they represent some small fixed run size, typically a run size of 1, and only then does any actual merging of data take place. Whatever advantage there might be due to cache localization of a depth first / left first ordering of run merges, it is offset by the overhead of recursion, and generally top down merge sort is slightly slower (about 5%) than bottom up merge sort).
need help with an optimized solution for the following problem http://acm.ro/prob/probleme/B.pdf.
Depending on the cost i either traverse the graph using only new edges, or using only
old edges, both of them work, but i need to pass test in a limited number of milliseconds,
and the algorithm for the old edges is dragging me down.
I need a way to optimize this, any suggestions are welcome
EDIT: for safety reasons i am taking the algorithm down, i'm sorry, i'm new so I don't
know what i need to do to delete the post now that it has answers
My initial algorithmic suggestion relied on an incorrect reading of the problem. Further, a textbook breadth-first search or Dijkstra on a graph of this size is unlikely to finish in a reasonable amount of time. There's likely an early-termination trick that you can employ for large cases; the thread Niklas B. linked to suggests several (as well as some other approaches). I couldn't find an early-termination trick that I could prove worked.
These micro-optimisation notes are still relevant, however:
I'd suggest not using Java's built-in Queue container for this (or any other built-in Java containers for anything else in a programming contest). It turns your 4-byte int into a gargantual Integer structure. This is very probably where your blowup is coming from. You can use a 500000-long int[] to store the data in your queue and two ints for the front and back of the queue instead. In general, you want to avoid instantiating Objects in Java contest programming because of their overhead.
Likewise, I'd suggest representing the edges of the graph as either a single big int[] or a 500000-long int[][] to cut down on that piece of overhead.
I only saw one queue in you code. That means you are searching from one direction only.
You may want to take a look at
Bidirectional Search
In java, im creating a SortedSet from a list which is always going to be ordered (but is only of type ArrayList). I figure adding them one by one is going to have pretty poor performance (in the case for example of an AVL tree), as it will have to reorder the tree a lot.
my question is, how should i be creating this set? in a way that it is as fast as possible to build a balanced tree?
the specific implementation i was planning on using was either IntRBTreeSet or IntAVLTreeSet from http://fastutil.dsi.unimi.it/docs/it/unimi/dsi/fastutil/ints/IntSortedSet.html
after writing this up, I think the poor performance wont affect me too much anyway (too small amount of data), but im still interested in how it would be done in a general case.
A set having a tree implementation would have the middle element from your list in the top. So the algorithm would be as following:
find the middle element of the List
insert it into set
repeat for both sub-lists to the left and to the right of the middle element
Red-Black trees are a good choice for the general case, and they have very fast inserts. See Chris Okasaki's paper for an elegant and fast implementation. The Functional Java library has a generic Set class that is backed by a red-black tree implemented according to this paper.
With all the discussion of using a Set, it occurs to me that maybe the problem could be re-stated. Why use a Set at all? If you just want to check for membership, and your source list is sorted, then do a binary search for the object - this will be as fast (and probably faster) than any n-tree you can envision, and it's not that tough to code.
So, envision a OrderedListSet interface that just wraps the underling List object. As long as the comparator used to order the list is also used for the binary search, this should be pretty straight-forward.
All Set operations will start with a getIndex(Object ob) call, then the appropriate action is taken on the List.
Do you have a performance problem with the simple approach of just inserting the elements as they come?
If not, don't optimize.
The built in TreeSet (http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html) class uses a red-black tree as it's backing tree (and, has been noted, red-black trees are quite fast for inserts). Here's good info on Red-Black trees (they don't have the problem of the typical binary tree implementation when inserting data that is mostly ordered already).
If you are dealing with huge data sets (big enough to require disk based backing, or significant paging file swap), then a B+Tree is a very good option (see JDBM for a Java based version of self-balancing B+Tree - it doesn't implement Set, but could be used that way if desired).
Depending on how your application is actually using this data, you might want to consider the GlazedLists library and make your lists 'live'. If all you are doing is static analysis, then this may be overkill, but it is an absolutely fantastic way of working with list based data. Definitely worth reading about.