Collections sort method vs iteration

Collections sort method vs iteration - java

I was working on a playing cards shuffle problem and found two solutions for it.
The target is to shuffle all 52 playing cards stored in a array as Card objects. Card class has id and name associated to it.
Now, one way is to iterate using for loop and then with the help of a temp card object holder and a random number generator, we can swap two objects. This continues until we reach half of the cards.
Another way is to implement comparable overriding compareto method with a random generator number, so we get a random response each time we call the method.
Which way is better you think?

You should not do it by sorting with a comparator that returns random results, because then those random results can be inconsistent with one another (e.g., saying that a<b<c<a), and this can actually result in the distribution of orderings you get being far from uniform. See e.g. this demonstration by Mike Bostock. Also, it takes longer, not that that should matter for shuffling 52 objects.
The standard way to do it does involve a loop, but your description sounds peculiar and I suspect what you have in mind may also not produce the ideal results. (If the question is updated to make it clearer what the "iterate using for loop" approach is meant to be, I will update this.)
(There is a way to get good shuffling by sorting: pair each element up with a random number -- e.g., a random floating-point number in the range 0..1 -- and then sort using that number as key. But this is slower than Fisher-Yates and requires extra memory. In lower-level languages it generally also takes more code; in higher-level languages it can be terser; I'd guess that for Java it ends up being about equal.)
[EDITED to add:] As Louis Wasserman very wisely says in comments, when your language's standard library has a ready-made function to do a thing, you should generally use it. Unless you're doing this for, e.g., a homework assignment that requires you to find and implement an algorithm to solve the problem.

First of all, the comparator you've described wont work. More on this here. TLDR: comparsions must be reproducible, so if your comparator says that a is less then b next time when comparing b to a it should return "greater", not a random value. The same for Comparable.
If I were you, I'd rather use Collections#shuffle method, which "randomly permutes the specified list using a default source of randomness. All permutations occur with approximately equal likelihood". It's always better to rely on someone's code, then write your own, especially if it is in a standard library.

Related

What assumptions can we safely make about the Java Collections shuffle(List<?>, Random) method?

So I am looking into the collections shuffle method and trying to come up with a list of what is and is not ensured when we run it. There are some obvious cases I've come up which are the following:
The list given will contain the same elements after shuffling as before
The list may or may not be the same after running the method (you could end up with the same order of elements)
The method will run in linear time (I think that this is true but am not 100% positive).
Does this list sum it up or am I missing some possible cases?

The official documentation of Collections.shuffle has a lot to say about what will happen. The list will be shuffled using what seems to be the Fisher-Yates shuffle algorithm, which (assuming that random access is available in O(1)) runs in time O(n) and space O(1). The implementation will use space O(n) if random access isn't available. Assuming that the underlying random source is totally unbiased, the probability of any particular ordering occurring is equal (that is, you get a uniformly-random distribution over possible permutations).
So, to answer your questions:
The list will contain the same elements.
They're probably in a different order, but there's a 1 / n! chance than they'll be in the same order.
The runtime is O(n), and the space usage is either O(1) or O(n) depending on whether your list has random access support.

yes (and more over- the list itself will remain the same object)
correct, there's always the slight chance of randomly getting the exact same order of elements (not that slight for small lists)
this is implementation based, but at least till java 7 it is linear (and not probable reason for it to change)

Should I use java collection sort() or implement my own?

I have an array that I need to sort the values in increasing order. The possible value inside the array are is between 1-9, there will be a lot of repeating value. (fyi: I'm working on a sudoku solver and trying to solve the puzzle starting with the box with least possibilities using backtracking)
The first idea that comes to my mine is to use Shell Sort.
I did some look up and I found out that the java collection uses "modified mergesort"(in which the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist).
So I wish to know if the differences in performance will be noticeable if I implement my own sorting algorithm.

If you only have 9 possible values, you probably want counting sort - the basic idea is:
Create an array of counts of size 9.
Iterate through the array and increment the corresponding index in the count array for each element.
Go through the count array and recreate the original array.
The running time of this would be O(n + 9) = O(n), where-as the running time of the standard API sort will be O(n log n).
So yes, this will most likely be faster than the standard comparison-based sort that the Java API uses, but only a benchmark will tell you for sure (and it could depend on the size of your data).
In general, I'd suggest that you first try using the standard API sort, and see if it's fast enough - it's literally just 1 line of code (except if you have to define a comparison function), compared to quite a few more for creating your own sorting function, and quite a bit of effort has gone into making sure it's as fast as possible, while keeping it generic.
If that's not fast enough, try to find and implement a sort that works well with your data. For example:
Insertion sort works well on data that's already almost sorted (although the running time is pretty terrible if the data is far from sorted).
Distribution sorts are worth considering if you have numeric data.
As noted in the comment, Arrays.parallelSort (from Java 8) is also an option worth considering, since it multi-threads the work (which sort doesn't do, and is certainly quite a bit of effort to do yourself ... efficiently).

Random vs Shuffle

I have an ArrayList which I want to grab a random value from. To do this I thought of two simple methods:
Method 1: Uses Random to generate a random number between 0 and the size of the ArrayList and then use that number for arrayList.get(x)
Method 2: Use arrayList.shuffle() and then arrayList.get(0).
Is one method preferable to the other in terms of randomness, I know it is impossible for one to be truly random but I want the result to be as random as possible.
EDIT: I only need one value from that ArrayList

It depends on the context.
Benefits of shuffling:
Once a shuffle, then just sequential grabbing
No repeated values
Benefits of randomizing:
Great for a small amount of values
Can repeat values

To answer your direct question: neither one of these is "more random" than the other. The results of the two methods are statistically indistinguishable. After all, the first step in shuffling an array is (basically) picking a number between 0 and N-1 (where N is the length of the array) and moving that element into the first position.
That being said, there are valid reasons to pick one or the other, depending on your specific needs. Jeroen's answer summarizes those well.

I would say the random number option is the best (Method 1).
Shuffling the objects takes up extra resources, because it has to move all of the objects around in the ArrayList, where generating a random number gives you the same effect without needing to use CPU time to cycle through array elements!
Also, be sure to generate a number between 0 and the size MINUS ONE. :)

If you just want one random selection, use Method 1. If you want to get a sequence of random selections with no duplicates, use Method 2.

Randomness depends on two factors, the algorithm (a.k.a the "generator") and the seed.
What generators does each method use?
The second overload of Collections.Shuffle() actually accepts a seeded Random. If you choose the default overload, it uses a Random anyway, as specified in the Javadoc. You're using a Random no matter what.
Are the generators seeded differently?
Another look at Random in the Javadoc shows that it is seeded by with some time value unless you specify a seed. Shuffle doesn't specify a time if you look at the implementation. You're using the default seed unless you specify one.
Because both use Random and both use the same default seed, they are equally random.
Which one has a higher time complexity?
Shuffling a list is O(n) (the Javadoc for Shuffle actually specifies linear time). The time complexity of Random.nextInt() is O(1). Obviously, the latter is faster in a case where only one value is needed.

Using hashcode for a unique ID

I am working in a java-based system where I need to set an id for certain elements in the visual display. One category of elements is Strings, so I decided to use the String.hashCode() method to get a unique identifier for these elements.
The problem I ran into, however, is that the system I am working in borks if the id is negative and String.hashCode often returns negative values. One quick solution is to just use Math.abs() around the hashcode call to guarantee a positive result. What I was wondering about this approach is what are the chances of two distinct elements having the same hashcode?
For example, if one string returns a hashcode of -10 and another string returns a hashcode of 10 an error would occur. In my system we're talking about collections of objects that aren't more than 30 elements large typically so I don't think this would really be an issue, but I am curious as to what the math says.

Hash codes can be thought of as pseudo-random numbers. Statistically, with a positive int hash code the chance of a collision between any two elements reaches 50% when the population size is about 54K (and 77K for any int). See Birthday Problem Probability Table for collision probabilities of various hash code sizes.
Also, your idea to use Math.abs() alone is flawed: It does not always return a positive number! In 2's compliment arithmetic, the absolute value of Integer.MIN_VALUE is itself! Famously, the hash code of "polygenelubricants" is this value.

Hashes are not unique, hence they are not apropriate for uniqueId.
As to probability of hash collision, you could read about birthday paradox. Actually (from what I recall) when drawing from an uniform distribution of N values, you should expect collision after drawing $\sqrt(N)$ (you could get collision much earlier). The problem is that Java's implementation of hashCode (and especially when hashing short strings) doesnt provide uniform distribution, so you'll get collision much earlier.

You already can get two strings with the same hashcode. This should be obvious if you think that you have an infinite number of strings and only 2^32 possible hashcodes.
You just make it a little more probable when taking the absolute value. The risk is small but if you need an unique id, this isn't the right approach.

What you can do when you only have 30-50 values as you said is register each String you get into an HashMap together with a running counter as value:
HashMap StringMap = new HashMap<String,Integer>();
StringMap.add("Test",1);
StringMap.add("AnotherTest",2);
You can then get your unique ID by calling this:
StringMap.get("Test"); //returns 1

What data-structure should I use to create my own "BigInteger" class?

As an optional assignment, I'm thinking about writing my own implementation of the BigInteger class, where I will provide my own methods for addition, subtraction, multiplication, etc.
This will be for arbitrarily long integer numbers, even hundreds of digits long.
While doing the math on these numbers, digit by digit isn't hard, what do you think the best datastructure would be to represent my "BigInteger"?
At first I was considering using an Array but then I was thinking I could still potentially overflow (run out of array slots) after a large add or multiplication. Would this be a good case to use a linked list, since I can tack on digits with O(1) time complexity?
Is there some other data-structure that would be even better suited than a linked list? Should the type that my data-structure holds be the smallest possible integer type I have available to me?
Also, should I be careful about how I store my "carry" variable? Should it, itself, be of my "BigInteger" type?

Check out the book C Interfaces and Implementations by David R. Hanson. It has 2 chapters on the subject, covering the vector structure, word size and many other issues you are likely to encounter.
It's written for C, but most of it is applicable to C++ and/or Java. And if you use C++ it will be a bit simpler because you can use something like std::vector to manage the array allocation for you.

Always use the smallest int type that will do the job you need (bytes). A linked list should work well, since you won't have to worry about overflowing.

If you use binary trees (whose leaves are ints), you get all the advantages of the linked list (unbounded number of digits, etc) with simpler divide-and-conquer algorithms. You do not have in this case a single base but many depending the level at which you're working.
If you do this, you need to use a BigInteger for the carry. You may consider it an advantage of the "linked list of ints" approach that the carry can always be represented as an int (and this is true for any base, not just for base 10 as most answers seem to assume that you should use... In any base, the carry is always a single digit)
I might as well say it: it would be a terrible waste to use base 10 when you can use 2^30 or 2^31.

Accessing elements of linked lists is slow. I think arrays are the way to go, with lots of bound checking and run time array resizing as needed.
Clarification: Traversing a linked list and traversing an array are both O(n) operations. But traversing a linked list requires deferencing a pointer at each step. Just because two algorithms both have the same complexity it doesn't mean that they both take the same time to run. The overhead of allocating and deallocating n nodes in a linked list will also be much heavier than memory management of a single array of size n, even if the array has to be resized a few times.

Wow, there are some… interesting answers here. I'd recommend reading a book rather than try to sort through all this contradictory advice.
That said, C/C++ is also ill-suited to this task. Big-integer is a kind of extended-precision math. Most CPUs provide instructions to handle extended-precision math at comparable or same speed (bits per instruction) as normal math. When you add 2^32+2^32, the answer is 0… but there is also a special carry output from the processor's ALU which a program can read and use.
C++ cannot access that flag, and there's no way in C either. You have to use assembler.
Just to satisfy curiosity, you can use the standard Boolean arithmetic to recover carry bits etc. But you will be much better off downloading an existing library.

I would say an array of ints.

An Array is indeed a natural fit. I think it is acceptable to throw OverflowException, when you run out of place in your memory. The teacher will see attention to detail.
A multiplication roughly doubles digit numbers, addition increases it by at most 1. It is easy to create a sufficiently big Array to store the result of your operation.
Carry is at most a one-digit long number in multiplication (9*9 = 1, carry 8). A single int will do.

std::vector<bool> or std::vector<unsigned int> is probably what you want. You will have to push_back() or resize() on them as you need more space for multiplies, etc. Also, remember to push_back the correct sign bits if you're using two-compliment.

i would say a std::vector of char (since it has to hold only 0-9) (if you plan to work in BCD)
If not BCD then use vector of int (you didnt make it clear)
Much less space overhead that link list
And all advice says 'use vector unless you have a good reason not too'

As a rule of thumb, use std::vector instead of std::list, unless you need to insert elements in the middle of the sequence very often. Vectors tend to be faster, since they are stored contiguously and thus benefit from better spatial locality (a major performance factor on modern platforms).
Make sure you use elements that are natural for the platform. If you want to be platform independent, use long. Remember that unless you have some special compiler intrinsics available, you'll need a type at least twice as large to perform multiplication.
I don't understand why you'd want carry to be a big integer. Carry is a single bit for addition and element-sized for multiplication.
Make sure you read Knuth's Art of Computer Programming, algorithms pertaining to arbitrary precision arithmetic are described there to a great extent.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.