How to get 100 random elements from HashSet in Java? - java

I have a HashSet in which I have 10000 elements. I want to extract random 100 elements from that HashSet. So I thought I can use shuffle on the set but it doesn't work.
Set<String> users = new HashSet<String>();
// for randomness, but this doesn't work
Collections.shuffle(users, new Random(System.nanoTime()));
// and use for loop to get 100 elements
I cannot use shuffle now, is there any other best way to get 100 random elements from HashSet in Java?

Without building a new list, you can implement the following algorithm:
n = 100
d = 10000 # length(users)
for user in users:
generate a random number p between 0 and 1
if p <= n / d:
select user
n -= 1
d -= 1
As you iterate through the list, you decrease the probability of
future elements from being chosen by decreasing n, but at the the
same time increase the probability by decreasing d. Initially,
you would have a 100/10000 chance of choosing the first element.
If you decide to take that element, you would have a 99/9999 chance
of choosing the second element; if you don't take the first one, you'll
have a slightly better 100/9999 chance of picking the second element. The math works out so that in the end, every element has a 100/10000 chance of being selected for the output.

Shuffling the collection implies that there is some defined order of elements within, so elements can be reordered. HashSet is not an ordered collection as there is no order of elements inside (or rather details of the ordering are not exposed to the user). Therefore implementation wise it's does not makes much sense to shuffle HashSet.
What you can do is add all elements from your set to the ArrayList, shuffle it and get your results.
List<String> usersList = new ArrayList<String>(users);
Collections.shuffle(usersList);
// get 100 elements out of the list

The java.lang.HashSet has an order so you can't shuffle Sets. If you must use Sets you might iterate over the Set and stop on a random position.
Pseudocode:
Set randomUsers = new HashSet<String>();
Random r = new Random();
Iterator it = users.iterator();
numUsersNeeded = 100;
numUsersLeft = users.size();
while (it.hasNext() && randomUsers.size() < 100) {
String user = it.next();
double prop = (double)numUsersNeeded / numUsersLeft;
--numUsersLeft;
if (prop > r.nextDouble() && randomUsers.add(user)) {
--numUsersNeeded;
}
}
You might repeat this because there is no garantiy that you fetch 100 elements.
If memory is no issue you can create an array and pick 100 random elements:
Pseudocode II:
Object userArray[] = user.toArray();
Set<String> randoms = new HashSet<String>();
while(randoms.size() != 100) {
int randomUser = userArray[new Random().nexInt(10000)];
randoms.add(randomUser);
}

Related

Finding mode for every window of size k in an array

Given an array of size n and k, how do you find the mode for every contiguous subarray of size k?
For example
arr = 1 2 2 6 6 1 1 7
k = 3
ans = 2 2 6 6 1 1
I was thinking of having a hashmap where the key is no and value is frequency, treemap where the key is freq and value is number, and having a queue to remove the first element when the size > k. Here the time complexity is o(nlog(n)). Can we do this in O(1)?.
This can be done in O(n) time
I was intrigued by this problem in part because, as I indicated in the comments, I felt certain that it could be done in O(n) time. I had some time over this past weekend, so I wrote up my solution to this problem.
Approach: Mode Frequencies
The basic concept is this: the mode of a collection of numbers is the number(s) which occur with the highest frequency within that set.
This means that whenever you add a number to the collection, if the number added was not already one of the mode-values then the frequency of the mode would not change. So with the collection (8 9 9) the mode-values are {9} and the mode-frequency is 2. If you add say a 5 to this collection ((8 9 9 5)) neither the mode-frequency nor the mode-values change. If instead you add an 8 to the collection ((8 9 9 8)) then the mode-values change to {9, 8} but the mode-frequency is still unchanged at 2. Finally, if you instead added a 9 to the collection ((8 9 9 9)), now the mode-frequency goes up by one.
Thus in all cases when you add a single number to the collection, the mode-frequency is either unchanged or goes up by only one. Likewise, when you remove a single number from the collection, the mode-frequency is either unchanged or goes down by at most one. So all incremental changes to the collection result in only two possible new mode-frequencies. This means that if we had all of the distinct numbers of the collection indexed by their frequencies, then we could always find the new Mode in a constant amount of time (i.e., O(1)).
To accomplish this I use a custom data structure ("ModeTracker") that has a multiset ("numFreqs") to store the distinct numbers of the collection along with their current frequency in the collection. This is implemented with a Dictionary<int, int> (I think that this is a Map in Java). Thus given a number, we can use this to find its current frequency within the collection in O(1).
This data structure also has an array of sets ("freqNums") that given a specific frequency will return all of the numbers that have that frequency in the current collection.
I have included the code for this data structure class below. Note that this is implemented in C# as I do not know Java well enough to implement it there, but I believe that a Java programmer should have no trouble translating it.
(pseudo)Code:
class ModeTracker
{
HashSet<int>[] freqNums; //numbers at each frequency
Dictionary<int, int> numFreqs; //frequencies for each number
int modeFreq_ = 0; //frequency of the current mode
public ModeTracker(int maxFrequency)
{
freqNums = new HashSet<int>[maxFrequency + 2];
// populate frequencies, so we dont have to check later
for (int i=0; i<maxFrequency+1; i++)
{
freqNums[i] = new HashSet<int>();
}
numFreqs = new Dictionary<int, int>();
}
public int Mode { get { return freqNums[modeFreq_].First(); } }
public void addNumber(int n)
{
int newFreq = adjustNumberCount(n, 1);
// new mode-frequency is one greater or the same
if (freqNums[modeFreq_+1].Count > 0) modeFreq_++;
}
public void removeNumber(int n)
{
int newFreq = adjustNumberCount(n, -1);
// new mode-frequency is the same or one less
if (freqNums[modeFreq_].Count == 0) modeFreq_--;
}
int adjustNumberCount(int num, int adjust)
{
// make sure we already have this number
if (!numFreqs.ContainsKey(num))
{
// add entries for it
numFreqs.Add(num, 0);
freqNums[0].Add(num);
}
// now adjust this number's frequency
int oldFreq = numFreqs[num];
int newFreq = oldFreq + adjust;
numFreqs[num] = newFreq;
// remove old freq for this number and and the new one
freqNums[oldFreq].Remove(num);
freqNums[newFreq].Add(num);
return newFreq;
}
}
Also, below is a small C# function that demonstrates how to use this datastructure to solve the problem originally posed in the question.
int[] ModesOfSubarrays(int[] arr, int subLen)
{
ModeTracker tracker = new ModeTracker(subLen);
int[] modes = new int[arr.Length - subLen + 1];
for (int i=0; i < arr.Length; i++)
{
//add every number into the tracker
tracker.addNumber(arr[i]);
if (i >= subLen)
{
// remove the number that just rotated out of the window
tracker.removeNumber(arr[i-subLen]);
}
if (i >= subLen - 1)
{
// add the new Mode to the output
modes[i - subLen + 1] = tracker.Mode;
}
}
return modes;
}
I have tested this and it does appear to work correctly for all of my tests.
Complexity Analysis
Going through the individual steps of the `ModesOfSubarrays()` function:
The new ModeTracker object is created in O(n) time or less.
The modes[] array is created in O(n) time.
The For(..) loops N times:
. 3a: the addNumber() function takes O(1) time
. 3b: the removeNumber() function takes O(1) time
. 3c: getting the new Mode takes O(1) time
So the total time is O(n) + O(n) + n*(O(1) + O(1) + O(1)) = O(n)
Please let me know of any questions that you might have about this code.

Pseudo code: Random Permutation

I have a pseudo code that I have translated into java code but anytime I run the code, I get an empty arraylist as a result but it is supposed to give me a random list of integers.
Here is the pseudo code:
Algorithm 1. RandPerm(N)
Input: Number of cities N
1) Let P = list of length N, (|P|=N) where pi=i
2) Let T = an empty list
3) While |P| > 0
4) Let i = UI(1,|P|)
5) Add pi to the end of T
6) Delete the ith element (pi) from P
7) End While
Output: Random tour T
Here is the java code:
public static ArrayList<Integer> RandPerm(int n)
{
ArrayList<Integer> P = new ArrayList<>(n);
ArrayList<Integer> T = new ArrayList<>();
int i;
while(P.size() > 0)
{
i = CS2004.UI(1, P.size());// generate random numbers between 1 and the size of P
T.add(P.get(i));
P.remove(P.get(i));
}
return T;
}
I don't know what I am doing wrong.
ArrayList<Integer> p = new ArrayList<>(n);
... creates an empty list with an initial capacity of n.
All this does is tell the ArrayList what size array to initialise as backing store - most of the time you achieve nothing useful by specifying this.
So your while(p.size() > 0) runs zero times, because p.size() is zero at the start.
In the pseudocode "where pi=i" suggests to me that you want to initialise the list like this:
for(int i=0;i<n;i++) {
p.add(i)
}
(I have lowercased your variable name - in Java the convention is for variables to startWithALowerCaseLetter -- only class names StartWithUpperCase. It's also the Java convention to give variables descriptive names, so cityIdentifiers perhaps)
You may want to know that, even if you fix the problem that P is always empty, there are 2 more issues with your implementation.
One is that P.remove(P.get(i)) does not necessarily remove the ith item if the list has equal value items. It scans from the beginning and removes the first occurrence of the item. See ArrayList.remove(Object obj). You should use P.remove(i) instead for the correct results.
Then the performance is O(n^2). The reason is that ArrayList remove an item by shifting all the subsequent items one slot to the left, which is an O(n) operation. To get a much better performance, you can implement your own "remove" operation by swapping the item to the end. When you generate the next random index, generate it within the range [0, beginning index of the removed items at the end). Swapping is O(1) and the overall performance is O(n). This is called Knuth Shuffle by the way.

How to efficiently generate a set of unique random numbers with a predefined distribution?

I have a map of items with some probability distribution:
Map<SingleObjectiveItem, Double> itemsDistribution;
Given a certain m I have to generate a Set of m elements sampled from the above distribution.
As of now I was using the naive way of doing it:
while(mySet.size < m)
mySet.add(getNextSample(itemsDistribution));
The getNextSample(...) method fetches an object from the distribution as per its probability. Now, as m increases the performance severely suffers. For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. Generate 1000 such sets and you have an application that crawls.
Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? Most collection shuffling techniques and the like are uniformly random. What would be a good way to address this?
UPDATE: The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is in the first run we'll definitely get a sample for the set. The 2nd iteration, it may be called at least twice and so on. If getNextSample is sequential in nature, i.e., goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2, 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 then the loop is at least Sigma(n^3). And that too is the lower bound!
If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). Efficient but may not be by a large margin.
Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. These sets are part of a randomized 'schedule' of items. Hence a 'set' of items.
Start out by generating a number of random points in two dimentions.
Then apply your distribution
Now find all entries within the distribution and pick the x coordinates, and you have your random numbers with the requested distribution like this:
The problem is unlikely to be the loop you show:
Let n be the size of the distribution, and I be the number of invocations to getNextSample. We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E[I] < m / (1 - m / n).
That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. If that is "slow" and "crawls", it is likely because getNextSample is slow. This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element).
The following should be faster (if m < 0.8 n)
class Distribution<T> {
private double[] cummulativeWeight;
private T[] item;
private double totalWeight;
Distribution(Map<T, Double> probabilityMap) {
int i = 0;
cummulativeWeight = new double[probabilityMap.size()];
item = (T[]) new Object[probabilityMap.size()];
for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
item[i] = entry.getKey();
totalWeight += entry.getValue();
cummulativeWeight[i] = totalWeight;
i++;
}
}
T randomItem() {
double weight = Math.random() * totalWeight;
int index = Arrays.binarySearch(cummulativeWeight, weight);
if (index < 0) {
index = -index - 1;
}
return item[index];
}
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
set.add(randomItem());
}
return set;
}
}
public class Test {
public static void main(String[] args) {
int max = 1_000_000;
HashMap<Integer, Double> probabilities = new HashMap<>();
for (int i = 0; i < max; i++) {
probabilities.put(i, (double) i);
}
Distribution<Integer> d = new Distribution<>(probabilities);
Set<Integer> set = d.randomSubset(max / 2);
//System.out.println(set);
}
}
The expected runtime is O(m / (1 - m / n) * log n). On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds.
As we can see, the expected runtime approaches infinity as m approaches n. If that is a problem (i.e. m > 0.9 n), the following more complex approach should work better:
Set<T> randomSubset(int size) {
Set<T> set = new HashSet<>();
while(set.size() < size) {
T randomItem = randomItem();
remove(randomItem); // removes the item from the distribution
set.add(randomItem);
}
return set;
}
To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is.
But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n.
If you are not concerning with randomness properties too much then I do it like this:
create buffer for pseudo-random numbers
double buff[MAX]; // [edit1] double pseudo random numbers
MAX is size should be big enough ... 1024*128 for example
type can be any (float,int,DWORD...)
fill buffer with numbers
you have range of numbers x = < x0,x1 > and probability function probability(x) defined by your probability distribution so do this:
for (i=0,x=x0;x<=x1;x+=stepx)
for (j=0,n=probability(x)*MAX,q=0.1*stepx/n;j<n;j++,i++) // [edit1] unique pseudo-random numbers
buff[i]=x+(double(i)*q); // [edit1] ...
The stepx is your accuracy for items (for integral types = 1) now the buff[] array has the same distribution as you need but it is not pseudo-random. Also you should add check if j is not >= MAX to avoid array overruns and also at the end the real size of buff[] is j (can be less than MAX due to rounding)
shuffle buff[]
do just few loops of swap buff[i] and buff[j] where i is the loop variable and j is pseudo-random <0-MAX)
write your pseudo-random function
it just return number from the buffer. At first call returns the buff[0] at second buff[1] and so on ... For standard generators When you hit the end of buff[] then shuffle buff[] again and start from buff[0] again. But as you need unique numbers then you can not reach the end of buffer so so set MAX to be big enough for your task otherwise uniqueness will not be assured.
[Notes]
MAX should be big enough to store the whole distribution you want. If it is not big enough then items with low probability can be missing completely.
[edit1] - tweaked answer a little to match the question needs (pointed by meriton thanks)
PS. complexity of initialization is O(N) and for get number is O(1).
You should implement your own random number generator (using a MonteCarlo methode or any good uniform generator like mersen twister) and basing on the inversion method (here).
For example : exponential law: generate a uniform random number u in [0,1] then your random variable of the exponential law would be : ln(1-u)/(-lambda) lambda being the exponential law parameter and ln the natural logarithm.
Hope it'll help ;).
I think you have two problems:
Your itemDistribution doesn't know you need a set, so when the set you are building gets
large you will pick a lot of elements that are already in the set. If you start with the
set all full and remove elements you will run into the same problem for very small sets.
Is there a reason why you don't remove the element from the itemDistribution after you
picked it? Then you wouldn't pick the same element twice?
The choice of datastructure for itemDistribution looks suspicious to me. You want the
getNextSample operation to be fast. Doesn't the map from values to probability force you
to iterate through large parts of the map for each getNextSample. I'm no good at
statistics but couldn't you represent the itemDistribution the other way, like a map from
probability, or maybe the sum of all smaller probabilities + probability to a element
of the set?
Your performance depends on how your getNextSample function works. If you have to iterate over all probabilities when you pick the next item, it might be slow.
A good way to pick several unique random items from a list is to first shuffle the list and then pop items off the list. You can shuffle the list once with the given distribution. From then on, picking your m items ist just popping the list.
Here's an implementation of a probabilistic shuffle:
List<Item> prob_shuffle(Map<Item, int> dist)
{
int n = dist.length;
List<Item> a = dist.keys();
int psum = 0;
int i, j;
for (i in dist) psum += dist[i];
for (i = 0; i < n; i++) {
int ip = rand(psum); // 0 <= ip < psum
int jp = 0;
for (j = i; j < n; j++) {
jp += dist[a[j]];
if (ip < jp) break;
}
psum -= dist[a[j]];
Item tmp = a[i];
a[i] = a[j];
a[j] = tmp;
}
return a;
}
This in not Java, but pseudocude after an implementation in C, so please take it with a grain of salt. The idea is to append items to the shuffled area by continuously picking items from the unshuffled area.
Here, I used integer probabilities. (The proabilities don't have to add to a special value, it's just "bigger is better".) You can use floating-point numbers but because of inaccuracies, you might end up going beyond the array when picking an item. You should use item n - 1 then. If you add that saftey net, you could even have items with zero probability that always get picked last.
There might be a method to speed up the picking loop, but I don't really see how. The swapping renders any precalculations useless.
Accumulate your probabilities in a table
Probability
Item Actual Accumulated
Item1 0.10 0.10
Item2 0.30 0.40
Item3 0.15 0.55
Item4 0.20 0.75
Item5 0.25 1.00
Make a random number between 0.0 and 1.0 and do a binary search for the first item with a sum that is greater than your generated number. This item would have been chosen with the desired probability.
Ebbe's method is called rejection sampling.
I sometimes use a simple method, using an inverse cumulative distribution function, which is a function that maps a number X between 0 and 1 onto the Y axis.
Then you just generate a uniformly distributed random number between 0 and 1, and apply the function to it.
That function is also called the "quantile function".
For example, suppose you want to generate a normally distributed random number.
It's cumulative distribution function is called Phi.
The inverse of that is called probit.
There are many ways to generate normal variates, and this is just one example.
You can easily construct an approximate cumulative distribution function for any univariate distribution you like, in the form of a table.
Then you can just invert it by table-lookup and interpolation.

Java how to add a Random Generator that excludes a certain number?

Suppose I wanted to generate random numbers taken from ArrayList:(1,2,3,4,5,6,7,8,9,10)
A Random Generator produces 5.
List gets updated- AL:(1,2,3,4,6,7,8,9,10)
Next Random Number cannot be 5.
I am writing a program that generates random numbers from a arraylist and once it generates the random number the list removes that number and the next random generated digit cannot be that number.
ArrayList<Integer> numsLeft = new ArrayList<Integer>(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
Random randomGenerator = new Random();
int number = 0;
String cont;
do
{
number = randomGenerator.nextInt(numsLeft.size());
numsLeft.remove(number);
System.out.println (number + " continue (y/n)");
cont = (stdin.readLine());
}
while (cont.equalsIgnoreCase("y"));
But the only thing I can do here is lower the size...
http://docs.oracle.com/javase/7/docs/api/java/util/Random.html
The easier approach is to simply shuffle your list then use the numbers in the shuffled order:
List<Integer> nums = new ArrayList<Integer>();
for (int i = 1; i < 11; i++)
nums.add(i);
Collections.shuffle(nums);
Now they are in random order, just use them one by one:
for (Integer i : nums) {
// use i
}
You could make an array of the available numbers. Then, the random number generator gives you the position in that array for the number that you want.
Probably a linked list or something would be more efficient, but the concept is the same.
So, with your example, you'd pull 5 the first time. The second time, you'd have this in your list:
1, 2, 3, 4, 6, 7, 8, 9
If your random number was 5 again, the fifth position is 6. Pop the six out, shift 7, 8, 9 over one, and decrement your random number generator to be 1-8 instead of 1-9. continue on.
of course, looking at your code, it looks like that is what you are trying to do already.
What seems to be the issue with your code? What results are you getting?
number = randomGenerator.nextInt(numsLeft.size());
numsLeft.remove(number);
You are now printing the random index that you are generating, not the number that was removed from the list. Is that what you wanted? I think you really meant this:
int index = randomGenerator.nextInt(numsLeft.size());
number = numsLeft.remove(index);
You could also do this using by randomly shuffling the list and then just going through it:
List<Integer> numsLeft = new ArrayList<Integer>(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
// Shuffle the list randomly
Collections.shuffle(numsLeft);
do {
// Remove the first number each time
int number = numsLeft.remove(0);
System.out.println (number + " continue (y/n)");
cont = (stdin.readLine());
} while (cont.equalsIgnoreCase("y"));
Why don't you create a hash map to take care of this. So your hash map can contain something like
Map[(1,1), (2,2), (3,3), ...] or Map[(1,true), (2,true), (3,true), ...]
So if you generate a number, then you can do something like:
String value = map.get(key); or boolean present = map.get(key);
if(value != null) or if(value == present)
map.remove(key), or you can even update the data and instead of removing the key you can update it and add the word removed or a boolean as previously suggested. But this way you can keep track of all the entries and removals in your map for each of the key values which would be your list of numbers.
remove can be pretty expensive operation when list is long. Shuffle is too - especially if you only need a few numbers. Here is another algorithm (it is famous but I can't find the source right now).
put your N (ordered) numbers in a list
Choose a random number m between 0 and N-1
Pick the element at location m. This is your unique random number
SWAP element m with the LAST element in the array
Decrement N by 1
Go to step 2
You "set aside" the numbers you have used in step 4 - but
Unlike shuffle, your initialization is fast
Unlike remove, your remove operation only takes moving one element (instead of, on average, N/2)
Unlike the "pick one and reject if you saw it before", your efficiency of picking a "new" number doesn't decrease as the number of elements picked increases.

How to remove elements in an arraylist start from an indicated index

As the picture shown, after one time run a method, i want remove the old items, and prepared for next time calculation, but i wondering how to remove elements in an arraylist start from an indicated index, like a queue, obey FIFO algorithm?
You can use List#subList(int, int):
List<Integer> list = ...
list = list.subList(10, list.size()); // creates a new list from the old starting from the 10th element
or, since subList creates a view on which every modification affects the original list, this may be even better:
List<Integer> list = ...
list.subList(0, 10).clear(); // clears the first 10 elements of list
Just use the remove() method for this.
Let's assume you want to remove elements with indices from 20 to 30 of an ArrayList:
ArrayList<String> list = ...
for (int i = 0; i < 10; i++) { // 30 - 20 = 10
list.remove(20);
}
Once the first element at index 20 is removed element 21 moves to index 20. So you have to delete 10 times the element at index 20 to remove the next 10 elements.
Since you're writing no high performance application it is bad style to store so much semantics in an index variable.
A better approach would be the use of a map.
E. g. Map<Item, Integer> itemStock and Map<Item, Double> prices. You then also wouldn't have any problems with the remove operation.

Categories