Huge sorted list of random numbers - java

I need to create a method which returns a number sampled of some random distribution where every time call the method the returned number is bigger than any previously returned numbers.
Or, in other words, i need an iterator for a sorted list of random values.
Unfortunately the list is too big to be created in memory as a whole. The first idea i came up with is to divide my value space into buckets, where each bucket contains values in some range [a, b).
Say my list has N elements. To create a bucket i would sample my distribution N times and put each value in the range [a, b) into the bucket. Values outside that bucket would be discarded.
This way i could create a new bucket each time i iterated over the last and keep memory consumption low.
Yet, as i am not an expert in statistics, i am a little afraid this will somehow screw up the numbers i get. Is this an appropriate approach? Is it important to use the same exact distribution generator (an instance of org.apache.commons.math3.distribution.RealDistribution) for each bucket?
Update: It seems i did a bad job of explaining what kind of random number i am talking about.
My numbers form a sample of a random distribution like for example a normal distribution with a mean of m and variance of v, or an uniform distribution or exponential distribution.
I use those numbers to model some behavior in a simulation. Say i want to trigger events at some times. I need to schedule billions of events and the times those events are triggered must form a sample of a random distribution.
So if i derive my next number by adding a random number to my previous number i indeed get a sequence of growing random numbers but the numbers wont form a sample of my distribution.

On you can say what are the requirements of your random generator.
I need to create a method which returns a number sampled of some random distribution where every time call the method the returned number is bigger than any previously returned numbers.
You can do something like.
private long previous = 0;
private final Random rand = new Random();
public long nextNumber() {
return previous += rand.nextInt(10) + 1;
}
The details depend on how you want to model your random numbers.

If the list is too big to store in memory, you can use a database and read/write batches of list items to and from the database.
This way you only ever need to store one batch in memory at any one time.

I would start off by creating a variable and storing your first random number, then generate another random number, compare them and if it is larger save it in both large storage and ram, repeat as the next random number would be compared to single value in memory.

You could add a random number to the previously generated number. So you have to keep in memory only the number you generated in the iteration step before.

SamplePartitioner is a class which divides a sample of some distribution in several partitions of fixed size, which are returned, one by one, by nextPartition().
nextPartition() creates the whole sample on every call but stores only the smallest partitionSize values, which are bigger than the biggest value of the last partition. By using a fixed seed, nextPartition() creates the exact same sample each time it is called.
class SamplePartitioner(sampleSize: Long, partitionSize: Int, dist: RealDistribution) {
private val seed = Random.nextInt
private var remaining = sampleSize
private var lastMax = 0.0
def nextPartition(): SortedSet[Double] = remaining.min(partitionSize) match {
case 0 => SortedSet.empty[Double]
case targetSize =>
dist.reseedRandomGenerator(seed)
val partition = fill(sampleSize, SortedSet.empty, targetSize)
lastMax = partition.last
remaining -= partition.size
partition
}
private def fill(samples: Long, partition: SortedSet[Double], targetSize: Long): SortedSet[Double] =
samples match {
case 0 => partition
case n =>
val sample = dist.sample()
val tmp = if (sample > lastMax) partition + sample else partition
fill(n - 1, if (partition.size > targetSize) tmp.init else tmp, targetSize)
}
}

Related

Combination Algorithm from multiple sets

I am trying to write an algorithm that tells me how many pairs I could generate with items coming from multiple set of values. For example I have the following sets:
{1,2,3} {4,5} {6}
From these sets I can generate 11 pairs:
{1,4}, {1,5}, {1,6}, {2,4}, {2,5}, {2,6}, {3,4}, {3,5}, {3,6}, {4,6}, {5,6}
I wrote the following algorithm:
int result=0;
for(int k=0;k<numberOfSets;k++){ //map is a list where I store all my sets
int size1 = map.get(k);
for(int l=k+1;l<numberOfSets;l++){
int size2 = map.get(l);
result += size1*size2;
}
}
But as you can see the algorithm is not very scalable. If the number of sets increases the algorithm starts performing very poorly.
Am I missing something?, Is there an algorithm that can help me with this ? I have been looking to combination and permutation algorithms but I am not very sure if thats the right path for this.
Thank you very much in advance
First at all, if the order in the pairs does matter, then starting with int l=k+1 in the inner cycle is erroneous. E.g. you are missing {4,1} if you consider it equal with {1,4}, then the result is correct, otherwise it isn't.
Second, to complicate the matter further, you don't say if the the pairs need to be unique or not. E.g. {1,2} , {2,3}, {4} will generate {2,4} twice - if you need to count it as unique, the result of your code is incorrect (and you will need to keep a Set<Pair<int,int>> to remove the duplicates and you will need to scan those sets and actually generate the pairs).
The good news: while you can't do better than O(N2) just for counting the pairs, even if you have thousands of sets, the millions of integral multiplication/additions are fast enough on nowaday computers - e.g Eigen deals quite well with O(N^3) operations for floating multiplications (see matrix multiplication operations).
Assuming you only care about the number of pairs, and are counting duplicates, then there is a more efficient algorithm:
We will keep track of the current number of sets, and the number of elements which we encountered so far.
Go over the list from the end to the start
For each new set, the number of new pairs we can make is the size of the set * the size of encountered elements. Add this to the current number of sets.
Add the size of the new set to the number of elements which we encountered so far.
The code:
int numberOfPairs=0;
int elementsEncountered=0;
for(int k = numberOfSets - 1 ; k >= 0 ; k--) {
int sizeOfCurrentSet = map.get(k);
int numberOfNewPairs = sizeOfCurrentSet * elementsEncountered;
numberOfPairs += numberOfNewPairs;
elementsEncountered += sizeOfCurrentSet;
}
The key point to relize is that when we count the number of new pairs that each set contributes, it doesn't matter from which set we select the second element of the pair. That is, we don't need to keep track of any set which we have already analyzed.

Random number generator without duplication

I am trying to have a piece of code in which a random number would be generated and will be saved in a collection so next time when another random number is generated i can check if this new number is already in list or not.
The main point of this method would be generating a number in ranged of 1 to 118, no duplicated number allowed.
Random rand = new Random();
randomNum2 = rand.nextInt(118) + 1;
if (!generated.contains(randomNum2))
{
String strTemp = "whiteElements\\"+String.valueOf(randomNum2)+".JPG";
btnPuzzlePiece2.setIcon(new ImageIcon(strTemp));
generated.add(randomNum2);
btnPuzzlePiece2.repaint();
}
else
setPicForBtnGame1();
BUT the problem is in this piece of code as the program continues generating numbers the possibility to have a correct random number (in range without duplicating) imagine after running the method 110 times... the possibility for the method to generate a valid random number reduces to less than 1%... which leaves the program with the chance of never having the list of numbers from 1-118 and also too much waste of process.
so how can i write this correctly?
p.s i thought of making 118 object and save them in a collection then generate a random object and after remove the object from the list so the next element has no chance of being duplicated.
Help me out please ...
Create a List, and populate it with the elements in your range. Then shuffle() the list, and the order is your random numbers. That is, the 0-th element is your first random number, the 1st element is your second random number, etc.
Wouldn't it be better to just generate something that can never be a duplicate?
A random number with no duplicates is usually known as a UUID.
The easiest way to generate a UUID is to prefix your random number with the current system time in milliseconds.
Of course there's a chance that it could be a duplicate but it's vanishingly small. Of course it might be long, so you'd want to then base64 encode it for example, to reduce it's size.
You can get a more or less guaranteed UUID down to about 8 characters using encoding.

Random.nextInt(int) is [slightly] biased

Namely, it will never generate more than 16 even numbers in a row with some specific upperBound parameters:
Random random = new Random();
int c = 0;
int max = 17;
int upperBound = 18;
while (c <= max) {
int nextInt = random.nextInt(upperBound);
boolean even = nextInt % 2 == 0;
if (even) {
c++;
} else {
c = 0;
}
}
In this example the code will loop forever, while when upperBound is, for example, 16, it terminates quickly.
What can be the reason of this behavior? There are some notes in the method's javadoc, but I failed to understand them.
UPD1: The code seems to terminate with odd upper bounds, but may stuck with even ones
UPD2:
I modified the code to capture the statistics of c as suggested in the comments:
Random random = new Random();
int c = 0;
long trials = 1 << 58;
int max = 20;
int[] stat = new int[max + 1];
while (trials > 0) {
while (c <= max && trials > 0) {
int nextInt = random.nextInt(18);
boolean even = nextInt % 2 == 0;
if (even) {
c++;
} else {
stat[c] = stat[c] + 1;
c = 0;
}
trials--;
}
}
System.out.println(Arrays.toString(stat));
Now it tries to reach 20 evens in the row - to get better statistics, and the upperBound is still 18.
The results turned out to be more than surprising:
[16776448, 8386560, 4195328, 2104576, 1044736,
518144, 264704, 132096, 68864, 29952, 15104,
12032, 1792, 3072, 256, 512, 0, 256, 0, 0]
At first it decreases as expected by the factor of 2, but note the last line! Here it goes crazy and the captured statistics seem to be completely weird.
Here is a bar plot in log scale:
How c gets the value 17 256 times is yet another mystery
http://docs.oracle.com/javase/6/docs/api/java/util/Random.html:
An instance of this class is used to generate a stream of
pseudorandom numbers. The class uses a 48-bit seed, which is modified
using a linear congruential formula. (See Donald Knuth, The Art of
Computer Programming, Volume 3, Section 3.2.1.)
If two instances of Random are created with the same seed, and the
same sequence of method calls is made for each, they will generate and
return identical sequences of numbers. [...]
It is a pseudo-random number generator. This means that you are not actually rolling a dice but rather use a formula to calculate the next "random" value based on the current random value. To creat the illusion of randomisation a seed is used. The seed is the first value used with the formula to generate the random value.
Apparently javas random implementation (the "formula"), does not generate more than 16 even numbers in a row.
This behaviour is the reason why the seed is usually initialized with the time. Deepending on when you start your program you will get different results.
The benefits of this approach are that you can generate repeatable results. If you have a game generating "random" maps, you can remember the seed to regenerate the same map if you want to play it again, for instance.
For true random numbers some operating systems provide special devices that generate "randomness" from external events like mousemovements or network traffic. However i do not know how to tap into those with java.
From the Java doc for secureRandom:
Many SecureRandom implementations are in the form of a pseudo-random
number generator (PRNG), which means they use a deterministic
algorithm to produce a pseudo-random sequence from a true random seed.
Other implementations may produce true random numbers, and yet others
may use a combination of both techniques.
Note that secureRandom does NOT guarantee true random numbers either.
Why changing the seed does not help
Lets assume random numbers would only have the range 0-7.
Now we use the following formula to generate the next "random" number:
next = (current + 3) % 8
the sequence becomes 0 3 6 1 4 7 2 5.
If you now take the seed 3 all you do is to change the starting point.
In this simple implementation that only uses the previous value, every value may occur only once before the sequence wraps arround and starts again. Otherwise there would be an unreachable part.
E.g. imagine the sequence 0 3 6 1 3 4 7 2 5. The numbers 0,4,7,2 and 5 would never be generated more than once(deepending on the seed they might be generated never), since once the sequence loops 3,6,1,3,6,1,... .
Simplified pseudo random number generators can be thought of a permutation of all numbers in the range and you use the seed as a starting point. If they are more advanced you would have to replace the permutation with a list in which the same numbers might occur multiple times.
More complex generators can have an internal state, allowing the same number to occur several times in the sequence, since the state lets the generator know where to continue.
The implementation of Random uses a simple linear congruential formula. Such formulae have a natural periodicity and all sorts of non-random patterns in the sequence they generate.
What you are seeing is an artefact of one of these patterns ... nothing deliberate. It is not an example of bias. Rather it is an example of auto-correlation.
If you need better (more "random") numbers, then you need to use SecureRandom rather than Random.
And the answer to "why was it implemented that way is" ... performance. A call to Random.nextInt can be completed in tens or hundreds of clock cycles. A call to SecureRandom is likely to be at least 2 orders of magnitude slower, possibly more.
For portability, Java specifies that implementations must use the inferior LCG method for java.util.Random. This method is completely unacceptable for any serious use of random numbers like complex simulations or Monte Carlo methods. Use an add-on library with a better PRNG algorithm, like Marsaglia's MWC or KISS. Mersenne Twister and Lagged Fibonacci Generators are often OK as well.
I'm sure there are Java libraries for these algorithms. I have a C library with Java bindings if that will work for you: ojrandlib.

How to be sure that random numbers are unique and not duplicated?

I have a simple code which generates random numbers
SecureRandom random = new SecureRandom();
...
public int getRandomNumber(int maxValue) {
return random.nextInt(maxValue);
}
The method above is called about 10 times (not in a loop). I want to ensure that all the numbers are unique (assuming that maxValue > 1000).
Can I be sure that I will get unique numbers every time I call it? If not, how can I fix it?
EDIT: I may have said it vaguely. I wanted to avoid manual checks if I really got unique numbers so I was wondering if there is a better solution.
There are different ways of achieving this and which is more appropriate will depend on how many numbers you need to pick from how many.
If you are selecting a small number of random numbers from a large range of potential numbers, then you're probably best just storing previously chosen numbers in a set and "manually" checking for duplicates. Most of the time, you won't actually get a duplicate and the test will have practically zero cost in practical terms. It might sound inelegant, but it's not actually as bad as it sounds.
Some underlying random number generation algorithms don't produce duplicates at their "raw" level. So for example, an algorithm called a XORShift generator can effectively produce all of the numbers within a certain range, shuffled without duplicates. So you basically choose a random starting point in the sequence then just generate the next n numbers and you know there won't be duplicates. But you can't arbitrarily choose "max" in this case: it has to be the natural maximum of the generator in question.
If the range of possible numbers is small-ish but the number of numbers you need to pick is within a couple of orders of magnitude of that range, then you could treat this as a random selection problem. For example, to choose 100,000 numbers within the range 10,000,000 without duplicates, I can do this:
Let m be the number of random numbers I've chosen so far
For i = 1 to 10,000,000
Generate a random (floating point) number, r, in the range 0-1
If (r < (100,000-m)/(10,000,000-i)), then add i to the list and increment m
Shuffle the list, then pick numbers sequentially from the list as required
But obviously, there's only much point in choosing the latter option if you need to pick some reasonably large proportion of the overall range of numbers. For choosing 10 numbers in the range 1 to a billion, you would be generating a billion random numbers when by just checking for duplicates as you go, you'd be very unlikely to actually get a duplicate and would only have ended up generating 10 random numbers.
A random sequence does not mean that all values are unique. The sequence 1,1,1,1 is exactly as likely as the sequence 712,4,22,424.
In other words, if you want to be guaranteed a sequence of unique numbers, generate 10 of them at once, check for the uniqueness condition of your choice and store them, then pick a number from that list instead of generating a random number in your 10 places.
Every time you call Random#nextInt(int) you will get
a pseudorandom, uniformly distributed int value between 0 (inclusive)
and the specified value (exclusive).
If you want x unique numbers, keep getting new numbers until you have that many, then select your "random" number from that list. However, since you are filtering the numbers generated, they won't truly be random anymore.
For such a small number of possible values, a trivial implementation would be to put your 1000 integers in a list, and have a loop which, at each iteration, generates a random number between 0 and list.size(), pick the number stored at this index, and remove it from the list.
This is code is very efficient with the CPU at the cost of memory. Each potiental value cost sizeof(int) * maxValue. An unsigned integer will work up to 65535 as a max. long can be used at the cost of a lot of memory 2000 bytes for 1000 values of 16 bit integers.
The whole purpose of the array is to say have you used this value before or not 1 = yes
'anything else = no
'The while loop will keep generating random numbers until a unique value is found.
'after a good random value is found it marks it as used and then returns it.
'Be careful of the scope of variable a as if it goes out of scope your array could erased.
' I have used this in c and it works.
' may take a bit of brushing up to get it working in Java.
unsigned int a(1000);
public int getRandomNumber(int maxValue) {
unsigned int rand;
while(a(rand)==1) {
rand=random.nextInt(maxValue);
if (a(rand)!=1) { a(rand)=1; return rand;}
}
}

random number with seed

Reference: link text
i cannot understand the following line , can anybody provide me some example for the below statement?
If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers
Since you asked for an example:
import java.util.Random;
public class RandomTest {
public static void main(String[] s) {
Random rnd1 = new Random(42);
Random rnd2 = new Random(42);
System.out.println(rnd1.nextInt(100)+" - "+rnd2.nextInt(100));
System.out.println(rnd1.nextInt()+" - "+rnd2.nextInt());
System.out.println(rnd1.nextDouble()+" - "+rnd2.nextDouble());
System.out.println(rnd1.nextLong()+" - "+rnd2.nextLong());
}
}
Both Random instances will always have the same output, no matter how often you run it, no matter what platform or what Java version you use:
30 - 30
234785527 - 234785527
0.6832234717598454 - 0.6832234717598454
5694868678511409995 - 5694868678511409995
The random generator is deterministic. Given the same input to Random and the same usage of the methods in Random, the sequence of pseudo-random numbers returned to your program will be the same even in different runs on different machines.
This is why it is pseudo-random - the numbers returned behave statistically like random numbers except they can be reliably predicted. True random numbers are unpredictable.
The Random class basically is a Psuedorandom Number Generator (also known as Deterministic random bit generator) that generates a sequence of numbers that approximates the properties of random numbers. It's not generally random but deterministic as it can be determined by small random states in the generator (such as seed). Because of the deterministic nature, you can generate identical result if you the sequence of methods and seeds are identical on 2 generators.
The numbers are not really random, given the same starting conditions (the seed) and the same sequence of operations, the same sequence of numbers will be generated. This is why it would not be a good iea to use the basic Random class as part of any cryptograhic or security related code since it may be possible for an attacker to figure out which sequnce is being generated and predict future numbers.
For a random number generator that emits non-deterministic values, take a look at SecureRandom.
See Random number generation, Computational methods on wikipedia for more info.
This means that when you create the Random object (e.g. at the start of your program), you will probably want to start with a new seed. Mostly people choose some time related value, such as the number of ticks.
The fact that the number sequences are the same given the same seed is actually very convenient if you want to debug your program: make sure you log the seed value and if something is wrong you can restart the program in the debugger using that same seed value. This means you can replay the scenario exactly. This would be impossible if you would (could) use a true random number generator.
With the same seed value, separate instances of Random will return/generate the same sequence of random numbers; more on this here:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Tech/Chapter04/javaRandNums.html
Ruby Example:
class LCG; def initialize(seed=Time.now.to_i, a=2416, b=374441, m=1771075); #x, #a, #b, #m = seed % m, a, b, m; end; def next(); #x = (#a * #x + #b) % #m; end; end
irb(main):004:0> time = Time.now.to_i
=> 1282908389
irb(main):005:0> r = LCG.new(time)
=> #<LCG:0x0000010094f578 #x=650089, #a=2416, #b=374441, #m=1771075>
irb(main):006:0> r.next
=> 45940
irb(main):007:0> r.next
=> 1558831
irb(main):008:0> r.next
=> 1204687
irb(main):009:0> f = LCG.new(time)
=> #<LCG:0x0000010084cb28 #x=650089, #a=2416, #b=374441, #m=1771075>
irb(main):010:0> f.next
=> 45940
irb(main):011:0> f.next
=> 1558831
irb(main):012:0> f.next
=> 1204687
Based on the values a/b/m, the result will be the same for a given seed. This can be used to generate the same "random" number in two places and both sides can depend on getting the same result. This can be useful for encryption; although obviously, this algorithm isn't cryptographically secure.

Categories