How to create multivariate uniform distribution? - java

Goal
I would like to sample from a bi-variate uniform distributions with specified correlation coefficient in java.
Question
What method can I use to implement such multivariate uniform distribution?
or
Is there an existing package that would implement such thing so that I don't have to reinvent the wheel?
What I've got so far
The packages mvtnorm in R allow to sample from a multivariate normal distribution with specified correlation coefficients. I thought that understanding their method may help me out either by doing something similar with uniform distributions or by repeating their work and using copulas to transform the multivariate normal into a multivariate uniform (as I did in R there).
The source code is written in Fortran and I don't speak Fortran! The code is based on this paper by Genz and Bretz but it is too math heavy for me.

I have an idea. Typically, you generate U(0,1) by generating, say, 32 random bits and dividing by 232, getting back one float. For two U(0,1) you generate two 32bit values, divide, get two floats back. So far so good. Such bi-variate generator would be uncorrelated, very simple to check etc
Suppose you build it your bi-variate generator in following way. Inside, you get two random 32bit integers, and then produce two U(0,1) with shared parts. Say, you take 24bits from first integer and 24bits for second integer, but upper (or lower, or middle, or ...) 8bits would be the same (taken from first integer and copied to second) for both of them.
Clearly, those two U(0,1) would be correlated. We could write them as
U(0,1)0 = a0 + b
U(0,1)1 = a1 + b
I omit some coefficients etc for simplicity. Each one is U(0,1) with mean of 1/2 and variance of 1/12. Now you have to compute Pearson correlation as
r = ( E[U(0,1)0 U(0,1)1] - 1/4 ) / sqrt(1/12)2
Using expansion above it should be easy after some algebra to compute r and compare with one you want. You may vary size of the correlated part b, as well as its position (high bits, low bits, somewhere in the middle) to fit desired r.
Realistically speaking, there should be infinite possibilities to have the same r but different sampling code and different bi-variate distributions. You might want to add more constrains in the future

Related

XGBoost Cross Validation with different cut point

I would like to create two models of binary prediction: one with the cut point strictly greater than 0.5 (in order to obtain fewer signals but better ones) and second with the cut point strictly less than 0.5.
Doing the cross-validation, we have a test error related to the cut point equal to 0.5. How can I do it with other cut value? I talk about XGBoost for Java.
xgboost returns a list of scores. You can do what ever you want to that list of scores.
I think that particularly in Java, it returns a 2d ArrayList of shape (1, n)
In binary prediction you probably used a logistic function, thus your scores will be between 0 to 1.
Take your scores object and create a custom function that will calculate new predictions, by the rules you've described.
If you are using an automated/xgboost-implemented Cross Validation Function, you might want to build a customized evaluation function which will do as you bid, and pass it as an argument to xgb.cv
If you want to be smart when setting your threshold, I suggest reading about AUC of Roc Curve and Precision Recall Curve.

How to test the correct implementation of special polynomials?

I need to implement the calculation of some special polynomials in Java (the language is not really important). These are calculated as a weighted sum of a number of base polynomials with fixed coefficients.
Each base polynomial has 2 to 10 coefficients and there are typically 10 base polynomials considered, giving a total of, say 20-50 coefficients.
Basically the calculation is no big deal but I am worried about typos. I only have a printed document as a template. So i would like to implement unit tests for the calculations. The issue is: How do I get reliable testing data. I do have another software that is supposed to calculate these functions but the process is complicated and also error prone - I would have to scale the input values, go through a number of menu selections in the software to produce the output and then paste it to my testing code.
I guess that there is no way around using the external software to generate some testing data, but maybe you have some recommendations for making this type of testing procedure safer or minimize the required number of test cases.
I am also worried about providing suitable input values: Depending on the value of the independent variable, certain terms will only have a tiny contribution to the output, while for other values they might dominate.
The types of errors I expect (and need to avoid) are:
Typos in coefficients
Coefficients applied to wrong power (i.e. a_7*x^6 instead of a_7*x^7 - just for demonstration, I am not calculating this way but am using Horner's scheme)
Off-by one errors (i.e. missing zero order or highest order term)
Since you have a polynomial of degree 10, testing at 11 distinct points should give certainty.
However, already a test at one well-randomized point, x=1.23004 to give an idea (away from small fractions like 2/3, 4/5), will with high probability show a difference if there is an error, because it is unlikely that the difference between the wrong and the true polynomial has a root at exactly this place.

Generating Random Numbers with Standard Deviation [duplicate]

This question already has answers here:
Java normal distribution
(2 answers)
Closed 7 years ago.
I'm experimenting with AI for an extremely simple simulation game.
The game will contain people (instances of objects with random properties) who have a set amount of money to spend.
I'd like the distribution of "wealth" to be statistically valid.
How can I generate a random number (money) which adheres to a standard deviation (e.g. mean:50, standard deviation: 10), whereby a value closer to the mean is more likely to be generated?
I think you're focusing on the wrong end of the problem. The first thing you need to do is identify the distribution you want to use to model wealth. A normal distribution with a mean of 50 and standard deviation of 10 nominally meets your needs, but so does a uniform distribution in the range [32.67949, 67.32051]. There are lots of statistical distributions that can have the same mean and standard deviation but which have completely different shapes, and it is the shape that will determine the validity of your distribution.
Income and wealth turn out to have very skewed distributions - they are bounded below by zero, while a few people have such large amounts compared to the rest of us that they drag the mean upward by quite noticeable amounts. Consequently, you don't want to use a naive distribution choice such as uniform or Gaussian, or anything else that is symmetric or can dip into negative territory. Using an exponential would be far more realistic, but still may not be sufficiently extreme to capture actual wealth distribution we see in the real world.
Once you've picked a distribution, there are many software libraries or sources of info that will help you generate values from that distribution.
Generating random numbers is a vast topic. But since you said it's a simple simulation, here's a simple approach to get going:
Generate several (say n) random numbers uniformly distributed on (0, 1). The built-in function Math.random can supply those numbers.
Add up those numbers. The sum has a distribution which is approximately normal, with mean = n/2 and standard deviation = sqrt(n)/sqrt(12). So if you subtract n/2, and then divide by sqrt(n)/sqrt(12), you'll have something which is approximately normal with mean 0 and standard deviation 1. Obviously if you pick n = 12 all you have to do is subtract 6 from the sum and you're done.
Now to get any other mean and standard deviation, just multiply by the standard deviation you want, and add the mean you want.
There are many other ways to go about it, but this is perhaps the simplest. I assume that's OK given your description of the problem.

Generating Random Hash Functions for LSH Minhash Algorithm

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment).
In order to do that, I've been generating random numbers a, b, and c (from the range 1 - 2001) for each of the 240 hash functions. Then, my hash function returns h = ((a*x) + b) % c, where h is the return value and x is one of the integers run through it.
Is this an efficient implementation of random hashing, or is there a more common/acceptable way to do it?
This post was asking a similar question, but I'm still somewhat confused by the wording of the answer: Minhash implementation how to find hash functions for permutations
When I was working with Bloom filters a few years ago, I ran across an article that describes how to generate multiple hash functions very simply, with a minimum of code. The method he describes works very well. See Less Hashing, Same Performance: Building a Better Bloom Filter.
The basic idea is to create two hash functions, call them h1 and h2, with which you can then simulate multiple hash functions, g1 through gk, using the formula:
gi = h1(x) + i*h2(x)
i varies from 1 to k (the number of hash functions you want).
The paper is well worth reading, even if you decide not to implement his idea. Although after reading it I can't imagine not wanting to implement it. It made my Bloom filter code a whole lot more tractable and didn't negatively impact performance.
So the method that I described above was almost correct. The numbers a and b should be randomly generated. However, c needs to be a prime number that is slightly larger than the maximum possible value of x. Once those numbers have been chosen, finding hash value h using h = ((a*x)+b) % c is the standard, accepted way to generate hash functions.
Also, a and b should be random numbers from the range 1 to c-1.

Random level function in skip list

I am looking at skip list implementation in Java , and I am wondering the purpose of the following method:
public static int randomLevel() {
int lvl = (int)(Math.log(1.-Math.random())/Math.log(1.-P));
return Math.min(lvl, MAX_LEVEL);
}
And what the difference between the above method and
Random.nextInt(6);
Can anyone explain that? Thanks.
Random.nextInt should provide a random variable whose probability distribution is (approximately) a discrete uniform distribution over the interval [0, 6).
You can learn more about this here.
http://puu.sh/XMwn
Note that internally Random uses a linear congruential generator where m = 2^48, a = 25214903917, and c = 11.
randomLevel instead (approximately) uses a geometric distribution where p = 0.5. You can learn more about the distribution here.
http://puu.sh/XMwT
Essentially, randomLevel returns 0 with probability 0.5, 1 with 0.25, 2 with 0.125, etc. until 6 with 0.5^7 i.e. *0.0078125** -- far different than the ~0.14 from Random.nextInt.
Now the importance of this is that a skip list is an inherently probabilistic data structure. By utilizing multiple sparse levels of linked lists, they can achieve average runtime performance of O(log n) search -- similar to a balanced binary search tree, but less complex and using less space. Using a uniform distribution here would not be appropriate, seeing how to as higher levels are less densely populated in comparison to lower ones (note: below, the levels grow downward) -- which is necessary for the fast searches.
Just like the link says...
"This gives us a 50% chance of the random_level() function returning 0, a 25% chance of returning 1, a 12.5% chance of returning 2 and so on..." The distribution is therefore not even. However, Random.nextInt() is. There is an equal likelihood that any number between 0 and 5 will be selected.
I haven't looked at the full implementation, but what probably happens is that randomLevel() us used to select a number, say n. Then, the element that needs to be added to the skiplist will have pointers 0, 1,...,n. You can think of each level as a separate list.
Why use a distribution like this? Well an even distribution will require too much memory for the benefit that it will have. By reducing the chance using a geometric distribution, the "sweet" spot is attained. Now the advantage of obtaining a value quickly, with a smaller memory footprint is realised.

Categories