Percentile calculation mismatch using apache.math3.stat.descriptive

Percentile calculation mismatch using apache.math3.stat.descriptive - java

I am calculating the 95th percentile of the following list of numbers:
66,337.8,989.7,1134.6,1118.7,1097.9,1122.1,1121.3,1106.7,871,325.2,285.1,264.1,295.8,342.4
The apache libraries use the NIST standards to calculate the percentile which is the same method used by Excel. According to Excel the 95th percentile of the list above should be 1125.85.
However, using the following code I get a different result:
DescriptiveStatistics shortList = new DescriptiveStatistics();
#BeforeTest
#Parameters("shortStatsList")
private void buildShortStatisticsList(String list) {
StringTokenizer tokens = new StringTokenizer(list, ",");
while (tokens.hasMoreTokens()) {
shortList.addValue(Double.parseDouble(tokens.nextToken()));
}
}
#Test
#Parameters("95thPercentileShortList")
public void percentileShortListTest(String percentile) {
Assert.assertEquals(Double.toString(shortList.getPercentile(95)), percentile);
}
This fails with the following message:
java.lang.AssertionError: expected:<1125.85> but was:<1134.6>
at org.testng.Assert.fail(Assert.java:89)
at org.testng.Assert.failNotEquals(Assert.java:489)
1134.6 is the maximum value in the list, not the 95th percentile, so I don't know where this value is coming from.

According to the documentation of getPercentile() it is using the percentile estimation algorithm, as recorded here.
Percentiles can be estimated from N measurements as follows: for the pth percentile, set p(N+1) equal to k+d for k an integer, and d, a fraction greater than or equal to 0 and less than 1.
For 0<k<N, Y(p)=Y[k]+d(Y[k+1]−Y[k])
For k=0, Y(p)=Y[1]
Note that any p ≤ 1/(N+1) will simply be set to the minimum value.
For k≥N,Y(p)=Y[N]
Note that any p ≥ N/(N+1) will simply be set to the maximum value.
Basically this means multiplying the requested percentile (0.95) by (N+1). In your case N is 15, and N+1 is 16, so you get 15.2.
You split this into the whole part k (15), and d (0.2). The k falls into category 3 above. That is, the estimated percentile is the maximum value.
If you keep on reading the NIST article that I linked above, you'll see the part titled "Note that there are other ways of calculating percentiles in common use". They refer you to an article by Hyndman & Fann, which describes several alternative ways of calculating percentiles. It's a misconception to thing that there is one NIST method. The methods in Hyndman & Fann are denoted by the labels R1 through R9. The article goes on to say:
Some software packages set 1+p(N−1) equal to k+d and then proceed as above. This is method R7 of Hyndman and Fan. This is the method used by Excel and is the default method for R (the R quantile function can optionally use any of the nine methods discussed in Hyndman & Fan).
The method used by default by Apache's DescriptiveStatistics is Hyndman & Fan's R6. The method used by Excel is R7. Both of them are "NIST methods", but for a small number of measurements, they can give different results.
Note that the Apache library does allow you to use the R7 algorithm or any of the others, by using the Percentile class. Something like this should do the trick:
DescriptiveStatistics shortList = new DescriptiveStatistics();
shortList.setPercentileImpl( new Percentile().
withEstimationType( Percentile.EstimationType.R_7 ) );
(Note that I haven't tested this).

Related

Random level function in skip list

I am looking at skip list implementation in Java , and I am wondering the purpose of the following method:
public static int randomLevel() {
int lvl = (int)(Math.log(1.-Math.random())/Math.log(1.-P));
return Math.min(lvl, MAX_LEVEL);
}
And what the difference between the above method and
Random.nextInt(6);
Can anyone explain that? Thanks.

Random.nextInt should provide a random variable whose probability distribution is (approximately) a discrete uniform distribution over the interval [0, 6).
You can learn more about this here.
http://puu.sh/XMwn
Note that internally Random uses a linear congruential generator where m = 2^48, a = 25214903917, and c = 11.
randomLevel instead (approximately) uses a geometric distribution where p = 0.5. You can learn more about the distribution here.
http://puu.sh/XMwT
Essentially, randomLevel returns 0 with probability 0.5, 1 with 0.25, 2 with 0.125, etc. until 6 with 0.5^7 i.e. *0.0078125** -- far different than the ~0.14 from Random.nextInt.
Now the importance of this is that a skip list is an inherently probabilistic data structure. By utilizing multiple sparse levels of linked lists, they can achieve average runtime performance of O(log n) search -- similar to a balanced binary search tree, but less complex and using less space. Using a uniform distribution here would not be appropriate, seeing how to as higher levels are less densely populated in comparison to lower ones (note: below, the levels grow downward) -- which is necessary for the fast searches.

Just like the link says...
"This gives us a 50% chance of the random_level() function returning 0, a 25% chance of returning 1, a 12.5% chance of returning 2 and so on..." The distribution is therefore not even. However, Random.nextInt() is. There is an equal likelihood that any number between 0 and 5 will be selected.
I haven't looked at the full implementation, but what probably happens is that randomLevel() us used to select a number, say n. Then, the element that needs to be added to the skiplist will have pointers 0, 1,...,n. You can think of each level as a separate list.
Why use a distribution like this? Well an even distribution will require too much memory for the benefit that it will have. By reducing the chance using a geometric distribution, the "sweet" spot is attained. Now the advantage of obtaining a value quickly, with a smaller memory footprint is realised.

Distributedly "dumping"/"compressing" data samples

I'm not really sure what's the right title for my question
So here's the question
Suppose I have N number of samples, eg:
1
2
3
4
.
.
.
N
Now I want to "reduce" the size of the sample from N to M, by dumping (N-M) data from the N samples.
I want the dumping to be as "distributed" as possible,
so like if I have 100 samples and want to compress it to 50 samples, I would throw away every other sample. Another example, say the data is 100 samples and I want to compress it to 25 samples. I would throw away 1 sample in the each group of 100/25 samples, meaning I iterate through each sample and count, and every time my count reaches 4 I would throw away the sample and restart the count.
The problem is how do I do this if the 4 above was to be 2.333 for example. How do I treat the decimal point to throw away the sample distributively?
Thanks a lot..

The terms you are looking for are resampling, downsampling and decimation. Note that in the general case you can't just throw away a subset of your data without risking aliasing. You need to low pass filter your data first, prior to decimation, so that there is no information above your new Nyquist rate which would be aliased.
When you want to downsample by a non-integer value, e.g. 2.333 as per your example above you would normally do this by upsampling by an integer factor M and then downsampling by a different integer factor N, where the fraction M/N gives you the required resampling factor. In your example M = 3 and N = 7, so you would upsample by a factor of 3 and then downsample by a factor of 7.

You seem to be talking about sampling rates and digital signal processing
Before you reduce, you normally filter the data to make sure high frequencies in your sample are not aliased to lower frequencies. For instance, in your (take every fourth value), a frequency of that repeats every four samples will alias to the "DC" or zero cycle frequency (for example "234123412341" starting with the first of every grouping will get "2,2,2,2", which might not be what you want. (a 3 cycle would also alias to a cycle like itself (231231231231) => 231... (unless I did that wrong because I'm tired). Filtering is a little beyond what I would like to discuss right now as it's a pretty advanced topic.
If you can represent your "2.333" as some sort of fraction, lets see, that's 7/3. you were talking 1 out of every 4 samples (1/4) sou I would say you're taking 3 out of every 7 samples. so you might (take, drop, take, drop, take, drop, drop). but there might be other methods.

For audio data that you want to sound decent (as opposed to aliased and distorted in the frequency domain), see Paul R.'s answer involving resampling. One method of resampling is interpolation, such as using a windowed-Sinc interpolation kernel which will properly low-pass filter the data as well as allow creating interpolated intermediate values.
For non-sampled and non-audio data, where you just want to throw away some samples in a close-to-evenly distributed manner, and don't care about adding frequency domain noise and distortion, something like this might work:
float myRatio = (float)(N-1) / (float)(M-1); // check to make sure M > 1 beforehand
for (int i=0; i < M; i++) {
int j = (int)roundf(myRatio * (float)i); // nearest bin decimation
myNewArrayLengthM[i] = myOldArrayLengthN[j];
}

Benford's Law in Java - how to make a math function into Java

I have a quick question. I am trying to make a fraud detection app in java, the app will be primarily based on Benford's law. Benford's law is super cool, it basically can be interpreted to say that in a real financial transaction the first digit is commonly a 1, 2, or 3 and very rarely an 8, 9. I haven't been able to get the Benford formula translated into code that can be run in Java.
http://www.mathpages.com/home/kmath302/kmath302.htm This link has more information about what the Benford law is and how it can be used.
I know that I will have to use the java math class to be able to use a natural log function, but I am not sure how to do that. Any help would be greatly appreciated.
Thanks so much!!

#Rui has mentioned how to compute the probability distribution function, but that's not going to help you much here.
What you want to use is either the Kolmogorov-Smirnov test or the Chi-squared test. Both are for used for comparing data to a known probability distribution to determine whether the dataset is likely/unlikely to have that probability distribution.
Chi-squared is for discrete distributions, and K-S is for continuous.
For using chi-squared with Benford's law, you would just create a histogram H[N], e.g. with 9 bins N=1,2,... 9, iterate over the dataset to check the first digit to count # of samples for each of the 9 non-zero digits (or first two digits with 90 bins). Then run the chi-squared test to compare the histogram with the expected count E[N].
For example, let's say you have 100 pieces of data. E[N] can be computed from Benford's Law:
E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] = 9.6910
E[5] = 7.9181
E[6] = 6.6946
E[7] = 5.7992
E[8] = 5.1152
E[9] = 4.5757
Then compute Χ2 = sum((H[k]-E[k])^2/E[k]), and compare to a threshold as specified in the test. (Here we have a fixed distribution with no parameters, so the number of parameters s=0 and p = s+1 = 1, and the # of bins n is 9, so the # of degrees of freedom = n-p = 8*. Then you go to your handy-dandy chi-squared table and see if the numbers look ok. For 8 degrees of freedom the confidence levels look like this:
Χ2 > 13.362: 10% chance the dataset still matches Benford's Law
Χ2 > 15.507: 5% chance the dataset still matches Benford's Law
Χ2 > 17.535: 2.5% chance the dataset still matches Benford's Law
Χ2 > 20.090: 1% chance the dataset still matches Benford's Law
Χ2 > 26.125: 0.1% chance the dataset still matches Benford's Law
Suppose your histogram yielded H = [29,17,12,10,8,7,6,5,6], for a Χ2 = 0.5585. That's very close to the expected distribution. (maybe even too close!)
Now suppose your histogram yielded H = [27,16,10,9,5,11,6,5,11], for a Χ2 = 13.89. There is less than a 10% chance that this histogram is from a distribution that matches Benford's Law. So I'd call the dataset questionable but not overly so.
Note that you have to pick the significance level (e.g. 10%/5%/etc.). If you use 10%, expect roughly 1 out of every 10 datasets that are really from Benford's distribution to fail, even though they're OK. It's a judgement call.
Looks like Apache Commons Math has a Java implementation of a chi-squared test:
ChiSquareTestImpl.chiSquare(double[] expected, long[] observed)
*note on degrees of freedom = 8: this makes sense; you have 9 numbers but they have 1 constraint, namely they all have to add up to the size of the dataset, so once you know the first 8 numbers of the histogram, you can figure out the ninth.
Kolmogorov-Smirnov is actually simpler (something I hadn't realized until I found a simple enough statement of how it works) but works for continuous distributions. The method works like this:
You compute the cumulative distribution function (CDF) for your probability distribution.
You compute an empirical cumulative distribution function (ECDF), which is easily obtained by putting your dataset in sorted order.
You find D = (approximately) the maximum vertical distance between the two curves.
Let's handle these more in depth for Benford's Law.
CDF for Benford's Law: this is just C = log10 x, where x is in the interval [1,10), i.e. including 1 but excluding 10. This can be easily seen if you look at the generalized form of Benford's Law, and instead of writing it log(1+1/n), writing it as log(n+1)-log(n) -- in other words, to get the probability of each bin, they're subtracting successive differences of log(n), so log(n) must be the CDF
ECDF: Take your dataset, and for each number, make the sign positive, write it in scientific notation, and set the exponent to 0. (Not sure what to do if you have a number that is 0; that seems to not lend itself to Benford's Law analysis.) Then sort the numbers in ascending order. The ECDF is the number of datapoints <= x for any valid x.
Calculate maximum difference D = max(d[k]) for each d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k]).
Here's an example: suppose our dataset = [3.02, 1.99, 28.3, 47, 0.61]. Then ECDF is represented by the sorted array [1.99, 2.83, 3.02, 4.7, 6.1], and you calculate D as follows:
D = max(
log10(1.99) - 0/5, 1/5 - log10(1.99),
log10(2.83) - 1/5, 2/5 - log10(2.83),
log10(3.02) - 2/5, 3/5 - log10(3.02),
log10(4.70) - 3/5, 4/5 - log10(4.70),
log10(6.10) - 4/5, 5/5 - log10(6.10)
)
which = 0.2988 (=log10(1.99) - 0).
Finally you have to use the D statistic -- I can't seem to find any reputable tables online, but Apache Commons Math has a KolmogorovSmirnovDistributionImpl.cdf() function that takes a calculated D value as input and tells you the probability that D would be less than this. It's probably easier to take 1-cdf(D) which tells you the probability that D would be greater than or equal to the value you calculate: if this is 1% or 0.1% it probably means that the data doesn't fit Benford's Law, but if it's 25% or 50% it's probably a good match.

If I understand correctly, you want the Benford formula in Java syntax?
public static double probability(int i) {
return Math.log(1+(1/(double) i))/Math.log(10);
}
Remember to insert a
import java.lang.Math;
after your package declaration.
I find it suspicious no one answered this yet.... >_>

I think what you are looking for is something like this:
for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
{
answer += Math.log(1+(1/(i*10+(double) digit)));
}
answer *= (1/Math.log(10)));

How to generate MFCC Algorithm's triangular windows and how to use them?

I am implementing MFCC algorithm in Java.
There is a sample code here: http://www.ee.columbia.edu/~dpwe/muscontent/practical/mfcc.m at Matlab. However I have some problems with mel filter banking process. How to generate triangular windows and how to use them?
PS1: An article which has a part that describes MFCC: http://arxiv.org/pdf/1003.4083
PS2: If there is a document about MFCC algorithms steps basically, it will be good.
PS3: My main question is related to that: MFCC with Java Linear and Logarithmic Filters some implementations use both linear and logarithmic filter and some of them not. What is that filters and what is the center frequent concept. I follow that code:MFCC Java , what is the difference of it between that code: MFCC Matlab

Triangular windows as frequency band filters aren't hard to implement. You basically want to integrate the FFT data within each band (defined as the frequency space between center frequency i-1 and center frequency i+1).
You're basically looking for something like,
for(int bandIdx = 0; bandIdx < numBands; bandIdx++) {
int startFreqIdx = centerFreqs[bandIdx-1];
int centerFreqIdx = centerFreqs[bandIdx];
int stopFreqIdx = centerFreqs[bandIdx+1];
for(int freq = startFreqIdx; i < centerFreqIdx; i++) {
magnitudeScale = centerFreqIdx-startFreqIdx;
bandData[bandIdx] += fftData[freq]*(i-startFreqIdx)/magnitudeScale;
}
for(int freq = centerFreqIdx; i <= stopFreqIdx; i++) {
magnitudeScale = centerFreqIdx-stopFreqIdx;
bandData[bandIdx] += fftData[freq]*(i-stopFreqIdx)/magnitudeScale;
}
}
If you do not understand the concept of a "center frequency" or a "band" or a "filter," pick up an elementary signals textbook--you shouldn't be implementing this algorithm without understanding what it does.
As for what the exact center frequencies are, it's up to you. Experiment and pick (or find in publications) values that capture the information you want to isolate from the data. The reason that there are no definitive values, or even scale for values, is because this algorithm tries to approximate a human ear, which is a very complicated listening device. Whereas one scale may work better for, say, speech, another may work better for music, etc. It's up to you to choose what is appropriate.

Answer for the second PS: I found this tutorial that really helped me computing the MFCCs.
As for the triangular windows and the filterbanks, from what I understood, they do overlap, they do not extend to negative frequences and the whole process of computing them from the FFT spectrum and applying them back to it goes something like this:
Choose a minimum and a maximum frequency for the filters (for example, min freq = 300Hz - the minimum voice frequency and max frequency = your sample rate / 2. Maybe this is where you should choose the 1000Hz limit you were talking about)
Compute the mel values from the min and max chosen frequences. Formula here.
Compute N equally distanced values between these two mel values. (I've seen examples of different values for N, you can even find a efficiency comparison for different of values in this work, for my tests I've picked 26)
Convert these values back to Hz. (you can find the formula on the same wiki page) => array of N + 2 filter values
Compute a filterbank (filter triangle) for each three consecutive values, either how Thomas suggested above (being careful with the indexes) or like in the turorial recommended at the beginning of this post) => an array of arrays, size NxM, asuming your FFT returned 2*M values and you only use M.
Pass the whole power spectrum (M values obtained from FFT) through each triangular filter to get a "filterbank energy" for each filter (for each filterbank (N loop), multiply each magnitude obtained after FFT to each value in the corresponding filterbank (M loop) and add the M obtained values) => N-sized array of energies.
These are your filterbank energies that you can further apply a log to, apply the DCT and extract the MFCCs...

Is it possible to get k-th element of m-character-length combination in O(1)?

Do you know any way to get k-th element of m-element combination in O(1)? Expected solution should work for any size of input data and any m value.
Let me explain this problem by example (python code):
>>> import itertools
>>> data = ['a', 'b', 'c', 'd']
>>> k = 2
>>> m = 3
>>> result = [''.join(el) for el in itertools.combinations(data, m)]
>>> print result
['abc', 'abd', 'acd', 'bcd']
>>> print result[k-1]
abd
For a given data the k-th (2-nd in this example) element of m-element combination is abd. Is it possible to that value (abd) without creating the whole combinatory list?
I'am asking because I have data of ~1,000,000 characters and it is impossible to create full m-character-length combinatory list to get k-th element.
The solution can be pseudo code, or a link the page describing this problem (unfortunately, I didn't find one).
Thanks!

http://en.wikipedia.org/wiki/Permutation#Numbering_permutations
Basically, express the index in the factorial number system, and use its digits as a selection from the original sequence (without replacement).

Not necessarily O(1), but the following should be very fast:
Take the original combinations algorithm:
def combinations(elems, m):
#The k-th element depends on what order you use for
#the combinations. Assuming it looks something like this...
if m == 0:
return [[]]
else:
combs = []
for e in elems:
combs += combinations(remove(e,elems), m-1)
For n initial elements and m combination length, we have n!/(n-m)!m! total combinations. We can use this fact to skip directly to our desired combination:
def kth_comb(elems, m, k):
#High level pseudo code
#Untested and probably full of errors
if m == 0:
return []
else:
combs_per_set = ncombs(len(elems) - 1, m-1)
i = k / combs_per_set
k = k % combs_per_set
x = elems[i]
return x + kth_comb(remove(x,elems), m-1, k)

first calculate r = !n/(!m*!(n-m)) with n the amount of elements
then floor(r/k) is the index of the first element in the result,
remove it (shift everything following to the left)
do m--, n-- and k = r%k
and repeat until m is 0 (hint when k is 0 just copy the following chars to the result)

I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem appears to fall under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it too is faster than other published techniques.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to Java, Python, or C++.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Percentile calculation mismatch using apache.math3.stat.descriptive - java

Related

Random level function in skip list

Distributedly "dumping"/"compressing" data samples

Benford's Law in Java - how to make a math function into Java

How to generate MFCC Algorithm's triangular windows and how to use them?

Is it possible to get k-th element of m-character-length combination in O(1)?

Categories

Resources