is there a faster way to search through cumulative distribution? - java

I have a List<Double> that holds probabilities (weights) for sampling an item. For example, the List holds 5 values as follows.
0.1, 0.4, 0.2, 0.1, 0.2
Each i-th Double value is the probability of sampling the i-th item of another List<Object>.
How can I construct an algorithm to perform the sampling according to these probabilities?
I tried something like this, where I first made the list of probabilities into a cumulative form.
0.1, 0.5, 0.7, 0.8, 1.0
Then my approach is as follows. I generate a random double, and iterate over the list to find the first item that is larger than the random double, and then return its index.
Random r = new Random();
double p = r.nextDouble();
int total = list.size();
for(int i=0; i < total; i++) {
double d = list.get(i);
if(d > p) {
return i;
}
}
return total-1;
This approach is slow as I am crawling through the list sequentially. In reality, my list is of 800,000 items associated with weights (probabilities) that I need to sample from. So, needless to say, this sequential approach is slow.
I'm not sure how binary search can help. Let's say I generated p = 0.01. Then, a binary search can use recursion as follows with the list.
compare 0.01 to 0.7, repeat with L = 0.1, 0.5
compare 0.01 to 0.1, stop
compare 0.01 to 0.5, stop
0.01 is smaller than 0.7, 0.5, and 0.1, but I obviously only want 0.1. So the stopping criteria is still not clear to me when using binary search.
If there's a library to help with this type of thing I'd also be interested.

Here is how you could do it using binary search, starting with the cumulative probabilities:
public static void main (String[] args) {
double[] cdf = {0.1, 0.5, 0.7, 0.8, 1.0};
double random = 0.75; // generate randomly between zero and one
int el = Arrays.binarySearch(cdf, random);
if (el < 0) {
el = -(el + 1);
}
System.out.println(el);
}
P.S. When the list of probabilities is short, a simple linear scan might turn out to be as efficient as binary search.

This isn't the most memory-efficient approach, but use a NavigableMap where your cumulative list's values are the keys. Then you can just use floorEntry(randon.nextDouble()). Like the binary search, it's log(n) space and n memory.
So...
NavigableMap<Double, Object> pdf = new TreeMap<>();
pdf.put(0.0, "foo");
pdf.put(0.1, "bar");
pdf.put(0.5, "baz");
pdf.put(0.7, "quz");
pdf.put(0.8, "quuz");
Random random = new Random();
pdf.floorEntry(random.nextDouble()).getValue();

Related

Best way to generate a List<Double> sequence of values given start, end, and step?

I'm actually very surprised I was unable to find the answer to this here, though maybe I'm just using the wrong search terms or something. Closest I could find is this, but they ask about generating a specific range of doubles with a specific step size, and the answers treat it as such. I need something that will generate the numbers with arbitrary start, end and step size.
I figure there has to be some method like this in a library somewhere already, but if so I wasn't able to find it easily (again, maybe I'm just using the wrong search terms or something). So here's what I've cooked up on my own in the last few minutes to do this:
import java.lang.Math;
import java.util.List;
import java.util.ArrayList;
public class DoubleSequenceGenerator {
/**
* Generates a List of Double values beginning with `start` and ending with
* the last step from `start` which includes the provided `end` value.
**/
public static List<Double> generateSequence(double start, double end, double step) {
Double numValues = (end-start)/step + 1.0;
List<Double> sequence = new ArrayList<Double>(numValues.intValue());
sequence.add(start);
for (int i=1; i < numValues; i++) {
sequence.add(start + step*i);
}
return sequence;
}
/**
* Generates a List of Double values beginning with `start` and ending with
* the last step from `start` which includes the provided `end` value.
*
* Each number in the sequence is rounded to the precision of the `step`
* value. For instance, if step=0.025, values will round to the nearest
* thousandth value (0.001).
**/
public static List<Double> generateSequenceRounded(double start, double end, double step) {
if (step != Math.floor(step)) {
Double numValues = (end-start)/step + 1.0;
List<Double> sequence = new ArrayList<Double>(numValues.intValue());
double fraction = step - Math.floor(step);
double mult = 10;
while (mult*fraction < 1.0) {
mult *= 10;
}
sequence.add(start);
for (int i=1; i < numValues; i++) {
sequence.add(Math.round(mult*(start + step*i))/mult);
}
return sequence;
}
return generateSequence(start, end, step);
}
}
These methods run a simple loop multiplying the step by the sequence index and adding to the start offset. This mitigates compounding floating-point errors which would occur with continuous incrementation (such as adding the step to a variable on each iteration).
I added the generateSequenceRounded method for those cases where a fractional step size can cause noticeable floating-point errors. It does require a bit more arithmetic, so in extremely performance sensitive situations such as ours, it's nice to have the option of using the simpler method when the rounding is unnecessary. I suspect that in most general use cases the rounding overhead would be negligible.
Note that I intentionally excluded logic for handling "abnormal" arguments such as Infinity, NaN, start > end, or a negative step size for simplicity and desire to focus on the question at hand.
Here's some example usage and corresponding output:
System.out.println(DoubleSequenceGenerator.generateSequence(0.0, 2.0, 0.2))
System.out.println(DoubleSequenceGenerator.generateSequenceRounded(0.0, 2.0, 0.2));
System.out.println(DoubleSequenceGenerator.generateSequence(0.0, 102.0, 10.2));
System.out.println(DoubleSequenceGenerator.generateSequenceRounded(0.0, 102.0, 10.2));
[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2000000000000002, 1.4000000000000001, 1.6, 1.8, 2.0]
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
[0.0, 10.2, 20.4, 30.599999999999998, 40.8, 51.0, 61.199999999999996, 71.39999999999999, 81.6, 91.8, 102.0]
[0.0, 10.2, 20.4, 30.6, 40.8, 51.0, 61.2, 71.4, 81.6, 91.8, 102.0]
Is there an existing library that provides this kind of functionality already?
If not, are there any issues with my approach?
Does anyone have a better approach to this?
Sequences can be easily generated using Java 11 Stream API.
The straightforward approach is to use DoubleStream:
public static List<Double> generateSequenceDoubleStream(double start, double end, double step) {
return DoubleStream.iterate(start, d -> d <= end, d -> d + step)
.boxed()
.collect(toList());
}
On ranges with a large number of iterations, double precision error could accumulate resulting in bigger error closer to the end of the range.
The error can be minimised by switching to IntStream and using integers and single double multiplier:
public static List<Double> generateSequenceIntStream(int start, int end, int step, double multiplier) {
return IntStream.iterate(start, i -> i <= end, i -> i + step)
.mapToDouble(i -> i * multiplier)
.boxed()
.collect(toList());
}
To get rid of a double precision error at all, BigDecimal can be used:
public static List<Double> generateSequenceBigDecimal(BigDecimal start, BigDecimal end, BigDecimal step) {
return Stream.iterate(start, d -> d.compareTo(end) <= 0, d -> d.add(step))
.mapToDouble(BigDecimal::doubleValue)
.boxed()
.collect(toList());
}
Examples:
public static void main(String[] args) {
System.out.println(generateSequenceDoubleStream(0.0, 2.0, 0.2));
//[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2, 1.4, 1.5999999999999999, 1.7999999999999998, 1.9999999999999998]
System.out.println(generateSequenceIntStream(0, 20, 2, 0.1));
//[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2000000000000002, 1.4000000000000001, 1.6, 1.8, 2.0]
System.out.println(generateSequenceBigDecimal(new BigDecimal("0"), new BigDecimal("2"), new BigDecimal("0.2")));
//[0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
}
Method iterate with this signature (3 parameters) was added in Java 9. So, for Java 8 the code looks like
DoubleStream.iterate(start, d -> d + step)
.limit((int) (1 + (end - start) / step))
Me personally, I would shorten the DoubleSequenceGenerator class up a bit for other goodies and use only one sequence generator method that contains the option to utilize whatever desired precision wanted or utilize no precision at all:
In the generator method below, if nothing (or any value less than 0) is supplied to the optional setPrecision parameter then no decimal precision rounding is carried out. If 0 is supplied for a precision value then the numbers are rounded to their nearest whole number (ie: 89.674 is rounded to 90.0). If a specific precision value greater than 0 is supplied then values are converted to that decimal precision.
BigDecimal is used here for...well....precision:
import java.util.List;
import java.util.ArrayList;
import java.math.BigDecimal;
import java.math.RoundingMode;
public class DoubleSequenceGenerator {
public static List<Double> generateSequence(double start, double end,
double step, int... setPrecision) {
int precision = -1;
if (setPrecision.length > 0) {
precision = setPrecision[0];
}
List<Double> sequence = new ArrayList<>();
for (double val = start; val < end; val+= step) {
if (precision > -1) {
sequence.add(BigDecimal.valueOf(val).setScale(precision, RoundingMode.HALF_UP).doubleValue());
}
else {
sequence.add(BigDecimal.valueOf(val).doubleValue());
}
}
if (sequence.get(sequence.size() - 1) < end) {
sequence.add(end);
}
return sequence;
}
// Other class goodies here ....
}
And in main():
System.out.println(generateSequence(0.0, 2.0, 0.2));
System.out.println(generateSequence(0.0, 2.0, 0.2, 0));
System.out.println(generateSequence(0.0, 2.0, 0.2, 1));
System.out.println();
System.out.println(generateSequence(0.0, 102.0, 10.2, 0));
System.out.println(generateSequence(0.0, 102.0, 10.2, 0));
System.out.println(generateSequence(0.0, 102.0, 10.2, 1));
And the console displays:
[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2, 1.4, 1.5999999999999999, 1.7999999999999998, 1.9999999999999998, 2.0]
[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0]
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
[0.0, 10.2, 20.4, 30.599999999999998, 40.8, 51.0, 61.2, 71.4, 81.60000000000001, 91.80000000000001, 102.0]
[0.0, 10.0, 20.0, 31.0, 41.0, 51.0, 61.0, 71.0, 82.0, 92.0, 102.0]
[0.0, 10.2, 20.4, 30.6, 40.8, 51.0, 61.2, 71.4, 81.6, 91.8, 102.0]
Is there an existing library that provides this kind of functionality already?
Sorry, I don't know, but judging by other answers, and their relative simplicity - no, there isn't. No need. Well, almost...
If not, are there any issues with my approach?
Yes and no. You have at least one bug, and some room for performance boost, but the approach itself is correct.
Your bug: rounding error (just change while (mult*fraction < 1.0) to while (mult*fraction < 10.0) and that should fix it)
All the others do not reach the end... well, maybe they just weren't observant enough to read comments in your code
All the others are slower.
Just changing condition in the main loop from int < Double to int < int will noticeably increase the speed of your code
Does anyone have a better approach to this?
Hmm... In what way?
Simplicity? generateSequenceDoubleStream of #Evgeniy Khyst looks quite simple. And should be used... but maybe no, because of next two points
Precise? generateSequenceDoubleStream is not! But still can be saved with the pattern start + step*i.
And start + step*i pattern is precise. Only BigDouble and fixed-point arithmetic can beat it. But BigDoubles are slow, and manual fixed-point arithmetic is tedious and may be inappropriate for your data.
By the way, on the matters of precision, you can entertain yourself with this: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Speed... well now we are on shaky grounds.
Check out this repl https://repl.it/repls/RespectfulSufficientWorker
I do not have a decent test stand right now, so I used repl.it... which is totally inadequate for performance testing, but it's not the main point. The point is - there is no definite answer. Except that maybe in your case, which is not totally clear from you question, you definitely should not use BigDecimal (read further).
I've tried to play and optimize for big inputs. And your original code, with some minor changes - the fastest. But maybe you need enormous amounts of small Lists? Then that can be a totally different story.
This code is quite simple to my taste, and fast enough:
public static List<Double> genNoRoundDirectToDouble(double start, double end, double step) {
int len = (int)Math.ceil((end-start)/step) + 1;
var sequence = new ArrayList<Double>(len);
sequence.add(start);
for (int i=1 ; i < len ; ++i) sequence.add(start + step*i);
return sequence;
}
If you prefer a more elegant way (or we should call it idiomatic), I, personally, would suggest:
public static List<Double> gen_DoubleStream_presice(double start, double end, double step) {
return IntStream.range(0, (int)Math.ceil((end-start)/step) + 1)
.mapToDouble(i -> start + i * step)
.boxed()
.collect(Collectors.toList());
}
Anyway, possible performance boosts are:
Try switching from Double to double, and if you really need them, you can switch back again, judging by the tests, it still may be faster. (But don't trust my, try it yourself with your data in your environment. As I said - repl.it sucks for benchmarks)
A little magic: separate loop for Math.round()... maybe it has something to do with data locality. I do not recommend this - result is very unstable. But it's fun.
double[] sequence = new double[len];
for (int i=1; i < len; ++i) sequence[i] = start + step*i;
List<Double> list = new ArrayList<Double>(len);
list.add(start);
for (int i=1; i < len; ++i) list.add(Math.round(sequence[i])/mult);
return list;
You should definitely consider to be more lazy and generate numbers on demand without storing then in Lists
I suspect that in most general use cases the rounding overhead would be negligible.
If you suspect something - test it :-) My answer is "Yes", but again... don't believe me. Test it.
So, back to the main question: Is there an better way?
Yes, of course!
But it depends.
Choose BigDecimal if you need very big numbers and very small numbers. But if you cast them back to Double, and more than that, use it with numbers of "close" magnitude - no need for them! Checkout the same repl: https://repl.it/repls/RespectfulSufficientWorker - last test shows that there will be no difference in results, but a dig loss in speed.
Make some micro-optimizations based on your data properties, your task, and your environment.
Prefer short and simple code if there is not to much to gain from performance boost of 5-10%. Don't waist your time
Maybe use fixed-point arithmetic if you can and if it's worth it.
Other than that, you are fine.
PS. There's also a Kahan Summation Formula implementation in the repl... just for fun. https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1346 and it works - you can mitigate summation errors
Try this.
public static List<Double> generateSequenceRounded(double start, double end, double step) {
long mult = (long) Math.pow(10, BigDecimal.valueOf(step).scale());
return DoubleStream.iterate(start, d -> (double) Math.round(mult * (d + step)) / mult)
.limit((long) (1 + (end - start) / step)).boxed().collect(Collectors.toList());
}
Here,
int java.math.BigDecimal.scale()
Returns the scale of this BigDecimal. If zero or positive, the scale is the number of digits to the right ofthe decimal point. If negative, the unscaled value of the number is multiplied by ten to the power of the negation of the scale. For example, a scale of -3 means the unscaled value is multiplied by 1000.
In main()
System.out.println(generateSequenceRounded(0.0, 102.0, 10.2));
System.out.println(generateSequenceRounded(0.0, 102.0, 10.24367));
And Output:
[0.0, 10.2, 20.4, 30.6, 40.8, 51.0, 61.2, 71.4, 81.6, 91.8, 102.0]
[0.0, 10.24367, 20.48734, 30.73101, 40.97468, 51.21835, 61.46202, 71.70569, 81.94936, 92.19303]

Weighted sampling with replacement in Java

Is there a function in Java, or in a library such as Apache Commons Math which is equivalent to the MATLAB function randsample?
More specifically, I want to find a function randSample which returns a vector of Independent and Identically Distributed random variables according to the probability distribution which I specify.
For example:
int[] a = randSample(new int[]{0, 1, 2}, 5, new double[]{0.2, 0.3, 0.5})
// { 0 w.p. 0.2
// a[i] = { 1 w.p. 0.3
// { 2 w.p. 0.5
The output is the same as the MATLAB code randsample([0 1 2], 5, true, [0.2 0.3 0.5]) where the true means sampling with replacement.
If such a function does not exist, how do I write one?
Note: I know that a similar question has been asked on Stack Overflow but unfortunately it has not been answered.
I'm pretty sure one doesn't exist, but it's pretty easy to make a function that would produce samples like that. First off, Java does come with a random number generator, specifically one with a function, Random.nextDouble() that can produce random doubles between 0.0 and 1.0.
import java.util.Random;
double someRandomDouble = Random.nextDouble();
// This will be a uniformly distributed
// random variable between 0.0 and 1.0.
If you have sampling with replacement, if you convert the pdf you have as an input into a cdf, you can use the random doubles Java provides to create a random data set by seeing in which part of the cdf it falls. So first you need to convert the pdf into a cdf.
int [] randsample(int[] values, int numsamples,
boolean withReplacement, double [] pdf) {
if(withReplacement) {
double[] cdf = new double[pdf.length];
cdf[0] = pdf[0];
for(int i=1; i<pdf.length; i++) {
cdf[i] = cdf[i-1] + pdf[i];
}
Then you make the properly-sized array of ints to store the result and start finding the random results:
int[] results = new int[numsamples];
for(int i=0; i<numsamples; i++) {
int currentPosition = 0;
while(randomValue > cdf[currentPosition] && currentPosition < cdf.length) {
currentPosition++; //Check the next one.
}
if(currentPosition < cdf.length) { //It worked!
results[i] = values[currentPosition];
} else { //It didn't work.. let's fail gracefully I guess.
results[i] = values[cdf.length-1];
// And assign it the last value.
}
}
//Now we're done and can return the results!
return results;
} else { //Without replacement.
throw new Exception("This is unimplemented!");
}
}
There's some error checking (make sure value array and pdf array are the same size) and some other features you can implement by overloading this to provide the other functions, but hopefully this is enough for you to start. Cheers!

Java - Taking character frequencies, creating probabilities, and then generating pseudo-random characters

I'm creating a pseudo-random text generator using a Markov model. Basically, I use a hash table to store lists of substrings of order k(the order of the Markov model), then for each substring I have a TreeMap of the suffixes with their frequencies throughout the substring.
I'm struggling with generating the random suffix. For each substring, I have a TreeMap containing all of the possible suffixes and their frequencies. I'm having trouble with using this to create a probability for each suffix, and then generating a pseudo-random suffix based on the probabilities.
Any help on the concept of this and how to go about doing this is appreciated. If you have any questions or need clarification, please let me know.
I'm not sure that a TreeMap is really the best data-structure for this, but . . .
You can use the Math.random() method to obtain a random value between 0.0 (inclusive) and 1.0 (exclusive). Then, iterate over the elements of your map, accumulating their frequencies, until you surpass that value. The suffix that first surpasses this value is your result. Assuming that your map-elements' frequencies all add up to 1.0, this will choose all suffixes in proportion to their frequencies.
For example:
public class Demo
{
private final Map<String, Double> suffixFrequencies =
new TreeMap<String, Double>();
private String getRandomSuffix()
{
final double value = Math.random();
double accum = 0.0;
for(final Map.Entry<String, Double> e : suffixFrequencies.entrySet())
{
accum += e.getValue();
if(accum > value)
return e.getKey();
}
throw new AssertionError(); // or something
}
public static void main(final String... args)
{
final Demo demo = new Demo();
demo.suffixFrequencies.put("abc", 0.3); // value in [0.0, 0.3)
demo.suffixFrequencies.put("def", 0.2); // value in [0.3, 0.5)
demo.suffixFrequencies.put("ghi", 0.5); // value in [0.5, 1.0)
// Print "abc" approximately three times, "def" approximately twice,
// and "ghi" approximately five times:
for(int i = 0; i < 10; ++i)
System.out.println(demo.getRandomSuffix());
}
}
Notes:
Due to roundoff error, the throw new AssertionError() probably actually will happen every so often, albeit very rarely. So I recommend that you replace that line with something that just always chooses the first element or last element or something.
If the frequencies don't all add up to 1.0, then you should add a pass at the beginning of getRandomSuffix() that determines the sum of all frequencies. You can then scale value accordingly.

Using precomputed kernel in libsvm causes it to get stuck

We are two students who want to use one-class svm for dectection of summary worthy sentences in text documents. We have already implemented sentence similarity functions for sentences, which we have used for another algorithm. We would now want to use the same functions as kernels for a one-class svm in libsvm for java.
We are using the PRECOMPUTED enum for the kernel_type field in our svm_parameter (param). In the x field of our svm_problem (prob) we have the kernel matrix on the form:
0:i 1:K(xi,x1) ... L:K(xi,xL)
where K(x,y) is the kernel value for the similarity of x and y, L is the number of sentences to compare and i is the current row index (0 to L).
The training of the kernel (svm.svm_train(prob, param)) seems to get sometimes get "stuck" in what seems like a infinite loop.
Have we missunderstood how to use the PRECOMPUTED enum, or does the problem lay elsewhere?
We solved this problem
It turns out that the "series numbers" in the first column needs to go from 1 to L, not 0 to L-1, which was our initial numbering. We found this out by inspecting the source in svm.java:
double kernel_function(int i, int j)
{
switch(kernel_type)
{
/* ... snip ...*/
case svm_parameter.PRECOMPUTED:
return x[i][(int)(x[j][0].value)].value;
/* ... snip ...*/
}
}
The reason for starting the numbering at 1 instead of 0, is that the first column of a row is used as column index when returning the value K(i,j).
Example
Consider this Java matrix:
double[][] K = new double[][] {
double[] { 1, 1.0, 0.1, 0.0, 0.2 },
double[] { 2, 0.5, 1.0, 0.1, 0.4 },
double[] { 3, 0.2, 0.3, 1.0, 0.7 },
double[] { 4, 0.6, 0.5, 0.5, 1.0 }
};
Now, libsvm needs the kernel value K(i,j) for say i=1 and j=3. The expression x[i][(int)(x[j][0].value)].value will break down to:
x[i] -> x[1] -> second row in K -> [2, 0.5, 1.0, 0.1, 0.4]
x[j][0] -> x[3][0] -> fourth row, first column -> 4
x[i][(int)(x[j][0].value)].value -> x[1][4] -> 0.4
This was a bit messy to realize at first, but changing the indexing solved our problem. Hopefully this might help someone else with similar problems.

Finding a mode with decreasing precision

I feel like there should be an available library to more simply do two things, A) Find the mode to an array, in the case of doubles and B) gracefully degrade the precision until you reach a particular frequency.
So imagine an array like this:
double[] a = {1.12, 1.15, 1.13, 2.0, 3.4, 3.44, 4.1, 4.2, 4.3, 4.4};
If I was looking for a frequency of 3 then it would go from 2 decimal positions to 1 decimal, and finally return 1.1 as my mode. If I had a frequency requirement of 4 it would return 4 as my mode.
I do have a set of code that is working the way I want, and returning what I am expecting, but I feel like there should be a more efficient way to accomplish this, or an existing library that would help me do the same. Attached is my code, I'd be interested in thoughts / comments on different approaches I should have taken....I have the iterations listed to limit how far the precision can degrade.
public static double findMode(double[] r, int frequencyReq)
{
double mode = 0d;
int frequency = 0;
int iterations = 4;
HashMap<Double, BigDecimal> counter = new HashMap<Double, BigDecimal>();
while(frequency < frequencyReq && iterations > 0){
String roundFormatString = "#.";
for(int j=0; j<iterations; j++){
roundFormatString += "#";
}
DecimalFormat roundFormat = new DecimalFormat(roundFormatString);
for(int i=0; i<r.length; i++){
double element = Double.valueOf(roundFormat.format(r[i]));
if(!counter.containsKey(element))
counter.put(element, new BigDecimal(0));
counter.put(element,counter.get(element).add(new BigDecimal(1)));
}
for(Double key : counter.keySet()){
if(counter.get(key).compareTo(new BigDecimal(frequency))>0){
mode = key;
frequency = counter.get(key).intValue();
log.debug("key: " + key + " Count: " + counter.get(key));
}
}
iterations--;
}
return mode;
}
Edit
Another way to rephrase the question, per Paulo's comment: the goal is to locate a number where in the neighborhood are at least frequency array elements, with the radius of the neighborhood being as small as possible.
Here a solution to the reformulated question:
The goal is to locate a number where in the neighborhood are at least frequency array elements, with the radius of the neighborhood being as small as possible.
(I took the freedom of switching the order of 1.15 and 1.13 in the input array.)
The basic idea is: We have the input already sorted (i.e. neighboring elements are consecutive), and we know how many elements we want in our neighborhood. So we loop once over this array, measuring the distance between the left element and the element frequency elements more to the right. Between them are frequency elements, so this forms a neighbourhood. Then we simply take the minimum such distance. (My method has a complicated way to return the results, you may want to do it better.)
This is not completely equivalent to your original question (does not work by fixed steps of digits), but maybe this is more what you really want :-)
You'll have to find a better way of formatting the results, though.
package de.fencing_game.paul.examples;
import java.util.Arrays;
/**
* searching of dense points in a distribution.
*
* Inspired by http://stackoverflow.com/questions/5329628/finding-a-mode-with-decreasing-precision.
*/
public class InpreciseMode {
/** our input data, should be sorted ascending. */
private double[] data;
public InpreciseMode(double ... data) {
this.data = data;
}
/**
* searchs the smallest neighbourhood (by diameter) which
* contains at least minSize elements.
*
* #return an array of two arrays:
* { { the middle point of the neighborhood,
* the diameter of the neighborhood },
* all the elements of the neigborhood }
*
* TODO: better return an object of a class encapsuling these.
*/
public double[][] findSmallNeighbourhood(int minSize) {
int currentLeft = -1;
int currentRight = -1;
double currentMinDiameter = Double.POSITIVE_INFINITY;
for(int i = 0; i + minSize-1 < data.length; i++) {
double diameter = data[i+minSize-1] - data[i];
if(diameter < currentMinDiameter) {
currentMinDiameter = diameter;
currentLeft = i;
currentRight = i + minSize-1;
}
}
return
new double[][] {
{
(data[currentRight] + data[currentLeft])/2.0,
currentMinDiameter
},
Arrays.copyOfRange(data, currentLeft, currentRight+1)
};
}
public void printSmallNeighbourhoods() {
for(int frequency = 2; frequency <= data.length; frequency++) {
double[][] found = findSmallNeighbourhood(frequency);
System.out.printf("There are %d elements in %f radius "+
"around %f:%n %s.%n",
frequency, found[0][1]/2, found[0][0],
Arrays.toString(found[1]));
}
}
public static void main(String[] params) {
InpreciseMode m =
new InpreciseMode(1.12, 1.13, 1.15, 2.0, 3.4, 3.44, 4.1,
4.2, 4.3, 4.4);
m.printSmallNeighbourhoods();
}
}
The output is
There are 2 elements in 0,005000 radius around 1,125000:
[1.12, 1.13].
There are 3 elements in 0,015000 radius around 1,135000:
[1.12, 1.13, 1.15].
There are 4 elements in 0,150000 radius around 4,250000:
[4.1, 4.2, 4.3, 4.4].
There are 5 elements in 0,450000 radius around 3,850000:
[3.4, 3.44, 4.1, 4.2, 4.3].
There are 6 elements in 0,500000 radius around 3,900000:
[3.4, 3.44, 4.1, 4.2, 4.3, 4.4].
There are 7 elements in 1,200000 radius around 3,200000:
[2.0, 3.4, 3.44, 4.1, 4.2, 4.3, 4.4].
There are 8 elements in 1,540000 radius around 2,660000:
[1.12, 1.13, 1.15, 2.0, 3.4, 3.44, 4.1, 4.2].
There are 9 elements in 1,590000 radius around 2,710000:
[1.12, 1.13, 1.15, 2.0, 3.4, 3.44, 4.1, 4.2, 4.3].
There are 10 elements in 1,640000 radius around 2,760000:
[1.12, 1.13, 1.15, 2.0, 3.4, 3.44, 4.1, 4.2, 4.3, 4.4].
I think there's nothing wrong with your code and I doubt that you will find a library that does something so specific. But if still you want an idea to approach this problem using a more OOP approach that reuses Java collections, here it comes another approach:
Create a class to represent numbers with different number of decimals. It would have something like VariableDecimal(double d,int ndecimals) as constructor.
In that class override the object methods equals and hashCode. Your implementation of equals will test if two instances of VariableDecimal are the same taking into account the value d and the number of decimals. hashCode can simple return d*exp(10,ndecimals) casted to Integer.
In your logic use HashMaps so that they reuse your object:
HashMap<VariableDecimal, AtomicInteger> counters = new HashMap<VariableDecimal, AtomicInteger>();
for (double d : a) {
VariableDecimal vd = new VariableDecimal(d,ndecimals);
if (counters.get(vd)!=null)
counters.set(vd,new AtomicInteger(0));
counters.get(vd).incrementAndGet();
}
/* at the end of this loop counters should hold a map with frequencies of
each double for the selected precision so that you can simply traverse and
get the max */
This piece of code doesn't show the iteration to decrement the number of decimals, which is trivial.

Categories