Using precomputed kernel in libsvm causes it to get stuck - java

We are two students who want to use one-class svm for dectection of summary worthy sentences in text documents. We have already implemented sentence similarity functions for sentences, which we have used for another algorithm. We would now want to use the same functions as kernels for a one-class svm in libsvm for java.
We are using the PRECOMPUTED enum for the kernel_type field in our svm_parameter (param). In the x field of our svm_problem (prob) we have the kernel matrix on the form:
0:i 1:K(xi,x1) ... L:K(xi,xL)
where K(x,y) is the kernel value for the similarity of x and y, L is the number of sentences to compare and i is the current row index (0 to L).
The training of the kernel (svm.svm_train(prob, param)) seems to get sometimes get "stuck" in what seems like a infinite loop.
Have we missunderstood how to use the PRECOMPUTED enum, or does the problem lay elsewhere?

We solved this problem
It turns out that the "series numbers" in the first column needs to go from 1 to L, not 0 to L-1, which was our initial numbering. We found this out by inspecting the source in svm.java:
double kernel_function(int i, int j)
{
switch(kernel_type)
{
/* ... snip ...*/
case svm_parameter.PRECOMPUTED:
return x[i][(int)(x[j][0].value)].value;
/* ... snip ...*/
}
}
The reason for starting the numbering at 1 instead of 0, is that the first column of a row is used as column index when returning the value K(i,j).
Example
Consider this Java matrix:
double[][] K = new double[][] {
double[] { 1, 1.0, 0.1, 0.0, 0.2 },
double[] { 2, 0.5, 1.0, 0.1, 0.4 },
double[] { 3, 0.2, 0.3, 1.0, 0.7 },
double[] { 4, 0.6, 0.5, 0.5, 1.0 }
};
Now, libsvm needs the kernel value K(i,j) for say i=1 and j=3. The expression x[i][(int)(x[j][0].value)].value will break down to:
x[i] -> x[1] -> second row in K -> [2, 0.5, 1.0, 0.1, 0.4]
x[j][0] -> x[3][0] -> fourth row, first column -> 4
x[i][(int)(x[j][0].value)].value -> x[1][4] -> 0.4
This was a bit messy to realize at first, but changing the indexing solved our problem. Hopefully this might help someone else with similar problems.

Related

Best way to generate a List<Double> sequence of values given start, end, and step?

I'm actually very surprised I was unable to find the answer to this here, though maybe I'm just using the wrong search terms or something. Closest I could find is this, but they ask about generating a specific range of doubles with a specific step size, and the answers treat it as such. I need something that will generate the numbers with arbitrary start, end and step size.
I figure there has to be some method like this in a library somewhere already, but if so I wasn't able to find it easily (again, maybe I'm just using the wrong search terms or something). So here's what I've cooked up on my own in the last few minutes to do this:
import java.lang.Math;
import java.util.List;
import java.util.ArrayList;
public class DoubleSequenceGenerator {
/**
* Generates a List of Double values beginning with `start` and ending with
* the last step from `start` which includes the provided `end` value.
**/
public static List<Double> generateSequence(double start, double end, double step) {
Double numValues = (end-start)/step + 1.0;
List<Double> sequence = new ArrayList<Double>(numValues.intValue());
sequence.add(start);
for (int i=1; i < numValues; i++) {
sequence.add(start + step*i);
}
return sequence;
}
/**
* Generates a List of Double values beginning with `start` and ending with
* the last step from `start` which includes the provided `end` value.
*
* Each number in the sequence is rounded to the precision of the `step`
* value. For instance, if step=0.025, values will round to the nearest
* thousandth value (0.001).
**/
public static List<Double> generateSequenceRounded(double start, double end, double step) {
if (step != Math.floor(step)) {
Double numValues = (end-start)/step + 1.0;
List<Double> sequence = new ArrayList<Double>(numValues.intValue());
double fraction = step - Math.floor(step);
double mult = 10;
while (mult*fraction < 1.0) {
mult *= 10;
}
sequence.add(start);
for (int i=1; i < numValues; i++) {
sequence.add(Math.round(mult*(start + step*i))/mult);
}
return sequence;
}
return generateSequence(start, end, step);
}
}
These methods run a simple loop multiplying the step by the sequence index and adding to the start offset. This mitigates compounding floating-point errors which would occur with continuous incrementation (such as adding the step to a variable on each iteration).
I added the generateSequenceRounded method for those cases where a fractional step size can cause noticeable floating-point errors. It does require a bit more arithmetic, so in extremely performance sensitive situations such as ours, it's nice to have the option of using the simpler method when the rounding is unnecessary. I suspect that in most general use cases the rounding overhead would be negligible.
Note that I intentionally excluded logic for handling "abnormal" arguments such as Infinity, NaN, start > end, or a negative step size for simplicity and desire to focus on the question at hand.
Here's some example usage and corresponding output:
System.out.println(DoubleSequenceGenerator.generateSequence(0.0, 2.0, 0.2))
System.out.println(DoubleSequenceGenerator.generateSequenceRounded(0.0, 2.0, 0.2));
System.out.println(DoubleSequenceGenerator.generateSequence(0.0, 102.0, 10.2));
System.out.println(DoubleSequenceGenerator.generateSequenceRounded(0.0, 102.0, 10.2));
[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2000000000000002, 1.4000000000000001, 1.6, 1.8, 2.0]
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
[0.0, 10.2, 20.4, 30.599999999999998, 40.8, 51.0, 61.199999999999996, 71.39999999999999, 81.6, 91.8, 102.0]
[0.0, 10.2, 20.4, 30.6, 40.8, 51.0, 61.2, 71.4, 81.6, 91.8, 102.0]
Is there an existing library that provides this kind of functionality already?
If not, are there any issues with my approach?
Does anyone have a better approach to this?
Sequences can be easily generated using Java 11 Stream API.
The straightforward approach is to use DoubleStream:
public static List<Double> generateSequenceDoubleStream(double start, double end, double step) {
return DoubleStream.iterate(start, d -> d <= end, d -> d + step)
.boxed()
.collect(toList());
}
On ranges with a large number of iterations, double precision error could accumulate resulting in bigger error closer to the end of the range.
The error can be minimised by switching to IntStream and using integers and single double multiplier:
public static List<Double> generateSequenceIntStream(int start, int end, int step, double multiplier) {
return IntStream.iterate(start, i -> i <= end, i -> i + step)
.mapToDouble(i -> i * multiplier)
.boxed()
.collect(toList());
}
To get rid of a double precision error at all, BigDecimal can be used:
public static List<Double> generateSequenceBigDecimal(BigDecimal start, BigDecimal end, BigDecimal step) {
return Stream.iterate(start, d -> d.compareTo(end) <= 0, d -> d.add(step))
.mapToDouble(BigDecimal::doubleValue)
.boxed()
.collect(toList());
}
Examples:
public static void main(String[] args) {
System.out.println(generateSequenceDoubleStream(0.0, 2.0, 0.2));
//[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2, 1.4, 1.5999999999999999, 1.7999999999999998, 1.9999999999999998]
System.out.println(generateSequenceIntStream(0, 20, 2, 0.1));
//[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2000000000000002, 1.4000000000000001, 1.6, 1.8, 2.0]
System.out.println(generateSequenceBigDecimal(new BigDecimal("0"), new BigDecimal("2"), new BigDecimal("0.2")));
//[0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
}
Method iterate with this signature (3 parameters) was added in Java 9. So, for Java 8 the code looks like
DoubleStream.iterate(start, d -> d + step)
.limit((int) (1 + (end - start) / step))
Me personally, I would shorten the DoubleSequenceGenerator class up a bit for other goodies and use only one sequence generator method that contains the option to utilize whatever desired precision wanted or utilize no precision at all:
In the generator method below, if nothing (or any value less than 0) is supplied to the optional setPrecision parameter then no decimal precision rounding is carried out. If 0 is supplied for a precision value then the numbers are rounded to their nearest whole number (ie: 89.674 is rounded to 90.0). If a specific precision value greater than 0 is supplied then values are converted to that decimal precision.
BigDecimal is used here for...well....precision:
import java.util.List;
import java.util.ArrayList;
import java.math.BigDecimal;
import java.math.RoundingMode;
public class DoubleSequenceGenerator {
public static List<Double> generateSequence(double start, double end,
double step, int... setPrecision) {
int precision = -1;
if (setPrecision.length > 0) {
precision = setPrecision[0];
}
List<Double> sequence = new ArrayList<>();
for (double val = start; val < end; val+= step) {
if (precision > -1) {
sequence.add(BigDecimal.valueOf(val).setScale(precision, RoundingMode.HALF_UP).doubleValue());
}
else {
sequence.add(BigDecimal.valueOf(val).doubleValue());
}
}
if (sequence.get(sequence.size() - 1) < end) {
sequence.add(end);
}
return sequence;
}
// Other class goodies here ....
}
And in main():
System.out.println(generateSequence(0.0, 2.0, 0.2));
System.out.println(generateSequence(0.0, 2.0, 0.2, 0));
System.out.println(generateSequence(0.0, 2.0, 0.2, 1));
System.out.println();
System.out.println(generateSequence(0.0, 102.0, 10.2, 0));
System.out.println(generateSequence(0.0, 102.0, 10.2, 0));
System.out.println(generateSequence(0.0, 102.0, 10.2, 1));
And the console displays:
[0.0, 0.2, 0.4, 0.6000000000000001, 0.8, 1.0, 1.2, 1.4, 1.5999999999999999, 1.7999999999999998, 1.9999999999999998, 2.0]
[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0]
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
[0.0, 10.2, 20.4, 30.599999999999998, 40.8, 51.0, 61.2, 71.4, 81.60000000000001, 91.80000000000001, 102.0]
[0.0, 10.0, 20.0, 31.0, 41.0, 51.0, 61.0, 71.0, 82.0, 92.0, 102.0]
[0.0, 10.2, 20.4, 30.6, 40.8, 51.0, 61.2, 71.4, 81.6, 91.8, 102.0]
Is there an existing library that provides this kind of functionality already?
Sorry, I don't know, but judging by other answers, and their relative simplicity - no, there isn't. No need. Well, almost...
If not, are there any issues with my approach?
Yes and no. You have at least one bug, and some room for performance boost, but the approach itself is correct.
Your bug: rounding error (just change while (mult*fraction < 1.0) to while (mult*fraction < 10.0) and that should fix it)
All the others do not reach the end... well, maybe they just weren't observant enough to read comments in your code
All the others are slower.
Just changing condition in the main loop from int < Double to int < int will noticeably increase the speed of your code
Does anyone have a better approach to this?
Hmm... In what way?
Simplicity? generateSequenceDoubleStream of #Evgeniy Khyst looks quite simple. And should be used... but maybe no, because of next two points
Precise? generateSequenceDoubleStream is not! But still can be saved with the pattern start + step*i.
And start + step*i pattern is precise. Only BigDouble and fixed-point arithmetic can beat it. But BigDoubles are slow, and manual fixed-point arithmetic is tedious and may be inappropriate for your data.
By the way, on the matters of precision, you can entertain yourself with this: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Speed... well now we are on shaky grounds.
Check out this repl https://repl.it/repls/RespectfulSufficientWorker
I do not have a decent test stand right now, so I used repl.it... which is totally inadequate for performance testing, but it's not the main point. The point is - there is no definite answer. Except that maybe in your case, which is not totally clear from you question, you definitely should not use BigDecimal (read further).
I've tried to play and optimize for big inputs. And your original code, with some minor changes - the fastest. But maybe you need enormous amounts of small Lists? Then that can be a totally different story.
This code is quite simple to my taste, and fast enough:
public static List<Double> genNoRoundDirectToDouble(double start, double end, double step) {
int len = (int)Math.ceil((end-start)/step) + 1;
var sequence = new ArrayList<Double>(len);
sequence.add(start);
for (int i=1 ; i < len ; ++i) sequence.add(start + step*i);
return sequence;
}
If you prefer a more elegant way (or we should call it idiomatic), I, personally, would suggest:
public static List<Double> gen_DoubleStream_presice(double start, double end, double step) {
return IntStream.range(0, (int)Math.ceil((end-start)/step) + 1)
.mapToDouble(i -> start + i * step)
.boxed()
.collect(Collectors.toList());
}
Anyway, possible performance boosts are:
Try switching from Double to double, and if you really need them, you can switch back again, judging by the tests, it still may be faster. (But don't trust my, try it yourself with your data in your environment. As I said - repl.it sucks for benchmarks)
A little magic: separate loop for Math.round()... maybe it has something to do with data locality. I do not recommend this - result is very unstable. But it's fun.
double[] sequence = new double[len];
for (int i=1; i < len; ++i) sequence[i] = start + step*i;
List<Double> list = new ArrayList<Double>(len);
list.add(start);
for (int i=1; i < len; ++i) list.add(Math.round(sequence[i])/mult);
return list;
You should definitely consider to be more lazy and generate numbers on demand without storing then in Lists
I suspect that in most general use cases the rounding overhead would be negligible.
If you suspect something - test it :-) My answer is "Yes", but again... don't believe me. Test it.
So, back to the main question: Is there an better way?
Yes, of course!
But it depends.
Choose BigDecimal if you need very big numbers and very small numbers. But if you cast them back to Double, and more than that, use it with numbers of "close" magnitude - no need for them! Checkout the same repl: https://repl.it/repls/RespectfulSufficientWorker - last test shows that there will be no difference in results, but a dig loss in speed.
Make some micro-optimizations based on your data properties, your task, and your environment.
Prefer short and simple code if there is not to much to gain from performance boost of 5-10%. Don't waist your time
Maybe use fixed-point arithmetic if you can and if it's worth it.
Other than that, you are fine.
PS. There's also a Kahan Summation Formula implementation in the repl... just for fun. https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1346 and it works - you can mitigate summation errors
Try this.
public static List<Double> generateSequenceRounded(double start, double end, double step) {
long mult = (long) Math.pow(10, BigDecimal.valueOf(step).scale());
return DoubleStream.iterate(start, d -> (double) Math.round(mult * (d + step)) / mult)
.limit((long) (1 + (end - start) / step)).boxed().collect(Collectors.toList());
}
Here,
int java.math.BigDecimal.scale()
Returns the scale of this BigDecimal. If zero or positive, the scale is the number of digits to the right ofthe decimal point. If negative, the unscaled value of the number is multiplied by ten to the power of the negation of the scale. For example, a scale of -3 means the unscaled value is multiplied by 1000.
In main()
System.out.println(generateSequenceRounded(0.0, 102.0, 10.2));
System.out.println(generateSequenceRounded(0.0, 102.0, 10.24367));
And Output:
[0.0, 10.2, 20.4, 30.6, 40.8, 51.0, 61.2, 71.4, 81.6, 91.8, 102.0]
[0.0, 10.24367, 20.48734, 30.73101, 40.97468, 51.21835, 61.46202, 71.70569, 81.94936, 92.19303]

Constrain direction with linear programming (simplex)

I want to make a modular 2D spaceship game. The user can use building blocks to create the ship. Blocks can be squares but also other shapes that do not fit in a grid. To calculate what thrusters should be firing with user input I use ojAlgo to solve a linear programming problem. The user input will be something like: [x: 1, y: 0, z 0] where a 1 is the maximum thrust that can be given in the direction and -1 the same to the other side. Z is rotation here.
I have two problems:
The first has to do with the way I calculate x and y separate:
final ExpressionsBasedModel tmpModel = new ExpressionsBasedModel();
final Expression expressionX = tmpModel.addExpression("x").weight(1); //This is the user input.
final Expression expressionY = tmpModel.addExpression("y").weight(0).level(0); //Level means lower == 0 and upper == 0
final Expression expressionZ = tmpModel.addExpression("z").weight(0).level(0);
for (int i = 0; i < size; i++) {
ThrusterBlock thrusterBlock = blockStash.thrusterList.get(i);
final Variable thrusterVar = Variable.make("th_" + i);
thrusterVar.lower(0).upper(1);
tmpModel.addVariable(thrusterVar);
expressionX.setLinearFactor(thrusterVar, thrusterBlock.displacement.x);
expressionY.setLinearFactor(thrusterVar, thrusterBlock.displacement.y);
expressionZ.setLinearFactor(thrusterVar, thrusterBlock.displacement.z);
//The thruster displacement is the calculated thrust in the given direction
}
Optimisation.Result tmpResult = tmpModel.maximise(); //To get the maximum weight (& thrust) out.
This works because i can set the weight to > 0 or < 0 and it will return for each thruster a value between 0 and 1. With a layout with 9 thrusters this could be a output:
OPTIMAL 1.203 # [1.0, 1.0, 0.0, 0.0, 0.32, 0.32, 1.0, 1.0, 1.0]
[
0th [x: 0.401, y: 0.0 , z: -0.022]
1th [x: 0.401, y: 0.0 , z: -0.022]
2th [x: -0.401, y: 0.0 , z: -0.027]
3th [x: -0.401, y: 0.0 , z: -0.027]
4th [x: 0.0, y: 0.401 , z: 0.025]
5th [x: 0.0, y: -0.401 , z: 0.025]
6th [x: 0.0, y: -0.401 , z: 0.025]
7th [x: 0.0, y: 0.401 , z: 0.025]
8th [x: 0.401, y: 0.0 , z: -0.022]
]
The problem is that i can't use this system to move diagonally (e.g. 40°) because
final Expression expressionX = tmpModel.addExpression("x").weight(1);
final Expression expressionY = tmpModel.addExpression("y").weight(1);
Won't constrain the equation to a 1:1 x & y ratio but find the most rewarding solution. I can't use level here because i don't know the max thrust for both.
What would be the best way to constrain x & y so that i can use degrees?
The second question:
When i want the ship to turn, it does so about 45° and then turns back. While the control system shows that the same thrusters are firing. I am sure there is a problem with the way i calculate the force but my experience with angle calculations is low. I think the mistake can be found here:
public Vector2 getForce(float power, boolean relative){
float force = maxThrust * power;
float angle = toRad(this.relativePosition.z) + (relative ? 0 : this.blockCluster.getPos().z);
//relativePosition.z is the orientation of the block in degree and this.blockCluster.getPos().z is the ship rotation in radians
temp.x = force * ((float) Math.cos(angle)) * -1;
temp.y = force * ((float) Math.sin(angle));
return temp;
}
I think it has something to do with Cosines and Sinus returning only 90° but i am not sure.
Lastly, any optimisation would be greatly appreciated since this might run for multiple frames in a row. I will save the results of the calculation if it is on the slow side.
Thanks for your time!
ps: Ported from 'Game Development' because the implementation is more general then the use case.

Liblinear usage format

I am using .NET implementation of liblinear in my C# code by the following nuget package:
https://www.nuget.org/packages/Liblinear/
But in the readme file of liblinear, the format for x is:
struct problem describes the problem:
struct problem
{
int l, n;
int *y;
struct feature_node **x;
double bias;
};
where `l` is the number of training data. If bias >= 0, we assume
that one additional feature is added to the end of each data
instance. `n` is the number of feature (including the bias feature
if bias >= 0). `y` is an array containing the target values. (integers
in classification, real numbers in regression) And `x` is an array
of pointers, each of which points to a sparse representation (array
of feature_node) of one training vector.
For example, if we have the following training data:
LABEL ATTR1 ATTR2 ATTR3 ATTR4 ATTR5
----- ----- ----- ----- ----- -----
1 0 0.1 0.2 0 0
2 0 0.1 0.3 -1.2 0
1 0.4 0 0 0 0
2 0 0.1 0 1.4 0.5
3 -0.1 -0.2 0.1 1.1 0.1
and bias = 1, then the components of problem are:
l = 5
n = 6
y -> 1 2 1 2 3
x -> [ ] -> (2,0.1) (3,0.2) (6,1) (-1,?)
[ ] -> (2,0.1) (3,0.3) (4,-1.2) (6,1) (-1,?)
[ ] -> (1,0.4) (6,1) (-1,?)
[ ] -> (2,0.1) (4,1.4) (5,0.5) (6,1) (-1,?)
[ ] -> (1,-0.1) (2,-0.2) (3,0.1) (4,1.1) (5,0.1) (6,1) (-1,?)
But, in the example showing java implementation:
https://gist.github.com/hodzanassredin/6682771
problem.x <- [|
[|new FeatureNode(1,0.); new FeatureNode(2,1.)|]
[|new FeatureNode(1,2.); new FeatureNode(2,0.)|]
|]// feature nodes
problem.y <- [|1.;2.|] // target values
which means his data set is:
1 0 1
2 2 0
So, he is not storing the nodes as per sparse format of liblinear. Does, anyone know of correct format for x for liblinear implementation?
Though it doesn't address exactly the library you mentioned, I can offer you an alternative. The
Accord.NET Framework has recently incorporated all of LIBLINEAR's algorithms in its machine learning
namespaces. It is also available through NuGet.
In this library, the direct syntax to create a linear support vector machine from in-memory data is
// Create a simple binary AND
// classification problem:
double[][] problem =
{
// a b a + b
new double[] { 0, 0, 0 },
new double[] { 0, 1, 0 },
new double[] { 1, 0, 0 },
new double[] { 1, 1, 1 },
};
// Get the two first columns as the problem
// inputs and the last column as the output
// input columns
double[][] inputs = problem.GetColumns(0, 1);
// output column
int[] outputs = problem.GetColumn(2).ToInt32();
// However, SVMs expect the output value to be
// either -1 or +1. As such, we have to convert
// it so the vector contains { -1, -1, -1, +1 }:
//
outputs = outputs.Apply(x => x == 0 ? -1 : 1);
After the problem is created, one can learn a linear SVM using
// Create a new linear-SVM for two inputs (a and b)
SupportVectorMachine svm = new SupportVectorMachine(inputs: 2);
// Create a L2-regularized L2-loss support vector classification
var teacher = new LinearDualCoordinateDescent(svm, inputs, outputs)
{
Loss = Loss.L2,
Complexity = 1000,
Tolerance = 1e-5
};
// Learn the machine
double error = teacher.Run(computeError: true);
// Compute the machine's answers for the learned inputs
int[] answers = inputs.Apply(x => Math.Sign(svm.Compute(x)));
This assumes, however, that your data is already in-memory. If you wish to load your data from the
disk, from a file in libsvm sparse format, you can use the framework's SparseReader class.
An example on how to use it can be found below:
// Suppose we are going to read a sparse sample file containing
// samples which have an actual dimension of 4. Since the samples
// are in a sparse format, each entry in the file will probably
// have a much smaller number of elements.
//
int sampleSize = 4;
// Create a new Sparse Sample Reader to read any given file,
// passing the correct dense sample size in the constructor
//
SparseReader reader = new SparseReader(file, Encoding.Default, sampleSize);
// Declare a vector to obtain the label
// of each of the samples in the file
//
int[] labels = null;
// Declare a vector to obtain the description (or comments)
// about each of the samples in the file, if present.
//
string[] descriptions = null;
// Read the sparse samples and store them in a dense vector array
double[][] samples = reader.ReadToEnd(out labels, out descriptions);
Afterwards, one can use the samples and labels vectors as the inputs and outputs of the problem,
respectively.
I hope it helps.
Disclaimer: I am the author of this library. I am answering this question in the sincere hope it
can be useful for the OP, since not long ago I also faced the same problems. If a moderator thinks
this looks like spam, feel free to delete. However, I am only posting this because I think it might
help others. I even came across this question by mistake while searching for existing C#
implementations of LIBSVM, not LIBLINEAR.

is there a faster way to search through cumulative distribution?

I have a List<Double> that holds probabilities (weights) for sampling an item. For example, the List holds 5 values as follows.
0.1, 0.4, 0.2, 0.1, 0.2
Each i-th Double value is the probability of sampling the i-th item of another List<Object>.
How can I construct an algorithm to perform the sampling according to these probabilities?
I tried something like this, where I first made the list of probabilities into a cumulative form.
0.1, 0.5, 0.7, 0.8, 1.0
Then my approach is as follows. I generate a random double, and iterate over the list to find the first item that is larger than the random double, and then return its index.
Random r = new Random();
double p = r.nextDouble();
int total = list.size();
for(int i=0; i < total; i++) {
double d = list.get(i);
if(d > p) {
return i;
}
}
return total-1;
This approach is slow as I am crawling through the list sequentially. In reality, my list is of 800,000 items associated with weights (probabilities) that I need to sample from. So, needless to say, this sequential approach is slow.
I'm not sure how binary search can help. Let's say I generated p = 0.01. Then, a binary search can use recursion as follows with the list.
compare 0.01 to 0.7, repeat with L = 0.1, 0.5
compare 0.01 to 0.1, stop
compare 0.01 to 0.5, stop
0.01 is smaller than 0.7, 0.5, and 0.1, but I obviously only want 0.1. So the stopping criteria is still not clear to me when using binary search.
If there's a library to help with this type of thing I'd also be interested.
Here is how you could do it using binary search, starting with the cumulative probabilities:
public static void main (String[] args) {
double[] cdf = {0.1, 0.5, 0.7, 0.8, 1.0};
double random = 0.75; // generate randomly between zero and one
int el = Arrays.binarySearch(cdf, random);
if (el < 0) {
el = -(el + 1);
}
System.out.println(el);
}
P.S. When the list of probabilities is short, a simple linear scan might turn out to be as efficient as binary search.
This isn't the most memory-efficient approach, but use a NavigableMap where your cumulative list's values are the keys. Then you can just use floorEntry(randon.nextDouble()). Like the binary search, it's log(n) space and n memory.
So...
NavigableMap<Double, Object> pdf = new TreeMap<>();
pdf.put(0.0, "foo");
pdf.put(0.1, "bar");
pdf.put(0.5, "baz");
pdf.put(0.7, "quz");
pdf.put(0.8, "quuz");
Random random = new Random();
pdf.floorEntry(random.nextDouble()).getValue();

Compute a table in every possible way

I would like to compute a table with values in "every possible way" by multiplying one value from each column to a product. I would preferably solve the problem in Java. The table is of size n*m. It could for example be of size 3*5 and containing:
0.5, 3.0, 5.0, 4.0, 0.75
0.5, 3.0, 5.0, 4.0, 0.75
0.5, 9.0, 5.0, 4.0, 3.0
One way of getting the product would be:
0.5 * 3.0 * 5.0 * 4.0 * 0.75
How do I compute this in "every possible way" when the table is of size n*m? I would like to write one program (presumably containing loops) that works for every n*m table.
You could do it recursively, as the other answer mentioned, but in general I find Java is somewhat unhappy with recursion. One other method to do it would be to keep track of a "signature" of where you are in the table (i.e., an array of length m where each value is 0 <= val < m). Each signature uniquely specifies a path through the table, and you can compute the value from a given signature pretty easily:
double val = 1.;
for (int j=0; j<m; j++)
val *= table[j][signature[j];
To iterate through all signatures, think of them as (up to) m-digit numbers in base n and simply increment through, making sure to carry when you get above n. Here's some untested, unoptimized, probably badly indexed sample code:
int[] sig = new int[m];
double[] values = new double[m*n];
while (sig[m-1] < n) {
values = getValue(table, sig);
int carry = 1, j = 0;
while (carry > 0 && j < n) {
sig[j] += carry;
carry = 0;
while (sig[j] >= n) {
sig[j] -= n;
carry += 1;
}
}
}
Create a recursive method that makes two calls, one call where you use a number in a column in the final product, and one call where you do not. In the call where you do not use it, you make two more calls, one where you use the next number in the column and one where you do not and so on. When you do use a number, you go to the next column, efficiently making a recursive tree of sorts where each leaf is a different combination of finding the product.
You would not need any data structure for this besides your table and it would be able to take in any size of table. If you do not understand the method I have described I can provide some short example code but it is fairly simple.
method findProducts(int total, pos x, pos y)
if(inbounds of table)
findProducts(total + column[x]row[y] value, 0, y+1)
findProducts(total, x+1, y)
else
print(total)
Something like this, a counter would be useful so you could only print those values that are combinations of y numbers, the amount of rows.

Categories