Several months ago I had to implement a two-dimensional Fourier transformation in Java. While the results seemed sane for a few manual checks I wondered how a good test-driven approach would look like.
Basically what I did was that I looked at reasonable values of the DC components and compared the AC components if they roughly match the Mathematica output.
My question is: Which unit tests would you implement for a discrete Fourier transformation? How would you validate results returned by your calculation?
As for other unit-tests, you should consider small fixed input test-vectors for which results can easily be computed manually and compared against. For the more involved input test-vectors, a direct DFT implementation should be easy enough to implement and used to cross-validate results (possibly on top of your own manual computations).
As far as specific test vectors for one-dimensional FFT, you can start with the following from dsprelated, which they selected to exercise common flaws:
Single FFT tests - N inputs and N outputs
Input random data
Inputs are all zeros
Inputs are all ones (or some other nonzero value)
Inputs alternate between +1 and -1.
Input is e^(8*j*2*pi*i/N) for i = 0,1,2, ...,N-1. (j = sqrt(-1))
Input is cos(8*2*pi*i/N) for i = 0,1,2, ...,N-1.
Input is e^((43/7)*j*2*pi*i/N) for i = 0,1,2, ...,N-1. (j sqrt(-1))
Input is cos((43/7)*2*pi*i/N) for i = 0,1,2, ...,N-1.
Multi FFT tests - run continuous sets of random data
Data sets start at times 0, N, 2N, 3N, 4N, ....
Data sets start at times 0, N+1, 2N+2, 3N+3, 4N+4, ....
For two-dimensional FFT, you can then build on the above. The first three cases are still directly applicable (random data, all zeros, all ones). Others require a bit more work but are still manageable for small input sizes.
Finally google searches should yield some reference images (before and after transform) for a few common cases such as black & white squares, rectangle, circles which are can be used as reference (see for example http://www.fmwconcepts.com/misc_tests/FFT_tests/).
99.9% of the numerical and coding issues you will likely find will be found by testing with a random complex vectors and comparing with a direct DFT to a tolerance on the order of floating point precision.
Zero, constant, or sinusoidal vectors may help understand a failure by allowing your eye to catch issues like initialization, clipping, folding, scaling. But they will not typically find anything that the random case does not.
My kissfft library does a few extra tests related to fixed point issues -- not an issue if you are working in floating point.
Related
I would like to create two models of binary prediction: one with the cut point strictly greater than 0.5 (in order to obtain fewer signals but better ones) and second with the cut point strictly less than 0.5.
Doing the cross-validation, we have a test error related to the cut point equal to 0.5. How can I do it with other cut value? I talk about XGBoost for Java.
xgboost returns a list of scores. You can do what ever you want to that list of scores.
I think that particularly in Java, it returns a 2d ArrayList of shape (1, n)
In binary prediction you probably used a logistic function, thus your scores will be between 0 to 1.
Take your scores object and create a custom function that will calculate new predictions, by the rules you've described.
If you are using an automated/xgboost-implemented Cross Validation Function, you might want to build a customized evaluation function which will do as you bid, and pass it as an argument to xgb.cv
If you want to be smart when setting your threshold, I suggest reading about AUC of Roc Curve and Precision Recall Curve.
I need to implement the calculation of some special polynomials in Java (the language is not really important). These are calculated as a weighted sum of a number of base polynomials with fixed coefficients.
Each base polynomial has 2 to 10 coefficients and there are typically 10 base polynomials considered, giving a total of, say 20-50 coefficients.
Basically the calculation is no big deal but I am worried about typos. I only have a printed document as a template. So i would like to implement unit tests for the calculations. The issue is: How do I get reliable testing data. I do have another software that is supposed to calculate these functions but the process is complicated and also error prone - I would have to scale the input values, go through a number of menu selections in the software to produce the output and then paste it to my testing code.
I guess that there is no way around using the external software to generate some testing data, but maybe you have some recommendations for making this type of testing procedure safer or minimize the required number of test cases.
I am also worried about providing suitable input values: Depending on the value of the independent variable, certain terms will only have a tiny contribution to the output, while for other values they might dominate.
The types of errors I expect (and need to avoid) are:
Typos in coefficients
Coefficients applied to wrong power (i.e. a_7*x^6 instead of a_7*x^7 - just for demonstration, I am not calculating this way but am using Horner's scheme)
Off-by one errors (i.e. missing zero order or highest order term)
Since you have a polynomial of degree 10, testing at 11 distinct points should give certainty.
However, already a test at one well-randomized point, x=1.23004 to give an idea (away from small fractions like 2/3, 4/5), will with high probability show a difference if there is an error, because it is unlikely that the difference between the wrong and the true polynomial has a root at exactly this place.
(EDIT: Looking at where this question started, it really ended up in a much better place. It wound up being a nice resource on the limits of RDD sizes in Spark when set through SparkContext.parallelize() vs. the actual size limits of RDDs. Also uncovered some arguments to parallelize() not found in user docs. Look especially at zero323's comments and his accepted answer.)
Nothing new under the sun but I can't find this question already asked ... the question is about how wrong/inadvisable/improper it might be to run a cast inside a large for loop in Java.
I want to run a for loop to initialize an Arraylist before passing it to a SparkContext.parallelize() method. I have found passing an uninitialized array to Spark can cause an empty collection error.
I have seen many posts about how floats and doubles are bad ideas as counters, I get that, just seems like this is a bad idea too? Like there must be a better way?
numListLen will be 10^6 * 10^3 for now, maybe as large at 10^12 at some point.
List<Double> numList = new ArrayList<Double>(numListLen);
for (long i = 0; i < numListLen; i++) {
numList.add((double) i);
}
I would love to hear where specifically this code falls down and can be improved. I'm a junior-level CS student so I haven't seen all the angles yet haha. Here's a CMU page seemingly approving this approach in C using implicit casting.
Just for background, numList is going to be passed to Spark to tell it how many times to run a simulation and create a RDD with the results, like this:
JavaRDD dataSet = jsc.parallelize(numList,SLICES_AKA_PARTITIONS);
// the function will be applied to each member of dataSet
Double count = dataSet.map(new Function<Double, Double>() {...
(Actually I'd love to run this Arraylist creation through Spark but it doesn't seem to take enough time to warrant that, 5 seconds on my i5 dual-core but if boosted to 10^12 then ... longer )
davidstenberg and Konstantinos Chalkias already covered problems related to using Doubles as counters and radiodef pointed out an issue with creating objects in the loop but at the end of the day you simply cannot allocate ArrayList larger than Integer.MAX_VALUE. On top of that, even with 231 elements, this is a pretty large object and serialization and network traffic can add a substantial overhead to your job.
There a few ways you can handle this:
using SparkContext.range method:
range(start: Long, end: Long,
step: Long = 1, numSlices: Int = defaultParallelism)
initializing RDD using range object. In PySpark you can use or range (xrange in Python 2), in Scala Range:
val rdd = sc.parallelize(1L to Long.MaxValue)
It requires constant memory on the driver and constant network traffic per executor (all you have to transfer it just a beginning and end).
In Java 8 LongStream.range could work the same way but it looks like JavaSparkContext doesn't provide required constructors yet. If you're brave enough to deal with all the singletons and implicits you can use Scala Range directly and if not you can simply write a Java friendly wrapper.
initialize RDD using emptyRDD method / small number of seeds and populate it using mapPartitions(WithIndex) / flatMap. See for example Creating array per Executor in Spark and combine into RDD
With a little bit of creativity you can actually generate an infinite number of elements this way (Spark FlatMap function for huge lists).
given you particular use case you should also take a look at mllib.random.RandomRDDs. It provides number of useful generators from different distributions.
The problem is using a double or float as the loop counter. In your case the loop counter is a long and does not suffer the same problem.
One problem with a double or float as a loop counter is that the floating point precision will leave gaps in the series of numbers represented. It is possible to get to a place within the valid range of a floating point number where adding one falls below the precision of the number being represented (requires 16 digits when the floating point format only supports 15 digits for example). If your loop went through such a point in normal execution it would not increment and continue in an infinite loop.
The other problem with doubles as loop counters is the ability to compare two floating points. Rounding means that to compare the variables successfully you need to look at values within a range. While you might find 1.0000000 == 0.999999999 your computer would not. So rounding might also make you miss the loop termination condition.
Neither of these problems occurs with your long as the loop counter. So enjoy having done it right.
Although I don't recommend the use of floating-point values (either single or double precision) as for-loop counters, in your case, where the step is not a decimal number (you use 1 as a step), everything depends on your largest expected number Vs the fraction part of double representation (52 bits).
Still, double numbers from 2^52..2^53 represent the integer part correctly, but after 2^53, you cannot always achieve integer-part precision.
In practice and because your loop step is 1, you would not experience any problems till 9,007,199,254,740,992 if you used double as counter and thus avoiding casting (you can't avoid boxing though from double to Double).
Perform a simple increment-test; you will see that 9,007,199,254,740,995 is the first false positive!
FYI: for float numbers, you are safe incrementing till 2^24 = 16777216 (in the article you provided, it uses the number 100000001.0f > 16777216 to present the problem).
I would like to use FastFourierTransformation in order to identify patterns so as to predict future values of my monitoring metrics. What I'm trying to do is:
I monitor incoming traffic load, which is being repeated seasonally (high picks during daytime), with an additionally trend over a period of a week (lower traffic during weekends, also seasonally repeated).
Although I've tried some augmented regression algorithms, but I would also like to use the FFT in order to identify the most important coefficients, so as to recognize these two most important frequencies, and then try to extrapolate so as to predict the traffic in the near future.
I'm struggling with apache.commons.math3.transform.FastFourierTransformation, although my theoretical background in mathematics causes me some troubles.
Supposing that I use a doube [] array for storing my latest traffic load in the observed timeframe, I use the following code:
double [] initialSignal = getMonitoringData(timeslide);
FastFourierTransformer fft = new FastFourierTransformer(DftNormalization.STANDARD);
Complex [] result = fft.transform(initialSignal, TransformType.FORWARD);
However I'm not familiar with what the Complex [] array represents. Does the imaginary attribute of each Complex object in the array represents the relevant sinusoidal coefficient?
So, if I want to take the denoised initial signal I only have to set to zero the less significant coefficients of the Complex [] result array?
But still, if I have the following
Complex [] denoised = fft.transform(importantCoefficiants, TransformType.INVERSE);
The result will still be an array of Complex. How can i get the newly transformed x(t) values of the time series?
And how can I extrapolate in order to predict the x(t+1), x(t+2) ... x(t+n) values, after denoising the initial time series?
Well, I guess I found a solution yesterday night, pretty similar with erickson's answer.
I calculate the x^2+y^2 and then I take into account the most significant coefficients. I'm setting to zero the other elements of the array and then I perform a IFFT. My final question now is:
How I can extrapolate the given result in the Complex array so as to predict future values?
For instance, if I have n=4096 samples (Complex [4096] array) as my input, then I suppose that the value of x(n+1) will be the value of the array[0], the value of x(n+2) will be the value of array[1] etc?
Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:
A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1
The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:
x = 0
y = 0.5
z = 1
Do I just find the nearest value to the output (0.435) which is y (0.5)?
You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:
Nominal vs ordinal variables
Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .
Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.
One-of-N encoding
A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.
A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.
In reply to your comment on #Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.
It really depends on the meaning of the attributes you're trying to normalize, and the functions used inside your NN. For example, if your attributes are non-linear, or if you're using a non-linear activation function, then linear normalization might not end up doing what you want it to do.
If the ranges of attribute values are relatively small, splitting the input and output into sets of binary inputs and outputs will probably be simpler and more accurate.
EDIT:
If the NN was able to accurately perform it's function, one of the outputs will be significantly higher than the others. If not, you might have a problem, depending on when you see inaccurate results.
Inaccurate results during early training are expected. They should become less and less common as you perform more training iterations. If they don't, your NN might not be appropriate for the task you're trying to perform. This could be simply a matter of increasing the size and/or number of hidden layers. Or it could be a more fundamental problem, requiring knowledge of what you're trying to do.
If you've succesfully trained your NN but are seeing inaccuracies when processing real-world data sets, then your training sets were likely not representative enough.
In all of these cases, there's a strong likelihood that your NN did something entirely different than what you wanted it to do. So at this point, simply selecting the highest output is as good a guess as any. But there's absolutely no guarantee that it'll be a better guess.