I am simulating my model using real 29 datasets. how can I calculate the accuracy of my model, based on the results obtained from these datasets?
Consider using a typical Five Number Summary (min, Q1, median, Q2, max) of the 29 datasets. Q1 and Q2 are basically medians of the datasets lower/higher than the median, respectively. Observe the numbers. If Q1 and Q2 have a relatively large difference, it means the datasets may not have a clear pattern, or you will need more data sets to see the pattern. If they have a small difference, however, find the result of your model and see where it fits. The closer your result is to the median, the more accurate your model is. You will need to revise your model though if your result falls outside the range between the min or max values.
Note: if your dataset contains more than one variable, simply repeat the process above for each variable and see how they fit together.
Related
I have a huge set of long integer identifiers that need to be distributed into (n) buckets as uniformly as possible. The long integer identifiers might have pockets of missing identifiers.
With that being the criteria, is there a difference between Using the long integer as is and doing a modulo (n) [long integer] or is it better to have a hashCode generated for the string version of long integer (to improve the distribution) and then do a modulo (n) [hash_code of string(long integer)]? Is the additional string conversion necessary to get the uniform spread via hash code?
Since I got feedback that my question does not have enough background information. I am adding some more information.
The identifiers are basically auto-incrementing numeric row identifiers that are autogenerated in a database representing an item id. The reason for pockets of missing identifiers is because of deletes.
The identifiers themselves are long integers.
The identifiers (items) themselves are in the order of (10s-100)+ million in some cases and in the order of thousands in some cases.
Only in the case where the identifiers are in the order of millions do I want to really spread them out into buckets (identifier count >> bucket count) for storage in a no-SQL system(partitions).
I was wondering if because of the fact that items get deleted, should I be resorting to (Long).toString().hashCode() to get the uniform spread instead of using the long numeric directly. I had a feeling that doing a toString.hashCode is not going to fetch me much, and I also did not like the fact that java hashCode does not guarantee same value across java revisions (though for String their hashCode implementation seems to be documented and stable for the past releases across years
)
There's no need to involve String.
new Integer(i).hashCode()
... gives you a hash - designed for the very purpose of evenly distributing into buckets.
new Integer(i).hashCode() % n
... will give you a number in the range you want.
However Integer.hashCode() is just:
return value;
So new Integer(i).hashCode() % n is equivalent to i % n.
Your question as is cannot be answered. #slim's try is the best you will get, because crucial information is missing in your question.
To distribute a set of items, you have to know something about their initial distribution.
If they are uniformly distributed and the number of buckets is significantly higher than the range of the inputs, then slim's answer is the way to go. If either of those conditions doesn't hold, it won't work.
If the range of inputs is not significantly higher than the number of buckets, you need to make sure the range of inputs is an exact multiple of the number of buckets, otherwise the last buckets won't get as many items. For instance, with range [0-999] and 400 buckets, first 200 buckets get items [0-199], [400-599] and [800-999] while the other 200 buckets get iems [200-399] and [600-799].
That is, half of your buckets end up with 50% more items than the other half.
If they are not uniformly distributed, as modulo operator doesn't change the distribution except by wrapping it, the output distribution is not uniform either.
This is when you need a hash function.
But to build a hash function, you must know how to characterize the input distribution. The point of the hash function being to break the recurring, predictable aspects of your input.
To be fair, there are some hash functions that work fairly well on most datasets, for instance Knuth's multiplicative method (assuming not too large inputs). You might, say, compute
hash(input) = input * 2654435761 % 2^32
It is good at breaking clusters of values. However, it fails at divisibility. That is, if most of your inputs are divisible by 2, the outputs will be too. [credit to this answer]
I found this gist has an interesting compilation of diverse hashing functions and their characteristics, you might pick one that best matches the characteristics of your dataset.
I need to implement the calculation of some special polynomials in Java (the language is not really important). These are calculated as a weighted sum of a number of base polynomials with fixed coefficients.
Each base polynomial has 2 to 10 coefficients and there are typically 10 base polynomials considered, giving a total of, say 20-50 coefficients.
Basically the calculation is no big deal but I am worried about typos. I only have a printed document as a template. So i would like to implement unit tests for the calculations. The issue is: How do I get reliable testing data. I do have another software that is supposed to calculate these functions but the process is complicated and also error prone - I would have to scale the input values, go through a number of menu selections in the software to produce the output and then paste it to my testing code.
I guess that there is no way around using the external software to generate some testing data, but maybe you have some recommendations for making this type of testing procedure safer or minimize the required number of test cases.
I am also worried about providing suitable input values: Depending on the value of the independent variable, certain terms will only have a tiny contribution to the output, while for other values they might dominate.
The types of errors I expect (and need to avoid) are:
Typos in coefficients
Coefficients applied to wrong power (i.e. a_7*x^6 instead of a_7*x^7 - just for demonstration, I am not calculating this way but am using Horner's scheme)
Off-by one errors (i.e. missing zero order or highest order term)
Since you have a polynomial of degree 10, testing at 11 distinct points should give certainty.
However, already a test at one well-randomized point, x=1.23004 to give an idea (away from small fractions like 2/3, 4/5), will with high probability show a difference if there is an error, because it is unlikely that the difference between the wrong and the true polynomial has a root at exactly this place.
I would like to use FastFourierTransformation in order to identify patterns so as to predict future values of my monitoring metrics. What I'm trying to do is:
I monitor incoming traffic load, which is being repeated seasonally (high picks during daytime), with an additionally trend over a period of a week (lower traffic during weekends, also seasonally repeated).
Although I've tried some augmented regression algorithms, but I would also like to use the FFT in order to identify the most important coefficients, so as to recognize these two most important frequencies, and then try to extrapolate so as to predict the traffic in the near future.
I'm struggling with apache.commons.math3.transform.FastFourierTransformation, although my theoretical background in mathematics causes me some troubles.
Supposing that I use a doube [] array for storing my latest traffic load in the observed timeframe, I use the following code:
double [] initialSignal = getMonitoringData(timeslide);
FastFourierTransformer fft = new FastFourierTransformer(DftNormalization.STANDARD);
Complex [] result = fft.transform(initialSignal, TransformType.FORWARD);
However I'm not familiar with what the Complex [] array represents. Does the imaginary attribute of each Complex object in the array represents the relevant sinusoidal coefficient?
So, if I want to take the denoised initial signal I only have to set to zero the less significant coefficients of the Complex [] result array?
But still, if I have the following
Complex [] denoised = fft.transform(importantCoefficiants, TransformType.INVERSE);
The result will still be an array of Complex. How can i get the newly transformed x(t) values of the time series?
And how can I extrapolate in order to predict the x(t+1), x(t+2) ... x(t+n) values, after denoising the initial time series?
Well, I guess I found a solution yesterday night, pretty similar with erickson's answer.
I calculate the x^2+y^2 and then I take into account the most significant coefficients. I'm setting to zero the other elements of the array and then I perform a IFFT. My final question now is:
How I can extrapolate the given result in the Complex array so as to predict future values?
For instance, if I have n=4096 samples (Complex [4096] array) as my input, then I suppose that the value of x(n+1) will be the value of the array[0], the value of x(n+2) will be the value of array[1] etc?
I have 500.000 unique 3D points, which I want to insert into a R-tree. The constructor of the R-tree accepts two parameters:
the minimal number of children a node can have
the maximal number of children a node can have
I've read on wikipedia that: "... best performance has been experienced with a minimum fill of 30%–40% of the maximum number of entries."
What would be the optimal values for the two parameters then ?
Well, what wikipedia states is:
minimum = approximately 0.3 * maximum to 0.4 * maximum
as for the maximum, this depends on your exact setup and implementation. In particular the dimensionality of your data set plays a huge role, but also the kind of queries you perform (think of the average number of points returned per query!) Therefore, the cannot be a general rule.
However, as R-trees are designed to be operated on disk, you should maybe choose the maximum value so that it optimally fills a single block on disk (8kb?)
Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:
A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1
The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:
x = 0
y = 0.5
z = 1
Do I just find the nearest value to the output (0.435) which is y (0.5)?
You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:
Nominal vs ordinal variables
Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .
Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.
One-of-N encoding
A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.
A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.
In reply to your comment on #Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.
It really depends on the meaning of the attributes you're trying to normalize, and the functions used inside your NN. For example, if your attributes are non-linear, or if you're using a non-linear activation function, then linear normalization might not end up doing what you want it to do.
If the ranges of attribute values are relatively small, splitting the input and output into sets of binary inputs and outputs will probably be simpler and more accurate.
EDIT:
If the NN was able to accurately perform it's function, one of the outputs will be significantly higher than the others. If not, you might have a problem, depending on when you see inaccurate results.
Inaccurate results during early training are expected. They should become less and less common as you perform more training iterations. If they don't, your NN might not be appropriate for the task you're trying to perform. This could be simply a matter of increasing the size and/or number of hidden layers. Or it could be a more fundamental problem, requiring knowledge of what you're trying to do.
If you've succesfully trained your NN but are seeing inaccuracies when processing real-world data sets, then your training sets were likely not representative enough.
In all of these cases, there's a strong likelihood that your NN did something entirely different than what you wanted it to do. So at this point, simply selecting the highest output is as good a guess as any. But there's absolutely no guarantee that it'll be a better guess.