How many children (min, max) should a R-tree node have? - java

I have 500.000 unique 3D points, which I want to insert into a R-tree. The constructor of the R-tree accepts two parameters:
the minimal number of children a node can have
the maximal number of children a node can have
I've read on wikipedia that: "... best performance has been experienced with a minimum fill of 30%–40% of the maximum number of entries."
What would be the optimal values for the two parameters then ?

Well, what wikipedia states is:
minimum = approximately 0.3 * maximum to 0.4 * maximum
as for the maximum, this depends on your exact setup and implementation. In particular the dimensionality of your data set plays a huge role, but also the kind of queries you perform (think of the average number of points returned per query!) Therefore, the cannot be a general rule.
However, as R-trees are designed to be operated on disk, you should maybe choose the maximum value so that it optimally fills a single block on disk (8kb?)

Related

How to get the optimal cluster number using the elbow method for java?

I use haifengl/smile and I need to get the optimal cluster number.
I am using CLARANS where I need to specify the number of clusters to create. I think maybe there is some solution to sort out for example from 2 to 10 clusters, see the best result and choose the number of clusters with the best result. How can this be done with the Elbow method?
To determine the appropriate number of clusters such that elements within the cluster are similar to each other and dissimilar to elements in other groups, can be found by applying a variety of techniques like;
Gap Statistic- compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data.
Silhouette Method The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.
Sum of Square method
For more details, read the sklearn documentation on this subject.
The Elbow method is not automatic.
You compute the scores for the desired range of k, plot this, and then visually try to find an "elbow" - which may or may not work.
Because x and y have no "correct" relation to each other, beware that the interpretation of the plot (and any geometric attempt to automate this) depend on the scaling of the plot and are inherently subjective. In the end, the entire concept of an "elbow" likely is flawed and not sound in this form. I'd rather look for more advanced measures where you can argue for the maximum or minimum, although some notion of "significantly better k" would be desirable.
Ways to find clusters:
1- Silhouette method:
Using separation and cohesion or just using an implemented method the optimal number of clusters is the one with the maximum silhouette coefficient. * silhouette coefficient range from [-1,1] and 1 is the best value.
Example of the silhouette method with scikit-learn.
2- Elbow method (You can use the elbow method automatically)
The elbow method is a graph between the number of clusters and the average square sum of the distances.
To apply it automatically in python there is a library Kneed in python to detect the knee in a graph.Kneed Repository

Java Structure that is able to determine approximate number of elements less then x in an ordered set which is updated concurrently

Suppose U is an ordered set of elements, S ⊆U, and x ∈ U. S is being updated concurrently. I want to get an estimate of the number of elements in S that is less x in O(log(|S|) time.
S is being maintained by another software component that I cannot change. However, whenever e is inserted (or deleted) into S I get a message e inserted (deleted). I don't want to maintain my own version of S since memory is limited. I am looking for a structure, ES, (perhaps using O(log(|S|) space) where I can get a reasonable estimate of the number of elements less than any give x. Assume that the entire set S can periodically be sampled to recreate or update ES.
Update: I think that this problem statement must include more specific values for U. One obvious case is where U are numbers (int, double,etc). Another case is where U are strings ordered lexigraphically.
In the case of numbers one could use a probability distribution (but how can that be determined?).
I am wondering if the set S can be scanned periodically. Place the entire set into an array and sort. Then pick the log(n) values at n/log(n), 2n/log(n) ... n where n = |S|. Then draw a histogram based on those values?
More generally how can one find the appropriate probability distribution from S?
Not sure what the unit of measure would be for strings lexigraphically ordered?
By concurrently, I'm assuming you mean thread-safe. In that case, I believe what you're looking for is a ConcurrentSkipListSet, which is essentially a concurrent TreeSet. You can use ConcurrentSkipListSet#headSet.size() or ConcurrentSkipListSet#tailSet.size() to get the amount of elements greater/less than (or equal to) a single element where you can pass in a custom Comparator.
Is x constant? If so it seems easy to track the number less than x as they are inserted and deleted?
If x isn't constant you could still take a histogram approach. Divide up the range that values can take. As items are inserted / deleted, keep track of how many items are in each range bucket. When you get a query, sum up all the values from smaller buckets.
I accept your point that bucketing is tricky - especially if you know nothing about the underlying data. You could record the first 100 values of x, and use those calculate a mean and a standard deviation. Then you could assume the values are normally distributed and calculate the buckets that way.
Obviously if you know more about the underlying data you can use a different distribution model. It would be easy enough to have a modular approach if you want it to be generic.

How to find the accuracy of a simulated model with different datasets

I am simulating my model using real 29 datasets. how can I calculate the accuracy of my model, based on the results obtained from these datasets?
Consider using a typical Five Number Summary (min, Q1, median, Q2, max) of the 29 datasets. Q1 and Q2 are basically medians of the datasets lower/higher than the median, respectively. Observe the numbers. If Q1 and Q2 have a relatively large difference, it means the datasets may not have a clear pattern, or you will need more data sets to see the pattern. If they have a small difference, however, find the result of your model and see where it fits. The closer your result is to the median, the more accurate your model is. You will need to revise your model though if your result falls outside the range between the min or max values.
Note: if your dataset contains more than one variable, simply repeat the process above for each variable and see how they fit together.

Finding N nodes in a graph with maximum spread / distance from eachother

Given a graph with N nodes (thousands), I need to find K nodes so that the average path length between each pair (K1,K2) of K is maximized. So basically, I want to place them as far away as possible from eachother.
Which algorithm would I use for this / how could I program this without having to try out several single combination of K?
Also as an extension: if I now have N nodes and I need to place 2 groups of nodes K and L in the graph such that the average path length between each pair (L,K) is maximized, how would I do this?
My current attempt is to just do a couple of random placements and then calculate the average path length between the pairs of both K and L, but this calculation is starting to take a lot of time so I'd rather not spend that much time on just evaluating randomly chosen combinations. I'd rather spend time once on getting the REAL most spread combination.
Are there any algorithms out there for this?
The bad news is that this problem is NP-hard, by a reduction from Independent Set. (Assume unit weights. Add a new vertex connected to all other vertices; then we're looking for a collection of K that have average distance 2.)
If you're determined to get an exact solution (and I'm not sure that you shouldn't be), then I'd try branch and bound, using node is/is not one of the K as the branching decision and a crude bound (given a subset of K, find the two nodes that maximize the appropriate combination of the distance between them and the distance to the subset, then set the bound to the appropriate weighted average incorporating the known inter-node distances).
If the exact algorithm above chokes on thousand-node graphs as Evgeny fears it will, then use a farthest-point clustering (link goes to the Wikipedia page on Facility Location, which describes FPC) to cut the graph to a manageable size, incurring a hopefully small approximation error.

Java - normalize and denormalize nominal attributes in neural networks

Hi I am building a simple multilayer network which is trained using back propagation. My problem at the moment is that some attributes in my dataset are nominal (non numeric) and I have to normalize them. I wanted to know what the best approach is. I was thinking along the lines of counting up how many distinct values there are for each attribute and assigning each an equal number between 0 and 1. For example suppose one of my attributes had values A to E then would the following be suitable?:
A = 0
B = 0.25
C = 0.5
D = 0.75
E = 1
The second part to my question is denormalizing the output to get it back to a nominal value. Would I first do the same as above to each distinct output attribute value in the dataset in order to get a numerical representation? Also after I get an output from the network, do I just see which number it is closer to? For example if I got 0.435 as an output and my output attribute values were assigned like this:
x = 0
y = 0.5
z = 1
Do I just find the nearest value to the output (0.435) which is y (0.5)?
You can only do what you are proposing if the variables are ordinal and not nominal, and even then it is a somewhat arbitrary decision. Before I suggest a solution, a note on terminology:
Nominal vs ordinal variables
Suppose A, B, etc stand for colours. These are the values of a nominal variable and can not be ordered in a meaningful way. You can't say red is greater than yellow. Therefore, you should not be assigning numbers to nominal variables .
Now suppose A, B, C, etc stand for garment sizes, e.g. small, medium, large, etc. Even though we are not measuring these sizes on an absolute scale (i.e. we don't say that small corresponds to 40 a chest circumference), it is clear that small < medium < large. With that in mind, it is still somewhat arbitrary whether you set small=1, medium=2, large=3, or small=2, medium=4, large=8.
One-of-N encoding
A better way to go about this is to to use the so called one-out-of-N encoding. If you have 5 distinct values, you need five input units, each of which can take the value 1 or 0. Continuing with my garments example, size extra small can be encoded as 10000, small as 01000, medium as 00100, etc.
A similar principle applies to the outputs of the network. If we treat garment size as output instead of input, when the network output the vector [0.01 -0.01 0.5 0.0001 -.0002], you interpret that as size medium.
In reply to your comment on #Daan's post: if you have 5 inputs, one of which takes 20 possible discrete values, you will need 24 input nodes. You might want to normalise the values of your 4 continuous inputs to the range [0, 1], because they may end out dominating your discrete variable.
It really depends on the meaning of the attributes you're trying to normalize, and the functions used inside your NN. For example, if your attributes are non-linear, or if you're using a non-linear activation function, then linear normalization might not end up doing what you want it to do.
If the ranges of attribute values are relatively small, splitting the input and output into sets of binary inputs and outputs will probably be simpler and more accurate.
EDIT:
If the NN was able to accurately perform it's function, one of the outputs will be significantly higher than the others. If not, you might have a problem, depending on when you see inaccurate results.
Inaccurate results during early training are expected. They should become less and less common as you perform more training iterations. If they don't, your NN might not be appropriate for the task you're trying to perform. This could be simply a matter of increasing the size and/or number of hidden layers. Or it could be a more fundamental problem, requiring knowledge of what you're trying to do.
If you've succesfully trained your NN but are seeing inaccuracies when processing real-world data sets, then your training sets were likely not representative enough.
In all of these cases, there's a strong likelihood that your NN did something entirely different than what you wanted it to do. So at this point, simply selecting the highest output is as good a guess as any. But there's absolutely no guarantee that it'll be a better guess.

Categories