What do we mean when we say that we are evaluating the clusters in WEKA frmework? Clustering is an unsupervised approach to grouping objects. What do we mean when we say we want to evaluate the result? Also, in addition to this, when we say that we are evaluating the clusters on top of the training data itself, what does that mean?
Thanks
Abhishek S
As written on this page:
Evaluation
The way Weka evaluates the clusterings depends on the cluster mode you select. Four different cluster modes are available (as buttons in the Cluster mode panel):
Use training set (default). After generating the clustering Weka classifies the training instances into clusters according to the cluster representation and computes the percentage of instances falling in each cluster. For example, the above clustering produced by k-means shows 43% (6 instances) in cluster 0 and 57% (8 instances) in cluster 1.
In Supplied test set or Percentage split Weka can evaluate clusterings on separate test data if the cluster representation is probabilistic (e.g. for EM).
Classes to clusters evaluation. In this mode Weka first ignores the class attribute and generates the clustering. Then during the test phase it assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then it computes the classification error, based on this assignment and also shows the corresponding confusion matrix. An example of this for k-means is shown below.
Related
Small question regarding how to build Grafana dashboards which separates "boundedElastic" vs "parallel" please.
Currently with a Spring Webflux app, I get out of the box very useful metrics for Reactor Core.
executor_pool_size_threads
executor_pool_core_threads
executor_pool_max_threads
etc
The Reactor Team even provides default dashboard so we can have visuals on the states:
https://github.com/reactor/reactor-monitoring-demo
Unfortunately, the current dashboards mix "boundedElastic" and "parallel", I am trying to build the same dashboards, but with "boundedElastic" and "parallel" separated.
I tried:
sum(executor_pool_size_threads{_ws_="my_workspace"}) by (reactor_scheduler_id, boundedElastic)
But no luck so far.
May I ask what is the correct way to do it please?
Thank you
In the demo project, the metrics are stored in Prometheus and are queried using PromQL. Each metric can have several labels and each label can have several values. The metrics can be selected by labels and values, e.g. my_metric{first_label="first_value", second_label="another_value"} selects my_metric where both labels are matching corresponding values.
So in your example the metric executor_pool_size_threads has the label reactor_scheduler_id. However, the values contain more information beyond scheduler name. On my machine (because of default pool size) the values are: parallel(8,"parallel") and boundedElastic("boundedElastic",maxThreads=80,maxTaskQueuedPerThread=100000,ttl=60s). So regex-match is useful here for matching the values with =~ operator.
PromQL query only for parallel:
sum (executor_pool_size_threads{reactor_scheduler_id=~"parallel.*"}) by (reactor_scheduler_id)
PromQL query only for boundedElastic:
sum (executor_pool_size_threads{reactor_scheduler_id=~"boundedElastic.*"}) by (reactor_scheduler_id)
I am using 10 folds cross validations technique to train 200K records. The target class index is like
Status {PASS,FAIL}
Pass has ~144K and Fail has ~6K instances.
while training the model using J48. Its not able to find the failures. The accuracy is 95% but most the cases its predicting just success. where as in our case, we need to find the failure which are actually happening.
So my question is mainly hypothetical analysis.
Does it really matter the distribution among class instances during training(in my case PASS,FAIL).
What could be possible values in weka J48 tree to train better as i see 2% failure in every 1000 records i pass. So, there will be increase in success if we increase the Success scenarios.
What should be the ratio among them in order to better train them.
There is nothing i could find in the API as far as ratio is concerned.
I am not adding the code because this is happening both with Java API as well as using weka GUI tool.
Many Thanks.
The problem here is that your dataset is very unbalanced. You do have a few options on how to help your classification task:
Generate synthetic instances for your minority class using an algorithm like SMOTE. This should increase your performance.
It's not possible in every case, but you could maybe try splitting your majority class into a couple of smaller classes. This would help the balance.
I believe Weka has a One Class Classifier. This allows to see decision boundary of the larger class and considers the minority class as an outlier allowing for hopefully better classifications. See here for Weka's implementation.
Edit:
You could also use a classifier that will weight classifications based on whether they are correct or not. Again, Weka has this as a meta classifier that can be applied to most base classifiers, see here again.
I have been reading several SO posts regarding K-D Trees vs. R-Trees but I still have some questions regarding my specific application.
For my Java application, I want to maintain a relatively small number of spatial data points (a few hundred thousand). The key is that data insertion will not be bulk loaded, but rather, frequently and incrementally inserted. I should also mention that I will be performing a good number of periodic range queries on sub-regions of the spatial domain.
I have read that K-D Trees do not typically support incremental building and that R-trees are more suitable for this since they maintain a balanced state.
However, after looking into the solutions suggested here:
Java commercial-friendly R-tree implementation?
I did not find that the implementations were easy to work with for returning a list of points in range searches. However, I have found: http://java-ml.sourceforge.net/ to have a very nice implementation of a K-D Tree that works quickly and outperforms standard array storage for a test set of points (~25K). Additionally, I have read that R-trees store redundant information when dealing with points (since a point is a rectangle with min=max).
Since I am working with a smaller number of points, are the differences between the two structures less important than, say, if I was working with a database application storing millions of points?
It is incorrect that R-trees can't store points. They are designed to support rectangles, and will need to do so at inner nodes. But a good implementation should store points at the leaf level, and roughly have the double data capacity there.
You can trivially store point, and expose them as a "rectangles" with min=max to the tree management code.
Your data isn't small. Small would be like 100 objects. For 100 objects, an R-tree won't make much sense, as it would likely consists of a single leaf only. For good performance, an R-tree needs a good fan-out. k-d-tree always have a fan-out of 2; they are binary trees. At 100k objects, a k-d-tree will be pretty deep. Assuming that you have a fanout of 100 (for dynamic r-trees, you then should allow up to 200 objects per page), you can store 1 million points in a 3-level tree.
I've used the ELKI R*-tree, and it is really fast. But it's not commercial friendly, unless you get a different license: it's AGPL-3 licensed, which is a copyleft license.
Furthermore, the API isn't designed for standalone use. If you want to use them, the best way is to work with the full ELKI framework, instead of trying to rip out the R*-tree.
If your data is low dimensional (say, 3-dimensional) and has a finite bound, don't underestimate the performance of simple grid-based approaches. In particular for in-memory operations. In many cases, I wouldn't even go to an Octree, but just define the optimal grid for my use case, and then implement it using object lists. Keep sorted by one coordinate within each grid cell to further accelerate performance.
If you want to frequently add/remove/update data points, you may want to look at the PH-Tree. The is on open source Java version available: www.phtree.org
It works a bit like a quadtree, but is much more efficient by using binary hypercubes and prefix-sharing.
It has excellent update performance (no rebalancing required) and is quite memory efficient. It works better with larger datasets, but 100K should be fine for 2 or 3 dimensions.
I am trying to implement K means in hadoop-1.0.1 in java language. I am frustrated now. Although I got a github link of the complete implementation of k means but as a newbie in Hadoop, I want to learn it without copying other's code. I have basic knowledge of map and reduce function available in hadoop. Can somebody provide me the idea to implement k means mapper and reducer class. Does it require iteration?
Ok I give it a go to tell you what I thought when implementing k-means in MapReduce.
This implementation differs from that of Mahout, mainly because it is to show how the algorithm could work in a distributed setup (and not for real production usage).
Also I assume that you really know how k-means works.
That having said we have to divide the whole algorithm into three main stages:
Job level
Map level
Reduce level
The Job Level
The job level is fairly simple, it is writing the input (Key = the class called ClusterCenter and Value = the class called VectorWritable), handling the iteration with the Hadoop job and reading the output of the whole job.
VectorWritable is a serializable implementation of a vector, in this case from my own math library, but actually nothing else than a simple double array.
The ClusterCenter is mainly a VectorWritable, but with convenience functions that a center usually needs (averaging for example).
In k-means you have some seedset of k-vectors that are your initial centers and some input vectors that you want to cluster. That is exactly the same in MapReduce, but I am writing them to two different files. The first file only contains the vectors and some dummy key center and the other file contains the real initial centers (namely cen.seq).
After all that is written to disk you can start your first job. This will of course first launch a Mapper which is the next topic.
The Map Level
In MapReduce it is always smart to know what is coming in and what is going out (in terms of objects).
So from the job level we know that we have ClusterCenter and VectorWritable as input, whereas the ClusterCenter is currently just a dummy. For sure we want to have the same as output, because the map stage is the famous assignment step from normal k-means.
You are reading the real centers file you created at job level to memory for comparision between the input vectors and the centers. Therefore you have this distance metric defined, in the mapper it is hardcoded to the ManhattanDistance.
To be a bit more specific, you get a part of your input in map stage and then you get to iterate over each input "key value pair" (it is a pair or tuple consisting of key and value) comparing with each of the centers. Here you are tracking which center is the nearest and then assign it to the center by writing the nearest ClusterCenter object along with the input vector itself to disk.
Your output is then: n-vectors along with their assigned center (as the key).
Hadoop is now sorting and grouping by your key, so you get every assigned vector for a single center in the reduce task.
The Reduce Level
As told above, you will have a ClusterCenter and its assigned VectorWritable's in the reduce stage.
This is the usual update step you have in normal k-means. So you are simply iterating over all vectors, summing them up and averaging them.
Now you have a new "Mean" which you can compare to the mean it was assigned before. Here you can measure a difference between the two centers which tells us about how much the center moved. Ideally it wouldn't have moved and converged.
The counter in Hadoop is used to track this convergence, the name is a bit misleading because it actually tracks how many centers have not converged to a final point, but I hope you can live with it.
Basically you are writing now the new center and all the vectors to disk again for the next iteration. In addition in the cleanup step, you are writing all the new gathered centers to the path used in the map step, so the new iteration has the new vectors.
Now back at the job stage, the MapReduce job should be done now. Now we are inspecting the counter of that job to get the number of how many centers haven't converged yet.
This counter is used at the while loop to determine if the whole algorithm can come to an end or not.
If not, return to the Map Level paragraph again, but use the output from the previous job as the input.
Actually this was the whole VooDoo.
For obvious reasons this shouldn't be used in production, because its performance is horrible. Better use the more tuned version of Mahout. But for educational purposes this algorithm is fine ;)
If you have any more questions, feel free to write me a mail or comment.
Cleo has several different type of lookahead searches which are backed by some very clever indexing strategies. The GenericTypeahead is presumably for the largest of datasets.
From http://sna-projects.com/cleo/design.php:
"The GenericTypeahead is designed for large data sets, which may contain millions of elements..."
Unfortunately the documentation doesn't go into how well or how the Typeahead's scale up. Has anyone used Cleo for very large datasets that might have some insight?
Cleo is for a single instance/node (i.e. a single JVM) and does not have any routing or broker logic. Within a single Cleo instance, you can have multiple logical partitions to take advantage of multi-core CPUs. On a typical commodity box with 32G - 64G memory, you can easily support tens of millions elements by setting up 2 or 3 Cleo GenericTypeahead instances.
To support billions of elements, you will have to use horizontal partitioning to set up many Cleo instances on many commodity boxes and then do scatter-and-gather.
Check out https://github.com/jingwei/cleo-primer to see how to set up a single Cleo GenericTypeahead instance within minutes.
Cheers.