I'm working on a Java project where I need to match user queries against several engines.
Each engine has a method similarity(Object a, Object b) which returns: +1 if the objects surely match; -1 if the objects surely DON'T match; any float in-between when there's uncertainty.
Example: user searches "Dragon Ball".
Engine 1 returns "Dragon Ball", "Dragon Ball GT", "Dragon Ball Z", and it claims they are DIFFERENT result (similarity=-1), no matter how similar their names look. This engine is accurate, so it has a high "weight" value.
Engine 2 returns 100 different results. Some of them relate to DBZ, others to DBGT, etc. The engine claims they're all "quite similar" (similarity between 0.5 and 1).
The system queries several other engines (10+)
I'm looking for a way to build clusters out of this system. I need to ensure that values with similarity near -1 will likely end up in different clusters, even if many other values are very similar to all of them.
Is there a well-known clustering algorithm to solve this problem? Is there a Java implementation available? Can I build it on my own, perhaps with the help of a support library? I'm good at Java (15+ years experience) but I'm completely new at clustering.
Thank you!
The obvious approach would be to use "1 - similarity" as a distance function, which will thus go from 0 to 2. Then add them up.
Or you could use 1 + similarity and take the product of these values, ... or, or, or, ...
But since you apparently trust the first score more, you may also want to increase its influence. There is no mathematical solution for this, you habe to choose the weights depending on your data and preferences. If you have training data, you can optimize weights for your approach, and you may want to even discard some rankers if they don't work well or are correlated.
Related
Within a university course I have some features of images (as text files). I have to rank those images according to their diversity.#
The idea I have in mind is to feed a k-means classifier with the images and then compute the euclidian-distance from the images within a cluster to the cluster's centroïd. Then do a rotation between clusters and take always the (next) closest image to the centroïd. I.e., return closest to centroïd 1, then closest to centroïd 2, then 3.... then second closest to centroïd 1, 2, 3 and so on.
First question: would this be a clever approach? Or am I on the wrong path?
Second question: I'm a bit confused. I thought I'd feed the data to Weka and it'd tell me "hey, if I were you, I'd split this data into 7 clusters", or something like that. I mean, that it'd be able to give me some information about the clusters I need. Instead, to use simplekmeans I'm supposed to know a priori how many clusters I'll use... how could I possibly know that?
One example of what I mean: let's say I have 3 mono-color images: light-blue, blue, red.
I thought Weka would notice that the 2 blues are similar and cluster them together.
Btw I'm kind of new to Weka (as you might have seen) so if you could provide some information on which functions I miggt want to use (and why :P) I'd be grateful!
Thank you!
Simple K-means - is an algorithm where you have to specify a number of the possible clusters in the data set.
If you don't know how many clusters there might be, it's better to get different algorithm or find out a number of the clusters.
You can use X-means -there you don't need to specify k parameter. (http://weka.sourceforge.net/doc.packages/XMeans/weka/clusterers/XMeans.html)
X-Means is K-Means extended by an Improve-Structure part In this part of the algorithm the centers are attempted to be split in its region. The decision between the children of each center and itself is done comparing the BIC-values of the two structures.
or you can observe a cut point chart based on AHC - hierarchical clustering algorithm (https://en.wikipedia.org/wiki/Hierarchical_clustering)
and then deduct a number of the clusters
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have about 3000 text documents which are related to a duration of time when the document was "interesting". So lets say document 1 has 300 lines of text with content, which led to a duration of interest of 5.5 days, whereas another document with 40 lines of text led to a duration of 6.7 days being "interesting", and so on.
Now the task is to predict the duration of interest (which is a continuous value) based on the text content.
I have two ideas to approach the problem:
Build a model of similar documents with a technology like http://radimrehurek.com/gensim/simserver.html. When a new document arrives one could try to find the 10 most similar documents in the past and simply compute the average of their duration and take that value as prediction for the duration of interest for the new document.
Put the documents into categories of duration (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then train a classifier to predict the category of duration based on the text content.
The advantage of idea #1 is that I could also calculate the standard deviation of my prediction, whereas with idea #2 it is less clear to me, how I could compute a similar measure of uncertainty of my prediction. Also it is unclear to me which categories to chose to get the best results from a classifier.
So is there a rule of thumb how to build a systems to best predict a continuous value like time from text documents? Should one use a classifier or should one use an approach using average values on similar documents? I have no real experience in that area and would like to know, which approach you think would probably yield the best results. Bonus point are given if you know a simple existing technology (Java or Python based) which could be used to solve this problem.
Approach (1) is called k-nearest neighbors regression. It's perfectly valid. So are myriad other approaches to regression, e.g. plain multiple regression using the documents' tokens as features.
Here's a skeleton script to fit a linear regression model using scikit-learn(*):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDRegressor
# build a term-document matrix with tf-idf weights for the terms
vect = TfidfVectorizer(input="filename")
Xtrain = vect.fit_transform(documents) # documents: list of filenames
# now set ytrain to a list of durations, such that ytrain[i] is the duration
# of documents[i]
ytrain = ...
# train a linear regression model using stochastic gradient descent (SGD)
regr = SGDRegressor()
regr.fit(Xtrain, ytrain)
That's it. If you now have new documents for which you want to predict the duration of interest, do
Xtest = vect.transform(new_documents)
ytest = regr.predict(Xtest)
This is a simple linear regression. In reality, I would expect interest duration to not be a linear function of a text's contents, but this might get you started. The next step would be to pick up any textbook on machine learning or statistics that treats more advanced regression models.
(*) I'm a contributor to this project, so this is not unbiased advice. Just about any half-decent machine learning toolkit has linear regression models.
(The following is based on my academic "experience", but seems informative enough to post it).
It looks like your task can be reformulated as:
Given a training set of scored documents, design a system for scoring
arbitrary documents based on their content.
"based on their content" is very ambiguous. In fact, I'd say it's too ambiguous.
You could try to find a specific feature of those documents which seems to be responsible for the score. It's more of a human task until you can narrow it down, e.g. you know you're looking for certain "valuable" words which make up the score, or maybe groups of words (have a look at http://en.wikipedia.org/wiki/N-gram).
You might also try developing a search-engine-like system, based on a similarity measure, sim(doc1, doc2). However, you'd need a large corpus featuring all possible scores (from the lowest to the highest, multiple times), so for every input document, similiar documents would have a chance to exist. Otherwise, the results would be inconslusive.
Depending on what values sim() would return, the measure should fullfill a relationship like:
sim(doc1,doc2) == 1.0 - |score(doc1) - score(doc2)|.
To test the quality of the measure, you could compute the similarity and score difference for each pair of ducuments, and check the correlation.
The first pick would be the cosine similarity using tf-idf
You've also mentioned categorizing the data. It seems to me like a method "justifying" a poor similarity measure. I.e. if the measure is good, it should be clear which category the document would fall into. As for classifiers, your documents should first have some "features" defined.
If you had a large corpus of the documents, you could try clustering to speed up the process.
Lastly, to determine the final score, I would suggest processing the scores of a few most similar documents. A raw average might not be the best idea in this case, because "less similar" would also mean "less accurate".
As for implementation, have a look at: Simple implementation of N-Gram, tf-idf and Cosine similarity in Python.
(IMHO, 3000 documents is way too low number for doing anything reliable with it without further knowledge of their content or the relationship between the content and score.)
I am working on a project where I need to group sentences based on how similar they are.
For Example, these sentences need to be grouped into a single cluster:
Apple's monster Q1 earnings still fall short on Wall Street
Apple announces Q1 2013 earnings: record $54.5 billion in revenue.
Apple posts record revenue and profits; iPhone sales jump nearly 30%.
The titles keep coming in, so I might need to arrange and modify the clusters on the fly. Currently I am using the Monge-Elkan algorithm to identify how similar two strings are, but I don't know how to cluster them.
Searching on the internet leads me to believe that I need to use K-Means algorithm to group content, but I am not sure how to proceed with what I have.
What makes matters slightly complicated is the fact that I have hosted it on Google App Engine, so I can't use File System.
Edit distance metrics are unlikely to effectively model the similarity of the meaning of sentences, which I assume you are after. Same goes for the low-level representation of text as a string of characters.
A better approach is to use a higher-level representation, such as the vector-space model. Here you collect all the unique words in your sentence collection(corpus) and map each of them to a number. Each document(sentence) is then represented as a vector:
[w1_count, w2_count, ..., wN_count]
Where N'th element is the count of N'th word (the word mapped to number N) in given sentence.
Now you can run k-means on this dataset, but better:
Process the data so that the important words such as 'Apple' are given more weight that common words such as 'on' or 'in'. One such technique is TF-IDF. Then run standard k-means on this with euclidean distance.
Even better, use an even higher-level tool such as Latent Semantic Analysis or Latent Dirichlet Allocation.
If you want to use your existing approach, Simon G.'s answer points you in the right direction and similarity to distance coversion is answered in this question.
First, change your similarities into dissimlarities so that they can be thought of as distances
Second, use a multidimensional scaling library to change the distances into points in space.
Third, use regular k-means on the points in space.
I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.
It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.
I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).
Maybe this one:
"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."
Code can be found here
Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).
After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.
This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.
There is this one:
http://code.google.com/p/lsh-clustering/
I haven't had time to test it but at least it compiles.
Here another one:
https://github.com/allenlsy/knn
It uses LSH for KNN. I'm currently investigating it's usability =)
The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.
In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.
It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...
I’m thinking of adding a feature to the TalkingPuffin Twitter client, where, after some training with the user, it can rank incoming tweets according to their predicted value. What solutions are there for the Java virtual machine (Scala or Java preferred) to do this sort of thing?
This is a classification problem, where you essentially want to learn a function y(x) which predicts whether 'x', an unlabeled tweet, belongs in the class 'valuable' or in the class 'not valuable'.
The trickiest bits here are not the algorithm (Naive Bayes is just counting and multiplying and is easy to code!) but:
Gathering the training data
Defining the optimal feature set
For one, I suggest you track tweets that the user favorites, replies to, and retweets, and for the second, look at qualities like who wrote the tweet, the words in the tweet, and whether it contains a link or not.
Doing this well is not easy. Google would love to be able to do such things ("What links will the user value"), as would Netflix ("What movies will they value") and many others. In fact, you'd probably do well to read through the notes about the winning entry for the Netflix Prize.
Then you need to extract a bunch of features, as #hmason says. And then you need an appropriate machine learning algorithm; you either need a function approximator (where you try to use your features to predict a value between, say, 0 and 1, where 1 is "best tweet ever" and 0 is "omg who cares") or a classifier (where you use your features to try to predict whether it's a "good" or "bad" tweet).
If you go for the latter--which makes user-training easy, since they just have to score tweets with "like" (to mix social network metaphors)--then you typically do best with support vector machines, for which there exists a fairly comprehensive Java library.
In the former case, there are a variety of techniques that might be worth trying; if you decide to use the LIBSVM library, they have variants for regression (i.e. parameter estimation) as well.