I am working on a project where I need to group sentences based on how similar they are.
For Example, these sentences need to be grouped into a single cluster:
Apple's monster Q1 earnings still fall short on Wall Street
Apple announces Q1 2013 earnings: record $54.5 billion in revenue.
Apple posts record revenue and profits; iPhone sales jump nearly 30%.
The titles keep coming in, so I might need to arrange and modify the clusters on the fly. Currently I am using the Monge-Elkan algorithm to identify how similar two strings are, but I don't know how to cluster them.
Searching on the internet leads me to believe that I need to use K-Means algorithm to group content, but I am not sure how to proceed with what I have.
What makes matters slightly complicated is the fact that I have hosted it on Google App Engine, so I can't use File System.
Edit distance metrics are unlikely to effectively model the similarity of the meaning of sentences, which I assume you are after. Same goes for the low-level representation of text as a string of characters.
A better approach is to use a higher-level representation, such as the vector-space model. Here you collect all the unique words in your sentence collection(corpus) and map each of them to a number. Each document(sentence) is then represented as a vector:
[w1_count, w2_count, ..., wN_count]
Where N'th element is the count of N'th word (the word mapped to number N) in given sentence.
Now you can run k-means on this dataset, but better:
Process the data so that the important words such as 'Apple' are given more weight that common words such as 'on' or 'in'. One such technique is TF-IDF. Then run standard k-means on this with euclidean distance.
Even better, use an even higher-level tool such as Latent Semantic Analysis or Latent Dirichlet Allocation.
If you want to use your existing approach, Simon G.'s answer points you in the right direction and similarity to distance coversion is answered in this question.
First, change your similarities into dissimlarities so that they can be thought of as distances
Second, use a multidimensional scaling library to change the distances into points in space.
Third, use regular k-means on the points in space.
Related
I'm working on a Java project where I need to match user queries against several engines.
Each engine has a method similarity(Object a, Object b) which returns: +1 if the objects surely match; -1 if the objects surely DON'T match; any float in-between when there's uncertainty.
Example: user searches "Dragon Ball".
Engine 1 returns "Dragon Ball", "Dragon Ball GT", "Dragon Ball Z", and it claims they are DIFFERENT result (similarity=-1), no matter how similar their names look. This engine is accurate, so it has a high "weight" value.
Engine 2 returns 100 different results. Some of them relate to DBZ, others to DBGT, etc. The engine claims they're all "quite similar" (similarity between 0.5 and 1).
The system queries several other engines (10+)
I'm looking for a way to build clusters out of this system. I need to ensure that values with similarity near -1 will likely end up in different clusters, even if many other values are very similar to all of them.
Is there a well-known clustering algorithm to solve this problem? Is there a Java implementation available? Can I build it on my own, perhaps with the help of a support library? I'm good at Java (15+ years experience) but I'm completely new at clustering.
Thank you!
The obvious approach would be to use "1 - similarity" as a distance function, which will thus go from 0 to 2. Then add them up.
Or you could use 1 + similarity and take the product of these values, ... or, or, or, ...
But since you apparently trust the first score more, you may also want to increase its influence. There is no mathematical solution for this, you habe to choose the weights depending on your data and preferences. If you have training data, you can optimize weights for your approach, and you may want to even discard some rankers if they don't work well or are correlated.
Consider we have 16 different categories, e.g., Computer, Science, Art, Business etc. We have some words under each category as synonyms, homonyms etc which describes the possible meaning of each topic and its range. Consequently there might be similar or even same words which falls in more than one categories. Our aim is to submit a query (with max length of 3, after stop word removal) to a system and ask the system to put this word into the category with highest similarity. So my question is, beside Cosine similarity, is there any good technique for doing this?
I know already about WordNet and its extended version, extjwnl, however, I wish to implement one which gives enough flexibility to me for small usages.
So, there are few things that can be done over and above to improve performance like stemming, lemmatization so that similarity is properly calculated.
Now as per similarity is concerned , you can use LDA(latent dirichlet allocation) to consider each document as combination of multiple topics.
LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you
Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
https://www.cs.princeton.edu/~blei/topicmodeling.html
Although this is unsupervised training where topics(categories are latent), you can use extension of LDA called LLDA(Labeled LDA).
I would not recommend using wordnet and cosine similarity as they doesn't consider co-occurences of terms and therefore might not work good with all datasets.
Jaccard Similarity can also be used in your case.
Jaccard Similarity converts a sentence into a set and then finds the intersection between the documents on which we have to find the similarity.
For more information on Jaccard Similarity you could take a look at https://en.wikipedia.org/wiki/Jaccard_index
I am trying to make a project of Document Clustering (in Java). There can be maximum 1 million documents and I want to make unsupervised cluster. To do, I am trying to implement EM algorithm with Gaussian Mixture Model.
But, I am not sure how to make the document vector.
I am thinking something like this, first I will calculate TF/IDF for each word in the document(after removing stop words and stemming done).
Then I will normalize each vector. At this stage, the question arises, how shall I represent a vector by a point? Is it possible?
I have learned about EM algorithm from this (https://www.youtube.com/watch?v=iQoXFmbXRJA) video where 1-D points are used for GMM and to be used in EM.
Can any one explain how to convert a vector in a 1-D point to implement EM for GMM?
If my approach is wrong, can you explain how to do the whole thing in simple words? Sorry for my long question. Thanks for your help!
If you're going to be clustering that many documents, you might consider K-Medoids as well, it creates the initial centroids using randomization (basically). As for representing the vectors as a point, In my experience that is really sketchy. What I have done in the past is store term vectors in a SortedMap, remove irrelevant terms however you want, normalize the vectors into sparse representations, then you can use something like Cosine Similarity, or Euclidean distance (inverted) to gauge similarity. I have used JavaML, Weka, and rolled my own unsupervised clustering. The KMedoids in JavaML is pretty good, you will have to reduce your vectors to double[] data structures (normalized of course) and use their dataset object.
HTH
I would start with something simpler than EM for GMM. If you know the number of clusters in advance, use K-Means. Otherwise, use Mean Shift.
If you must learn a GMM, then note that it can work with an N-D feature vector. If you must must reduce the features to a single dimension, you can use PCA (or some other data dimensionality reduction) algorithm to do that.
In any case, you can find implementations of these algorithms on the net and don't have to implement them yourself, which would slow down your project.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have about 3000 text documents which are related to a duration of time when the document was "interesting". So lets say document 1 has 300 lines of text with content, which led to a duration of interest of 5.5 days, whereas another document with 40 lines of text led to a duration of 6.7 days being "interesting", and so on.
Now the task is to predict the duration of interest (which is a continuous value) based on the text content.
I have two ideas to approach the problem:
Build a model of similar documents with a technology like http://radimrehurek.com/gensim/simserver.html. When a new document arrives one could try to find the 10 most similar documents in the past and simply compute the average of their duration and take that value as prediction for the duration of interest for the new document.
Put the documents into categories of duration (e.g. 1 day, 2 days, 3-5 days, 6-10 days, ...). Then train a classifier to predict the category of duration based on the text content.
The advantage of idea #1 is that I could also calculate the standard deviation of my prediction, whereas with idea #2 it is less clear to me, how I could compute a similar measure of uncertainty of my prediction. Also it is unclear to me which categories to chose to get the best results from a classifier.
So is there a rule of thumb how to build a systems to best predict a continuous value like time from text documents? Should one use a classifier or should one use an approach using average values on similar documents? I have no real experience in that area and would like to know, which approach you think would probably yield the best results. Bonus point are given if you know a simple existing technology (Java or Python based) which could be used to solve this problem.
Approach (1) is called k-nearest neighbors regression. It's perfectly valid. So are myriad other approaches to regression, e.g. plain multiple regression using the documents' tokens as features.
Here's a skeleton script to fit a linear regression model using scikit-learn(*):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDRegressor
# build a term-document matrix with tf-idf weights for the terms
vect = TfidfVectorizer(input="filename")
Xtrain = vect.fit_transform(documents) # documents: list of filenames
# now set ytrain to a list of durations, such that ytrain[i] is the duration
# of documents[i]
ytrain = ...
# train a linear regression model using stochastic gradient descent (SGD)
regr = SGDRegressor()
regr.fit(Xtrain, ytrain)
That's it. If you now have new documents for which you want to predict the duration of interest, do
Xtest = vect.transform(new_documents)
ytest = regr.predict(Xtest)
This is a simple linear regression. In reality, I would expect interest duration to not be a linear function of a text's contents, but this might get you started. The next step would be to pick up any textbook on machine learning or statistics that treats more advanced regression models.
(*) I'm a contributor to this project, so this is not unbiased advice. Just about any half-decent machine learning toolkit has linear regression models.
(The following is based on my academic "experience", but seems informative enough to post it).
It looks like your task can be reformulated as:
Given a training set of scored documents, design a system for scoring
arbitrary documents based on their content.
"based on their content" is very ambiguous. In fact, I'd say it's too ambiguous.
You could try to find a specific feature of those documents which seems to be responsible for the score. It's more of a human task until you can narrow it down, e.g. you know you're looking for certain "valuable" words which make up the score, or maybe groups of words (have a look at http://en.wikipedia.org/wiki/N-gram).
You might also try developing a search-engine-like system, based on a similarity measure, sim(doc1, doc2). However, you'd need a large corpus featuring all possible scores (from the lowest to the highest, multiple times), so for every input document, similiar documents would have a chance to exist. Otherwise, the results would be inconslusive.
Depending on what values sim() would return, the measure should fullfill a relationship like:
sim(doc1,doc2) == 1.0 - |score(doc1) - score(doc2)|.
To test the quality of the measure, you could compute the similarity and score difference for each pair of ducuments, and check the correlation.
The first pick would be the cosine similarity using tf-idf
You've also mentioned categorizing the data. It seems to me like a method "justifying" a poor similarity measure. I.e. if the measure is good, it should be clear which category the document would fall into. As for classifiers, your documents should first have some "features" defined.
If you had a large corpus of the documents, you could try clustering to speed up the process.
Lastly, to determine the final score, I would suggest processing the scores of a few most similar documents. A raw average might not be the best idea in this case, because "less similar" would also mean "less accurate".
As for implementation, have a look at: Simple implementation of N-Gram, tf-idf and Cosine similarity in Python.
(IMHO, 3000 documents is way too low number for doing anything reliable with it without further knowledge of their content or the relationship between the content and score.)
I'm working on a project where I am using genetic algorithms to generate word lists which best describe a text.
I'm presently using cosine similarity to do it, but it has two flaws: it's far too slow for purpose and that if two vectors being compared are zeroes, it ends up with an artificially high similarity and a word vector that isn't very good.
Any suggestions for other measures which would be faster/take less notice of words that aren't there?
Thanks.
Cosine similarity is dot-product over product-of-magnitudes, so minimizing number of dimensions is crucial.
To cull the herd a bit, you might want to apply stemming to collapse words with similar meaning into a single dimension, and toss out hapax legomena (words that only occur once in the corpus under consideration) from the dimension pool, since an algorithm isn't likely to be able to derive much useful information from them.
I'm not sure what would give rise to the zero vectors, though. Can you give an example?
EDIT: So what you're after is to create a word list that is selective for a particular document or cluster? In that case, you need some ways to eliminate low-selectivity words.
You might want to treat the most common words as stop words to further cull your dimension set and get back a little bit more performance. Also, on the genetic algorithm side, your fitness function needs to penalize word lists that match documents outside of the target cluster, not just reward those that match documents within the cluster, so your word list doesn't get cluttered with terms that are merely frequent rather than selective.
If you need better semantic selectivity even after tuning the fitness function., you might want to consider using orthogonal sparse bigrams instead of individual words. I've no idea what it'll do in terms of number of dimensions, though, because while there will be O(kn2) distinct terms instead of n, a lot more of them will be hapaxes. This may cause a problem if you need individual words instead of OSBs in your term lists though.