I'm working on a project where I am using genetic algorithms to generate word lists which best describe a text.
I'm presently using cosine similarity to do it, but it has two flaws: it's far too slow for purpose and that if two vectors being compared are zeroes, it ends up with an artificially high similarity and a word vector that isn't very good.
Any suggestions for other measures which would be faster/take less notice of words that aren't there?
Thanks.
Cosine similarity is dot-product over product-of-magnitudes, so minimizing number of dimensions is crucial.
To cull the herd a bit, you might want to apply stemming to collapse words with similar meaning into a single dimension, and toss out hapax legomena (words that only occur once in the corpus under consideration) from the dimension pool, since an algorithm isn't likely to be able to derive much useful information from them.
I'm not sure what would give rise to the zero vectors, though. Can you give an example?
EDIT: So what you're after is to create a word list that is selective for a particular document or cluster? In that case, you need some ways to eliminate low-selectivity words.
You might want to treat the most common words as stop words to further cull your dimension set and get back a little bit more performance. Also, on the genetic algorithm side, your fitness function needs to penalize word lists that match documents outside of the target cluster, not just reward those that match documents within the cluster, so your word list doesn't get cluttered with terms that are merely frequent rather than selective.
If you need better semantic selectivity even after tuning the fitness function., you might want to consider using orthogonal sparse bigrams instead of individual words. I've no idea what it'll do in terms of number of dimensions, though, because while there will be O(kn2) distinct terms instead of n, a lot more of them will be hapaxes. This may cause a problem if you need individual words instead of OSBs in your term lists though.
Related
For a current project, I want to use genetic algorithms - currently I had a look at the jenetics library.
How can I force that some genes are dependent on each other? I want to map CSS on the gene, f.e. I have genes indicating if an image is displayed, and in case it is also the respective height and width. So I want to have those genes as a group togheter, as it would make no sense that after a crossover, the chrosome would indicate something like "no image" - height 100px - width 0px.
Is there a method to do so? Or maybe another library (in java) which supports this?
Many thanks!
You want to embed more knowledge into your system to reduce the search space.
If it would be knowledge about the structure of the solution, I would propose taking a look at grammatical evolution (GE). Your knowledge appears to be more about valid combinations of codons, so GE is not easily applicable.
It might be possible to combine a few features into a single codon, but this may be undesirable and/or unfeasible (e.g. due to great number of possible combinations).
But in fact you don't have an issue here:
it's fine to have meaningless genotypes — they will be removed due to the selection pressure
it's fine to have meaningless codon sequences — it's called "bloat"; bloat is quite common to some evolutionary algorithms (usually discussed in the context of genetic programming) and is not strictly bad; fighting with bloat too much can reduce the search performance
If you know how your genome is encoded - that is, you know which sequences of chromosomes form groups - then you could extend (since you mention jenetics) io.jenetics.MultiPointCrossover to avoid splitting groups. (Source code available on GitHub.)
It could be as simple as storing ranges of genes which form groups if one of the random cut indexes would split a group, adjusting the index to the nearest end of the group. (Of course this would cause a statistically higher likelihood of cuts at the ends of groups; it would probably be better to generate a new random location until it doesn't intersect a group.)
But it's also valid (as Pete notes) to have genes which aren't meaningful (ignored) based on other genes; if the combination is anti-survival it will be selected out.
Consider we have 16 different categories, e.g., Computer, Science, Art, Business etc. We have some words under each category as synonyms, homonyms etc which describes the possible meaning of each topic and its range. Consequently there might be similar or even same words which falls in more than one categories. Our aim is to submit a query (with max length of 3, after stop word removal) to a system and ask the system to put this word into the category with highest similarity. So my question is, beside Cosine similarity, is there any good technique for doing this?
I know already about WordNet and its extended version, extjwnl, however, I wish to implement one which gives enough flexibility to me for small usages.
So, there are few things that can be done over and above to improve performance like stemming, lemmatization so that similarity is properly calculated.
Now as per similarity is concerned , you can use LDA(latent dirichlet allocation) to consider each document as combination of multiple topics.
LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you
Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.
Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.
https://www.cs.princeton.edu/~blei/topicmodeling.html
Although this is unsupervised training where topics(categories are latent), you can use extension of LDA called LLDA(Labeled LDA).
I would not recommend using wordnet and cosine similarity as they doesn't consider co-occurences of terms and therefore might not work good with all datasets.
Jaccard Similarity can also be used in your case.
Jaccard Similarity converts a sentence into a set and then finds the intersection between the documents on which we have to find the similarity.
For more information on Jaccard Similarity you could take a look at https://en.wikipedia.org/wiki/Jaccard_index
I am trying to make a project of Document Clustering (in Java). There can be maximum 1 million documents and I want to make unsupervised cluster. To do, I am trying to implement EM algorithm with Gaussian Mixture Model.
But, I am not sure how to make the document vector.
I am thinking something like this, first I will calculate TF/IDF for each word in the document(after removing stop words and stemming done).
Then I will normalize each vector. At this stage, the question arises, how shall I represent a vector by a point? Is it possible?
I have learned about EM algorithm from this (https://www.youtube.com/watch?v=iQoXFmbXRJA) video where 1-D points are used for GMM and to be used in EM.
Can any one explain how to convert a vector in a 1-D point to implement EM for GMM?
If my approach is wrong, can you explain how to do the whole thing in simple words? Sorry for my long question. Thanks for your help!
If you're going to be clustering that many documents, you might consider K-Medoids as well, it creates the initial centroids using randomization (basically). As for representing the vectors as a point, In my experience that is really sketchy. What I have done in the past is store term vectors in a SortedMap, remove irrelevant terms however you want, normalize the vectors into sparse representations, then you can use something like Cosine Similarity, or Euclidean distance (inverted) to gauge similarity. I have used JavaML, Weka, and rolled my own unsupervised clustering. The KMedoids in JavaML is pretty good, you will have to reduce your vectors to double[] data structures (normalized of course) and use their dataset object.
HTH
I would start with something simpler than EM for GMM. If you know the number of clusters in advance, use K-Means. Otherwise, use Mean Shift.
If you must learn a GMM, then note that it can work with an N-D feature vector. If you must must reduce the features to a single dimension, you can use PCA (or some other data dimensionality reduction) algorithm to do that.
In any case, you can find implementations of these algorithms on the net and don't have to implement them yourself, which would slow down your project.
I am working on a project where I need to group sentences based on how similar they are.
For Example, these sentences need to be grouped into a single cluster:
Apple's monster Q1 earnings still fall short on Wall Street
Apple announces Q1 2013 earnings: record $54.5 billion in revenue.
Apple posts record revenue and profits; iPhone sales jump nearly 30%.
The titles keep coming in, so I might need to arrange and modify the clusters on the fly. Currently I am using the Monge-Elkan algorithm to identify how similar two strings are, but I don't know how to cluster them.
Searching on the internet leads me to believe that I need to use K-Means algorithm to group content, but I am not sure how to proceed with what I have.
What makes matters slightly complicated is the fact that I have hosted it on Google App Engine, so I can't use File System.
Edit distance metrics are unlikely to effectively model the similarity of the meaning of sentences, which I assume you are after. Same goes for the low-level representation of text as a string of characters.
A better approach is to use a higher-level representation, such as the vector-space model. Here you collect all the unique words in your sentence collection(corpus) and map each of them to a number. Each document(sentence) is then represented as a vector:
[w1_count, w2_count, ..., wN_count]
Where N'th element is the count of N'th word (the word mapped to number N) in given sentence.
Now you can run k-means on this dataset, but better:
Process the data so that the important words such as 'Apple' are given more weight that common words such as 'on' or 'in'. One such technique is TF-IDF. Then run standard k-means on this with euclidean distance.
Even better, use an even higher-level tool such as Latent Semantic Analysis or Latent Dirichlet Allocation.
If you want to use your existing approach, Simon G.'s answer points you in the right direction and similarity to distance coversion is answered in this question.
First, change your similarities into dissimlarities so that they can be thought of as distances
Second, use a multidimensional scaling library to change the distances into points in space.
Third, use regular k-means on the points in space.
I'm looking for a lightweight Java library that supports Nearest Neighbor Searches by Locality Sensitive Hashing for nearly equally distributed data in a high dimensional (in my case 32) dataset with some hundreds of thousands data points.
It's totally good enough to get all entries in a bucket for a query. Which ones i really need could then be processed in a different way under consideration of some filter parameters my problem include.
I already found likelike but hope that there is something a bit smaller and without need of any other tools (like Apache Hadoop in the case of likelike).
Maybe this one:
"TarsosLSH is a Java library implementing Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works."
Code can be found here
Apache Spark has an LSH implementation: https://spark.apache.org/docs/2.1.0/ml-features.html#locality-sensitive-hashing (API).
After having played with both the tdebatty and TarsosLSH implementations, I'll likely use Spark, as it supports sparse vectors as input. The tdebatty requires a non-sparse array of booleans or int's, and the TarsosLSH Vector implementation is a non-sparse array of doubles. This severely limits the number of dimensions one can reasonably support.
This page provides links to more projects, as well as related papers and information: https://janzhou.org/lsh/.
There is this one:
http://code.google.com/p/lsh-clustering/
I haven't had time to test it but at least it compiles.
Here another one:
https://github.com/allenlsy/knn
It uses LSH for KNN. I'm currently investigating it's usability =)
The ELKI data mining framework comes with an LSH index. It can be used with most algorithms included (anything that uses range or nn searches) and sometimes works very well.
In other cases, LSH doesn't seem to be a good approach. It can be quite tricky to get the LSH parameters right: if you choose some parameters too high, runtime grows a lot (all the way to a linear scan). If you choose them too low, the index becomes too approximative and loses to many neighbors.
It's probably the biggest challenge with LSH: finding good parameters, that yield the desired speedup and getting a good enough accuracy out of the index...