Do you know of any existing implementation in any language (preferably python) of any entity set expansion algorithms, such that the one from Google sets ? ( http://labs.google.com/sets )
I couldn't find any library implementing such algorithms and I'd like to play with some of those to see how they would perform on some specific task I would like to implement.
Any help is welcome !
Thanks a lot for your help,
Regards,
Nicolas.
I'm not aware of any ready to use open source libraries that implement the sort of clustering on demand of named entities provided by Google Sets. However, there are a few academic papers that describe in detail how to build similar systems, e.g.:
Language-Independent Set Expansion of Named Entities using the Web Wang and Cohen, in EMNLP 2009
Online Demo
Bayesian Sets Ghahramani and Heller, in NIPS, 2005
Below is a brief summary of Wang and Cohen's method. If you do end up implementing something like this yourself, it might be good to start with their method. I suspect most people will find it more intuitive than Ghahramani and Heller's formulation.
Wang and Cohen 2009
Wang and Cohen start by describing a method for automatically constructing extraction patterns that allow them to find lists of named entities in any sort of structured document. The method looks at the prefixes and suffixes bracketing known occurrences of named entities. These prefix and suffixes are then used to identify other named entities within the same document.
To complete a clusters of entities, they build a graph consisting of the interconnections between named entities, the extraction patterns associated with them, and the documents. Using this graph and starting at the nodes for the cluster's seed entities (i.e., the initial set of entities in the set to be completed), they perform numerous random walks on the graph up to 10 steps in length. They count how many times they reach the nodes corresponding to non-seed entities. Non-seed entities with high counts can then be used to complete the cluster.
Related
I'm working on the Wikipedia Category Graph (WCG). In the WCG, each article is associated to multiple categories.
For example, the article "Lists_of_Israeli_footballers" is linked to multiple categories, such as :
Lists of association football players by nationality - Israeli footballers - Association football in Israel lists
Now, if you climb back the category tree, you are likely to find a lot of paths climbing up to the "Football" category, but there is also at least one path leading up to "Science" for example.
This is problematic because my final goal is to be able to determinate whether or not an article belongs to a given Category using the list of categories it's linked with : right now a simple ancestor search gives false positives (for example : identifies "Israeli footballers" as part of the "Science" category - which is obviously not the expected result).
I want an algorithm able to find out what the most likely ancestor is.
I thought about two main solutions :
Count the number of distinct paths in the WCG linking article's category vertices to the candidate ancestor category (and use number of paths linking to other categories of same depth for comparison)
Use some kind of clustering algorithm and make ancestor search queries in isolated graph spaces
The issue with those options is that they seem to be very costly considering the size of the WCG (2 million vertices - even more edges). Eventually, I could work with a solution that uses a preprocessing algorithm in O(n) or more to achieve O(1) later, but I need the queries to be overall very fast.
Are there existing solutions to my problem ? Open to all suggestions.
Np, thanks for clarifying. anything like clustering is probably not a good idea, because those type of algorithms are meant to determine a category for an object that is not associated with a category yet. In your problem all objects (footballer article) is already associated to different categories.
You should probably do a complete search through all articles and save the matched categories with each article in a hash table so that you can then retrieve this category information when you need to know this for a new article.
Whether or not a category is relevant for an article seems totally arbitrary to me and seems to be something you should decide for yourself (e.g. determine a threshhold of 5 links to a category before it is determined part of the category).
If you're getting these articles from wikipedia you're probably going to have a pretty long run working through the entire tree, but in my opinion it seems like it's your only choice.
Search with DFS, and each time you find an arcticle-category match save the article in a hashtable (you need to be able to reduce an article to a unique identifier).
This is probably my most vague answer I've ever posted here, and your question might be too broad... if you're not helped with this please let me know so I can consider removing it in order to avoid confusion with future readers.
Write a program with the following objective -
be able to identify whether a word/phrase represents a thing/product. For example -
1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product.
2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing.
Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."
I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!
What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:
create your own labelling algorithm, create training data, test, eval and finally tag your data
use an existing knowledge base (lexicon) to extract semantic labels for each target word
The first option is a complex research project in itself. Do it if you have the time and resources.
The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.
This task is called named entity reconition problem.
EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.
Out of the box, Standford NLP can only recognize following types:
Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical
(MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION,
SET) entities
so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.
Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).
EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:
Download CRF++ and look at provided examples, they are in a simple text format
Annotate you data in a similar manner
a OTHER
glove PRODUCT
comprising OTHER
...
and so on.
Spilt you annotated data into two files train (80%) and dev(20%)
use following baseline template features (paste in template file)
U02:%x[0,0]
U01:%x[-1,0]
U01:%x[-2,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
4.Run
crf_learn template train.txt model
crf_test -m model dev.txt > result.txt
Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.
As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates
ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.
I have a requirement to associate math terms that come under a common topic. For e.g. angles, cos, tan, etc., should relate to trigonometry. So when a user searches for angles, triangles, etc. the search should present results related to trigonometry as well. Can anyone provide leads on how to do this in Apache Lucene?
There is a classification api which includes K-nearest neighbors and naive Bayes models.
You would first use the train() method with your training set. Once the classifier is trained use the assignClass() method to classify a given string.
For a training set you could use Wikipedia pages for your given classes.
After you give those two a try you could make use of the Classifier interface to build a competing model.
If you already know the associations, you can just add them to the index for the specific terms -- i.e. indexing 'cos' as 'cos', 'trigonometry'.
Also if you know the associations, you could index the parent term and all of the sibling terms -- i.e. indexing 'cos' as 'trigonometry', 'cos', 'sin', etc. This sounds more like what you want.
In addition to #Josh S.'s good answer, you can also take a more direct approach, of generating your own synonyms dictionary, e.g. see Match a word with similar words using Solr?
I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?
Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.
Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py
As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.
I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?
update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)
For java try Lucene
I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.
It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.
If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...
Some 3rd party solutions are listed here.
Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.
The Overview can give you more info
http://www.compass-project.org/overview.html
I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.
Lucene - Java
It's open source as well so you are free to use and deploy in your application.
As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions
Also take a look at Lemur Toolkit.
Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.
As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.
I believe the Boost::Statechart library may be of some help for that matter.
I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).