How to boost some particular words in lucene index?
For e.g. I have a list of items:
"lucene in action"
"solr in action"
"solr in action book"
"building search applications"
"building search applications book"
I consider the word "book" as not important and would like to down vote it. I would not like to use filter to remove the word completely from search results as it is still might be useful. Some book might have a word book in it's name (for e.g. "book of mormon").
Currently, I use
new StandardAnalyzer(version)
and store fields as
new TextField("name", name, Field.Store.YES)
Ideally, I would like to have a dictionary with a list of terms to boost and to provide it to lucene. I know that I can boost on search if I break the request to terms (like "lucene" AND "book"^0.5), but it's not what I want.
In Apache Lucene, you can configure boosting in three different places: Document, Field and Query. Since you don't want to boost at the query level, then I think boosting during Field level might come in handy in your case. Method setBoost() of Field class.
Keep in mind that if you add the boost to your field, then you need to do that before adding the document to the index.
You need also to think about what to do when you delete a document from the index or your dictionary of words is changed (which I'm pretty sure it will).
Related
Is it possible to configure or otherwise alter how Elastic Search scores its results?
When running a search for "term" using the NativeSearchQueryBuilder documents that contain one instance of the term are all scored the same. This makes sense. However one of the documents contain just the term, where are the others contain term and other data. For example;
Doc1: Title : Space
Doc2: Title : Space Time
Doc3: Title : No Space
So when searching for Space is there anyway to make Doc1 score more highly?
-Edit
So, a little more detail following briarheart's response. I think the problem is the way we're implementing typeahead searches. If I run the Space query using our standard search the ranking is as outlined by briarheart, but our typeahead scores everything equally because we are using the wildcard request part and looking for "term*" so "Space" and "Space Lane" do both match that equally well.
So really I guess I'm asking the wrong question. Scoring is working as it should, I just need to figure out a better implementation of type ahead.
(The Suggest Request Part doesn't seem to fit the use case as this would involve picking and resubmitting the desired suggestion).
I do not know how exactly does look like your query but actually in case of using full text search of the term "Space" the document "Doc1" from your example will get the highest score because of length of its "Title" field. Shorter fields have more weight in terms of relevance.
i've store a lucene document with a single TextField contains words without stems.
I need to implement a search program that allow users to search words and exact words,
but if i've stored words without stemming, a stem search cannot be done.
There's a method to search both exact words and/or stemming words in Documents without
store Two fields ?
Thanks in advance.
Indexing two separate fields seems like the right approach to me.
Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).
There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.
You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:
Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);
I can see a couple of possibilities to accomplish it though, if you really want to.
One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.
The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.
Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.
I was wondering if you know any algorithm that can do an automatic assignment for the following situation: I have some papers with a some keywords defined, and some reviewers that have some specific keywords defined. How could I do an automatic mapping, so that the reviewer could review the papers from his/her area of interest?
If you are open to using external tools Lucene is a library that will allow you to search text based on (from their website)
phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g., title, author, contents)
date-range searching
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
You will basically need to design your own parser, or specialize an existing parser according to your needs. You need to scan the papers, and according to your keywords,search and match your tokens accordingly. Then the sentences with these keywords are to be separated and displayed to the reviewer.
I would suggest the Stanford NLP POS tagger. Every keyword that you would need, will fall under some part-of-speech. You can then just tag your complete document, and search for those tags and accordingly sort out the sentences.
Apache Lucene could be one solution.
It allows you to index documents either in a RAM directory, or within a real directory of your file system, and then to perform full-text searches.
Its proposes a lot of very interesting features like filters or analyzers. You can for example:
remove the stop words depending on the language of the documents (e.g. for english: a, the, of, etc.);
stem the tokens (e.g. function, functional, functionality, etc., are considered as a single instance);
perform complex queries (e.g. review*, keyw?rds, "to be or not to be", etc.);
and so on and so forth...
You should have a look! Don't hesitate to ask me some code samples if Lucene is the way you chose :)
I have classified a set of documents with Lucene (fields: content, category). Each document has it's own category, but some of them are labeled as uncategorized. Is there any way to classify these documents easily in java?
Classification is a broad problem in the field of Machine Learning/Statistics. After reading your question what I feel you have used kind of SQL group by clause (though in Lucene). If you want the machine to classify the documents than you need to know Machine Learning Algorithms like Neural Networks, Bayesian, SVM etc. There are excellent libraries available in Java for these tasks. For this to work you will need features (a set of attributes extracted from data) on which you can train you Algorithm so that it may predict your classification label.
There are some good API's in Java (which allows you to concentrate on code without going in too much in understanding the mathematical theory behind those Algorithms, though if you know it would be very advantageous). Weka is good. I also came across a couple of books from Manning which have handled these tasks well. Here you go:
Chapter 10 (Classification) of Collective Intelligence in Action: http://www.manning.com/alag/
Chapter 5 (Classification) of Algorithms of Intelligent Web: http://www.manning.com/marmanis/
These are absolutely fantastic material (for Java people) on classification particularly suited for people who just dont want to dive in in to the theory (though very essential :)) and just quickly want a working code.
Collective Intelligence in Action has solved the problem of classification using JDM and Weka. So have a look at these two for your tasks.
Yes you can use similarity queries such as implemented by the MoreLikeThisQuery class for this kind of things (assuming you have some large text field in the documents for your lucene index). Have a look at the javadoc of the underlying MoreLikeThis class for details on how it works.
To turn your lucene index into a text classifier you have two options:
For any new text to classifier, query for the top 10 or 50 most similar documents that have at least one category, sum the category occurrences among those "neighbors" and pick up the top 3 frequent categories among those similar documents (for instance).
Alternatively you can index a new set of aggregate documents, one for each category by concatenating (all or a sample of) the text of the documents of this category. Then run similarity query with you input text directly on those "fake" documents.
The first strategy is known in machine learning as k-Nearest Neighbors classification. The second is a hack :)
If you have many categories (say more than 1000) the second option might be better (faster to classify). I have not run any clean performance evaluation though.
You might also find this blog post interesting.
If you want to use Solr, your need to enable the MoreLikeThisHandler and set termVectors=true on the content field.
The sunburnt Solr client for python is able to perform mlt queries. Here is a prototype python classifier that uses Solr for classification using an index of Wikipedia categories:
https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py
As of Lucene 5.2.1, you can use indexed documents to classify new documents. Out of the box, Lucene offers a naive Bayes classifier, a k-Nearest Neighbor classifier (based on the MoreLikeThis class) and a Perceptron based classifier.
The drawback is that all of these classes are marked with experimental warnings and documented with links to Wikipedia.
I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.
Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?
The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.
Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?
Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.
Hope this makes sense.
Mr Morgan.
From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.
From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.
How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)
I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.
Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.
If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/
Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.
You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.
Regards