"boosting" different instances of the same field in a lucene document

"boosting" different instances of the same field in a lucene document - java

I want to use a single field to index the document's title and body, in an effort to improve performance.
The idea was to do something like this:
Field title = new Field("text", "alpha bravo charlie", Field.Store.NO, Field.Index.ANALYZED);
title.setBoost(3)
Field body = new Field("text", "delta echo foxtrot", Field.Store.NO, Field.Index.ANALYZED);
Document doc = new Document();
doc.add(title);
doc.add(body);
And then I could just do a single TermQuery instead of a BooleanQuery for two separate fields.
However, it turns out that a field boost is the multiple of all the boost of fields of the same name in the document. In my case, it means that both fields have a boost of 3.
Is there a way I can get what I want without resorting to using two different fields? One way would be to add the title field several times to the document, which increases the term frequency. This works, but seems incredibly brain-dead.
I also know about payloads, but that seems like an overkill for what I'm after.
Any ideas?

If you want to take a page out of Google's book (at least their old book), then you may want to create separate indexes: one for document bodies, another for titles. I'm assuming there is a field stored that points to a true UID for each actual document.
The alternative answer is to write custom implementations of [Similarity][1] to get the behavior you want. Unfortunately I find that Lucene often needs this customization unique problems arise.
[1]: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String, int)

You can index title and body separately with title field boosted by a desired value. Then, you can use MultiFieldQueryParser to search multiple fields.
While, technically, searching multiple fields takes longer time, typically even with this overhead, Lucene tends to be extremely fast (of the order of few tens or hundreds of milliseconds.)

Related

How can I get the size of Solr Facet results?

There is a multi-value field in my schema named XXX. And it may be more 10,0000 documents in my Solr, I want to get how many values exist in XXX without any duplication.
For now, I use facet.field=XXX&facet.limit=-1 to get the facet results size. It will spend a lot of time and sometimes occur Read Timeout.
What I want for the facet results is only the 'size', I don't care about the contents.
By the way, I use Solr 5.0, is there any other better solution to solve my requirement?

The index does maintain a list of unique terms, since that is how the inverted index works. It is also very very fast to compute and return, unlike faceting. If your values are single terms, then that could be a way of getting to what you want. There is a way to get unique terms, given that the TermsComponent is enabled in your solrconfig.xml. For example:
http://localhost:8983/solr/corename/terms?q=*%3A*&wt=json&indent=true&terms=true&terms.fl=XXX
Would return a list of all unique terms, and their counts:
{
"responseHeader":{
"status":0,
"QTime":0},
"terms":{
"XXX":[
"John Backus",3,
"Ada Lovelace",3,
"Charles Babbage",2,
"John Mauchly",1,
"Alan Turing",1
]
}
}
The length of this list is the amount of unique terms, in the example that would be 5. Unfortunately the API doesn't provide a way to just ask for the count, without returning the list of terms, so while it has speed advantage in generating the list, the amount of time required to return full list gives it a similar drawback to the facets approach. Also, the returned list may become quite long.
Check out https://wiki.apache.org/solr/TermsComponent for the API details.

Lucene: How to perform search on several independent index sets and merge the result?

Now I have several Lucene index sets (I call it shards), which indexes different document sets. They are independent, which means I can perform search on each of them without reading others. Then I get a query request. I want to search it over every index set and combine the result to form the final top documents.
I know that when scoring the documents, Lucene needs to know the <idf> of every term, and different index sets will give different <idf> to the same term (because different index sets hold different document sets). Thus to my understanding, I cannot compare the document score from different index sets directly. Then how should I generate the final result?
An obvious solution would be first merge the index and then perform the search over the big index. However, this is tooo time-consuming for me and thus unacceptable. Anyone has other better solutions?
P.S.: I don't want to use any packages or softwares (like Katta) except Lucene and Hadoop.

I think MultiReader is what you are looking for. If you have multiple IndexReaders, say reader1 and reader2:
MultiReader multiReader = new MultiReader(reader1, reader2);
IndexSearcher searcher = new IndexSearcher(multiReader);

Solr: The default OR operator returns irrelevant results, when the fields are queried with multiple words

I need to make my Solr-based search return results if all of the search keywords appear anywhere in any of the search fields.
The current situation:
an example search query: keywords: "berlin house john" name: "berlin house john" name" author: "berlin house john" name"
Let's suppose that there is only one result, where keywords="house", name="berlin", and author="john" and there is no other possible permutation of these three words.
if the defaultOperator is OR, Solr returns a simple OR-ing of every keyword in every field, which is an enormous list, where of course, the best matching result is at the first position, but the next results have very little relevance (perhaps only one field matching), and they simply confuse the user.
On another hand, if i switch the default operator to AND, I get absolutely no results. I guess it is trying to find a perfect match for all three words, in all three fields, which of course, does not exist.
The search terms come to the application from a search input, in which, the user writes free text - there are no specific language conventions (hashtags or something).
I know that what I am asking about is possible because I have done it before with pure Lucene, and it worked. What am I doing wrong?

If you just need to make sure, all words appear in all fields I would suggest copying all relevant fields into one field at index time and query this one instead. To do so, you need to introduce a new field and then use copyField for all sourcefields you want to copy over. To copy all fields, use:
<copyField source="*" dest="text"/>
See http://wiki.apache.org/solr/SchemaXml#Copy_Fields for details.
An similar approach would be to use boolean algebra at query time. This is a bit different from the above solution.
Your query should look like
(keywords:"berlin" OR keywords:"house" OR keywords:"john") AND
(name:"berlin" OR name:"house" OR name:"john") AND
(author:"berlin" OR author:"house" OR author:"john")
which basically states: one or more terms must match in keyword and one or more terms must match in name and one or more terms must match in author.

From Solr 4, defaultOperator is deprecated. Please don't use it.
Also as for me defaultOperator works same as specified operator in query. I can't said why it is, its just my experience.
Please try query with param {!q.op=AND}
I guess you use default query parser, fix me if I am wrong

How to improve performance of SMO classifier in weka?

I am using weka SMO classifier for classify the documents.There are many parameters for smo available like Kernal, tolerance etc.., I tested using different parameters but i not get good result large data set.
For more than 90 category only 20% documents getting correctly classified.
Please anyone tell me the best set of parameter to get highest performance in SMO.

Principal issue here is not classification itself, but rather selecting suitable features. Using raw HTML leads to very large noise which in its turn makes classification results very poor. Thus, to get good results do the following:
Extract relevant text. Not just remove HTML tags, but get exactly the text describing item.
Create dictionary of key words. E.g. capuccino, latte, white rice, etc.
Use stemming or lemmatization to get word's base form and avoid counting, for example, "cotton" and "cottons" as 2 different words.
Make feature vectors from text. Attributes (feature names) should be all words from your dictionary. Values may be: binary (1 if word occurs in text, 0 otherwise), integer (number of occurrences of word in question in text), tf-idf (use this one if your texts have very different lengths) and others.
And only after all these steps you can use classifer.
Most probably classifier type won't play a big role here: dictionary-based features normally lead to quite exact results regardless of classification technique in use. You can use SVM (SMO), Naive Bayes, ANN or even kNN. More sophisticated methods include creation of category hierarchy, where, for example, category "coffee" is included into category "drinks" which in its turn is part of category "food".

Is this possible to develop some criteria based search on the Strings in C# or JAVA?

I have one List in C#.This String array contains elements of Paragraph that are read from the Ms-Word File.for example,
list 0-> The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Finally the image displayed in the header will be added to finalize the report.
list 1->The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Various other elements of WordprocessingML will also be handled. By moving the formatting information into styles a higher degree of re-use is made possible. The document will be marked using custom XML tags and the insertion of other advanced elements such as a table of contents is discussed. But before all the advanced features can be added, the base of the document needs to be built.
Some thing like that.
Now My search String is :
The picture above shows the main report which will be used for many of the markup samples in this chapter. There are several interesting elements in this sample document. First there rae the basic text elements, the primary building blocks for your document. Next up is the table at the bottom of the report which will be discussed in full, including the handy styling effects such as row-banding. Before going over all the elements which make up the sample documents a basic document structure needs to be laid out. When you take a WordprocessingML document and use the Windows Explorer shell to rename the docx extension to zip you will find many different elements, especially in larger documents.
I want to check my search String with that list elements.
my criteria is "If each list element contains 85% match or exact match of search string then we want to retrieve that list elements.
In our case,
list 0 -> more satisfies my search string.
list 1 -it also matches some text,but i think below not equal to my criteria...
How i do this kind of criteria based search on String...?
I have more confusion on my problem also
Welcome your ideas and thoughts...

The keyword is DISTANCE or "string distance". and also, "Paragraph similarity"
You seek to implement a function which would express as a scalar, say a percentage as suggested in the question, indicative of how similar a string is from another string.
Plain string distance functions such as hamming or Levenstein may not be appropriate, for they work at character level rather than at word level, but generally these algorithms convey the idea of what is needed.
Working at word level you'll probably also want to take into account some common NLP features, for example ignore (or give less weight to) very common words (such as 'the', 'in', 'of' etc.) and maybe allow for some forms of stemming. The order of the words, or for the least their proximity may also be of import.
One key factor to remember is that even with relatively short strings, many distances functions can be quite expensive, computationally speaking. Before selecting one particular algorithm you'll need to get an idea of the general parameters of the problem:
how many strings would have to be compared? (on average, maximum)
how many words/token do the string contain? (on average, max)
Is it possible to introduce a simple (quick) filter to reduce the number of strings to be compared ?
how fancy do we need to get with linguistic features ?
is it possible to pre-process the strings ?
Are all the records in a single language ?
Comparing Methods for Single Paragraph Similarity Analysis, a scholarly paper provides a survey of relevant techniques and considerations.
In a nutshell, the the amount of design-time and run-time one can apply this relatively open problem varies greatly and is typically a compromise between the level of precision desired vs. the run-time resources and the overall complexity of the solution which may be acceptable.
In its simplest form, when the order of the words matters little, computing the sum of factors based on the TF-IDF values of the words which match may be a very acceptable solution.
Fancier solutions may introduce a pipeline of processes borrowed from NLP, for example Part-of-Speech Tagging (say for the purpose of avoiding false positive such as "SAW" as a noun (to cut wood), and "SAW" as the past tense of the verb "to see". or more likely to filter outright some of the words based on their grammatical function), stemming and possibly semantic substitutions, concept extraction or latent semantic analysis.

You may want to look into lucene for Java or lucene.net for c#. I don't think it'll do the percentage requirement you want out of the box, but it's a great tool for doing text matching.
You maybe could run a separate query for each word, and then work out the percentage yourself of ones that matched.

Here's an idea (and not a solution by any means but something to get started with)
private IEnumerable<string> SearchList = GetAllItems(); // load your list
void Search(string searchPara)
{
char[] delimiters = new char[]{' ','.',','};
var wordsInSearchPara = searchPara.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a=>a.ToLower()).OrderBy(a => a);
foreach (var item in SearchList)
{
var wordsInItem = item.Split(delimiters, StringSplitOptions.RemoveEmptyEntries).Select(a => a.ToLower()).OrderBy(a => a);
var common = wordsInItem.Intersect(wordsInSearchPara);
// now that you know the common items, you can get the differential
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.