I'm new to Apache Solr and trying to make a query using search terms against a field called "normalizedContents" and of type "text".
All of the search terms must exist in the field. Problem is, I'm getting inconsistent results.
For example, the solr index has only one document with normalizedContents field with value = "EDOUARD SERGE WILFRID EDOS0004 UNE MENTION COMPLEMENTAIRE"
I tried these queries in solr's web interface:
normalizedContents:(edouard AND une) returns the result
normalizedContents:(edouar* AND une) returns the result
normalizedContents:(EDOUAR* AND une) doesn't return anything
normalizedContents:(edouar AND une) doesn't return anything
normalizedContents:(edouar* AND un) returns the result (although there's no "un" word)
normalizedContents:(edouar* AND uned) returns the result (although there's no "uned" word)
Here's the declaration of normalizedContents in schema.xml:
<field name="normalizedContents" type="text" indexed="true" stored="true" multiValued="false"/>
So, wildcards and AND operator do not follow the expected behavior. What am I doing wrong ?
Thanks.
By default the field type text does stemming on the content (solr.SnowballPorterFilterFactory). Thus 'un' and 'uned' match une. Then you might not have the solr.LowerCaseFilterFactory filter on both, query and index analyzer, therefore EDUAR* does not match. And the 4th doesnt match as edouard is not stemmed to edouar. If you want exact matches, you should copy the data in another field that has a type with a more limited set of filters. E.g. only a solr.WhitespaceTokenizerFactory
Posting the <fieldType name="text"> section from your schema might be helpful to understand everything.
Related
I have a query that when given a word that starts with a one-letter word followed by space character and then another word (ex: "T Distribution"), does not return results. While given "Distribution" alone returns results including the results for "T Distribution". It is the same behavior with all search terms beginning with a one-letter word followed by space character and then another word.
The problem appears when the search term is of this pattern:
"[one-letter][space][letter/word]". example: "o ring".
What would be the problem that the LIKE operator not working correctly in this case?
Here is my query:
#Cacheable(value = "filteredConcept")
#Query("SELECT NEW sina.backend.data.model.ConceptSummaryVer04(s.id, s.arabicGloss, s.englishGloss, s.example, s.dataSourceId,
s.synsetFrequnecy, s.arabicWordsCache, s.englishWordsCache, s.superId, s.categoryId, s.dataSourceCacheAr, s.dataSourceCacheEn,
s.superTypeCasheAr, s.superTypeCasheEn, s.area, s.era, s.rank, s.undiacritizedArabicWordsCache, s.normalizedEnglishWordsCache,
s.isTranslation, s.isGloss, s.arabicSynonymsCount, s.englishSynonymsCount) FROM Concept s
where s.undiacritizedArabicWordsCache LIKE %:searchTerm% AND data_source_id != 200 AND data_source_id != 31")
List<ConceptSummaryVer04> findByArabicWordsCacheAndNotConcept(#Param("searchTerm") String searchTerm, Sort sort);
the result of the query on the database itself:
link to screenshot
results on the database are returned no matter the letters case:
link to screenshot
I solved this problem.
It was due to the default configuration of the Full-text index on mysql database which is by default set to 2 (ft_min_word_len = 2).
I changed that and rebuilt the index. Then, one-letter words were returned by the query.
12.9.6 Fine-Tuning MySQL Full-Text Search
Use some quotes:
LIKE '%:searchTerm%';
Set searchTerm="%your_word%" and use it on query like this :
... s.undiacritizedArabicWordsCache LIKE :searchTerm ...
I am searching for String Kansas City in description field.
"q":"description: *Kansas City*", but I am getting the results for both Kansas and City. Also it is getting the results from content field as well. I am not sure why it is fetching results from content field. Please suggest me if I am doing any error in my query.
Your quoting is wrong
description:"kansas city"
for example
What are the stars for?
After tokenizing and parsing query it looks like kansas city is tokenized into "kansas" and "city" and filters are applied as per fieldtype definition.
then they are searched in fieldname specified.
description:*Kansas
after tokenizing/word splitting, "city" becomes
different word for which you didn't specify fieldname. so by default it is searched in defaultfield(which is may be content in your case)
defaultsearchfield:city*
in your case after parsing description:kansasandcontent:city you can see the same debugQuery=on with URL in your browser.
I'm using solr to search documents and matching on one field and then want to boost based on keywords appearing in other fields.
Ex.
<str name="qf">
name^1
</str>
<str name="pf">
keywords1^2
keywords2^1.2
description^0.2
</str>
So if I search for
q=foo+bar
and I have a result
name = "This is a foo bar"
keywords1 = "bar"
keywords2 = "foo cats dogs chicken"
description = "There is a foo in here with a bar"
The query gets a boost from description but not from keywords1 and keywords2. I know this is because pf searches keywords1 for "foo bar" (with phrase slop), not "foo" "bar". I would like to boost based on individual words. Is this possible without a plugin?
Things I have thought of:
I am aware of options like pf2 and pf3 but basically what I'm looking for is a pf1. I want to be able to boost on single words.
The reason I do not just add keywords1, keywords2 to qf is I do not want them to be matched. As keywords two has some terms that may not have anything to do with the document.
I could at query time break up the query and use bq
bq = keywords1:foo OR keywords1:bar etc...
but I would like to assign different weights to different fields, and that is cumbersome to build the query.
In summary I would like a parameter a la pf1.
I looked through the SOLR source code and it was fairly straight forward to add a pf1 parameter to the edismax query parser - pf2 and pf3 use more general functions that can be called for pf1. However I really didn't want to have to recompile solr from source. I've managed a work around using nested queries.
I now have
<str name="qf">
name^1
keywords1^2
keywords2^1.2
description^0.2
</str>
And my query now looks like
q=(foo bar) AND _query_:"{!edismax qf='name'}foo bar"
So I get all the boosting of the first query but I restrict on name.
Maybe I'm really missing something.
I have indexed a bunch of key/value pairs in Lucene (v4.1 if it matters). Say I have
key1=value1 and key2=value2, e.g. as read from a properties file.
They get indexed both as specific fields and into a catchall "ALL" field, e.g.
new Field("key1", "value1", aFieldTypeMimickingKeywords);
new Field("key2", "value2", aFieldTypeMimickingKeywords);
new Field("ALL", "key1=value1", aFieldTypeMimickingKeywords);
new Field("ALL", "key2=value2", aFieldTypeMimickingKeywords);
// then get added to the Document of course...
I can then do a wildcard search, using
new WildcardQuery(new Term("ALL", "*alue1"));
and it will find the hit.
But, it would be nice to get more info, like "what was complete value (e.g. "key1=value1") that goes with that hit?".
The best I can figure out it to get the Document, then get the list of IndexableFields, then loop over all of them and see if the field.stringValue().contains("alue1"). (I can look at the data structures in the debugger and all the info is there)
This seems completely insane cause isn't that what Lucene just did? Shouldn't the Hit information return some of the Fields?
Is Lucene missing what seems like "obvious" functionality? Google and starting at the APIs hasn't revealed anything straightforward, but I feel like I must be searching on the wrong stuff.
You might want to try with IndexSearcher.explain() method. Once you get the ID of the matching document, prepare a query for each field (using the same search keywords) and invoke Explanation.isMatch() for each query: the ones that yield true will give you the matched field. Example:
for (String field: fields){
Query query = new WildcardQuery(new Term(field, "*alue1"));
Explanation ex = searcher.explain(query, docID);
if (ex.isMatch()){
//Your query matched field
}
}
I have an Apache Solr 3.5 setup that has a SchemaXml like this:
<field name="appid" type="string" indexed="true" stored="true" required="true"/>
<field name="docid" type="string" indexed="true" stored="true" required="true"/>
What I would need is a field that concatenates them together and uses that as <uniqueKey>. There seems nothing built-in, short of creating a multi-valued id field and using <copyField>, but it seems uniqueKey requires a single-valued field.
The only reason I need this is to allow clients to blindly fire <add> calls and have Solr figure out if it's an addition or update. Therefore, I don't care too much how the ID looks like.
I assume I would have to write my own Analyzer or Tokenizer? I'm just starting out learning Solr, so I'm not 100% sure what I'd actually need and would appreciate any hints towards what I need to implement.
I would personally give that burden to the users, since it's pretty easy for them adding a field to each document.
Otherwise, you would have to write a few lines of code I guess. You could write your own UpdateRequestProcessorFactory which adds the new field automatically to every input document based on the value of other existing fields. You can use a separator and keep it single value.
On your UpdateRequestProcessor you should override the processAdd method like this:
#Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
String appid = (String)doc.getFieldValue( "appid" );
String docid = (String)doc.getFieldValue( "docid" );
doc.addField("uniqueid", appid + "-" + docid);
// pass it up the chain
super.processAdd(cmd);
}
Then you should add your UpdateProcessor to your customized updateRequestProcessorChain as the first processor in the chain (solrconfig.xml):
<updateRequestProcessorChain name="mychain" >
<processor class="my.package.MyUpdateRequestProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
<processor class="solr.LogUpdateProcessorFactory" />
</updateRequestProcessorChain>
Hope it works, I haven't tried it. I already did something like this but not with uniqueKey or required fields, that's the only problem you could find. But I guess if you put the updateProcessor at the beginning of the chain, it should work.