Solr dismax highlighting not respecting analyzer - java

In the schema of Solr 3.6.2 there are two field declarations, text and exact
<field name="text" type="text" indexed="true" stored="true" />
<field name="exact" type="string" indexed="true" stored="true" />
The former using StandardTokenizer and the latter KeywordTokenizer.
Solr queries describing the problem:
?hl=true
&hl.fl=text,exact
&defType=edismax
&qf=text+exact <-------- here
&q=a-b
Highlight output for field exact:
<em>a</em>-<em>b</em>.
The problem is the summary for field exact is produced using the analyzer from text.
?hl=true
&hl.fl=text,exact
&defType=edismax
&qf=exact <-------- here
&q=a-b
Highlight output for field exact:
<em>a-b</em>.
By simply removing text from qf we get the correct analyzer, why?

With debugQuery on
+DisjunctionMaxQuery(((exact:a-b) | ((text:a text:b)~2)))
Solr highlighter after finding a match in exact also seem to match a and b only based on the presence in the query. hl.requireFieldMatch=true does seem to combat that!

Related

Import Data to Solr using java

I was trying out to upload data to solr server using java.
Is it possible to do so or to create collection and upload data directly from java, or is there any way to do so.
I found two options using DIH and Tika.
Any advise will be helpful.
You can give a try to solrj api: https://wiki.apache.org/solr/Solrj. It can be used to upload/search against solr instance.
If you are running Solr in Cloud (ZooKeeper) mode then using solrj you can create collection.
But upload the configuration to be used by SolrCloud before the collection creation command.
If you are using standalone mode then create collection manually.
Sample code to upload Document at solr server using SOLRJ:
SolrServer server = new HttpSolrServer("http://localhost:8983/solr/CORE_NAME/");
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "1");
doc.addField("Name", "John");
doc.addField("RollNo", "101");
server.add(doc);
UpdateResponse updateResponse = server.commit();
System.out.println(updateResponse.getStatus());
Make sure you have following entries in schema.xml which will be at conf folder of Core.
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Name" type="text_general" indexed="true" stored="true"/>
<field name="RollNo" type="text_general" indexed="true" stored="true"/>

Retrieving attribute values depending on the value of another attribute using xpath

I have the following xml doc:
<database>
<order>
<data>
<field name="time" value="10:10:10" />
</data>
<data>
<field name="product" value="product_type_1">
<field name="attributeA" value="Foo" />
<field name="attributeB" value="Bar" />
</field>
<field name="attributeC" value="Jeam" />
<field name="attributeD" value="Beam" />
<field name="attributeE" value="Deam" />
</data>
</order>
<order>
<data>
<field name="time" value="10:10:11" />
</data>
<data>
<field name="product" value="product_type_2">
<field name="attributeF" value="Bravo" />
<field name="attributeG" value="Echo" />
</field>
<field name="attributeC" value="Jeam2" />
<field name="attributeD" value="Beam2" />
<field name="attributeJ" value="Charlie" />
<field name="attributeK" value="Tango" />
<field name="attributeL" value="Zulu" />
</data>
</order>
It is a set of "order" elements but the "field" (both on quantity and type) depend on the value of the element whose name is "product". I am interested in extracting info depending on the value of the product. More specifically, I would end up with something like this table:
Time Product AttributeA AttributeB AttributeC AttributeD
10:10:10 product_type_1 Foo Bar Jeam Beam
10:10:11 product_type_2 Jeam2 Beam2
In other words I am trying to "cut" unesessary info depending on the value of child element of "order". I am trying to achive this by using xpath (in java) but I am stuck. It is impossible for me to emulate the "if" condition described above.
I am thinking of using and xpath query to retrieve one order element at a time, then query for the product type and then choose the apropriate xpath to retieve the coresponding attributes, but that sounds really inneficient and slow.
Is it possible to do it more efficiently? Is xpath not the right answer here?
Thanks in advance.
P.S: The alignment and organization of the data you see above doesn't really matter as long as I retrieve the correct data then I am sure I'll be able to print them somehow.
If you want to use XPath, you will need at least XPath 3.0 or XQuery (this code is valid in both of them). Have a look at XQuery engines if you want to use this in Java, for example Saxon, BaseX, eXist DB, ...
for $order in /database/order
return string-join((
$order//field[#name='time']/#value,
$order//field[#name='product']/#value,
($order//field[#name='attributeA']/#value, '')[1],
($order//field[#name='attributeB']/#value, '')[1],
($order//field[#name='attributeC']/#value, '')[1],
($order//field[#name='attributeD']/#value, '')[1]),
' ')
The pattern used for the attributes makes sure that empty values do not break the table layout (so for the second product type, attributes C and D do not get attributes A and B). is the tab character.
If you want to use Java for further processing the output, I'd go with this: Fetch all orders (/database/order) and loop over them. Then, for each order, use DOM (or XPath again) to fetch the nodes you need. Yet it seems that the question you asked is not your actual problem, it might be that using XQuery could lead to a cleaner solution.

Solr search is returning partial string matches

Using Solr 3.6.1, I have this field in my schema.xml:
<field name="names" type="text_general" indexed="true" stored="false" multiValued="true"/>
<dynamicField name="names_*" type="text_general" indexed="true" stored="true"/>
The documentation in the schema.xml states that "text_general" should:
tokenize with StandardTokenizer
removes stop words from case-insensitive "stopwords.txt" (which is currently empty)
down cases the string.
At query time only, it also applies synonyms (which is also empty at this time)
I have two documents indexed in Solr with this data for the field:
<!-- doc 1 -->
<str name="names_data">Name ABC Dev Loc</str>
<!-- doc 2 -->
<str name="names_data">Name ABC Dev Location</str>
When I execute the following query:
id:(doc1 OR doc2) AND names:Dev+Location)
Both documents are returned. I would have expected that only doc2 would have been returned based on my understanding of how Solr's StandardTokenizer works.
Why does "Dev+Location" match "Dev Loc" and "Dev Location"?
The type text_general is probably configured to use a stemmer, which is treating Loc as a variant of Location.
You could configure the type to not use a stemmer, or try searching for the whole string using names:"Dev Location"
This might be why.
This part of the query names:Dev+Location is only searching where names:Dev since the Location term does not have a field name qualifier it is searching for Location against whatever the <defaultSearchField> is set to in schema.xml
So you could try to quote the field like names:"Dev Location" or prefix it names:Dev AND names:Location

How to get last indexed record in Solr?

I want to know how to get/search last indexed record in Apache Solr..?
When the existing record is updated then it goes to end of all the records...so I want to get that last indexed record.
thanks..
You could add a 'timestamp' field to your Solr schema that puts the current date/time into the record when it is added.
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Then, do a sort in descending order by this field and the first record will be the latest one. A query like this should do it:-
http://localhost:8080/solr/core-name/select/q=*%3A*&start=0&rows=1&sort=timestamp+desc
You can sort the documents by the indexed order using the following query.
http://localhost:8983/solr/select?q=*:*&sort=_docid_ asc
or
http://localhost:8983/solr/select?q=*:*&sort=_docid_ desc

Castor XML Mapping and java.util.Map

I've been using Castor these past couple of days to try to get a little serialization going between my Java program and XML in a readable way. Though it has a few faults, Castor's automatic xml generation via reflection is actually very functional. Unfortunately, one thing that seems to be fairly well left out of the examples is dealing with generics. It seems the reflection API does a wonderful job as it is, but as it is inadvertently grabbing a lot of redundant data just because methods start with get___(), I wanted to write my own mapping file to stave this off.
Firstly, it seems altogether fair that in the attributes to a "field" element, one should define "type". However, it does not specify what should be done if this type is abstract or simply an interface. What should I put as the type then?
Secondly, most "collection" type objects specified in Castor (List, Vector, Collection, Set, etc) only require 1 generic type, so specifying "type" as what's inside and "collection="true"" are enough. However, it does not specify what I should do in the case of a collection like a Map, where 2 types are necessary. How can I specify both the key type and value type?
Any help at all would be greatly appreciated!
For the second of my questions:
When specifying something with a Map or a Table, you need to redefine org.exolab.castor.mapping.MapItem within the bind-xml element within your field element. Example taken from here
<class name="some.example.Clazz">
<field name="a-map" get-method="getAMap" set-method="setAMap">
<bind-xml ...>
<class name="org.exolab.castor.mapping.MapItem">
<field name="key" type="java.lang.String">
<bind-xml name="id"/>
</field>
<field name="value" type="com.acme.Foo"/>
</class>
</bind-xml>
</field>
</class>
Also, omit the type attribute from the parent field element.
For my first question, the trick is to NOT specify the type in the field element and allow Castor to infer it by itself. If you have definitions for the classes that could appear there, then it will automatically use those. For example:
<class name="some.example.Clazz">
<!-- can contain condition1 or condition2 elements -->
<field name="condition" collection="arraylist" required="true">
<bind-xml name="condition" node="element" />
</field>
</class>
<class name="some.example.condition1">
<field name="oneField" >
<xml-bind name="fieldOne" />
</field>
</class>
<class name="some.example.condition2">
<field name="anotherField />
<xml-bind name="fieldTwo" />
</field>
</class>
The output of into XML by Castor would use condition1 and condition2 style XML into the "condition" field of Clazz while still referring to its proper instantiation type.

Categories