How to create a custom AnalyzerFactory in GraphDB full text search?

How to create a custom AnalyzerFactory in GraphDB full text search? - java

(Using GraphDB 8.1 free).
http://graphdb.ontotext.com/documentation/free/full-text-search.html says that I can enable a custom AnalyzerFactory for GraphDB full-text search, using the luc:analyzer param, by implemeting the interface com.ontotext.trree.plugin.lucene.AnalyzerFactory. However I can't find this interface anywhere. It is not in the jar graphdb-free-runtime-8.1.0.jar.
I checked the feature matrix at http://ontotext.com/products/graphdb/editions/#feature-comparison-table and it seems this feature '"Connectors Lucene" is available for the free edition of GraphDB.
In which jar is the com.ontotext.trree.plugin.lucene.AnalyzerFactory interface located ? what do I need to import in my project to implement this interface ?
Is there pre-existing AnalyzerFactories included with GraphDB to use Lucene other analyzers ? (I am interested in using a FrenchAnalyzer).
Thanks !

GraphDB offers two different Lucene-based plugins.
Lucene FTS plugin indexes RDF molecules and the correct documentation link is: http://graphdb.ontotext.com/documentation/free/full-text-search.html
Lucene Connector performs online synchronization between the RDF and Lucene document models using sequences of configurations like ?subject propertyPath ?object to id|fild value. The correct documentation link is: http://graphdb.ontotext.com/documentation/free/lucene-graphdb-connector.html
I encourage you to use the Lucene Connector, unless you don't have a special case for RDF molecules. Here is a simple example how to configure the connector with French analyzer and index all values for rdfs:label predicate for resources of type urn:MyClass. Select a repository and from the SPARQL query view execute:
PREFIX :<http://www.ontotext.com/connectors/lucene#>
PREFIX inst:<http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
inst:labelFR-copy :createConnector '''
{
"fields": [
{
"indexed": true,
"stored": true,
"analyzed": true,
"multivalued": true,
"fieldName": "label",
"propertyChain": [
"http://www.w3.org/2000/01/rdf-schema#label"
],
"facet": true
}
],
"types": [
"urn:MyClass"
],
"stripMarkup": false,
"analyzer": "org.apache.lucene.analysis.fr.FrenchAnalyzer"
}
''' .
}
Then manually add some sample test data from Import > Text area:
<urn:instance:test> <http://www.w3.org/2000/01/rdf-schema#label> "C'est une example".
<urn:instance:test> a <urn:MyClass>.
Once you commit the transaction, the Connector will update the Lucene index. Now you can run search queries like:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?entity ?snippetField ?snippetText {
?search a inst:labelFR ;
:query "label:*" ;
:entities ?entity .
?entity :snippets _:s .
_:s :snippetField ?snippetField ;
:snippetText ?snippetText .
}
To create a custom analyzer follow the instructions in the documentation and extend org.apache.lucene.analysis.Analyzer class. Put the custom analyzer JAR in lib/plugins/lucene-connector/ path.

Related

ElasticSearch IO How to remove id from JSON document before writing

I have an Apache Beam streaming job which reads data from Kafka and writes to ElasticSearch using ElasticSearchIO.
The issue I'm having is that messages in Kafka already have key field, and using ElasticSearchIO.Write.withIdFn() I'm mapping this field to document _id field in ElasticSearch.
Having a big volume of data I don't want the key field to be also written to ElasticSearch as part of _source.
Is there an option/workaround that would allow doing that?

Using the Ingest API and the remove processor you´ll be able to solve this pretty easy only using your elasticsearch cluster. You can also simulate ingest pipeline and the results.
I´ve prepared a example which will probably cover your case:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "remove id form incoming docs",
"processors": [
{"remove": {
"field": "id",
"ignore_failure": true
}}
]
},
"docs": [
{"_source":{"id":"123546", "other_field":"other value"}}
]
}
You see, there is one test document containing a filed "id". This field is not present in the response/result anymore:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_type",
"_id" : "_id",
"_source" : {
"other_field" : "other value"
},
"_ingest" : {
"timestamp" : "2018-12-03T16:33:33.885909Z"
}
}
}
]
}

I've created a ticket in Apache Beam JIRA describing this issue.
For now the original issue can not be resolved as part of indexation process using Apache Beam API.
The workaround that Etienne Chauchot, one of the maintainers, proposed is to
have separate task which will clear indexed data afterwords.
See Remove a field from a Elasticsearch document for example.
For the future, if someone also would like to leverage such feature, you might want to follow the linked ticket.

Specifying keyword type on String field

I started using hibernate-search-elasticsearch(5.8.2) because it seemed easy to integrate it maintains elasticsearch indices up to date without writing any code. It's a cool lib, but I'm starting to think that it has a very small set of the elasticsearch functionalities implemented. I'm executing a query with a painless script filter which needs to access a String field, which type is 'text' in the index mapping and this is not possible without enabling field data. But I'm not very keen on enabling it as it consumes a lot of heap memory. Here's what elasticsearch team suggests to do in my case:
Fielddata documentation
Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.
A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.
Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
Unfortunately I can't find a way to do it with the hibernate-search annotations. Can someone tell me if this is possible or I have to migrate to the vanilla elasticsearch lib and not using any wrappers?

With the current version of Hibernate Search, you need to create a different field for that (e.g. you can't have different flavors of the same field). Note that that's what Elasticsearch is doing under the hood anyway.
#Field(analyzer = "your-text-analyzer") // your default full text search field with the default name
#Field(name="myPropertyAggregation", index = Index.NO, normalizer = "keyword")
#SortableField(forField = "myPropertyAggregation")
private String myProperty;
It should create an unanalyzed field with doc values. You then need to refer to the myPropertyAggregation field for your aggregations.
Note that we will expose much more Elasticsearch features in the API in the future Search 6. In Search 5, the APIs are designed with Lucene in mind and we couldn't break them.

Query Elastic document field with and without characters

I have the following documents stored at my elasticsearch index (my_index):
{
"name": "111666"
},
{
"name": "111A666"
},
{
"name": "111B666"
}
and I want to be able to query these documents using both the exact value of the name field as well as a character-trimmed version of the value.
Examples
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "111666"
}
}
}
}
should return all of the (3) documents mentioned above.
On the other hand:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "111a666"
}
}
}
}
should return just one document (the one that matches exactly with the the provided value of the name field).
I didn't find a way to configure the settings of my_index in order to support such functionality (custom search/index analyzers etc..).
I should mention here that I am using ElasticSearch's Java API (QueryBuilders) in order to implement the above-mentioned queries, so I thought of doing it the Java-way.
Logic
1) Check if the provided query-string contains a letter
2) If yes (e.g 111A666), then search for 111A666 using a standard search analyzer
3) If not (e.g 111666), then use a custom search analyzer that trims the characters of the `name` field
Questions
1) Is it possible to implement this by somehow configuring how the data are stored/indexed at Elastic Search?
2) If not, is it possible to conditionally change the analyzer of a field at Runtime? (using Java)

You can easily use any build in analyzer or any custom analyzer to map your document in elasticsearch. More information on analyzer is here
The "term" query search for exact match. You can find more information about exact match here (Finding Exact Values)
But you can not change a index once it created. If you want to change any index, you have to create a new index and migrate all your data to new index.

Your question is about different logic for the analyzer at index and query time.
The solution for your Q1 is to generate two tokens at index time (111a666 -> [111a666, 111666]) but only on token at query time (111a666 -> 111a666 and 111666 -> 111666).
I.m.h.o. your have to generate a new analyzer like
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern_replace-tokenfilter.html which supported "preserve_original" like https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-capture-tokenfilter.html does.
Or you could use two fields (one with original and one without letters) and search over both.

Not able to Query alphanumeric fields from ELASTIC SEARCH using TERMS QUERY

I am trying to query Alphanumeric values from the index using TERMS QUERY, But it is not giving me the output.
Query:
{
"size" : 10000,
"query" : {
"bool" : {
"must" : {
"terms" : {
"caid" : [ "A100945","A100896" ]
}
}
}
},
"fields" : [ "acco", "bOS", "aid", "TTl", "caid" ]
}
I want to get all the entries that has caid A100945 or A100896
The same query works fine for NUmeric fields.
I am not planning to use QueryString/MatchQuery as i am trying to build general query builder that can build query for all the request. Hence am looking to get the entries usinng TERMS Query only.
Note: I am using Java API org.elasticsearch.index.query.QueryBuilders for building the Query.
eg: QueryBuilders.termQuery("caid", "["A10xxx", "A101xxx"]")
Please help.
Regards,
Mik

If you have not customized the mappings/analysis for the caid-field, then your values are indexed as e.g. a100945, a100896 (note the lowercasing.)
The terms-query does not do query-time text-analysis, so you'll be searching for A100945 which does not match a100945.
This is quite a common problem, and is explained a bit more in this article on Troubleshooting Elasticsearch searches, for Beginners.

You better use match query.match query are analyzed[applied default analyzer and query] like
QueryBuilders.matchQuery("caid", "["A10xxx", "A101xxx"]");

MongoDb - Update collection atomically if set does not exist

I have the following document in my collection:
{
"_id":NumberLong(106379),
"_class":"x.y.z.SomeObject",
"name":"Some Name",
"information":{
"hotelId":NumberLong(106379),
"names":[
{
"localeStr":"en_US",
"name":"some Other Name"
}
],
"address":{
"address1":"5405 Google Avenue",
"city":"Mountain View",
"cityIdInCitiesCodes":"123456",
"stateId":"CA",
"countryId":"US",
"zipCode":"12345"
},
"descriptions":[
{
"localeStr":"en_US",
"description": "Some Description"
}
],
},
"providers":[
],
"some other set":{
"a":"bla bla bla",
"b":"bla,bla bla",
}
"another Property":"fdfdfdfdfdf"
}
I need to run through all documents in collection and if "providers": [] is empty I need to create new set based on values of information section.
I'm far from being MongoDB expert, so I have the few questions:
Can I do it as atomic operation?
Can I do this using MongoDB console? as far as I understood I can do it using $addToSet and $each command?
If not is there any Java based driver that can provide such functionality?

Can I do it as atomic operation?
Every document will be updated in an atomic fashion. There is no "atomic" in MongoDB in the sense of RDBMS, meaning all operations will succeed or fail, but you can prevent other writes interleaves using $isolated operator
Can I do this using MongoDB console?
Sure you can. To find all empty providers array you can issue a command like:
db.zz.find(providers :{ $size : 0}})
To update all documents where the array is of zero length with a fixed set of string, you can issue a query such as
db.zz.update({providers : { $size : 0}}, {$addToSet : {providers : "zz"}})
If you want to add a portion to you document based on a document's data, you can use the notorious $where query, do mind the warnings appearing in that link, or - as you had mentioned - query for empty provider array, and use cursor.forEach()
If not is there any Java based driver that can provide such functionality?
Sure, you have a Java driver, as for each other major programming language. It can practically do everything described, and basically every thing you can do from the shell. Is suggest you to get started from the Java Language Center.
Also there are several frameworks which facilitate working with MongoDB and bridge the object-document world. I will not give a least here as I'm pretty biased, but I'm sure a quick Google search can do.

db.so.find({ providers: { $size: 0} }).forEach(function(doc) {
doc.providers.push( doc.information.hotelId );
db.so.save(doc);
});
This will push the information.hotelId of the corresponding document into an empty providers array. Replace that with whatever field you would rather insert into the providers array.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.