How to index documents containing nested properties with Lucene? - java

I'll try to reduce my case to the necessary: I'm building a Webapp (with Spring) with a search interface that lets you search a corpus of annotated/tagged texts. In my DB (MongoDB) one document represents one page of a book collection (totaling ~8000 pages).
Here is an example of the Document structure in JSON (I removed a lot of meta data for brevity. Also, and this is important, the "tokens"-array contains up to 700 objects in most cases.):
{
"_id" : ObjectId("5622c29eef86d3c2f23fd62c"),
"scanId" : "592ea208b6d108ee5ae63f79",
"volume" : "Volume I",
"chapters" : [
"Some Chapter Name"
],
"languages" : [
"English",
"German"
],
"tokens" : [
{
"form" : "The",
"index" : 0,
"tags" : [
"ART"
]
},
{
"form" : "house",
"index" : 1,
"tags" : [
"NN",
"NN_P"
]
},
{
"form" : "is",
"index" : 2,
"tags" : [
"V",
"CONJ_C"
]
}
]
}
So you see i don't have a plain text, here. I now want to build an index with Lucene to quickly search this DB. The problem is that i want to be able to search certain words, their tags AND the context around it. Like "give me all documents containing the word 'House' tagged as 'NN' followed by a word tagged with 'V'.". I couldn't find a way to index these sub-structures with native Lucene functionality.
What i tried to do to at least be able to search for words and their tags is the following: In my Lucene index, a document doesn't represent a whole page, but only a word/token with it's tags. So one index document looks like this (expressed in JSON syntax for readability):
{
"token" : "house",
"tag" : "NN",
"tag" : "NN_P",
"index" : 1,
"pageId" : "5622c29eef86d3c2f23fd62c"
}
... Yes, Lucene allows me to use one field multiple times. So now i can search for a word and it's tags and get a reference to the page object in my DB via it's ID. But this is pretty ugly for two reasons: I now have two completely different document representations (DB and Lucene index) and to process a complex query like the one i mentioned above i'd have to query for the word and it's tag and then further check the context of the hits in the retrieved documents manually.
So my question is: Is there a way to index documents in Lucene containing fields/properties whose values are nested objects that in turn have certain properties?

Is there a way to index documents in Lucene containing fields/properties whose values are nested objects that in turn have certain properties?
Elasticsearch certainly lets you do this. I think it's possible to do all of it in pure lucene, but may be some effort.
Basically, you need to use the 'nested' query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html
PUT /my_index
{
"mappings": {
"type1" : {
"properties" : {
"tokens" : {
"type" : "nested"
}
}
}
}
}
This tells ES to index the contents of this field as a list of separate documents, allowing you to query them individually using the 'nested' query:
GET my_index/_search
{
"query": {
"nested": {
"path": "tokens",
"query": {
"bool": {
"must": [
{ "match": { "tokens.form": "house" }},
{ "match": { "tokens.tags": "NN" }}
]
}
}
}
}
}

Related

GlobalSecondary index in DynamoDB

I have to create a Global Secondary Index in Dynamo Db. My Primary table structure is below -
{
"primaryId" : "1234" //HashKey
"dummy1" : "kkd",
"dummy2" : "ddd",
"secondObj": [{
"secondObjId" : "1234",
"name" : "1234",
},
{
"secondObjId" : "12345",
"name" : "12345",
}]
}
Now i have to create GlobalSecondary Index based on "secondObjId" as a hashkey. is it possible to create?
I have created it using AWS console but its showing item count 0 and if i am creating GlobalSecondaryIndex using "dumy1" then its showing proper item count.
So my query is that is it possible to create a GlobalSecondayIndex based on a attribute from DynamoDBDocument?
Indexes can be built only on top-level JSON attributes. In addition, range keys must be scalar values in DynamoDB (one of String, Number, Binary, or Boolean).

Dynamodb How to perform update in a list element in an item?

I am storing data in a list element. That list attribute consists of multiple fields like year and name. Is there any way I can delete only the rows where year is 2018 like in relational databases??
Data is stored in this manner in which ID is primary key and some other fields as password, city etc. Below is how data is stored in quarters.
{
"q1risk" : "0",
"q1targets" : [ ]
}
UpdateItem succeeded:
{
"q2risk" : "100.0",
"q2targets" : [ {
"level" : "Basic",
"year" : "2017",
"name" : "1",
"completed" : "0",
"category" : "US"
}, {
"level" : "Basic",
"year" : "2017",
"name" : "2",
"completed" : "0",
"category" : "US"
}, {
"level" : "Basic",
"year" : "2018",
"name" : "3",
"completed" : "0",
"category" : "US"
} ]
}
Taken from this answer to a similar question, according to the DynamoDB UpdateExpression documentation, you can remove individual elements from the list using REMOVE ListAttribute[index].
However, there doesn't appear to be a way to conditionally delete list elements based on a condition.
Your best bet to minimize read/write throughput is to:
Query for the item with a CONTAINS filter condition and a projection expression on the list attribute.
If the item is found, it contains at least one element you want to remove, so find the indices of elements you want to remove.
Use REMOVE with the indices to remove those elements from the item's list attribute.

How do I make sure null or missing fields comes first during sort on that field in Elastic Search

I am writing a Java client for elastic search. How do make sure my null or missing field comes always first when sorting using that field.
Please make sure you set _first for the "missing" in your search query
{
"sort" : [
{ "price" : {"missing" : "_first"} },
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
In java
FieldSortBuilder sorter = new FieldSortBuilder("price");
sorter.missing("_first");

How to update embedded mongo document using spring mongo data api

I need some help updating property of embedded collection with JSON structure below -
translation
{
"_id" : ObjectId("533d4c73d86b8977fda970a9"),
"_class" : "com.xxx.xxx.translation.domain.Translation",
"locales" : [
{
"_id" : "en-US",
"description" : "English (United States)",
"isActive" : true
},
{
"_id" : "pt-BR",
"description" : "Portuguese (Brazil)",
"isActive" : true
},
{
"_id" : "nl-NL",
"description" : "Dutch (Netherlands)",
"isActive" : true
}
],
"screens" : [
{
"_id" : "caseCodes",
"dictionary" : [
{
"key" : "CS_CAT1",
"parameterizedValue" : "My investigations",
"locale" : "en-US"
},
{
"key" : "MY_INVESTIGATIONS",
"parameterizedValue" : "",
"locale" : "pt-BR"
},
}
]
}
In above structure:
I want to update "parameterizedValue" usinng spring-data-mongo-db API 1.3.4, for screen with _id="caseCodes" and key = "CS_CAT1".
I tried (here 'values' is an collection name for TranslationValue array)
mongoOperations.updateFirst(Query.query(Criteria.where("screens._id")
.is("caseCodes")), new Update().push(
"screens.dictionary.$.values", translationValue),
Translation.class);
but it said, "can't append array to string "dictionary"....
Any pointers or help here? Thanks.
-Sanjeev
There are a few problems with your logic as well as problems with your schema for this type of update.
Firstly what you have are nested arrays, and this causes a problem with updates as described in the documentation for the positional $ operator. What this means is that any condition matching an element of an array on the query side of the update statement will only match the first array index found.
Since you need a specific entry in the inner array you would need to match that as well. But the "catch" says that only the first match will be used in the positional operator so you cannot have both. The query form (if it were possible to work, which it does not) would actually be something like this (native shell):
db.collection.update(
{
"screens._id": "caseCodes",
"screens.dictionary.key": "CS_CAT1"
},
{
"$set": {
"screens.$.dictionary.$.parameterizedValue": "new value"
}
}
)
Now that would "appear" to be more correct than what you are doing, but of course this fails because the positional operator cannot be used more than once. I may just quite stupidly work in this case as it just so happens that the first matched index of the "screens" array (which is 0) happens to be exactly the same as the required index of the inner element. But really that is just "dumb luck".
To illustrate better, what you need to do with these type of updates is already know the indexes of the elements and place those values directly into the update statement using "dot notation". So updating your second "dictionary" element would go like this:
db.collection.update(
{
"screens._id": "caseCodes",
"screens.dictionary.key": "MY_INVESTIGATIONS"
},
{
"$set": {
"screens.0.dictionary.1.parameterizedValue": "new value"
}
}
)
Also noting that the correct operator to use here is $set as you are not appending to either of the arrays, but rather you wish to change an element.
Since that sort of exactness in updates is unlikely to suit what you need, then you should look at changing the schema to accommodate your operations in a much more supported way. So one possibility is that your "screens" data may not possibly need to be an array, and you could change that to a document form like so:
"screens" : {
"caseCodes": [
{
"key" : "CS_CAT1",
"parameterizedValue" : "My investigations",
"locale" : "en-US"
},
{
"key" : "MY_INVESTIGATIONS",
"parameterizedValue" : "",
"locale" : "pt-BR"
},
]
}
This changed the form to:
db.collection.update(
{
"screens.caseCodes.key": "CS_CAT1"
},
{
"$set": {
"screens.caseCodes.$.parameterizedValue": "new value"
}
}
)
That may or may not work for your purposes, but you either live with the limitations of using a nested array or otherwise change your schema in some way.

Mongodb: finding lowest element in embedded array

I have a collection students look like this:
{
"_id" : 10,
"name" : "Christiano Ronaldo",
"scores" : [
{
"type" : "exam",
"score" : 40.58945534169687
},
{
"type" : "quiz",
"score" : 4.30461571152303
},
{
"type" : "homework",
"score" : 62.36309025722009
},
{
"type" : "homework",
"score" : 32.1707802903173
}
]
}
How do I find out the lowest homework? Using javadriver.
You don't with normal querying. You always query for full documents rather than embedded elements within that document. If you make seperate documents for each score, e.g :
{
"_id" : 10,
"name" : "Christiano Ronaldo",
"type" : "exam",
"score" : 40.58945534169687
}
You can search highest/lowest exam score for Christiano Ronaldo. Note that the MongoDB Aggregation Framework can be used to answer your question but I'm going to assume that's out of scope here.
Also note that your schema is very problematic. There is no way to query a specific "homework" score with this schema. I would denormalize here and use a document per score.

Categories