elasticsearch select which field use for boost - java

Given an elasticsearch document like this:
{
"name": "bob",
"title": "long text",
"text": "long text bla bla...",
"val_a1": 0.3,
"val_a2": 0.7,
"val_a3": 1.1,
...
"val_az": 0.65
}
I need to make a search on Elastisearch with a given boost value on text field plus a boost value on document got from a named field val_xy.
In example, a search could be:
"long" with boost value on text: 2.0 and general boost val_a6
So if found "long" on text field I use a boost of 2.0, and using a boost value from field value val_a6.
How can I do this search on a java Elasticsearch client? It's possible?

What you want is a function_score query. The documentation isn't the best and can be highly confusing. But using your example above you'd do something like the following:
"function_score": {
"query": {
"term": {
"title": "long"
}
},
"functions": [
{
"filter": {
"term": {
"title": "long"
}
},
"script_score": {
"script": "_score*2.0*doc['val_a6'].value"
}
}
],
"score_mode": "max",
"boost_mode": "replace"
}
My eureka moment with function_score queries was figuring out you could do filters, including bool filters, within the "functions" part.

Related

How to select fields in different levels of a jsonfile with jsonPath?

I want to convert jsonobjcts into csv files. Wy (working) attempt so far is to load the json file as a JSONObject (from the googlecode.josn-simple library), then converting them with jsonPath into a string array which is then used to build the csv rows. However I am facing a problem with jsonPath. From the given example json...
{
"issues": [
{
"key": "abc",
"fields": {
"issuetype": {
"name": "Bug",
"id": "1",
"subtask": false
},
"priority": {
"name": "Major",
"id": "3"
},
"created": "2020-5-11",
"status": {
"name": "OPEN"
}
}
},
{
"key": "def",
"fields": {
"issuetype": {
"name": "Info",
"id": "5",
"subtask": false
},
"priority": {
"name": "Minor",
"id": "2"
},
"created": "2020-5-8",
"status": {
"name": "DONE"
}
}
}
]}
I want to select the following:
[
"abc",
"Bug",
"Major",
"2020-5-11",
"OPEN",
"def",
"Info",
"Minor",
"2020-5-8",
"DONE"
]
The csv should look like that:
abc,Bug,Major,2020-5-11,OPEN
def,Info,Minor,2020-5-8,DONE
I tried $.issues.[*].[key,fields] and I get
"abc",
{
"issuetype": {
"name": "Bug",
"id": "1",
"subtask": false
},
"priority": {
"name": "Major",
"id": "3"
},
"created": "2020-5-11",
"status": {
"name": "OPEN"
}
},
"def",
{
"issuetype": {
"name": "Info",
"id": "5",
"subtask": false
},
"priority": {
"name": "Minor",
"id": "2"
},
"created": "2020-5-8",
"status": {
"name": "DONE"
}
}
]
But when I want to select e.g. only "created" $.issues.[*].[key,fields.[created]
[
"2020-5-11",
"2020-5-8"
]
This is the result.
But I just do not get how to select "key" and e.g. "name" in the field issuetype.
How do I do that with jsonPath or is there a better way to filter a jsonfile and then convert it into a csv?
I recommend what I believe is a better way - which is to create a set of Java classes which represent the structure of your JSON data. When you read the JSON into these classes, you can manipulate the data using standard Java.
I also recommend a different JSON parser - in this case Jackson, but there are others. Why? Mainly, familiarity - see later on for more notes on that.
Starting with the end result: Assuming I have a class called Container which contains all the issues listed in the JSON file, I can then populate it with the following:
//import com.fasterxml.jackson.databind.ObjectMapper;
String jsonString = "{...}" // your JSON data as a string, for this demo.
ObjectMapper objectMapper = new ObjectMapper();
Container container = objectMapper.readValue(jsonString, Container.class);
Now I can print out all the issues in the CSV format you want as follows:
container.getIssues().forEach((issue) -> {
printCsvRow(issue);
});
Here, the printCsvRow() method looks like this:
private void printCsvRow(Issue issue) {
String key = issue.getKey();
Fields fields = issue.getFields();
String type = fields.getIssuetype().getName();
String priority = fields.getPriority().getName();
String created = fields.getCreated();
String status = fields.getStatus().getName();
System.out.println(String.join(",", key, type, priority, created, status));
}
In reality, I would use a CSV library to ensure records are formatted correctly - the above is just for illustration, to show how the JSON data can be accessed.
The following is printed:
abc,Bug,Major,2020-5-11,OPEN
def,Info,Minor,2020-5-8,DONE
And to filter only OPEN records, I can do something like this:
container.getIssues()
.stream()
.filter(issue -> issue.getFields().getStatus().getName().equals("OPEN"))
.forEach((issue) -> {
printCsvRow(issue);
});
The following is printed:
abc,Bug,Major,2020-5-11,OPEN
To enable Jackson, I use Maven with the following dependency:
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.10.3</version>
</dependency>
In case you don't use Maven, this gives me 3 JARs: jackson-databind, jackson-annotations, and jackson-core.
To create the nested Java classes I need (to mirror the structure of the JSON), I use a tool which generates them for me using your sample JSON.
In my case, I used this tool, but there are others.
I chose "Container" as the name of the root Java class; a source type of JSON; and selected Jackson 2.x annotations. I also requested getters and setters.
I added the generated classes (Fields, Issue, Issuetype, Priority, Status, and Container) to my project.
WARNING: The completeness of these Java classes is only as good as the sample JSON. But you can, of course, enhance these classes to more accurately reflect the actual JSON you need to handle.
The Jackson ObjectMapper takes care of loading the JSON into the class structure.
I chose to use Jackson instead of JsonPath, simply because of familiarity. JsonPath appears to have very similar object mapping capabilities - but I have never used those features of JsonPath.
Final note: You can use xpath style predicates in JsonPath to access individual data items and groups of items - as you describe in your question. But (in my experience) it is almost always worth the extra effort to create Java classes, if you want to process all your data in more flexible ways - especially if that involves transforming the JSON input into different output structures.

Elastic Search multi match gets wrong result

I am sending a query to Elastic Search to find all segments which has a field matching the query.
We are implementing a "free search" which the user could write any text he wants and we build a query which search this text throw all the segments fields.
Each segment which one (or more) of it's fields has this text should return
For example:
I would like to get all the segments which with the name "tony lopez".
Each segment has a field of "first_name" and a field of "last_name".
The query our service builds:
"multi_match" : {
"query": "tony lopez",
"type": "best_fields"
"fields": [],
"operator": "OR"
}
The result from the Elastic using this query is a segment which includes "first_name" field "tony" and "last_name" field "lopez", but also a segment when the "first_name" field is "joe" and "last_name" is "tony".
In this type of query, I would like to recive only the segments which it's name is "tony (first_name) lopez (last_name)"
How can I fix that issue?
Hope i'm not jumping into conclusions too soon but if you want to get only tony and lopez as firstname and lastname use this:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"first": "tony"
}
},
{
"match": {
"last": "lopez"
}
}
]
}
}
}
But if one of your indexed documents contains for example tony s as firstname, the query above will return it too.
Why? firstname is a text datatype
A field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed.
More Details
If you run this query via kibana:
POST my_index/_analyze
{
"field": "first",
"text": ["tony s"]
}
You will see that tony s is analyzed as two tokens tony and s.
passed through an analyzer to convert the string into a list of individual terms (tony as a term and s as a term).
That is why the above query returns tony s in results, it matches tony.
If you want to get only tony and lopez exact match then you should use this query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"first.keyword": {
"value": "tony"
}
}
},
{
"term": {
"last.keyword": {
"value": "lopez"
}
}
}
]
}
}
}
Read about keyword datatype
UPDATE
Try this query - it is not perfect same issue with my tony s example and if you have a document with firstname lopez and lastname tony it will find it.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "tony lopez",
"fields": [],
"type": "cross_fields",
"operator":"AND",
"analyzer": "standard"
}
}
}
The cross_fields type is particularly useful with structured documents where multiple fields should match. For instance, when querying the first_name and last_name fields for “Will Smith”, the best match is likely to have “Will” in one field and “Smith” in the other
cross fields
Hope it helps

How do I create an ElasticSearch query without knowing what the field is?

I have someone putting JSON objects into Elasticsearch for which I do not know any fields. I would like to search all the fields for a given value using a matchQuery.
I understand that the _all is deprecated, and the copy_to doesn't work because I don't know what fields are available beforehand. Is there a way to accomplish this without knowing what fields to search for beforehand?
Yes, you can achieve this using a custom _all field (which I called my_all) and a dynamic template for your index. Basically, this idea is to have a generic mapping for all fields with a copy_to setting to the my_all field. I've also added store: true for the my_all field but only for the purpose of showing you that it works, in practice you won't need it.
So let's go and create the index:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"all_fields": {
"match": "*",
"mapping": {
"copy_to": "my_all"
}
}
}
],
"properties": {
"my_all": {
"type": "text",
"store": true
}
}
}
}
}
Then index a document:
PUT my_index/_doc/1
{
"test": "the cat drinks milk",
"age": 10,
"alive": true,
"date": "2018-03-21T10:00:00.123Z",
"val": ["data", "data2", "data3"]
}
Finally, we can search using the my_all field and also show its content (because we store its content) in addition to the _source of the document:
GET my_index/_search?q=my_all:cat&_source=true&stored_fields=my_all
And the result is shown below:
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"test": "the cat drinks milk",
"age": 10,
"alive": true,
"date": "2018-03-21T10:00:00.123Z",
"val": [
"data",
"data2",
"data3"
]
},
"fields": {
"my_all": [
"the cat drinks milk",
"10",
"true",
"2018-03-21T10:00:00.123Z",
"data",
"data2",
"data3"
]
}
}
So given you can create the index and mapping of your index, you'll be able to search whatever people are sending to it.

ElasticSearch: Update multi fields in Script Java Plugin

I saw this snippet where update_by_query can update source directly
POST twitter/_update_by_query
{
"script": {
"inline": "ctx._source.likes++",
"lang": "painless"
},
"query": {
"term": {
"user": "kimchy"
}
}
}
instead of using painless, I wrote a native script plugin in Java because of my complex business logic.
{
"subtotal": 1000,
"markup": 2,
"total": 2000,
"items": [
{
"subtotal": 100,
"markup": 2,
"total": 200
},
{
"subtotal": 500,
"markup": 2,
"total": 1000
}
]
}
User can set markup value in the application. If user change markup to 3, I want to update markup and total field including the ones in nested object. (NOTE: I can't use painless because in my case, the logic is more complicated than just multiplies those fields. That's why I use Java)
// my plugin code
public Object run() {
// change field value of "markup"
// change field value of "total"
return true;
}
My Code is almost similar with https://github.com/imotov/elasticsearch-native-script-example/blob/master/src/main/java/org/elasticsearch/examples/nativescript/script/TFIDFScoreScript.java
I was trying with source().put("markup", 3) but I kept getting NullPointerException
ElasticSearch Version: 5.0.0
Thank you

Is aggregation (count) on dimension but not on metrics supported by Druid

For example there are two dimensions: [country, website] and one metric: [PV].
I want to know the average PV of website for each country.
To make it, it's easy to get the total PV in each country, however it's difficult to get the count of website in each country, furthermore the expect result is the total PV(in each country) divided by the count of website(in each country).
What I can do is apply "groupBy" query by country & website as below, and then group the result by country outside in my application. It's very very very slow, because the query extract lots of data from Druid and most of them is meaningless just for a sum.
{
"queryType": "groupBy",
"dataSource": "--",
"dimensions": [
"country",
"website"
],
"granularity": "all",
"intervals": [
"--"
],
"aggregations": [
{
"type": "longSum",
"name": "PV",
"fieldName": "PV"
}
]
}
Any one can help with this? I'm wondering it's impossible such a common query is not supported by Druid.
Thanks in advance.
To be clear, I describe my expected result by SQL, if you have known what I want to do or not familiar to SQL, please ignore the following part.
SELECT country, sum(a.PV_all) / count(a.website) as PV_AVG FROM
(SELECT country, website, SUM(PV) as PV_all FROM DB GROUP BY country, website ) a
GROUP BY country
Have you tried using a nested groupBy query ? druid support that.
In nutshell you can have something like
{
"queryType": "groupBy",
"dataSource":{
"type": "query",
"query": {
"queryType": "groupBy",
"dataSource": "yourDataSource",
"granularity": "--",
"dimensions": ["country", "website"],
"aggregations": [
{
"type": "longSum",
"name": "PV",
"fieldName": "PV"
}
],
"intervals": [ "2012-01-01T00:00:00.000/2020-01-03T00:00:00.000" ]
}
},
"granularity": "all",
"dimensions": ["country"],
"aggregations": [
----
],
"intervals": [ "2012-01-01T00:00:00.000/2020-01-03T00:00:00.000" ]
}

Categories