How to remove duplicate search result in elasticsearch when using scroll search? - java

I am developing a project that integrates Java and Elasticsearch.
And I am using scroll api because of searching a large amount of data.
I want to see unique result( like distint in oracle).
How to remove duplicate search result in elasticsearch?
I searched, but couldn't find the Java version.
My code is like this (this is a just sample code):
final Scroll scroll = new Scroll(TimeValue.timeValueMinutes(1L));
SearchRequest searchRequest = new SearchRequest("posts");
searchRequest.scroll(scroll);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(matchQuery("title", "Elasticsearch"));
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();
SearchHit[] searchHits = searchResponse.getHits().getHits();
while (searchHits != null && searchHits.length > 0) {
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
scrollId = searchResponse.getScrollId();
searchHits = searchResponse.getHits().getHits();
}
Is there any way to search the data on the elastic without duplication?

Scroll will return all the documents in elasticsearch. You cannot perform distinct operation in elasticsearch
There are two ways to resolve your issue
Field collapsing
It does a group by on different fields and return top 1 document
Terms Aggregation
{
"aggs": {
"t": {
"terms": {
"script": "doc['title.keyword'] + ' '+doc['description.keyword']"
}
}
}
}
Scroll is intended to fetch large number of documents and above options are not for bulk data. So you need to perform distinct operation client side(outside of elastic search)

Related

Using SliceBuilder with Elastic RestHighLevelClient to achieve parallelism in elastic search

I need to fetch more than 1,000,000 records from Elasticsearch using Java RestHighLevelClient
I am using scroll for pagination and everything is working fine.
Code Looks like something this:
class ScrollTest {
final static RestHighLevelClient client = LocalhostClient.create();
public static void main(String[] args) throws IOException {
long st= System.currentTimeMillis();
SearchRequest searchRequest = new SearchRequest("movies_data");
QueryBuilder matchQueryBuilder = QueryBuilders.boolQuery().must(new MatchAllQueryBuilder());
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(matchQueryBuilder);
searchSourceBuilder.size(10000); //max is 10000
searchRequest.indices("movies_data");
searchRequest.source(searchSourceBuilder);
final Scroll scroll = new Scroll(TimeValue.timeValueSeconds(5l));
searchRequest.scroll(scroll);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();
SearchHit[] data = new SearchHit[0];
SearchHit[] searchHits = searchResponse.getHits().getHits();
while (searchHits != null && searchHits.length > 0) {
// funcation for conatcting results
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(scroll);
searchResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
scrollId = searchResponse.getScrollId();
searchHits = searchResponse.getHits().getHits();
System.out.println("###################"+searchHits.length);
}
ClearScrollRequest clearScrollRequest = new ClearScrollRequest();
clearScrollRequest.addScrollId(scrollId);
ClearScrollResponse clearScrollResponse = client.clearScroll(clearScrollRequest, RequestOptions.DEFAULT);
System.out.println("Time Taken"+(System.currentTimeMillis()-st));
}
}
But Fetching large number of document using scroll is taking a lot of time since its querying for every 10000 documents so end-up with 100 request. Overriding the default windows size values also not helped much ( tried with 100, 1000, 10000 and 25000)
I want to make request fetching parallel at least within page using slice scroll, but missing how to use it with scroll.
Can someone please guide how to use slice builder to achieve parallelism ?

How to translate Elastic Search query in Java by using the query builder?

I am trying to translate the following elastic search query in Java, using the query builder? Can someone give any ideas of it?
GET <index-name>/_search
{
“size”: 1,
“sort”: [
{
“Date”: {
“order”: “desc”
}
}
]
}
You need to use SortBuilder to sort on fields. Below request will sort Date field in desc order
RestHighLevelClient client = new RestHighLevelClient(restClientBuilder);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.size(1).sort(new FieldSortBuilder("Date").order(SortOrder.DESC));
SearchRequest searchRequest = new SearchRequest("my-index");
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);

Java equivalent of elasticsearch CURL query?

My project had a new reqiurement of migrating all datas which are currently in postresql to elasticsearch. Successfully migrated all my datas, but I am stuck with writing java code to search for some datas in elastic search index.
Sample structure of a hit in index is attached in the below image:
I need to find average of activity.attentionLevel from the index.
I wrote something like below query to find average:
GET proktor-activities/_search
{
"aggs" : {
"avg_attention" : {
"avg" : {
"script" : "Float.parseFloat(doc['activity.attentionLevel.keyword'].value)" }
}
}
}
please help me to find java equivalent code for doing the same.
thanks in advance
Using Elastic's RestHighLevel API would be something like this:
// Create the client
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(new HttpHost("localhost", 9200, "http")));
// Specify the aggregation request
SearchRequest searchRequest = new SearchRequest("proktor-activities");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.aggregation(AggregationBuilders
.avg("avg_attention")
.field("activity.attentionLevel"));
searchRequest.source(searchSourceBuilder);
// Execute the aggreagation with Elasticsearch
SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);
// Read the aggregation result
Aggregations aggregations = response.getAggregations();
Avg averageAttention = aggregations.get("avg_attention");
double result = averageAttention.getValue();
client.close(); // get done with the client
More information here.

How to use search_after in Elastic High Level Rest Client for pagination

I am using elastic RestHighLevelClient to talk to ES. I am able to query basic queries. Although i am trying to use teh search_after api to design a paginated api from my front end queries. Although query_after is simple to use in the RestLowLevelClient api, i am not able to figure how to use it in the HighLevel API.
Looks like the lucene api has SearchAfterSortedDocQuery, but i am not able to figure how to use it with the elastic search api. For example: in the code below i initialize SearchAfterSortedDocQuery query but not sure how to use it .
RestHighLevelClient client = ESRestClient.getClient();
SearchRequest searchRequest = new SearchRequest();
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.termQuery("field1",value1));
searchSourceBuilder.query(QueryBuilders.termQuery("field2",value2));
searchSourceBuilder.from(0);
searchSourceBuilder.size(limit);
SortField[] sortFields = new SortField[2];
sortFields[0]= new SortField("field1", SortField.Type.LONG);
sortFields[1]= new SortField("field2", SortField.Type.LONG);
FieldDoc fieldDoc = new FieldDoc(0,0); //Is this correct? how to initialize field doc?
fieldDoc.fields = new Object[2];
fieldDoc.fields[0] = new Long("-156034");
fieldDoc.fields[1] = new Long("2297416849");
SearchAfterSortedDocQuery query = new SearchAfterSortedDocQuery(new Sort(sortFields), fieldDoc);
searchRequest.source(searchSourceBuilder);
searchRequest.indices("index1");
try {
SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);
System.out.println(response);
}
catch(IOException e){
System.out.println(e);
}
}
I think instead of you using SearchAfterSortedDocQuery, just set search_after in searchSourceBuilder look like below
searchSourceBuilder.searchAfter(new Object[]{sortAfterValue});
after that use SearchRequest and rest client to get the response
SearchRequest searchRequest = new SearchRequest("index");
searchRequest.types("type");
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT)
SearchHits hits = searchResponse.getHits();
final you should keep the last sort value by getSortValues() from hits to go to the next page.
hits.getAt(lastIndex).getSortValues()
Elasticsearch search_after api
hope this help.

Elasticsearch multiple condition query using java api

There are multiple documents containing around 100 fields each. I'd like to perform a following search trough elasticsearch Java API 5.x:
There are 3 fields I'd like to use for this search i.e.
department
job
name
I'd like to search the return documents that match fields like "department: D1", "department: D2", "job: J1", "job: J2" "name: N1"
I've been trying to do it this way
String[] departments = ["d1","d2","d3"];
String[] jobs = ["j1","j2","j3"];
String[] names = ["n1"];
MultiSearchRequestBuilder requestbuilder;
requestBuilder.add(client.prepareSearch().setQuery(QueryBuilders.termsQuery("department", departments)));
requestBuilder.add(client.prepareSearch().setQuery(QueryBuilders.termsQuery("job", jobs)));
requestBuilder.add(client.prepareSearch().setQuery(QueryBuilders.termsQuery("name", names)));
MultiSearchResponse response = requestBuilder.get();
However the queries are executed as if each was an individual query, i.e. in this example when j3 exists in d4, the document with d4 will be matched aswell
How to perform the search the way I mentioned? I've been trying numerous different queries and nothing seems to work, is there something I am missing?
You don't want to use MultiSearchRequestBuilder, you simply need to combine your three constraints in a bool/filter query:
BoolQueryBuilder query = QueryBuilders.boolQuery()
.filter(QueryBuilders.termsQuery("department", departments))
.filter(QueryBuilders.termsQuery("job", jobs))
.filter(QueryBuilders.termsQuery("name", names));
SearchResponse resp = client.prepareSearch().setQuery(query).get();
For Elasticsearch 5.6.4 of using HighRestClient, add required number of sourcebuilder...
static RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")));
public static void multisearch() throws IOException{
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
sourceBuilder.query(QueryBuilders.termQuery("name", "vijay1"));
SearchRequest searchRequest = new SearchRequest();
searchRequest.indices("posts-1","posts-2").source(sourceBuilder);
SearchResponse searchResponse = client.search(searchRequest);
RestStatus status = searchResponse.status();
System.out.println(searchResponse.toString());

Categories