Scroll over aggregations with Elastic Search - java

I am using Elastic Search and aggregation as request on my documents. My aggregation is a terms aggregation on a specific field.
I would like to be able to get all the aggregations but the request only return me the 10th first. I tried to use the scroll of elastic search, but it is applied at the request and not on the aggregation, the documents returned are well scrolled, but the aggregation are still the same.
Does anyone had the same issue ?
Here is my request in java :
SearchResponse response = client.prepareSearch("index").setTypes("type").setQuery(QueryBuilders.matchAllQuery())
.addAggregation(AggregationBuilders.terms("TermsAggr").field("aggField")).execute().actionGet();
And here is how i am getting the buckets :
Terms terms = response.getAggregations().get("TermsAggr");
Collection<Bucket> buckets = terms.getBuckets();

The default return size for an aggregation result is 10 items, that is why you are only getting 10 results back. You will need to set the size on the AggregationBuilders object to a larger value in order to return more or all values in the bucket.
SearchResponse response = client.prepareSearch("index")
.setTypes("type")
.setQuery(QueryBuilders.matchAllQuery())
.addAggregation(AggregationBuilders.terms("TermsAggr")
.field("aggField").size(100))
.execute().actionGet();
If you always want to return all values, you can set size(0). However, you will need to upgrade to ES 1.1, as this was recently added to that release via Issue 4837.

Related

How to query multi-valued array fields in Elasticsearch using Java client?

Using the Elasticsearch High Level REST Client for Java v7.3
I have a few fields in the schema that look like this:
{
"document_type" : ["Utility", "Credit"]
}
Basically one field could have an array of strings as the value. I not only need to query for a specific document_type, but also a general string query.
I've tried the following code:
QueryBuilder query = QueryBuilders.boolQuery()
.must(QueryBuilders.queryStringQuery(terms))
.filter(QueryBuilders.termQuery("document_type", "Utility"));
...which does not return any results. If I remove the ".filter()" part the query returns fine, but the filter appears to prevent any results from coming back. I'm suspecting it's because document_type is a multi-valued array - maybe I'm wrong though. How would I build a query query all documents for specific terms, but also filter by document_type?
I think, the reason is the wrong query. Consider using the terms query instead of term query. There is also a eqivalent in the java api.
Here is a good overview of the query qsl queries and their eqivalent in the high level rest client: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-query-builders.html

Cannot use _source with pagination in Spring Data Elasticsearch

I am facing multiple weird problems when trying to use _source in a query with pagination.
If I use stream API then the sourceFilter is totally discarded. So this query will not generate _source json attribute in the query:
SourceFilter sourceFilter = new FetchSourceFilter(new String[]{"emails.sha256"}, null);
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(query)
.withSourceFilter(sourceFilter)
.withPageable(PageRequest.of(0, pageSize))
.build();
elasticsearchTemplate.stream(searchQuery, clazz)
On the other hand, if I change the stream method by queryForPage
elasticsearchTemplate.queryForPage(searchQuery, clazz)
The Elasticsearch query is properly generating the _source json attribute, but then I face issues with the pagination when the from attribute gets quite bigger. The error I get is:
{
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10002]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
}
I cannot modify max_result_window because it will always be big (I have billions of documents).
I also tested the startScroll that should resolve the pagination problem but I got a weird NoSuchMethodError
java.lang.NoSuchMethodError: org.springframework.data.elasticsearch.core.ElasticsearchTemplate.startScroll(JLorg/springframework/data/elasticsearch/core/query/SearchQuery;Ljava/lang/Class;)Lorg/springframework/data/elasticsearch/core/ScrolledPage;
I am using Spring Data Elasticsearch 3.2.0.BUILD-SNAPSHOT and Elasticsearch 6.5.4
Any idea about how I can paginate a query but limiting the response data using _source?

Lucidworks Fusion 4.1 transform result documents using Javascript query pipeline

How can I transform solr response using JavaScript query Pipeline in Lucidworks Fusion 4.1? For example I have the following response:
[
{ "doc_type":"type1",
"publicationDate":"2018/10/10",
"sortDate":"2017/9/9"},
{ "doc_type":"type2",
"publicationDate":"2018/5/5",
"sortDate":"2017/12/12"}]
And I need to change it with the following conditions:
If doc_type = type1 then put sortDate in publicationDate and remove sortDate; else only remove sortDate
How can I manipulate with response? There is no documentation in official website
Currently, you cannot modify the Solr response. All you can do is add to it. So you could add a new block of JSON, include the "id" of the item and then list the fields and values you want to use in your UI.
Otherwise, you need to make the change in your Index Pipeline (as long as the value doesn't need to change based on the query).

Dynamodb AWS Java scan withLimit is not working

I am trying to use the DynamoDBScanExpression withLimit of 1 using Java aws-sdk version 1.11.140
Even if I use .withLimit(1) i.e.
List<DomainObject> result = mapper.scan(new DynamoDBScanExpression().withLimit(1));
returns me list of all entries i.e. 7. Am I doing something wrong?
P.S. I tried using cli for this query and
aws dynamodb scan --table-name auditlog --limit 1 --endpoint-url http://localhost:8000
returns me just 1 result.
DynamoDBMapper.scan will return a PaginatedScanList - Paginated results are loaded on demand when the user executes an operation that requires them. Some operations, such as size(), must fetch the entire list, but results are lazily fetched page by page when possible.
Hence, The limit parameter set on DynamoDBScanExpression is the maximum number of items to be fetched per page.
So in your case, a PaginatedList is returned and when you do PaginatedList.size it attempts to load all items from Dynamodb, under the hood the items were loaded 1 per page (each page is a fetch request to DynamoDb) till it get's to the end of the PaginatedList.
Since you're only interested in the first result, a good way to get that without fetching all the 7 items from Dynamo would be :
Iterator it = mapper.scan(DomainObject.class, new DynamoDBScanExpression().withLimit(1)).iterator();
if ( it.hasNext() ) {
DomainObject dob = (DomainObject) it.next();
}
With the above code, only the first item will fetched from dynamodb.
The take away is that : The limit parameter in DynamoDBQueryExpression is used in the pagination purpose only. It's a limit on the amount of items per page not a limit on the number of pages that can be requested.

Elasticsearch refreshing index automatically with index.refresh=-1?

I am using Elasticsearch with the Java API.
I am indexing offline data with big bulk inserts, so I set index.refresh=-1
I don't refresh the index "manually" anywhere.
It seems that refresh is done at some point, because queries do return data. The only scenario where the data wasn't returned was when I tested with just a few documents, and querying was done immediately after insertion (using the same Client object).
I wonder if index refresh is called implicitly by Elasticsearch or by the Java library at some stage, even when index.refresh=-1 ?
Or how else could the behavior be explained?
Client generation:
Client client = TransportClient.builder().settings(settings)
.build()
.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(address),port));
Insertion:
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (MyObject object : list) {
bulkRequest.add(client.prepareIndex(index, type)
.setSource(XContentFactory.jsonBuilder()
.startObject()
// ... add object fields here ...
.endObject()
));
}
BulkResponse bulkResponse = bulkRequest.get();
Querying:
QueryBuilder query = ...;
SearchResponse resp = client.prepareSearch(index)
.setQuery(query)
.setSize(Integer.MAX_VALUE)
// adding fields here
.get();
SearchHit[] = resp.getHits().getHits();
The reason the documents were searchable despite refresh interval being disabled could be either due to index-buffer filling up resulting in creation of lucene segment or translog being full resulting in commit of lucene segment either of which makes the documents searchable.
As per the documentation
By default, Elasticsearch uses memory heuristics in order to
automatically trigger flush operations as required in order to clear
memory.
Also the index buffer settings can be manipulated as follows.
This article is a good read with regard to how data is searchable and durable.
You can also look at this SO thread written by one of elasticsearch contributers for more details between flush vs refresh.
You can use indices-stats to verify all this i.e verify if there was a flush or refresh
Example :
GET <index_name>/_stats/refresh
GET <index_name>/_stats/flush

Categories