Cannot use _source with pagination in Spring Data Elasticsearch

Cannot use _source with pagination in Spring Data Elasticsearch - java

I am facing multiple weird problems when trying to use _source in a query with pagination.
If I use stream API then the sourceFilter is totally discarded. So this query will not generate _source json attribute in the query:
SourceFilter sourceFilter = new FetchSourceFilter(new String[]{"emails.sha256"}, null);
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(query)
.withSourceFilter(sourceFilter)
.withPageable(PageRequest.of(0, pageSize))
.build();
elasticsearchTemplate.stream(searchQuery, clazz)
On the other hand, if I change the stream method by queryForPage
elasticsearchTemplate.queryForPage(searchQuery, clazz)
The Elasticsearch query is properly generating the _source json attribute, but then I face issues with the pagination when the from attribute gets quite bigger. The error I get is:
{
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10002]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
}
I cannot modify max_result_window because it will always be big (I have billions of documents).
I also tested the startScroll that should resolve the pagination problem but I got a weird NoSuchMethodError
java.lang.NoSuchMethodError: org.springframework.data.elasticsearch.core.ElasticsearchTemplate.startScroll(JLorg/springframework/data/elasticsearch/core/query/SearchQuery;Ljava/lang/Class;)Lorg/springframework/data/elasticsearch/core/ScrolledPage;
I am using Spring Data Elasticsearch 3.2.0.BUILD-SNAPSHOT and Elasticsearch 6.5.4
Any idea about how I can paginate a query but limiting the response data using _source?

Related

How to query multi-valued array fields in Elasticsearch using Java client?

Using the Elasticsearch High Level REST Client for Java v7.3
I have a few fields in the schema that look like this:
{
"document_type" : ["Utility", "Credit"]
}
Basically one field could have an array of strings as the value. I not only need to query for a specific document_type, but also a general string query.
I've tried the following code:
QueryBuilder query = QueryBuilders.boolQuery()
.must(QueryBuilders.queryStringQuery(terms))
.filter(QueryBuilders.termQuery("document_type", "Utility"));
...which does not return any results. If I remove the ".filter()" part the query returns fine, but the filter appears to prevent any results from coming back. I'm suspecting it's because document_type is a multi-valued array - maybe I'm wrong though. How would I build a query query all documents for specific terms, but also filter by document_type?

I think, the reason is the wrong query. Consider using the terms query instead of term query. There is also a eqivalent in the java api.
Here is a good overview of the query qsl queries and their eqivalent in the high level rest client: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-query-builders.html

Lucidworks Fusion 4.1 transform result documents using Javascript query pipeline

How can I transform solr response using JavaScript query Pipeline in Lucidworks Fusion 4.1? For example I have the following response:
[
{ "doc_type":"type1",
"publicationDate":"2018/10/10",
"sortDate":"2017/9/9"},
{ "doc_type":"type2",
"publicationDate":"2018/5/5",
"sortDate":"2017/12/12"}]
And I need to change it with the following conditions:
If doc_type = type1 then put sortDate in publicationDate and remove sortDate; else only remove sortDate
How can I manipulate with response? There is no documentation in official website

Currently, you cannot modify the Solr response. All you can do is add to it. So you could add a new block of JSON, include the "id" of the item and then list the fields and values you want to use in your UI.
Otherwise, you need to make the change in your Index Pipeline (as long as the value doesn't need to change based on the query).

Couchbase Java SDK: N1QL queries that include document id

I'm looking to perform a query on my Couchbase database using the Java client SDK, which will return a list of results that include the document id for each result. Currently I'm using:
Statement stat = select("*").from(i("myBucket"))
.where(x(fieldIwantToGet).eq(s(valueIwantToGet)));
N1qlQueryResult result = bucket.query(stat);
However, N1qlQueryResult seems to only return a list of JsonObjects without any of the associated meta data. Looking at the documentation it seems like I want a method that returns a list of Document objects, but I can't see any bucket methods that I call that do the job.
Anyone know a way of doing this?

You need to use the below query to get Document Id:
Statement stat = select("meta(myBucket).id").from(i("myBucket"))
.where(x(fieldIwantToGet).eq(s(valueIwantToGet)));
The above would return you an array of Document Id.

Elasticsearch refreshing index automatically with index.refresh=-1?

I am using Elasticsearch with the Java API.
I am indexing offline data with big bulk inserts, so I set index.refresh=-1
I don't refresh the index "manually" anywhere.
It seems that refresh is done at some point, because queries do return data. The only scenario where the data wasn't returned was when I tested with just a few documents, and querying was done immediately after insertion (using the same Client object).
I wonder if index refresh is called implicitly by Elasticsearch or by the Java library at some stage, even when index.refresh=-1 ?
Or how else could the behavior be explained?
Client generation:
Client client = TransportClient.builder().settings(settings)
.build()
.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(address),port));
Insertion:
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (MyObject object : list) {
bulkRequest.add(client.prepareIndex(index, type)
.setSource(XContentFactory.jsonBuilder()
.startObject()
// ... add object fields here ...
.endObject()
));
}
BulkResponse bulkResponse = bulkRequest.get();
Querying:
QueryBuilder query = ...;
SearchResponse resp = client.prepareSearch(index)
.setQuery(query)
.setSize(Integer.MAX_VALUE)
// adding fields here
.get();
SearchHit[] = resp.getHits().getHits();

The reason the documents were searchable despite refresh interval being disabled could be either due to index-buffer filling up resulting in creation of lucene segment or translog being full resulting in commit of lucene segment either of which makes the documents searchable.
As per the documentation
By default, Elasticsearch uses memory heuristics in order to
automatically trigger flush operations as required in order to clear
memory.
Also the index buffer settings can be manipulated as follows.
This article is a good read with regard to how data is searchable and durable.
You can also look at this SO thread written by one of elasticsearch contributers for more details between flush vs refresh.
You can use indices-stats to verify all this i.e verify if there was a flush or refresh
Example :
GET <index_name>/_stats/refresh
GET <index_name>/_stats/flush

Scroll over aggregations with Elastic Search

I am using Elastic Search and aggregation as request on my documents. My aggregation is a terms aggregation on a specific field.
I would like to be able to get all the aggregations but the request only return me the 10th first. I tried to use the scroll of elastic search, but it is applied at the request and not on the aggregation, the documents returned are well scrolled, but the aggregation are still the same.
Does anyone had the same issue ?
Here is my request in java :
SearchResponse response = client.prepareSearch("index").setTypes("type").setQuery(QueryBuilders.matchAllQuery())
.addAggregation(AggregationBuilders.terms("TermsAggr").field("aggField")).execute().actionGet();
And here is how i am getting the buckets :
Terms terms = response.getAggregations().get("TermsAggr");
Collection<Bucket> buckets = terms.getBuckets();

The default return size for an aggregation result is 10 items, that is why you are only getting 10 results back. You will need to set the size on the AggregationBuilders object to a larger value in order to return more or all values in the bucket.
SearchResponse response = client.prepareSearch("index")
.setTypes("type")
.setQuery(QueryBuilders.matchAllQuery())
.addAggregation(AggregationBuilders.terms("TermsAggr")
.field("aggField").size(100))
.execute().actionGet();
If you always want to return all values, you can set size(0). However, you will need to upgrade to ES 1.1, as this was recently added to that release via Issue 4837.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cannot use _source with pagination in Spring Data Elasticsearch - java

Related

How to query multi-valued array fields in Elasticsearch using Java client?

Lucidworks Fusion 4.1 transform result documents using Javascript query pipeline

Couchbase Java SDK: N1QL queries that include document id

Elasticsearch refreshing index automatically with index.refresh=-1?

Scroll over aggregations with Elastic Search

Categories

Resources