Elasticsearch refreshing index automatically with index.refresh=-1? - java

I am using Elasticsearch with the Java API.
I am indexing offline data with big bulk inserts, so I set index.refresh=-1
I don't refresh the index "manually" anywhere.
It seems that refresh is done at some point, because queries do return data. The only scenario where the data wasn't returned was when I tested with just a few documents, and querying was done immediately after insertion (using the same Client object).
I wonder if index refresh is called implicitly by Elasticsearch or by the Java library at some stage, even when index.refresh=-1 ?
Or how else could the behavior be explained?
Client generation:
Client client = TransportClient.builder().settings(settings)
.build()
.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(address),port));
Insertion:
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (MyObject object : list) {
bulkRequest.add(client.prepareIndex(index, type)
.setSource(XContentFactory.jsonBuilder()
.startObject()
// ... add object fields here ...
.endObject()
));
}
BulkResponse bulkResponse = bulkRequest.get();
Querying:
QueryBuilder query = ...;
SearchResponse resp = client.prepareSearch(index)
.setQuery(query)
.setSize(Integer.MAX_VALUE)
// adding fields here
.get();
SearchHit[] = resp.getHits().getHits();

The reason the documents were searchable despite refresh interval being disabled could be either due to index-buffer filling up resulting in creation of lucene segment or translog being full resulting in commit of lucene segment either of which makes the documents searchable.
As per the documentation
By default, Elasticsearch uses memory heuristics in order to
automatically trigger flush operations as required in order to clear
memory.
Also the index buffer settings can be manipulated as follows.
This article is a good read with regard to how data is searchable and durable.
You can also look at this SO thread written by one of elasticsearch contributers for more details between flush vs refresh.
You can use indices-stats to verify all this i.e verify if there was a flush or refresh
Example :
GET <index_name>/_stats/refresh
GET <index_name>/_stats/flush

Related

Cannot use _source with pagination in Spring Data Elasticsearch

I am facing multiple weird problems when trying to use _source in a query with pagination.
If I use stream API then the sourceFilter is totally discarded. So this query will not generate _source json attribute in the query:
SourceFilter sourceFilter = new FetchSourceFilter(new String[]{"emails.sha256"}, null);
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(query)
.withSourceFilter(sourceFilter)
.withPageable(PageRequest.of(0, pageSize))
.build();
elasticsearchTemplate.stream(searchQuery, clazz)
On the other hand, if I change the stream method by queryForPage
elasticsearchTemplate.queryForPage(searchQuery, clazz)
The Elasticsearch query is properly generating the _source json attribute, but then I face issues with the pagination when the from attribute gets quite bigger. The error I get is:
{
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10002]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
}
I cannot modify max_result_window because it will always be big (I have billions of documents).
I also tested the startScroll that should resolve the pagination problem but I got a weird NoSuchMethodError
java.lang.NoSuchMethodError: org.springframework.data.elasticsearch.core.ElasticsearchTemplate.startScroll(JLorg/springframework/data/elasticsearch/core/query/SearchQuery;Ljava/lang/Class;)Lorg/springframework/data/elasticsearch/core/ScrolledPage;
I am using Spring Data Elasticsearch 3.2.0.BUILD-SNAPSHOT and Elasticsearch 6.5.4
Any idea about how I can paginate a query but limiting the response data using _source?

Dynamodb AWS Java scan withLimit is not working

I am trying to use the DynamoDBScanExpression withLimit of 1 using Java aws-sdk version 1.11.140
Even if I use .withLimit(1) i.e.
List<DomainObject> result = mapper.scan(new DynamoDBScanExpression().withLimit(1));
returns me list of all entries i.e. 7. Am I doing something wrong?
P.S. I tried using cli for this query and
aws dynamodb scan --table-name auditlog --limit 1 --endpoint-url http://localhost:8000
returns me just 1 result.
DynamoDBMapper.scan will return a PaginatedScanList - Paginated results are loaded on demand when the user executes an operation that requires them. Some operations, such as size(), must fetch the entire list, but results are lazily fetched page by page when possible.
Hence, The limit parameter set on DynamoDBScanExpression is the maximum number of items to be fetched per page.
So in your case, a PaginatedList is returned and when you do PaginatedList.size it attempts to load all items from Dynamodb, under the hood the items were loaded 1 per page (each page is a fetch request to DynamoDb) till it get's to the end of the PaginatedList.
Since you're only interested in the first result, a good way to get that without fetching all the 7 items from Dynamo would be :
Iterator it = mapper.scan(DomainObject.class, new DynamoDBScanExpression().withLimit(1)).iterator();
if ( it.hasNext() ) {
DomainObject dob = (DomainObject) it.next();
}
With the above code, only the first item will fetched from dynamodb.
The take away is that : The limit parameter in DynamoDBQueryExpression is used in the pagination purpose only. It's a limit on the amount of items per page not a limit on the number of pages that can be requested.

Scroll over aggregations with Elastic Search

I am using Elastic Search and aggregation as request on my documents. My aggregation is a terms aggregation on a specific field.
I would like to be able to get all the aggregations but the request only return me the 10th first. I tried to use the scroll of elastic search, but it is applied at the request and not on the aggregation, the documents returned are well scrolled, but the aggregation are still the same.
Does anyone had the same issue ?
Here is my request in java :
SearchResponse response = client.prepareSearch("index").setTypes("type").setQuery(QueryBuilders.matchAllQuery())
.addAggregation(AggregationBuilders.terms("TermsAggr").field("aggField")).execute().actionGet();
And here is how i am getting the buckets :
Terms terms = response.getAggregations().get("TermsAggr");
Collection<Bucket> buckets = terms.getBuckets();
The default return size for an aggregation result is 10 items, that is why you are only getting 10 results back. You will need to set the size on the AggregationBuilders object to a larger value in order to return more or all values in the bucket.
SearchResponse response = client.prepareSearch("index")
.setTypes("type")
.setQuery(QueryBuilders.matchAllQuery())
.addAggregation(AggregationBuilders.terms("TermsAggr")
.field("aggField").size(100))
.execute().actionGet();
If you always want to return all values, you can set size(0). However, you will need to upgrade to ES 1.1, as this was recently added to that release via Issue 4837.

Couchbase 2.0 Java SDK 1.1 - Synchronous Add and Views

I am trying to create a junit test. Scenario:
setUp: I'm adding two json documents to database
Test: I'm getting those documents using view
tearDown: I'm removing both objects
My view:
function (doc, meta) {
if (doc.type && doc.type == "UserConnection") {
emit([doc.providerId, doc.providerUserId], doc.userId);
}
}
This is how I add those documents to database and make sure that "add" is synchronous:
public boolean add(String key, Object element) {
String json = gson.toJson(element);
OperationFuture<Boolean> result = couchbaseClient.add(key, 0, json);
return result.get();
}
JSON Documents that I'm adding are:
{"userId":"1","providerId":"test_pId","providerUserId":"test_pUId","type":"UserConnection"}
{"userId":"2","providerId":"test_pId","providerUserId":"test_pUId","type":"UserConnection"}
This is how I call the view:
View view = couchbaseClient.getView(DESIGN_DOCUMENT_NAME, VIEW_NAME);
Query query = new Query();
query.setKey(ComplexKey.of("test_pId", "test_pUId"));
ViewResponse viewResponse = couchbaseClient.query(view, query);
Problem:
Test fails due to invalid number of elements fetched from view.
My observations:
Sometimes tests are passing
Number of elements that are fetched from view is not consistent(from 0 to 2)
When I've added those documents to database instead of calling setUp the test passed every time
Acording to this http://www.couchbase.com/docs/couchbase-sdk-java-1.1/create-update-docs.html documentation I'm adding those json documents synchronously by calling get() on returned Future object.
My question:
Is there something wrong with how I've approached to fetching data from view just after this data was inserted to DB? Is there any good practise for solving this problem? And can someone explain it to me please what I've did wrong?
Thanks,
Dariusz
In Couchbase 2.0 documents are required to be written to disk before they will show up in a view. There are three ways you can do an operation with the Java SDK. The first is asynchronous which means that you just send the data and at a later time check to make sure that the data was received correctly. If you do an asynchronous operation and then immediately call .get() as you did above then you have created a synchronous operation. When an operation returns success in these two cases above you are only guaranteed that the item has been written into memory. Your test passed sometimes only because you were lucky enough that both items were written to disk before did your query.
The third way to do an operation is with durability requirements and this is the one you want to do for your tests. Durability requirements allow you to say that you want an item to be written to disk or replicated before success is returned to the client. Take a look at the following function.
https://github.com/couchbase/couchbase-java-client/blob/1.1.0/src/main/java/com/couchbase/client/CouchbaseClient.java#L1293
You will want to use this function and set the PersistedTo parameter to MASTER.

Retrieving the UUID from Apache Solr after a commit

I am using Solrj to add new documents to a Solr instance. In my document schema the id is a UUID (solr.UUIDField). Each time a document is created the id is filled with the unique id, which is exactly what I want. Sometimes it's necessary in my application that I can retrieve this unique id to add it as a field value when inserting another document. So my question is, how can I retrieve this generated uuid from solr after adding a document?
Solrj returns me this UpdateResponse object after commiting, but I don't know how to get the new uuid out of it.
I am adding a document like this
CommonsHttpSolrServer server = new CommonsHttpSolrServer(MY_SERVER_URL);
SolrInputDocument doc = new SolrInputDocument();
// [...] multiple addField calls
server.add(doc);
UpdateResponse ur = server.commit();
AFAIK you aren't going to ever get a UUID from an add or a commit. When you do an add or commit, the update request handler gives you back query time and status, but not much else (assuming it is successful). You can actually see what is in the HTTP response by running a manual add/commit like so:
http://localhost:8983/solr/update?stream.body=<add><doc><field name="id">test</field><field name="title">test title</field></doc></add>
http://localhost:8983/solr/update?stream.body=<commit/>
If you run those queries in a web browser, they will submit a test document and commit it, respectively. You will then be able to see what information is available to SolrJ (not much).
You could write your own (modified) update handler in Java, but that seems like a ton of work. You could also enable the "timestamp" field in your Solr schema so you can query solr by last modified date and find the items you just committed.
Both of those methods would be major hacks, though. Your best bet is to figure out a unique ID for your documents before you submit them to Solr, then use that unique ID to retrieve them. Using a generated UUID is more of a "fire and forget about this" method. Since you don't want to forget, you will need to generate your own UUID.
Since you're using Java, it should be dead simple to do with UUID, using some code like this:
CommonsHttpSolrServer server = new CommonsHttpSolrServer(MY_SERVER_URL);
SolrInputDocument doc = new SolrInputDocument();
UUID uuid = UUID.randomUUID();
doc.addField("id", uuid.toString());
// [...] multiple addField calls
server.add(doc);
UpdateResponse ur = server.commit();

Categories