Elasticsearch SearchContextMissingException during 'scan & scroll' query with Spring Data Elasticsearch - java

I am using elasticsearch 2.2.0 with default cluster configuration. I encounter a problem with scan and scroll query using spring data elasticsearch. When I execute the query I get error like this:
[2016-06-29 12:45:52,046][DEBUG][action.search.type ] [Vector] [155597] Failed to execute query phase
RemoteTransportException[[Vector][10.132.47.95:9300][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [155597]];
Caused by: SearchContextMissingException[No search context found for id [155597]]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:611)
at org.elasticsearch.search.SearchService.executeScan(SearchService.java:311)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:433)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:430)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My 'scan & scroll' code:
public List<T> getAllElements(SearchQuery searchQuery) {
searchQuery.setPageable(new PageRequest(0, PAGE_SIZE));
String scrollId = elasticsearchTemplate.scan(searchQuery, 1000, false);
List<T> allElements = new LinkedList<>();
boolean hasRecords = true;
while (hasRecords) {
Page<T> page = elasticsearchTemplate.scroll(scrollId, 5000, resultMapper);
if (page.hasContent()) {
allElements.addAll(page.getContent());
} else {
hasRecords = false;
}
}
elasticsearchTemplate.clearScroll(scrollId);
return allElements;
}
When my query result size is less than PAGE_SIZE parameter, then error like this occurs five times. I guess that it is one per shard. When result size is bigger than PAGE_SIZE, the error occurs few times more. I've tried to refactor my code to not call:
Page<T> page = elasticsearchTemplate.scroll(scrollId, 5000, resultMapper);
when I'm sure that the page has no content. But it works only if PAGE_SIZE is bigger than query result, so it is no the solution at all.
I have to add that it is problem only on elasticsearch side. On the client side the errors is hidden and in each case the query result is correct. Has anybody knows what causes this issue?
Thank you for help,
Simon.

I get this error if the ElasticSearch system closes the connection. Typically it's exactly what #Val said - dead connections. Things sometimes die in ES for no good reason - master node down, data node is too congested, bad performing queries, Kibana running at the same time you are in middle of querying...I've been hit by all of these at one time or another to get this error.
Suggestion: Up the initial connection time - 1000L might be too short for it to get what it needs. It won't hurt if the query ends sooner.
This also happens randomly when I try to pull too much data quickly; you might have huge documents and trying to pull PAGESIZE of 50,000 might be a little too much. We don't know what you chose for PAGESIZE.
Suggestion: Lower PAGESIZE to something like 500. Or 20. See if these smaller values slow down the errors.
I know I have less of these problems after moving to ES 2.3.3.

This usually happens if your search context is not alive anymore.
In your case, you're starting your scan with a timeout of 1 second and then each scan is alive during 5 seconds. It's probably too low. The default duration to keep the search context alive is 1 minute, so you should probably increase it to 60 seconds like this:
String scrollId = elasticsearchTemplate.scan(searchQuery, 60000, false);
...
Page<T> page = elasticsearchTemplate.scroll(scrollId, 60000, resultMapper);

I ran into a similar problem and I suspect that Spring Data Elasticsearch has some internal bug about passing the Scroll-ID. In my case I just tried to scroll through the whole index and I can rule out #Val his answer about "This usually happens if your search context is not alive anymore", because the exceptions occurred regardless of the duration. Also the exceptions started after the first Page and occurred for every other paging query.
In my case I could simply use elasticsearchTemplate.stream(). It uses Scroll & Scan internally and seems to pass the Scroll-ID correctly. Oh, and it's simpler to use:
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.matchAllQuery())
.withPageable(new PageRequest(0, 10000))
.build();
Iterator<Post> postIterator = elasticsearchTemplate.stream(searchQuery, Post.class);
while(postIterator.hasNext()) {
Post post = postIterator.next();
}

Related

Cassandra, Java and MANY Async request : is this good?

I'm developping a Java application with Cassandra with my table :
id | registration | name
1 1 xxx
1 2 xxx
1 3 xxx
2 1 xxx
2 2 xxx
... ... ...
... ... ...
100,000 34 xxx
My tables have very large amount of rows (more than 50,000,000). I have a myListIds of String id to iterate over. I could use :
SELECT * FROM table WHERE id IN (1,7,18, 34,...,)
//image more than 10,000,000 numbers in 'IN'
But this is a bad pattern. So instead I'm using async request this way :
List<ResultSetFuture> futures = new ArrayList<>();
Map<String, ResultSetFuture> map = new HashMap<>();
// map : key = id & value = data from Cassandra
for (String id : myListIds)
{
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(id));
mapFutures.put(id, resultSetFuture);
}
Then I will process my data with getUninterruptibly() method.
Here is my problems : I'm doing maybe more than 10,000,000 Casandra request (one request for each 'id'). And I'm putting all these results inside a Map.
Can this cause heap memory error ? What's the best way to deal with that ?
Thank you
Note: your question is "is this a good design pattern".
If you are having to perform 10,000,000 cassandra data requests then you have structured your data incorrectly. Ultimately you should design your database from the ground up so that you only ever have to perform 1-2 fetches.
Now, granted, if you have 5000 cassandra nodes this might not be a huge problem(it probably still is) but it still reeks of bad database design. I think the solution is to take a look at your schema.
I see the following problems with your code:
Overloaded Cassandra cluster, it won't be able to process so many async requests, and you requests will be failed with NoHostAvailableException
Overloaded cassadra driver, your client app will fails with IO exceptions, because system will not be able process so many async requests.(see details about connection tuning https://docs.datastax.com/en/developer/java-driver/3.1/manual/pooling/)
And yes, memory issues are possible. It depends on the data size
Possible solution is limit number of async requests and process data by chunks.(E.g see this answer )

Rally REST Query throwing MalformedJsonException

I'm trying to query something from the Rally database. Right now I'm just trying to make sure I can get through initially. This code:
//create rallyrest object
RallyRestApi restApi = new RallyRestApi(new URI(hostname), username, password);
restApi.setApplicationName("QueryTest");
restApi.setWsapiVersion("v2.0");
restApi.setApplicationVersion("1.1");
System.out.println("1: So far, so good -- RallyRestApi object created");
try {
//create query request
String type = "HierarchicalRequirement";
QueryRequest qreq = new QueryRequest(type);
System.out.println("2: Still going -- Query Request Created");
//set fetch, filter, and project
qreq.setFetch(new Fetch("Name","FormattedID"));
qreq.setQueryFilter(new QueryFilter("Name", "contains", "freight"));
qreq.setProject(projectNumber);
System.out.println("3: Going strong -- Fetch, Filter, and Project set");
//create response from query********Blows up
QueryResponse resp = restApi.query(qreq);
System.out.println("4: We made it!");
} catch (Exception e) {
System.out.println("Error: " + e);
}
finally {
restApi.close();
}
}
gives me this error:
Exception in thread "main" com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 22
at com.google.gson.JsonParser.parse(JsonParser.java:65)
at com.google.gson.JsonParser.parse(JsonParser.java:45)
at com.rallydev.rest.response.Response.<init>(Response.java:25)
at com.rallydev.rest.response.QueryResponse.<init>(QueryResponse.java:18)
at com.rallydev.rest.RallyRestApi.query(RallyRestApi.java:227)
at RQuery.main(RQuery.java:65)
Caused by: com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 22
at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1505)
at com.google.gson.stream.JsonReader.checkLenient(JsonReader.java:1386)
at com.google.gson.stream.JsonReader.doPeek(JsonReader.java:531)
at com.google.gson.stream.JsonReader.peek(JsonReader.java:414)
at com.google.gson.JsonParser.parse(JsonParser.java:60)
... 5 more
Java Result: 1
Could someone please explain why this is happening? Is my code wrong? If I need to do as the error suggests and set Json.lenient(true), please give me instructions on how to do that with my code.
Thank you!
Your code works for me. I don't see anything wrong with the code.
Try different query, maybe there are some extra characters.
See this post - it mentioned a case with trailing NUL (\0) characters. Perhaps you need to set lenient to true, but I don't know how to do it: there is no direct access to it when working with Rally QueryResponse.
The reason for trying a different query is that there are two local factors: your java environment and your data. MalformedJsonException indicates bad json which points to data. You are fetching only "Name" and "FormattedID", so chances are the culprit is somewhere in the name. Try a different query, e.g. (FormattedID = US123) but choose the story that does not contain "freight" in the name. Establish that at least one particular query works - it will indicate further that the issue is indeed related to data, and not the environment.
Next, try the same query (Name contains "freight") directly in WS API, which is an interactive document where queries can be tested. An equivalent of the query in your code can also be pasted in the browser:
https://rally1.rallydev.com/slm/webservice/v2.0/hierarchicalrequirement?workspace=https://rally1.rallydev.com/slm/webservice/v2.0/workspace/123&query=(Name%20contains%20%22fraight%22)&start=1&pagesize=200&fetch=Name,FormattedID
Make sure to replace 123 in /workspace/123 with the valid OID of your workspace.
Does the query return or you see the same error in the browser?
If the query returns, what is the TotalResultCount?
The total result count will help to troubleshoot further. You may run your code one page at a time, and knowing the TotalResultCount it is possible to manipulate pagesize, start and limit to narrow down your code to the page where the culprit story exists (assuming that there is a culprit story). Here is an example:
qreq.setPageSize(200);
qreq.setStart(2);
qreq.setLimit(200);
Maximum pagesize is 200. Default is 20. The actual number to use depends on TotalResultCount.
The start index for queries begins at 1. The default is 1. In this example we start with second page
My environment is Java SE 1.6 and these jars:
httpcore-4.2.4.jar
httpclient-4.2.5.jar
commons-logging-1.1.1.jar
commons-codec-1.6.jar
gson-2.2.4.jar
rally-rest-api-2.0.4.jar

elasticsearch: Cannot find indexed data (unit Node is closed)

I'm trying to start using elasticsearch (having been a long-term compass user) and I'm having some pretty serious problems with the basics, which is highly frustrating.
The current problem I'm facing is that indexed data is not showing up until after the node is closed. Here is a sample of my code
Node node = nodeBuilder().node();
Client client = node.client();
client.prepareIndex("index1", "type1", "1").setSource("{ \"name\": \"Aaron\"}").execute().actionGet();
client.prepareIndex("index1", "type1", "2").setSource("{ \"name\": \"Andrew\"}").execute().actionGet();
client.prepareIndex("index1", "type1", "3").setSource("{ \"name\": \"Alistair\"}").execute().actionGet();
QueryBuilder queryBuilder = QueryBuilders.wildcardQuery("name", "a*");
SearchRequestBuilder searchRequestBuilder = client.prepareSearch("index1");
searchRequestBuilder.setTypes("type1");
searchRequestBuilder.setSearchType(SearchType.DEFAULT);
searchRequestBuilder.setQuery(queryBuilder);
SearchResponse response = searchRequestBuilder.execute().actionGet();
System.out.println("Response contains " + response.getHits().totalHits() + " hits");
for (SearchHit currentHit : response.getHits())
{
System.out.println(currentHit.getSourceAsString());
}
client.close();
node.close();
The first time I run this, it finds no hits in the search. However, if I run it again - it does indeed find the names that all begin with the letter "A" (don't get me started on the auto-lowercasing of indexed items, but not of searches - that cost me over an hour).
If I remover the close, it doesn't matter how many times I run the above I never find results. However, if I add the close statements, it works second time (every time).
It feels like something to do with having buffered index changes that aren't flushed?
I am sure I am missing something obvious and basic. But I just cannot put my finger on it.
You want to refresh the index before you'll be able to search for the latest changes. Put this after the indexing, before executing the search:
client.admin().indices().prepareRefresh("index1").execute().actionGet();
With the default settings, Elasticsearch will call refresh periodically every 1 second.

How do I limit the number of results when using the Java driver for mongo db?

http://api.mongodb.org/java/2.1/com/mongodb/DBCollection.html#find(com.mongodb.DBObject,com.mongodb.DBObject,int,int)
Using this with Grails and the mongo db plugin.
Here's the code I'm using... not sure why but the cursor is returning the entire set of data. In this case, I'm just trying to return the first 20 matches (with is_processed = false):
def limit = {
def count = 1;
def shape_cursor = mongo.shapes.find(new BasicDBObject("is_processed", false),new BasicDBObject(),0,20);
while(shape_cursor.hasNext()){
shape_cursor.next();
render "<div>" + count + "</div"
count++;
}
}
Anyone have an idea?
limit is a method of DBCursor: DBCursor.limit(n).
So you simply need to do
def shape_cursor = mongo.shapes.find(...).limit(20);
According to the JavaDoc you linked to the second int parameter is not the maximum number to return, but
batchSize - if positive, is the # of objects per batch sent back from the db. all objects that match will be returned. if batchSize < 0, its a hard limit, and only 1 batch will either batchSize or the # that fit in a batch
Maybe a negative number (-20) would do what you want, but I find the statement above too confusing to be sure about it, so I would set the batchSize to 20 and then filter in your application code.
Maybe file this as a bug / feature request. There should be a way to specify skip/limit that works just like on the shell interface. (Update: and there is, on the cursor class, see the other answer).

Why wont this sort in Solr work?

I need to sort on a date-field type, which name is "mod_date".
It works like this in the browser adress-bar:
http://localhost:8983/solr/select/?&q=bmw&sort=mod_date+desc
But I am using a phpSolr client which sends an URL to Solr, and the url sent is this:
fq=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&version=1.2&wt=json&json.nl=map&q=%2A%3A%2A&start=0&rows=5&sort=mod_date+desc
// This wont work and is echoed after this in php:
$queryString = http_build_query($params, null, $this->_queryStringDelimiter);
$queryString = preg_replace('/%5B(?:[0-9]|[1-9][0-9]+)%5D=/', '=', $queryString);
This wont work, I dont know why!
Everything else works fine, all right fields are returned. But the sort doesn't work.
Any ideas?
Thanks
BTW: The field "mod_date" contains something like:
2010-03-04T19:37:22.5Z
EDIT:
First I use PHP to send this to a SolrPhpClient which is another php-file called service.php:
require_once('../SolrPhpClient/Apache/Solr/Service.php');
$solr = new Apache_Solr_Service('localhost', 8983, '/solr/');
$results = $solr->search($querystring, $p, $limit, $solr_params);
$solr_params is an array which contains the solr-parameters (q, fq, etc).
Now, in service.php:
$params['version'] = self::SOLR_VERSION;
// common parameters in this interface
$params['wt'] = self::SOLR_WRITER;
$params['json.nl'] = $this->_namedListTreatment;
$params['q'] = $query;
$params['sort'] = 'mod_date desc'; // HERE IS THE SORT I HAVE PROBLEM WITH
$params['start'] = $offset;
$params['rows'] = $limit;
$queryString = http_build_query($params, null, $this->_queryStringDelimiter);
$queryString = preg_replace('/%5B(?:[0-9]|[1-9][0-9]+)%5D=/', '=', $queryString);
if ($method == self::METHOD_GET)
{
return $this->_sendRawGet($this->_searchUrl . $this->_queryDelimiter . $queryString);
}
else if ($method == self::METHOD_POST)
{
return $this->_sendRawPost($this->_searchUrl, $queryString, FALSE, 'application/x-www-form-urlencoded');
}
The $results contain the results from Solr...
So this is the way I need to get to work (via php).
This code below (also on top of this Q) works but thats because I paste it into the adress bar manually, not via the PHPclient. But thats just for debugging, I need to get it to work via the PHPclient:
http://localhost:8983/solr/select/?&q=bmw&sort=mod_date+des // Not via phpclient, but works
UPDATE (2010-03-08):
I have tried Donovans codes (the urls) and they work fine.
Now, I have noticed that it is one of the parameters causing the 'SORT' not to work.
This parameter is the "wt" parameter. If we take the url from top of this Q, (fq=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&version=1.2&wt=json&json.nl=map&q=%2A%3A%2A&start=0&rows=5&sort=mod_date+desc), and just simply remove the "wt" parameter, then the sort works.
BUT the results appear differently, thus making my php code not able to recognize the results I believe. Donovan would know this I think. I am guessing in order for the PHPClient to work, the results must be in a specific structure, which gets messed up as soon as I remove the wt parameter.
Donovan, help me please...
Here is some background what I use your SolrPhpClient for:
I have a classifieds website, which uses MySql. But for the searching I am using Solr to search some indexed fields. Then Solr returns an array of ID:numbers (for all matches of the search criteria). Then I use those ID:numbers to find everything in a MySql db and fetch all other information (example is not searchable information).
So simplified: Search -> Solr returns all matches in an array of ID:nrs -> Id:numbers from Solr are the same as the Id numbers in the MySql db, so I can just make a simple match agains every record with the ID matching the ID from the Solr results array.
I don't use Faceting, no boosting, no relevancy or other fancy stuff. I only sort by the latest classified put, and give the option to users to also sort on the cheapest price. Nothing more.
Then I use the "fq" parameter to do queries on different fields in Solr depending on category chosen by users (example "cars" in this case which in my language is "Bilar").
I am really stuck with this problem here...
Thanks for all help
As pointed out in the stack overflow comments, your browser query is different than your php client based query - to remove that from the equation you should test with this corrected. To get the same results as the browser based query you're php code should have looked something like this:
$solr = new Apache_Solr_Client(...);
$searchOptions = array(
'sort' => 'mod_date desc'
);
$results = $solr->search("bmw", 0, 10, $searchOptions);
Instead, I imagine it looks more like:
$searchOptions = array(
'fq' => 'category:"Bilar" + car_action:Sälje',
'sort' => 'mod_date desc'
)
$solr->search("\*:*", 0, 10, $searchOptions);
What I expect you to see is that php client results will be the same as the browser based results, and I imagine the same would happen if you did it the opposite way - take your current parameters from the php client and applied them correctly to the browser based query.
Now onto your problem, you don't see documents sorted properly.
I would try this query, which is the equivalent of the php client based code:
http://localhost:8983/solr/select/?&q=%2A%3A%2A&fq=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&sort=mod_date+desc
versus this query, which moves the filter query into the main query:
http://localhost:8983/solr/select/?&q=+category%3A%22Bilar%22+%2B+car_action%3AS%C3%A4ljes&sort=mod_date+desc
and see if there is a difference. If there is, then it might be a bug in how results from cached filtered queries are used and sorted by solr - which wouldn't be a problem with the client, but the solr service itself.
Hope this gets you closer to an anser.
Use session's values for save sort parameters.
The quick answer in case someone is attempting to sort via solr-php-client:
$searchOptions = array('sort' => 'field_date desc');
Ditch the + sign that you would usually put on the URL. It took me a while as well to figure it out, I was encoding it and putting it all over the place...
It's possible it's related to the json.nl=map parameter. When the response is set to JSON with wt=json and json.nl=map, facets are not sorted as expected with the facet.sort or f.<field_name>.facet.sort=count|index options.
e.g. with facet.sort=count and wt=json only, I get:
"dc_coverage": [
"United States",
5,
"19th century",
1,
"20th century",
1,
"Detroit (Mich.)",
1,
"Pennsylvania",
1,
"United States--Michigan--Detroit",
1,
"United States--Washington, D.C.",
1
]
But with facet.sort=count, wt=json, and json.nl=map as an option, you can see the sorting is lost:
"dc_coverage": {
"19th century": 1,
"20th century": 1,
"Detroit (Mich.)": 1,
"Pennsylvania": 1,
"United States": 5,
"United States--Michigan--Detroit": 1,
"United States--Washington, D.C.": 1
}
There is more information here about formatting the JSON response when using json.nl=map: https://cwiki.apache.org/confluence/display/solr/Response+Writers#ResponseWriters-JSONResponseWriter

Categories