Elasticsearch Java API implementation to fetch millions of records

Elasticsearch Java API implementation to fetch millions of records - java

I want to get all doc (millions) in elastic index based on some condition. I used below query in elastic.
GET /<index-name>/_search
{
"from" : 99550, "size" : 500,
"query" : {
"term" : { "CC_ENGAGEMENT_NUMBER" : "1967" }
}
}
And below are my java implementation.
public IndexSearchResult findByStudIdAndcollageId(final String studId, final String collageId,
Integer Page_Number_Start_Index, Integer Total_No_Of_Records) {
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
List<Map<String, Object>> searchResults = new ArrayList<Map<String, Object>>();
IndexSearchResult indexSearchResult = new IndexSearchResult();
try {
QueryBuilder qurBd = new BoolQueryBuilder().minimumShouldMatch(2)
.should(QueryBuilders.matchQuery("STUD_ID", studId).operator(Operator.AND))
.should(QueryBuilders.matchQuery("CLG_ID", collageId).operator(Operator.AND));
sourceBuilder.from(Page_Number_Start_Index).size(Total_No_Of_Records);
sourceBuilder.query(qurBd);
sourceBuilder.sort(new FieldSortBuilder("ROLL_NO.keyword").order(SortOrder.DESC));
SearchRequest searchRequest = new SearchRequest();
searchRequest.indices("clgindex");
searchRequest.source(sourceBuilder);
SearchResponse response;
response = rClient.search(searchRequest, RequestOptions.DEFAULT);
response.getHits().forEach(searchHit -> {
searchResults.add(searchHit.getSourceAsMap());
});
indexSearchResult.setListOfIndexes(searchResults);
log.info("searchResultsHits {}", searchResults.size());
} catch (Exception e) {
log.error("search :: Search on clg flat index. {}", e.getMessage());
}
return indexSearchResult;
}
So if the limit from 99550 and size 500 then it will not fetch more that 1L records.
Error: "reason" : "Result window is too large, from + size must be less than or equal to: [100000] but was [100050]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
}
I don't want to change [index.max_result_window]. Only want solution at Java side to search all docs in index based on conditions by implementing elasticserach API.
Thanks in advance..

Related

Java ElasticSearch API search multiple possible values

I'm searching in multiple fields, and I want to get results if the record matches a specific value (entry.getValue()) or the String "ALL"
Here is my code, but it's not working.
SearchRequest searchRequest = new SearchRequest(MY_INDEX);
final BoolQueryBuilder booleanQuery = QueryBuilders.boolQuery();
searchRequest.source().query(booleanQuery);
final BoolQueryBuilder booleanQuery= QueryBuilders.boolQuery();
for (Map.Entry<String, String> entry : params.entrySet()) {
booleanQuery.should(QueryBuilders.termsQuery(entry.getKey(), entry.getValue(), "ALL");
}
I'm using JDK 11 and ES 7.1

Here is a sample code written for country index which is searching for data provided in map. Customize it according to your needs.
//using map for country
Map<String, String> map = new HashMap<>();
map.put("country" , "FRANCE");
map.put("countryCode", "FR");
//List of should queries this will go in should clause of bool query
List<Query> shouldQueryList = new ArrayList<>();
for (Map.Entry<String, String> entry :map.entrySet()) {
//list of terms to match i.e value from map and all.
List<FieldValue> list = Arrays.asList(FieldValue.of(entry.getValue()), FieldValue.of("ALL"));
//Terms query
Query query = new Query.Builder().terms(termsQueryBuilder -> termsQueryBuilder
.field(entry.getKey())
.terms(termQueryField -> termQueryField
.value(list))).build();
shouldQueryList.add(query);
}
try {
//running search from elastic search java client 7.16.3
SearchResponse<Country> response = elasticsearchClient.search(searchRequest -> searchRequest
.query(qBuilder -> qBuilder
.bool(boolQueryBuilder -> boolQueryBuilder
//using should query list here
.should(shouldQueryList)))
, Country.class);
response.hits().hits().forEach(a -> {
//Print matching country name in console
System.out.println(a.source().getCountry());
});
} catch (IOException e) {
log.info(e.getMessage());
}
Above code will generate query like this :
{"query":{"bool":{"should":[{"terms":{"country":["FRANCE","ALL"]}},{"terms":{"countryCode":["FR","ALL"]}}]}}}

how to insert unique data in Elasticsearch index document in java

this is my previous question - how to insert data in elastic search index
index mapping is as follows
{
"test" : {
"mappings" : {
"properties" : {
"name" : {
"type" : "keyword"
},
"info" : {
"type" : "nested"
},
"joining" : {
"type" : "date"
}
}
}
how can i check the data of field is already present or not before uploading a data to the index
Note :- I dont have id field maintained in index. need to check name in each document if it is already present then dont insert document into index
thanks in advance

As you don't have a id field in your mapping, you have to search on name field and you can use below code to search on it.
public List<SearchResult> search(String searchTerm) throws IOException {
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
MatchQueryBuilder multiMatchQueryBuilder = new
MatchQueryBuilder(searchTerm, "firstName");
searchSourceBuilder.query(matchQueryBuilder);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = esclient.search(searchRequest, RequestOptions.DEFAULT);
return getSearchResults(searchResponse);
}
Note, as you have keyword field instead of match you can use termquerybuilder
And it uses the utility method to parse the searchResponse of ES, code of which is below:
private List<yourpojo> getSearchResults(SearchResponse searchResponse) {
RestStatus status = searchResponse.status();
TimeValue took = searchResponse.getTook();
Boolean terminatedEarly = searchResponse.isTerminatedEarly();
boolean timedOut = searchResponse.isTimedOut();
// Start fetching the documents matching the search results.
//https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-search
// .html#java-rest-high-search-response-search-hits
SearchHits hits = searchResponse.getHits();
SearchHit[] searchHits = hits.getHits();
List<sr> sr = new ArrayList<>();
for (SearchHit hit : searchHits) {
// do something with the SearchHit
String index = hit.getIndex();
String id = hit.getId();
float score = hit.getScore();
//String sourceAsString = hit.getSourceAsString();
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
String firstName = (String) sourceAsMap.get("firstName");
sr.add(userSearchResultBuilder.build());
}

Efficient way to get only _ids in ElasticSearch - Java

Is this the most efficient way to retrieve only ids from ElasticSearch?
requestBuilder.setQuery(queryBuilder);
requestBuilder.setFrom(start);
requestBuilder.setSize(limit);
requestBuilder.setFetchSource(false);
SearchResponse response = requestBuilder.execute().actionGet();
SearchHit[] hits = response.getHits().getHits();
List<Long> refugeeIds = new ArrayList<>();
for (SearchHit hit : hits) {
if (hit.getId() != null) {
refugeeIds.add(Long.parseLong(hit.getId().toString()));
}
}

That should be the best way. You don't return the _source and ES will only return the _type, _index, _score and _id.

CloudSearch deleteByQuery

The official Solr Java API has a deleteByQuery operation where we can delete documents that satisfy a query. The AWS CloudSearch SDK doesn't seem to have matching functionality. Am I just not seeing the deleteByQuery equivalent, or is this something we'll need to roll our own?
Something like this:
SearchRequest searchRequest = new SearchRequest();
searchRequest.setQuery(queryString);
searchRequest.setReturn("id,version");
SearchResult searchResult = awsCloudSearch.search(searchRequest);
JSONArray docs = new JSONArray();
for (Hit hit : searchResult.getHits().getHit()) {
JSONObject doc = new JSONObject();
doc.put("id", hit.getId());
// is version necessary?
doc.put("version", hit.getFields().get("version").get(0));
doc.put("type", "delete");
docs.put(doc);
}
UploadDocumentsRequest uploadDocumentsRequest = new UploadDocumentsRequest();
StringInputStream documents = new StringInputStream(docs.toString());
uploadDocumentsRequest.setDocuments(documents);
UploadDocumentsResult uploadResult = awsCloudSearch.uploadDocuments(uploadDocumentsRequest);
Is this reasonable? Is there an easier way?

You're correct that CloudSearch doesn't have an equivalent to deleteByQuery. Your approach looks like the next best thing.
And no, version is not necessary -- it was removed with the CloudSearch 01-01-2013 API (aka v2).

CloudSearch doesn't provide delete as query, it supports delete in a slightly different way i.e. build json object having only document id (to be deleted) and operation should be specified as delete. These json objects can be batched together but batch size has to be less than 5 MB.
Following class supports this functionality, you just pass its delete method the array of ids to be deleted:
class AWS_CS
{
protected $client;
function connect($domain)
{
try{
$csClient = CloudSearchClient::factory(array(
'key' => 'YOUR_KEY',
'secret' => 'YOUR_SECRET',
'region' => 'us-east-1'
));
$this->client = $csClient->getDomainClient(
$domain,
array(
'credentials' => $csClient->getCredentials(),
'scheme' => 'HTTPS'
)
);
}
catch(Exception $ex){
echo "Exception: ";
echo $ex->getMessage();
}
//$this->client->addSubscriber(LogPlugin::getDebugPlugin());
}
function search($queryStr, $domain){
$this->connect($domain);
$result = $this->client->search(array(
'query' => $queryStr,
'queryParser' => 'lucene',
'size' => 100,
'return' => '_score,_all_fields'
))->toArray();
return json_encode($result['hits']);
//$hitCount = $result->getPath('hits/found');
//echo "Number of Hits: {$hitCount}\n";
}
function deleteDocs($idArray, $operation = 'delete'){
$batch = array();
foreach($idArray as $id){
//dumpArray($song);
$batch[] = array(
'type' => $operation,
'id' => $id);
}
$batch = array_filter($batch);
$jsonObj = json_encode($batch, JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_QUOT | JSON_HEX_AMP);
print_r($this->client->uploadDocuments(array(
'documents' => $jsonObj,
'contentType' =>'application/json'
)));
return $result['status'] == 'success' ? mb_strlen($jsonObj) : 0;
}
}

Modified for C# - Deleting uploaded document in cloud search
public void DeleteUploadedDocuments(string location)
{
SearchRequest searchRequest = new SearchRequest { };
searchRequest = new SearchRequest { Query = "resourcename:'filepath'", QueryParser = QueryParser.Lucene, Size = 10000 };
searchClient = new AmazonCloudSearchDomainClient( ConfigurationManager.AppSettings["awsAccessKeyId"] , ConfigurationManager.AppSettings["awsSecretAccessKey"] , new AmazonCloudSearchDomainConfig { ServiceURL = ConfigurationManager.AppSettings["CloudSearchEndPoint"] });
SearchResponse searchResponse = searchClient.Search(searchRequest);
JArray docs = new JArray();
foreach (Hit hit in searchResponse.Hits.Hit)
{
JObject doc = new JObject();
doc.Add("id", hit.Id);
doc.Add("type", "delete");
docs.Add(doc);
}
UpdateIndexDocument<JArray>(docs, ConfigurationManager.AppSettings["CloudSearchEndPoint"]);
}
public void UpdateIndexDocument<T>(T document, string DocumentUrl)
{
AmazonCloudSearchDomainConfig config = new AmazonCloudSearchDomainConfig { ServiceURL = DocumentUrl };
AmazonCloudSearchDomainClient searchClient = new AmazonCloudSearchDomainClient( ConfigurationManager.AppSettings["awsAccessKeyId"] , ConfigurationManager.AppSettings["awsSecretAccessKey"] , config);
using (Stream stream = GenerateStreamFromString(JsonConvert.SerializeObject(document)))
{
UploadDocumentsRequest upload = new UploadDocumentsRequest()
{
ContentType = "application/json",
Documents = stream
};
searchClient.UploadDocuments(upload);
};
}

Return all records in one query in Elasticsearch

I have a database in elastic search and want to get all records on my web site page. I wrote a bean, which connects to the elastic search node, searches records and returns some response. My simple java code, which does the searching, is
SearchResponse response = getClient().prepareSearch(indexName)
.setTypes(typeName)
.setQuery(queryString("\*:*"))
.setExplain(true)
.execute().actionGet();
But Elasticsearch set default size to 10 and I have 10 hits in response. There are more than 10 records in my database. If I set size to Integer.MAX_VALUE my search becomes very slow and this not what I want.
How can I get all the records in one action in an acceptable amount of time without setting size of response?

public List<Map<String, Object>> getAllDocs(){
int scrollSize = 1000;
List<Map<String,Object>> esData = new ArrayList<Map<String,Object>>();
SearchResponse response = null;
int i = 0;
while( response == null || response.getHits().hits().length != 0){
response = client.prepareSearch(indexName)
.setTypes(typeName)
.setQuery(QueryBuilders.matchAllQuery())
.setSize(scrollSize)
.setFrom(i * scrollSize)
.execute()
.actionGet();
for(SearchHit hit : response.getHits()){
esData.add(hit.getSource());
}
i++;
}
return esData;
}

The current highest-ranked answer works, but it requires loading the whole list of results in memory, which can cause memory issues for large result sets, and is in any case unnecessary.
I created a Java class that implements a nice Iterator over SearchHits, that allows to iterate through all results. Internally, it handles pagination by issuing queries that include the from: field, and it only keeps in memory one page of results.
Usage:
// build your query here -- no need for setFrom(int)
SearchRequestBuilder requestBuilder = client.prepareSearch(indexName)
.setTypes(typeName)
.setQuery(QueryBuilders.matchAllQuery())
SearchHitIterator hitIterator = new SearchHitIterator(requestBuilder);
while (hitIterator.hasNext()) {
SearchHit hit = hitIterator.next();
// process your hit
}
Note that, when creating your SearchRequestBuilder, you don't need to call setFrom(int), as this will be done interally by the SearchHitIterator. If you want to specify the size of a page (i.e. the number of search hits per page), you can call setSize(int), otherwise ElasticSearch's default value is used.
SearchHitIterator:
import java.util.Iterator;
import org.elasticsearch.action.search.SearchRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.search.SearchHit;
public class SearchHitIterator implements Iterator<SearchHit> {
private final SearchRequestBuilder initialRequest;
private int searchHitCounter;
private SearchHit[] currentPageResults;
private int currentResultIndex;
public SearchHitIterator(SearchRequestBuilder initialRequest) {
this.initialRequest = initialRequest;
this.searchHitCounter = 0;
this.currentResultIndex = -1;
}
#Override
public boolean hasNext() {
if (currentPageResults == null || currentResultIndex + 1 >= currentPageResults.length) {
SearchRequestBuilder paginatedRequestBuilder = initialRequest.setFrom(searchHitCounter);
SearchResponse response = paginatedRequestBuilder.execute().actionGet();
currentPageResults = response.getHits().getHits();
if (currentPageResults.length < 1) return false;
currentResultIndex = -1;
}
return true;
}
#Override
public SearchHit next() {
if (!hasNext()) return null;
currentResultIndex++;
searchHitCounter++;
return currentPageResults[currentResultIndex];
}
}
In fact, realizing how convenient it is to have such a class, I wonder why ElasticSearch's Java client does not offer something similar.

You could use scrolling API.
The other suggestion of using a searchhit iterator would also work great,but only when you don't want to update those hits.
import static org.elasticsearch.index.query.QueryBuilders.*;
QueryBuilder qb = termQuery("multi", "test");
SearchResponse scrollResp = client.prepareSearch(test)
.addSort(FieldSortBuilder.DOC_FIELD_NAME, SortOrder.ASC)
.setScroll(new TimeValue(60000))
.setQuery(qb)
.setSize(100).execute().actionGet(); //max of 100 hits will be returned for each scroll
//Scroll until no hits are returned
do {
for (SearchHit hit : scrollResp.getHits().getHits()) {
//Handle the hit...
}
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
} while(scrollResp.getHits().getHits().length != 0); // Zero hits mark the end of the scroll and the while loop.

It has been long since you asked this question, and I would like to post my answer for future readers.
As mentioned above answers, it is better to load documents with size and start when you have thousands of documents in the index. In my project, the search loads 50 results as default size and starting from zero index, if a user wants to load more data, then the next 50 results will be loaded. Here what I have done in the code:
public List<CourseDto> searchAllCourses(int startDocument) {
final int searchSize = 50;
final SearchRequest searchRequest = new SearchRequest("course_index");
final SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.matchAllQuery());
if (startDocument != 0) {
startDocument += searchSize;
}
searchSourceBuilder.from(startDocument);
searchSourceBuilder.size(searchSize);
// sort the document
searchSourceBuilder.sort(new FieldSortBuilder("publishedDate").order(SortOrder.ASC));
searchRequest.source(searchSourceBuilder);
List<CourseDto> courseList = new ArrayList<>();
try {
final SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
final SearchHits hits = searchResponse.getHits();
// Do you want to know how many documents (results) are returned? here is you get:
TotalHits totalHits = hits.getTotalHits();
long numHits = totalHits.value;
final SearchHit[] searchHits = hits.getHits();
final ObjectMapper mapper = new ObjectMapper();
for (SearchHit hit : searchHits) {
// convert json object to CourseDto
courseList.add(mapper.readValue(hit.getSourceAsString(), CourseDto.class));
}
} catch (IOException e) {
logger.error("Cannot search by all mach. " + e);
}
return courseList;
}
Info:
- Elasticsearch version 7.5.0
- Java High Level REST Client is used as client.
I Hope, this will be useful for someone.

You will have to trade off the number of returned results vs the time you want the user to wait and the amount of available server memory. If you've indexed 1,000,000 documents, there isn't a realistic way to retrieve all those results in one request. I'm assuming your results are for one user. You'll have to consider how the system will perform under load.

To query all, you should build a CountRequestBuilder to get the total number of records (by CountResponse) then set the number back to size of your seach request.

If your primary focus is on exporting all records you might want to go for a solution which does not require any kind of sorting, as sorting is an expensive operation.
You could use the scan and scroll approach with ElasticsearchCRUD as described here.

For version 6.3.2, the following worked:
public List<Map<String, Object>> getAllDocs(String indexName, String searchType) throws FileNotFoundException, UnsupportedEncodingException{
int scrollSize = 1000;
List<Map<String,Object>> esData = new ArrayList<>();
SearchResponse response = null;
int i=0;
response = client.prepareSearch(indexName)
.setScroll(new TimeValue(60000))
.setTypes(searchType) // The document types to execute the search against. Defaults to be executed against all types.
.setQuery(QueryBuilders.matchAllQuery())
.setSize(scrollSize).get(); //max of 100 hits will be returned for each scroll
//Scroll until no hits are returned
do {
for (SearchHit hit : response.getHits().getHits()) {
++i;
System.out.println (i + " " + hit.getId());
writer.println(i + " " + hit.getId());
}
System.out.println(i);
response = client.prepareSearchScroll(response.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
} while(response.getHits().getHits().length != 0); // Zero hits mark the end of the scroll and the while loop.
return esData;
}

SearchResponse response = restHighLevelClient.search(new SearchRequest("Index_Name"), RequestOptions.DEFAULT);
SearchHit[] hits = response.getHits().getHits();

1.set the max size ，e.g: MAX_INT_VALUE;
private static final int MAXSIZE=1000000;
#Override
public List getAllSaleCityByCity(int cityId) throws Exception {
List<EsSaleCity> list=new ArrayList<EsSaleCity>();
Client client=EsFactory.getClient();
SearchResponse response= client.prepareSearch(getIndex(EsSaleCity.class)).setTypes(getType(EsSaleCity.class)).setSize(MAXSIZE)
.setQuery(QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(), FilterBuilders.boolFilter()
.must(FilterBuilders.termFilter("cityId", cityId)))).execute().actionGet();
SearchHits searchHits=response.getHits();
SearchHit[] hits=searchHits.getHits();
for(SearchHit hit:hits){
Map<String, Object> resultMap=hit.getSource();
EsSaleCity saleCity=setEntity(resultMap, EsSaleCity.class);
list.add(saleCity);
}
return list;
}
2.count the ES before you search
CountResponse countResponse = client.prepareCount(getIndex(EsSaleCity.class)).setTypes(getType(EsSaleCity.class)).setQuery(queryBuilder).execute().actionGet();
int size = (int)countResponse.getCount();//this is you want size;
then you can
SearchResponse response= client.prepareSearch(getIndex(EsSaleCity.class)).setTypes(getType(EsSaleCity.class)).setSize(size);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Elasticsearch Java API implementation to fetch millions of records - java

Related

Java ElasticSearch API search multiple possible values

how to insert unique data in Elasticsearch index document in java

Efficient way to get only _ids in ElasticSearch - Java

CloudSearch deleteByQuery

Return all records in one query in Elasticsearch

Categories

Resources