CloudSearch deleteByQuery

CloudSearch deleteByQuery - java

The official Solr Java API has a deleteByQuery operation where we can delete documents that satisfy a query. The AWS CloudSearch SDK doesn't seem to have matching functionality. Am I just not seeing the deleteByQuery equivalent, or is this something we'll need to roll our own?
Something like this:
SearchRequest searchRequest = new SearchRequest();
searchRequest.setQuery(queryString);
searchRequest.setReturn("id,version");
SearchResult searchResult = awsCloudSearch.search(searchRequest);
JSONArray docs = new JSONArray();
for (Hit hit : searchResult.getHits().getHit()) {
JSONObject doc = new JSONObject();
doc.put("id", hit.getId());
// is version necessary?
doc.put("version", hit.getFields().get("version").get(0));
doc.put("type", "delete");
docs.put(doc);
}
UploadDocumentsRequest uploadDocumentsRequest = new UploadDocumentsRequest();
StringInputStream documents = new StringInputStream(docs.toString());
uploadDocumentsRequest.setDocuments(documents);
UploadDocumentsResult uploadResult = awsCloudSearch.uploadDocuments(uploadDocumentsRequest);
Is this reasonable? Is there an easier way?

You're correct that CloudSearch doesn't have an equivalent to deleteByQuery. Your approach looks like the next best thing.
And no, version is not necessary -- it was removed with the CloudSearch 01-01-2013 API (aka v2).

CloudSearch doesn't provide delete as query, it supports delete in a slightly different way i.e. build json object having only document id (to be deleted) and operation should be specified as delete. These json objects can be batched together but batch size has to be less than 5 MB.
Following class supports this functionality, you just pass its delete method the array of ids to be deleted:
class AWS_CS
{
protected $client;
function connect($domain)
{
try{
$csClient = CloudSearchClient::factory(array(
'key' => 'YOUR_KEY',
'secret' => 'YOUR_SECRET',
'region' => 'us-east-1'
));
$this->client = $csClient->getDomainClient(
$domain,
array(
'credentials' => $csClient->getCredentials(),
'scheme' => 'HTTPS'
)
);
}
catch(Exception $ex){
echo "Exception: ";
echo $ex->getMessage();
}
//$this->client->addSubscriber(LogPlugin::getDebugPlugin());
}
function search($queryStr, $domain){
$this->connect($domain);
$result = $this->client->search(array(
'query' => $queryStr,
'queryParser' => 'lucene',
'size' => 100,
'return' => '_score,_all_fields'
))->toArray();
return json_encode($result['hits']);
//$hitCount = $result->getPath('hits/found');
//echo "Number of Hits: {$hitCount}\n";
}
function deleteDocs($idArray, $operation = 'delete'){
$batch = array();
foreach($idArray as $id){
//dumpArray($song);
$batch[] = array(
'type' => $operation,
'id' => $id);
}
$batch = array_filter($batch);
$jsonObj = json_encode($batch, JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_QUOT | JSON_HEX_AMP);
print_r($this->client->uploadDocuments(array(
'documents' => $jsonObj,
'contentType' =>'application/json'
)));
return $result['status'] == 'success' ? mb_strlen($jsonObj) : 0;
}
}

Modified for C# - Deleting uploaded document in cloud search
public void DeleteUploadedDocuments(string location)
{
SearchRequest searchRequest = new SearchRequest { };
searchRequest = new SearchRequest { Query = "resourcename:'filepath'", QueryParser = QueryParser.Lucene, Size = 10000 };
searchClient = new AmazonCloudSearchDomainClient( ConfigurationManager.AppSettings["awsAccessKeyId"] , ConfigurationManager.AppSettings["awsSecretAccessKey"] , new AmazonCloudSearchDomainConfig { ServiceURL = ConfigurationManager.AppSettings["CloudSearchEndPoint"] });
SearchResponse searchResponse = searchClient.Search(searchRequest);
JArray docs = new JArray();
foreach (Hit hit in searchResponse.Hits.Hit)
{
JObject doc = new JObject();
doc.Add("id", hit.Id);
doc.Add("type", "delete");
docs.Add(doc);
}
UpdateIndexDocument<JArray>(docs, ConfigurationManager.AppSettings["CloudSearchEndPoint"]);
}
public void UpdateIndexDocument<T>(T document, string DocumentUrl)
{
AmazonCloudSearchDomainConfig config = new AmazonCloudSearchDomainConfig { ServiceURL = DocumentUrl };
AmazonCloudSearchDomainClient searchClient = new AmazonCloudSearchDomainClient( ConfigurationManager.AppSettings["awsAccessKeyId"] , ConfigurationManager.AppSettings["awsSecretAccessKey"] , config);
using (Stream stream = GenerateStreamFromString(JsonConvert.SerializeObject(document)))
{
UploadDocumentsRequest upload = new UploadDocumentsRequest()
{
ContentType = "application/json",
Documents = stream
};
searchClient.UploadDocuments(upload);
};
}

Related

Elasticsearch inner hits reponse

This is my query function :
public List<feed> search(String id) throws IOException {
Query nestedQuery = NestedQuery.of(nq ->nq.path("comment").innerHits(InnerHits.of(ih -> ih)).query(MatchQuery
.of(mq -> mq.field("comment.c_text").query(id))._toQuery()))._toQuery();
Query termQueryTitle = TermQuery.of(tq -> tq.field("title").value(id))._toQuery();
Query termQueryBody = TermQuery.of(tq -> tq.field("body").value(id))._toQuery();
Query boolQuery = BoolQuery.of(bq -> bq.should(nestedQuery, termQueryBody, termQueryTitle))._toQuery();
SearchRequest searchRequest = SearchRequest.of(s -> s.index(indexName).query(boolQuery));
var response = elasticsearchClient.search(searchRequest, feed.class);
for (var hit : response.hits().hits()){
System.out.println("this is inner hit response: " + (hit.innerHits().get("comment").hits().hits())); }
List<Hit<feed>> hits = response.hits().hits();
List<feed> feeds = new ArrayList<>();
feed f=null;
for(Hit object : hits){
f = (feed) object.source();
feeds.add(f); }
return feeds;
}
i have add this code
for (var hit : response.hits().hits()){
System.out.println("this is inner hit response: " + (hit.innerHits().get("comment").hits().hits())); }
if it founds 2 records it gives me the refrence of 2 records but dont show me the actual records like its outpout is as follow if it founds 2 records in inner hit :
this is inner hit response [co.elastic.clients.elasticsearch.core.search.Hit#75679b1a]
this is inner hit response [co.elastic.clients.elasticsearch.core.search.Hit#1916d9c6]
can anyone help me to poput the actual records

This properly works for me in console :
for (var hit : response.hits().hits()) {
var innerHits = hit.innerHits().get("comment").hits().hits();
for (var innerHit : innerHits) {
JsonData source = innerHit.source();
String jsonDataString = source.toString();
System.out.println("Matched comments"+jsonDataString);
}
}

I created a class Comment with property "c_text" and did a cast before adding inside a lists comments.
var comments = new ArrayList<Comment>();
for (var hit : response.hits().hits()) {
comments.addAll(hit.innerHits().get("comment").hits().hits().stream().map(
h -> h.source().to(Comment.class)
).collect(Collectors.toList()));
}
System.out.println(comments);

How to build a query with variable number of match() clauses using the ElasticSearch Java API

I want to retrieve documents based on a combination of 3 field values:
canonicalForm
grammar
meaning
Here is how I do it now.
String canonicalForm = "tut";
String grammar = "verb";
String meaning = "to land";
BoolQuery bool = BoolQuery.of(q -> q
.must(m -> m
.match(mt -> mt
.field("descr.canonicalForm")
.query(canonicalForm)
)
)
.must(m -> m
.match(mt -> mt
.field("descr.grammar")
.query(grammar)
)
)
.must(m -> m
.match(mt -> mt
.field("descr.meaning")
.query(meaning)
)
)
);
This works as long as I provide a value for all three fields. But sometimes I want to search using only 1 or two of the fields.
I tried setting the "absent" field values to null, but that raises an exception.
I also tried setting the "absent" value to the empty string but that always returns 0 hits.
Another solution would be to only add a match() clause for a field if the provided value is not null, but I can't figure out how to insert this kind of conditionals in the fluent DSL builder pattern.

I found a solution, but it's awkward as hell. If someone has a more fluent solution to suggest, please let me know.
The solution I came up with is to:
Create the query as a JSONObject
Transform that JSONObject to an InputStream
Feed that InputStream to .query(q -> q.withJson()) method
Here is an example below:
// Say we have these input field values
String canonicalForm = "tut";
String grammar = "verb";
String meaning = null; // This means we don't want to query on field 'meaning'
// Build a JSONArray that will contain the "match" criteria for the non-null
// input field values
//
JSONArray mustArr = new JSONArray();
if (canonicalForm != null) {
mustArr.put(new JSONObject()
.put("match", new JSONObject()
.put("descr.canonicalForm", new JSONObject()
.put("query", canonicalForm)
)
)
);
}
if (grammar != null) {
mustArr.put(new JSONObject()
.put("match", new JSONObject()
.put("descr.grammar", new JSONObject()
.put("query", grammar)
)
)
);
}
if (meaning != null) {
mustArr.put(new JSONObject()
.put("match", new JSONObject()
.put("descr.meaning", new JSONObject()
.put("query", meaning)
)
)
);
}
// Build a "bool" query object, feeding it the "must" array.
JSONObject queryJObj = new JSONObject()
.put("bool", new JSONObject()
.put("must", mustArr)
)
;
// Convert the "query" object to an InputStream
String queryJsonStr = queryJObj.toString();
InputStream queryIS = new ByteArrayInputStream(queryJsonStr.getBytes(StandardCharsets.UTF_8));
SearchRequest sr = SearchRequest.of(s -> s
.index("morphemes")
.query(q -> q
.withJson(queryIS)
)
);

I created a new code java.
var map = new HashMap<String, String>() {{
put("descr.canonicalForm", "tut");
put("descr.grammar", "verb");
put("descr.meaning", "");
}};
var queries = new ArrayList<Query>();
map.forEach((field, entry) -> {
if (StringUtils.isNotEmpty(entry)) {
queries.add(MatchQuery.of(m -> m.field(field).query(entry))._toQuery());
}
});
var boolQuery = BoolQuery.of(bq -> bq.must(queries));
SearchRequest sr = SearchRequest.of(s -> s
.index("morphemes")
.query(q -> q.bool(boolQuery)
)
);

Between in createdCriteria not works

I have a very strange problem. I create 3 entities with the following data:
CCB ccb1 = new Ccb(1)
CCB ccb2 = new Ccb(2)
CCB ccb3 = new Ccb(3)
Where the parameter (Long) is the object id.
Then, when wanting to create a list with the between clause, it is created with size = 0:
ConcurrentLinkedQueue<Long> ccbIds = new ConcurrentLinkedQueue(
Ccb.createCriteria().list {
between("id", 1, 5)
projections {
id()
}
}
)
I've tried this alternative and it doesn't work either:
ConcurrentLinkedQueue<Long> ccbIds = new ConcurrentLinkedQueue(
Ccb.createCriteria().list {
between("id", "1", "5")
projections {
id()
}
}
)
The incredible thing is that if I replace the between with the eq:
ConcurrentLinkedQueue<Long> ccbIds = new ConcurrentLinkedQueue(
Ccb.createCriteria().list {
eq("id", 2)
projections {
id()
}
}
)
Now the list returns me the element with id 2!
I can't understand where is the error.
Thanks!
EDIT:
Config of DataSource.groovy:
dataSource {
dbCreate = "create-drop"
driverClassName = "org.h2.Driver"
dialect = "org.hibernate.dialect.H2Dialect"
url = "jdbc:h2:mem:devDb;MVCC=TRUE;LOCK_TIMEOUT=10000;DB_CLOSE_ON_EXIT=FALSE"
}

try this:
Ccb.createCriteria().list {
between("id", 1l, 5l)
projections {
property('id')
}
}
or:
Ccb.createCriteria().list {
and{
between("id", 1l, 5l)
}
projections {
property('id')
}
}

Can't you stream the list and filter by ID?
def list = foolist.stream().filter(f -> f.getId() > 0 && f.getId() < 4).collect(Collectors.toList())

After run different tests, I came to the conclusion that it is a bug in Grails when using the H2 storage. With SQL it works fine.

Elasticsearch Java API implementation to fetch millions of records

I want to get all doc (millions) in elastic index based on some condition. I used below query in elastic.
GET /<index-name>/_search
{
"from" : 99550, "size" : 500,
"query" : {
"term" : { "CC_ENGAGEMENT_NUMBER" : "1967" }
}
}
And below are my java implementation.
public IndexSearchResult findByStudIdAndcollageId(final String studId, final String collageId,
Integer Page_Number_Start_Index, Integer Total_No_Of_Records) {
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
List<Map<String, Object>> searchResults = new ArrayList<Map<String, Object>>();
IndexSearchResult indexSearchResult = new IndexSearchResult();
try {
QueryBuilder qurBd = new BoolQueryBuilder().minimumShouldMatch(2)
.should(QueryBuilders.matchQuery("STUD_ID", studId).operator(Operator.AND))
.should(QueryBuilders.matchQuery("CLG_ID", collageId).operator(Operator.AND));
sourceBuilder.from(Page_Number_Start_Index).size(Total_No_Of_Records);
sourceBuilder.query(qurBd);
sourceBuilder.sort(new FieldSortBuilder("ROLL_NO.keyword").order(SortOrder.DESC));
SearchRequest searchRequest = new SearchRequest();
searchRequest.indices("clgindex");
searchRequest.source(sourceBuilder);
SearchResponse response;
response = rClient.search(searchRequest, RequestOptions.DEFAULT);
response.getHits().forEach(searchHit -> {
searchResults.add(searchHit.getSourceAsMap());
});
indexSearchResult.setListOfIndexes(searchResults);
log.info("searchResultsHits {}", searchResults.size());
} catch (Exception e) {
log.error("search :: Search on clg flat index. {}", e.getMessage());
}
return indexSearchResult;
}
So if the limit from 99550 and size 500 then it will not fetch more that 1L records.
Error: "reason" : "Result window is too large, from + size must be less than or equal to: [100000] but was [100050]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
}
I don't want to change [index.max_result_window]. Only want solution at Java side to search all docs in index based on conditions by implementing elasticserach API.
Thanks in advance..

AWS Firehose Transformation lambda putting all messages in same s3 folder

I have a Kinesis stream, i have created firehose delivery stream and saving all the data to s3, it was saving correctly in hourly folders. Then i have written firehose transformation lambda, after deploying that all the messages are going to same folder, i am not sure what i am missing. I have below fields in my response from lambda function:
result.put("recordId", record.getRecordId());
result.put("result", "Ok");
result.put("approximateArrivalEpoch", record.getApproximateArrivalEpoch());
result.put("approximateArrivalTimestamp",record.getApproximateArrivalTimestamp());
result.put("kinesisRecordMetadata", record.getKinesisRecordMetadata());
result.put("data", Base64.getEncoder().encodeToString(jsonData.getBytes()));
Edit:
Here is my code in java. I am using KinesisFirehoseEvent and decoding was not needed for my case and i got ByteBuffer in KinesisFirehoseEvent
public JSONObject handler(KinesisFirehoseEvent kinesisFirehoseEvent, Context context) {
final LambdaLogger logger = context.getLogger();
final JSONArray resultArray = new JSONArray();
for (final KinesisFirehoseEvent.Record record: kinesisFirehoseEvent.getRecords()) {
final byte[] data = record.getData().array();
final Optional<TestData> testData = deserialize(data, logger);
if (testData.isPresent()) {
final JSONObject jsonObj = new JSONObject();
final String jsonData = gson.toJson(testData.get());
jsonObj.put("recordId", record.getRecordId());
jsonObj.put("result", "Ok");
jsonObj.put("approximateArrivalEpoch", record.getApproximateArrivalEpoch());
jsonObj.put("approximateArrivalTimestamp", record.getApproximateArrivalTimestamp());
jsonObj.put("kinesisRecordMetadata", record.getKinesisRecordMetadata());
jsonObj.put("data", Base64.getEncoder().encodeToString
(jsonData.getBytes()));
resultArray.add(jsonObj);
}
else {
logger.log("testData not deserialized");
}
}
final JSONObject jsonFinalObj = new JSONObject();
jsonFinalObj.put("records", resultArray);
return jsonFinalObj;
}

The lambda function returning data is not in correct format,
Checkout the below example,
'use strict';
console.log('Loading function');
/* Stock Ticker format parser */
const parser = /^\{\"TICKER_SYMBOL\"\:\"[A-Z]+\"\,\"SECTOR\"\:"[A-Z]+\"\,\"CHANGE\"\:[-.0-9]+\,\"PRICE\"\:[-.0-9]+\}/;
exports.handler = (event, context, callback) => {
let success = 0; // Number of valid entries found
let failure = 0; // Number of invalid entries found
let dropped = 0; // Number of dropped entries
/* Process the list of records and transform them */
const output = event.records.map((record) => {
const entry = (new Buffer(record.data, 'base64')).toString('utf8');
let match = parser.exec(entry);
if (match) {
let parsed_match = JSON.parse(match);
var milliseconds = new Date().getTime();
/* Add timestamp and convert to CSV */
const result = `${milliseconds},${parsed_match.TICKER_SYMBOL},${parsed_match.SECTOR},${parsed_match.CHANGE},${parsed_match.PRICE}`+"\n";
const payload = (new Buffer(result, 'utf8')).toString('base64');
if (parsed_match.SECTOR != 'RETAIL') {
/* Dropped event, notify and leave the record intact */
dropped++;
return {
recordId: record.recordId,
result: 'Dropped',
data: record.data,
};
}
else {
/* Transformed event */
success++;
return {
recordId: record.recordId,
result: 'Ok',
data: payload,
};
}
}
else {
/* Failed event, notify the error and leave the record intact */
console.log("Failed event : "+ record.data);
failure++;
return {
recordId: record.recordId,
result: 'ProcessingFailed',
data: record.data,
};
}
/* This transformation is the "identity" transformation, the data is left intact
return {
recordId: record.recordId,
result: 'Ok',
data: record.data,
} */
});
console.log(`Processing completed. Successful records ${output.length}.`);
callback(null, { records: output });
};
Below documentation can help more details on the data returning format,
https://aws.amazon.com/blogs/compute/amazon-kinesis-firehose-data-transformation-with-aws-lambda/
Hope it helps.

I got this working using above code only, its just that looks like stream is slow so data of new hours haven't reached yet.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

CloudSearch deleteByQuery - java

You're correct that CloudSearch doesn't have an equivalent to deleteByQuery. Your approach looks like the next best thing. And no, version is not necessary -- it was removed with the CloudSearch 01-01-2013 API (aka v2).

Related

Elasticsearch inner hits reponse

How to build a query with variable number of match() clauses using the ElasticSearch Java API

Between in createdCriteria not works

Elasticsearch Java API implementation to fetch millions of records

AWS Firehose Transformation lambda putting all messages in same s3 folder

Categories

Resources