KStream windowedBy SessionWindow emitting intermediate aggregated messages - java

I am trying to merge values in Kafka Stream, by grouping them with Key, windowed by session (30 seconds) aggregate & produce to a new topic. However with in the same window, the code produces multiple aggregated records, with some having null value, some has the initial few aggregated values. I would expect the aggregate only to produce 1 output at the end of the join window. I also tried using suppress, but it didn't helped. Can someone point me if I did something wrong here?
Here is my code:
KStream<String, EventLog> elogStream = builder.stream("eventlognw", Consumed.with(stringSerde, eventLogSerde));
elogStream.groupByKey()
.windowedBy(SessionWindows.ofInactivityGapWithNoGrace(Duration.ofSeconds(30)))
.aggregate(() -> EventLogList.newBuilder().setEventLogItems(new ArrayList<EventLog>()).build(), (key, value, wradAggregate) -> {
List<EventLog> elogList = wradAggregate.getEventLogItems();
if(null == elogList || elogList.isEmpty()) {
elogList = new ArrayList<>();
}
elogList.add(value);
wradAggregate.setEventLogItems(elogList);
log.info("***wradAggregate={}", null != wradAggregate ? wradAggregate.toString() : null);
return wradAggregate;
}, (aggKey, aggOne, aggTwo) -> {
List<EventLog> elogList2 = aggTwo.getEventLogItems();
elogList2.removeAll(aggOne.getEventLogItems());
aggOne.setEventLogItems(elogList2);
log.info("***aggOne={}", null != aggOne ? aggOne.toString() : null);
return aggOne;
})
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
//.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(30), BufferConfig.unbounded()))
.toStream()
.map((k,v) -> KeyValue.pair(k.key(), v))
.peek((k, v) -> log.info("****After applying GroupBy and Aggregate on elogStream, key(reqId): {}, value(eventLog): {}", k, v))
//.to("eventlogagg")

Related

(Datastax 4.1.0) (Cassandra )How do I collect all responses with session.executeAsync?

I wantta make async call to cassandra db with execute.Async call
in manuel I found this code but I couldn't understand how to collect all rows into any list.
Really basic call like Select * from table, and I want to store all the results.
https://docs.datastax.com/en/developer/java-driver/4.4/manual/core/async/
CompletionStage<CqlSession> sessionStage = CqlSession.builder().buildAsync();
// Chain one async operation after another:
CompletionStage<AsyncResultSet> responseStage =
sessionStage.thenCompose(
session -> session.executeAsync("SELECT release_version FROM system.local"));
// Apply a synchronous computation:
CompletionStage<String> resultStage =
responseStage.thenApply(resultSet -> resultSet.one().getString("release_version"));
// Perform an action once a stage is complete:
resultStage.whenComplete(
(version, error) -> {
if (error != null) {
System.out.printf("Failed to retrieve the version: %s%n", error.getMessage());
} else {
System.out.printf("Server version: %s%n", version);
}
sessionStage.thenAccept(CqlSession::closeAsync);
});
You need to refer to the section about asynchronous paging - you need to provide a callback that will collect data into list supplied as external object. Documentation has a following example:
CompletionStage<AsyncResultSet> futureRs =
session.executeAsync("SELECT * FROM myTable WHERE id = 1");
futureRs.whenComplete(this::processRows);
void processRows(AsyncResultSet rs, Throwable error) {
if (error != null) {
// The query failed, process the error
} else {
for (Row row : rs.currentPage()) {
// Process the row...
}
if (rs.hasMorePages()) {
rs.fetchNextPage().whenComplete(this::processRows);
}
}
}
in this case processRows can store data in the list that is the part of the current object, something like this:
class Abc {
List<Row> rows = new ArrayList<>();
// call to executeAsync
void processRows(AsyncResultSet rs, Throwable error) {
....
for (Row row : rs.currentPage()) {
rows.add(row);
}
....
}
}
but you'll need to be very careful with select * from table as it may return a lot of results, plus it may timeout if you have too much data - in this case it's better to perform token range scan (I have an example for driver 3.x, but no for 4.x yet).
Here is a sample for 4.x (you also find sample for reactive code available from 4.4 BTW)
https://github.com/datastax/cassandra-reactive-demo/blob/master/2_async/src/main/java/com/datastax/demo/async/repository/AsyncStockRepository.java

Rest High Level Client : Request timeout is not working

We are trying to use a request timeout in our queries but it doesn't seem to be working for us.
Here're the things we have done as part of setup:
search.default_allow_partial_results : false (on server side as well
as client side)
Set the timeout of 10ms in every search query that we hit. (client
side)
Apart from these, we have global timeouts set as shown in the code
below:
RestHighLevelClient client = new
RestHighLevelClient(RestClient.builder(httpHost).setRequestConfigCallback(
requestConfigBuilder -> requestConfigBuilder
.setConnectTimeout(30000)
.setConnectionRequestTimeout(90000)
.setSocketTimeout(90000)).setMaxRetryTimeoutMillis(90000));
Queries are taking more than 8 seconds but still not getting timed out. We have disabled the partial results as an expectation to get a timeout error but we don't get any error as well.
Also, the isTimedOut flag is returned as false always even though query took more than the specified timeout.
Here's a sample of request that I'm querying:
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
QueryBuilder infraQueryBuilder = QueryBuilders.termQuery("field1", field1);
QueryBuilder totalCountRangeQueryBuilder = QueryBuilders.rangeQuery("field2").gte(3);
BoolQueryBuilder innerBoolQueryBuilder = QueryBuilders.boolQuery();
innerBoolQueryBuilder.must(QueryBuilders.rangeQuery("nestedDocType1.field1").gt(2));
QueryBuilder filter = QueryBuilders
.nestedQuery("nestedDocType1", innerBoolQueryBuilder, ScoreMode.Max)
.innerHit(new InnerHitBuilder()
.setFetchSourceContext(
new FetchSourceContext(true, new String[]{"nestedDocType1.field1"}, null))
.addSort(SortBuilders.fieldSort("nestedDocType1.field1").order(SortOrder.DESC))
.setSize(1)
);
boolQueryBuilder.must(infraQueryBuilder);
boolQueryBuilder.must(totalCountRangeQueryBuilder);
if (inputRevisions != null && (inputRevisions.size() > 0)) {
QueryBuilder allEligibleRevisionsFilter = QueryBuilders
.termsQuery("field3", inputRevisions);
boolQueryBuilder.must(allEligibleRevisionsFilter);
}
boolQueryBuilder.filter(filter);
sourceBuilder.query(boolQueryBuilder)
.fetchSource(new String[]{
"field3",
"field2"
}, null);
sourceBuilder.size(batchSize);
sourceBuilder.timeout(TimeValue.timeValueMillis(10));
SearchRequest searchRequest = createSearchRequest(sourceBuilder, enterpriseId);
searchRequest.allowPartialSearchResults(false);
SearchResponse searchResponse = getSearchResponse(searchRequest);
ESCustomScroll<Set<String>> esCustomScroll = this::populateProcessedRevisionsSetWithESScroll;
getESDataByScroll(esCustomScroll, searchResponse, processedRevisions); // gets the data by scrolling over again and again until data is available.
Here's the code that we use for scrolling:
private boolean populateProcessedRevisionsSetWithESScroll(SearchResponse searchResponse, Set<String> processedRevisions) {
if(searchResponse == null ||
searchResponse.getHits() == null ||
searchResponse.getHits().getHits() == null ||
searchResponse.getHits().getHits().length == 0) {
return false;
}
for(SearchHit outerHit : searchResponse.getHits().getHits()) {
Map<String, Object> outerSourceMap = outerHit.getSourceAsMap();
String revision = (String) outerSourceMap.get("field4");
int totalCount = (Integer) outerSourceMap.get("field3");
SearchHit[] innerHits = outerHit.getInnerHits().get("nestedDocType1").getHits();
if(innerHits == null || innerHits.length == 0) {
logger.error("No inner hits found for revision: "+revision);
continue;
}
Map<String, Object> innerSourceMap = innerHits[0].getSourceAsMap();
int simCount = (Integer) innerSourceMap.get("field1");
if(((totalCount - simCount) == 0) || (simCount > ((totalCount - simCount) / 2))) {
processedRevisions.add(revision);
}
}
return true;
}
Even in case of partial results, we expect the isTimedOut flag to be set. But that's not the case.
Can you please guide us where are we wrong or what are we missing?
Related question: Java High Level Rest Client is not releasing connection although timeout is set
Try setting setMaxRetryTimeoutMillis for RestClientBuilder – it will create a listener and cut it off after setMaxRetryTimeoutMillis expires.

Reactor, how to debug an OverflowException?

I'm trying to find a way to understand/debug why I randomly have that stacktrace :
reactor.core.Exceptions$OverflowException: Could not emit buffer due to lack of requests
at reactor.core.Exceptions.failWithOverflow(Exceptions.java:215)
at reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.emit(FluxBufferPredicate.java:292)
at reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.onNextNewBuffer(FluxBufferPredicate.java:251)
at reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.tryOnNext(FluxBufferPredicate.java:205)
at reactor.core.publisher.FluxBufferPredicate$BufferPredicateSubscriber.onNext(FluxBufferPredicate.java:180)
at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:201)
at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerNext(FluxConcatMap.java:271)
at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onNext(FluxConcatMap.java:803)
at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:232)
at reactor.core.publisher.FluxIterable$IterableSubscription.request(FluxIterable.java:190)
at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:1444)
at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.onSubscribe(Operators.java:1318)
at reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:128)
at reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:61)
at reactor.core.publisher.Flux.subscribe(Flux.java:6873)
Does it means that the producer is faster than the consumer ? My pattern is probably not standard and looks like the following (simplified here):
Flux<Pair<Person, String>> auto = getPersons() // REST GET endpoint
.map(p -> {
// In my real-life example, the operation done here is quiet expensive.
Person newP = new Person(p.name, p.age + 10);
return new Pair<>(newP, "The new age of " + newP.name + " is now " + newP.age);
})
.publish()
.autoConnect(2);
Flux<Person> personsToSave = auto.map(e -> e.first);
Flux<String> auditToSave = auto.map(e -> e.second);
Mono.when(
savePersons(personsToSave), // REST POST endpoint
saveAudit(auditToSave)) // REST POST endpoint
.doOnError(e -> System.err.println(e.getMessage()))
.block();
The Hooks.onOperatorDebug(), or log() does not help me a lot. I don't have the problem if I remove the publish() and only save the persons OR the audit.
Can someone give me how to investigate more precisely ? (or an idea to solve the issue)
Reactor 3.1.6

Kafka Streams application strange behavior in docker container

I am running Kafka Streams application in a docker container with docker-compose. However, the streams application is behaving strangely. So, I have a source topic (topicSource) and multiple destination topics (topicDestination1 , topicDestination2 ... topicDestination10) that I am branching to based on certain predicates.
topicSoure and topicDestination1 have a direct mapping i.e all the records are simply going into the destination topic without any filtering.
Now all this works perfectly fine when I am run the application locally or on a server without containers.
On the other hand, when I run streams app in container (using docker-compose and using kubernetes) then it doesn't forward all logs from topicSoure to topicDestination1. In fact, only a few number of records are forwarded. For Example some 3000 + records on source topic and only 6 records in destination topic. And all this is really strange.
This is my Dockerfile:
#FROM openjdk:8u151-jdk-alpine3.7
FROM openjdk:8-jdk
COPY /target/streams-examples-0.1.jar /streamsApp/
COPY /target/libs /streamsApp/libs
COPY log4j.properties /
CMD ["java", "-jar", "/streamsApp/streams-examples-0.1.jar"]
NOTE: I am building a jar before creating the image so that I always have an updated code. I have made sure that both the codes, the one running without container and the one with container are same.
Main.java:
Creating Source Stream from Source Topic:
KStream<String, String> source_stream = builder.stream("topicSource");
Branching based on predicates:
KStream<String, String>[] branches_source_topic = source_stream.branch(
(key, value) -> (value.contains("Operation\":\"SharingSet") && value.contains("ItemType\":\"File")), // Sharing Set by Date
(key, value) -> (value.contains("Operation\":\"AddedToSecureLink") && value.contains("ItemType\":\"File")), // Added to secure link
(key, value) -> (value.contains("Operation\":\"AddedToGroup")), // Added to group
(key, value) -> (value.contains("Operation\":\"Add member to role.") || value.contains("Operation\":\"Remove member from role.")),//Role update by date
(key, value) -> (value.contains("Operation\":\"FileUploaded") || value.contains("Operation\":\"FileDeleted")
|| value.contains("Operation\":\"FileRenamed") || value.contains("Operation\":\"FileMoved")), // Upload file by date
(key, value) -> (value.contains("Operation\":\"UserLoggedIn")), // User logged in by date
(key, value) -> (value.contains("Operation\":\"Delete user.") || value.contains("Operation\":\"Add user.")
&& value.contains("ResultStatus\":\"success")), // Manage user by date
(key, value) -> (value.contains("Operation\":\"DLPRuleMatch") && value.contains("Workload\":\"OneDrive")) // MS DLP
);
Sending logs to destination topics:
This is the direct mapping topic i.e. all the records are simply going into the destination topic without any filtering.
AppUtil.pushToTopic(source_stream, Constant.USER_ACTIVITY_BY_DATE, "topicDestination1");
Sending logs from branches to destination topics:
AppUtil.pushToTopic(branches_source_topic[0], Constant.SHARING_SET_BY_DATE, "topicDestination2");
AppUtil.pushToTopic(branches_source_topic[1], Constant.ADDED_TO_SECURE_LINK_BY_DATE, "topicDestination3");
AppUtil.pushToTopic(branches_source_topic[2], Constant.ADDED_TO_GROUP_BY_DATE, "topicDestination4");
AppUtil.pushToTopic(branches_source_topic[3], Constant.ROLE_UPDATE_BY_DATE, "topicDestination5");
AppUtil.pushToTopic(branches_source_topic[4], Constant.UPLOAD_FILE_BY_DATE, "topicDestination6");
AppUtil.pushToTopic(branches_source_topic[5], Constant.USER_LOGGED_IN_BY_DATE, "topicDestination7");
AppUtil.pushToTopic(branches_source_topic[6], Constant.MANAGE_USER_BY_DATE, "topicDestination8");
AppUtli.java:
public static void pushToTopic(KStream<String, String> sourceTopic, HashMap<String, String> hmap, String destTopicName) {
sourceTopic.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
ArrayList<String> keywords = new ArrayList<String>();
try {
JSONObject send = new JSONObject();
JSONObject received = processJSON(new JSONObject(value), destTopicName);
boolean valid_json = true;
for(String key: hmap.keySet()) {
if (received.has(hmap.get(key))) {
send.put(key, received.get(hmap.get(key)));
}
else {
valid_json = false;
}
}
if (valid_json) {
keywords.add(send.toString());
}
} catch (Exception e) {
System.err.println("Unable to convert to json");
e.printStackTrace();
}
return keywords;
}
}).to(destTopicName);
}
Where are the logs coming from:
So the logs are coming from an online continuous stream. A python job gets the logs which are basically URLs and sends them to a pre-source-topic. Then in streams app I am creating a streams from that topic and hitting those URLs which then return json logs that I am pushing to topicSource.
I have spent a lot of time trying to resolve this. I have no idea what is going wrong or why is it not processing all logs. Kindly help me figure this out.
So after a lot of debugging I came to know that I was exploring in the wrong direction, it was a simple case of consumer being slow than the producer. The producer kept on writing new records on topic and since the messages were being consumed after stream processing the consumer obviously was slow. Simply increasing topic partitions and launching multiple application instances with the same application id did the trick.

Cannot get a joined Kafka stream to run or output anything

For the below code, both stream1 and stream2 run fine individually and I can see output, but the joined stream just doesn't log anything at all. I have a feeling it has something to do with the join window, but the data from both streams comes in at almost exactly the same time.
val stream = builder.stream(stringSerde, byteArraySerde, "topic")
val stream1 = stream
.filter((key, value) => somefilter(key, value))
.through(stringSerde, byteArraySerde, "topic1")
val stream2 = stream
.filter((key, value) => someotherfilter(key, value))
.through(stringSerde, byteArraySerde, "topic2")
val joinedStream = stream1
.join(stream2, (value1: Array[Byte], value2: Array[Byte]) => {
println("wont print anything")
return somerandomdata
},
JoinWindows.of("othertopic").within(10000L),
stringSerde, byteArraySerde, byteArraySerde)
Shouldn't the keys of both topic be te same in order to join them?
I think the Javadoc explains this :
https://kafka.apache.org/0102/javadoc/org/apache/kafka/streams/kstream/JoinWindows.html
This might also be an interesting read :
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics

Categories