I'm using Apache Kafka Stream where I added a transform in my stream
final StreamsBuilder streamsBuilder = new StreamsBuilder();
final StoreBuilder<KeyValueStore<String, byte[]>> correlationStore =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(STORE_NAME),
Serdes.String(),
Serdes.ByteArray());
streamsBuilder.addStateStore(correlationStore);
streamsBuilder.stream(topicName, inputConsumed)
.peek(InboundPendingMessageStreamer::logEntries)
.transform(() -> new CleanerTransformer<String, byte[], KeyValue<String, byte[]>>(Duration.ofMillis(5000), STORE_NAME), STORE_NAME)
.toTable();
I'm having difficulties to understand the CleanerTransformer Transformer class that I create, where in the init method, I set a schedule with a scanFrequency and a PunctuationType.
#Override
public void init(ProcessorContext context) {
this.stateStore = context.getStateStore(purgeStoreName);
context.schedule(scanFrequency, PunctuationType.STREAM_TIME, timestamp -> {
try (final KeyValueIterator<K, byte[]> all = stateStore.all()) {
while (all.hasNext()) {
final var headers = context.headers();
final KeyValue<K, byte[]> record = all.next();
}
}
});
}
Adding an event in the stream, I got the message in the schedule callback, but it's only executed once.
My understanding was, that it should be executed every time configured in the scanFrequency.
Any idea what I'm doing wrong here?
Related
I have a KStream eventsStream, which is get data from a topic "events".
There is two type of events, their keys:
1. {user_id = X, event_id = 1} {..value, include time_event...}
2. {user_id = X, event_id = 2} {..value, include time_event...}
I need to migrate events with event_id = 1 to a topic "results" if during 10 minutes there is not given an event with event_id = 2 by user.
For example,
1. First case: we get data {user_id = 100, event_id = 1} {.. time_event = xxxx ...} and no events during 10 minutes {user_id = 100, event_id = 2} {.. time_event = xxxx + 10 minutes...}, so we'll write it to results-topic
2. Second case: we get data {user_id = 100, event_id = 1} {.. time_event = xxxx ...} and an event during 10 minutes {user_id = 100, event_id = 2} {.. time_event = xxxx + 5 minutes...}, so we'll not write it to results-topic
How does it possible to realise in java code this behavior using kafka-streams?
My code:
public class ResultStream {
public static KafkaStreams newStream() {
Properties properties = Config.getProperties("ResultStream");
Serde<String> stringSerde = Serdes.String();
StreamsBuilder builder = new StreamsBuilder();
StoreBuilder<KeyValueStore<String, String>> store =
Stores.keyValueStoreBuilder(
Stores.inMemoryKeyValueStore("inmemory"),
stringSerde,
stringSerde
);
builder.addStateStore(store);
KStream<String, String> resourceEventStream = builder.stream(EVENTS.topicName(), Consumed.with(stringSerde, stringSerde));
resourceEventStream.print(Printed.toSysOut());
resourceEventStream.process(() -> new CashProcessor("inmemory"), "inmemory");
resourceEventStream.process(() -> new FilterProcessor("inmemory", resourceEventStream), "inmemory");
Topology topology = builder.build();
return new KafkaStreams(topology, properties);
}
}
public class FilterProcessor implements Processor {
private ProcessorContext context;
private String eventStoreName;
private KeyValueStore<String, String> eventStore;
private KStream<String, String> stream;
public FilterProcessor(String eventStoreName, KStream<String, String> stream) {
this.eventStoreName = eventStoreName;
this.stream = stream;
}
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
eventStore = (KeyValueStore) processorContext.getStateStore(eventStoreName);
}
#Override
public void process(Object key, Object value) {
this.context.schedule(Duration.ofMinutes(1), PunctuationType.WALL_CLOCK_TIME, timestamp -> {
System.out.println("Scheduler is working");
stream.filter((k, v) -> {
JsonObject events = new Gson().fromJson(k, JsonObject.class);
if (***condition***) {
return true;
}
return false;
}).to("results");
});
}
#Override
public void close() {
}
}
CashProcessor's role only to put events to local store, and delete record with event_id = 1 by user if there is given an event_id = 2 with the same user.
FilterProcess should filter events using local store every minute. But I can't invoke correctly this processing (as I do it in fact)...
I'm really need help.
Why do you pass KStream into your processor? That is not how the DSL works.
As you "connect" your processors via resourceEventStream.process() already, your FilterProcessor#process(key, value) method will be called for each record in the stream automatically -- however, a KStream#process() is a terminal operation and thus does not allow you to send any data downstream. Instead, you might want to use transform() (that is basically the same as process() plus an output KStream).
To actually forward data downstream in your punctuation, you should use context.forward() using the ProcessorContext that is provided via init() method.
I would like to turn many records into one per message. I tried many things like custom reducing and aggregators, but they all still send one-to-one records back out. For example I would like to convert many strings into just one string. If my stream is messages with the same key, but different values, "the", "sky", "is", "blue", then I would like to outback back one concatenation of them in a new topic "the,sky,is,blue,". What I am instead getting is 4 messages "the,", "the, sky,", "the,sky, is,", "the,sky,is,blue,". When I send a second message to the kafka consumer, it will concatenate on the previous aggregation and I eventually receive this "the,sky,is,blue,the,sky,is,blue,"
I also tried using a custom storebuilder and changing a lot of the settings to see if that would do anything.
Map<String, String> changelogConfig = new HashMap<>();
changelogConfig.put("message.down.conversion.enable", "true");
changelogConfig.put("flush.messages", "0");
changelogConfig.put("flush.ms", "0");
StoreBuilder<KeyValueStore<String, String>> aggStoreSupplier = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("AggStore"),
Serdes.String(),
Serdes.String())
.withLoggingEnabled(changelogConfig);
KStream<String, String> results = source // single message get processed and eventually i get these string results I need to concatenate
.groupByKey() // this kgroupedstream has the N records, which was how many were sent in the message
.reduce(new Reducer<String>() {
#Override
public String apply(String aggValue, String value) {
return value + "," + aggValue;
}
}, Materialized.as("AggStore"))
.toStream();
results.to("results", Produced.with(Serdes.String(), Serdes.String()));
final Topology topology = builder.build(); // to describe topology
System.out.println(topology.describe()); // to print description
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.cleanUp();
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
I'm doing a project and I'm stuck on the KTable.
I want to take records from a topic and put them in a KTable(store), so that I have 1 record for 1 key.
static KafkaStreams streams;
final Serde<Long> longSerde = Serdes.Long();
final Serde<byte[]> byteSerde = Serdes.ByteArray();
static String topicName;
static String storeName;
final StreamsBuilder builder = new StreamsBuilder();
KStream<Long, byte[]> streamed = builder.stream(topicName, Consumed.with(longSerde, byteSerde));
KTable<Long, byte[]> records = streamed.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
storeName);
This is the closest I got to the answer I think.
Your approach is correct, but you need to use the correct serdes.
In .reduce() function, value type should be byte[].
KStream<Long, byte[]> streamed = builder.stream(topicName, Consumed.with(longSerde, byteSerde));
KTable<Long, byte[]> records = streamed.groupByKey().reduce(
new Reducer<byte[]>() {
#Override
public byte[] apply(byte[] aggValue, byte[] newValue) {
return newValue;
}
},
Materialized.as(storename).with(longSerde,byteSerde));
Now my previous usage scenarios is like below:
using FlatFileItemReader read input stream with .txt file line by line
using ItemProcessor process per line data to invoke remote service with http
using FlatFileItemWriter write result of per request into the file
I would like to process remote calling with multi thread with ItemProcessor in step 2
Main flow code like below (with spring boot):
//read data
FlatFileItemReader<ItemProcessing> reader = read(batchReqRun);
//process data
ItemProcessor<ItemProcessing, ItemProcessing> processor = process(batchReqDef);
//write data
File localOutputDir = new File(localStoragePath+"/batch-results");
File localOutputFile = new File(localOutputDir, batchReqExec.getDestFile());
FlatFileItemWriter<ItemProcessing> writer = write(localOutputDir,localOutputFile);
StepExecutionListener stepExecListener = new StepExecutionListener() {
#Override
public void beforeStep(StepExecution stepExecution) {
logger.info("Job {} step start {}",stepExecution.getJobExecutionId(), stepExecution.getStepName());
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
logger.info("Job {} step end {}",stepExecution.getJobExecutionId(), stepExecution.getStepName());
//.......ingore some code
return finalStatus;
}
};
Tasklet resultFileTasklet = new BatchFileResultTasklet(localOutputFile, httpClientService);
TaskletStep resultFileStep = stepBuilder.get("result")
.tasklet(resultFileTasklet)
.listener(stepExecListener)
.build();
//create step
Step mainStep = stepBuilder.get("run")
.<ItemProcessing, ItemProcessing>chunk(5)
.faultTolerant()
.skip(IOException.class).skip(SocketTimeoutException.class)//skip IOException here
.skipLimit(2000)
.reader(reader)
.processor(processor)
.writer(writer)
.listener(stepExecListener)
.listener(new ItemProcessorListener()) //add process listener
.listener(skipExceptionListener) //add skip exception listner
.build();
//create job
Job job = jobBuilder.get(batchReqExec.getId())
.start(mainStep)
.next(resultFileStep)
.build();
JobParametersBuilder jobParamBuilder = new JobParametersBuilder();
//run job
JobExecution execution = jobLauncher.run(job, jobParamBuilder.toJobParameters());
read data like below:
private FlatFileItemReader<ItemProcessing> read(BatchRequestsRun batchReqRun) throws Exception {
//prepare input file
File localInputDir = new File(localStoragePath+"/batch-requests");
if(!localInputDir.exists() || localInputDir.isFile()) {
localInputDir.mkdir();
}
File localFile = new File(localInputDir, batchReqRun.getFileRef()+"-"+batchReqRun.getFile());
if(!localFile.exists()) {
httpClientService.getFileFromStorage(batchReqRun.getFileRef(), localFile);
}
FlatFileItemReader<ItemProcessing> reader = new FlatFileItemReader<ItemProcessing>();
reader.setResource(new FileSystemResource(localFile));
reader.setLineMapper(new DefaultLineMapper<ItemProcessing>() {
{
setLineTokenizer(new DelimitedLineTokenizer());
setFieldSetMapper(new FieldSetMapper<ItemProcessing>() {
#Override
public ItemProcessing mapFieldSet(FieldSet fieldSet) throws BindException {
ItemProcessing item = new ItemProcessing();
item.setFieldSet(fieldSet);
return item;
}
});
}
});
return reader;
}
process data like below:
private ItemProcessor<ItemProcessing, ItemProcessing> process(BatchRequestsDef batchReqDef) {
ItemProcessor<ItemProcessing, ItemProcessing> processor = (input) -> {
VelocityContext context = new VelocityContext();
//.....ingore velocity code
String responseBody = null;
//send http invoking
input.setResponseBody(httpClientService.process(batchReqDef, input));
responseBody = input.getResponseBody();
logger.info(responseBody);
// using Groovy to parse response
Binding binding = new Binding();
try {
binding.setVariable("response", responseBody);
GroovyShell shell = new GroovyShell(binding);
Object result = shell.evaluate(batchReqDef.getConfig().getResponseHandler());
input.setResult(result.toString());
} catch(Exception e) {
logger.error("parse groovy script found exception:{},{}",e.getMessage(),e);
}
return input;
};
return processor;
}
Ignore writing file method here.
Who can help me to implement process method with multi thread ?
I guess spring batch read one line data and then process one line (execute ItemProcessor to invoke remote service directly)
As We known, the speed of read one line data much more than invoking http service one time.
So I want to read all data(or some part data) into memory (List) with single thread,and then invoke remote call with multi thread in step 2.
(It's very easy with using java thread pool ,but i don't known implementation with spring batch)
Please show me some code, thanks a lot!
I have a use case, where I need to read messages from kafka and for each message, extract data and invoke elasticsearch Index. The response will be further used to do further processing.
I am getting below error when invoking JavaEsSpark.esJsonRDD
java.lang.ClassCastException: org.elasticsearch.spark.rdd.EsPartition incompatible with org.apache.spark.rdd.ParallelCollectionPartition
at org.apache.spark.rdd.ParallelCollectionRDD.compute(ParallelCollectionRDD.scala:102)
My code snippet is below
public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaIntegration <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaIntegration").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true");
//Setting when using JavaEsSpark.esJsonRDD
sparkConf.set("es.nodes",<NODE URL>);
sparkConf.set("es.nodes.wan.only","true");
context = new JavaSparkContext(sparkConf);
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
//Receive Message From kafka
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,args[0], args[1], topicMap);
JavaDStream<String> jsons = messages
.map(new Function<Tuple2<String, String>, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(Tuple2<String, String> tuple2){
JavaRDD<String> esRDD = JavaEsSpark.esJsonRDD(context, <index>,<search string> ).values() ;
return null;
}
});
jsons.print();
jssc.start();
jssc.awaitTermination();
}
I am getting error when invoking JavaEsSpark.esJsonRDD. Is it correct way to do it? How do I successfully invoke ES from spark?
I am running kafka and spark on windows and invoking external elastic search index.