Conceptually
Say I have a topic called "addresses" where the key is a person's name (String) and the value is a person's address (String). An update to the person's address would be a message consisting of their name as the key and a new address as the value. So, if I only want the most recent value for any one key, I suppose I would make a ktable. When I do that, what's actually going on here? Is Kafka creating a new topic which is actually the ktable and truncating old values? Or do I have to create a new topic for the current addresses? Or is it something else entirely?
Practically
All the examples and tutorials I'm finding are using deprecated methods, so I'm hoping for something newer. My current solution has been to read the topic like this:
final Properties config = new Properties();
//leaving out all the config.put() for readability
final Consumer<String, String> consumer = new KafkaConsumer<>(config);
consumer.subscribe(Collections.singletonList("addresses"));
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(200);
for (ConsumerRecord<String, String> record : records) {
//do stuff
}
}
} finally {
consumer.close();
}
This worked for a while, but now I would like a stateful solution. Is there a simple way to just stick the information into a ktable as it comes in? I don't want to filter it or anything, I just want any entry to update the state. Thanks in advance.
The KTable does not create a new topic. Instead, it treats messages on the source topic like an upsert in a database - if it's the first time the key is seen, it's like inserting a new record. If it's not the first time, it's treated like an update to an existing record. The KTable can be used as inputs to other parts of Kafka Streams, and you could save that state in a local store as well and query that K/V store in other parts of your application.
Related
I have a state store which is defined like below:
StoreBuilder<KeyValueStore<String, DataDocument>> indexStore = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("Data"), Serdes.String(), DataDocumentSerDes.dataDocumentSerDes())
.withLoggingEnabled(changelogConfig);
I have processor defined which will process source topic and write it to this state store.
I have a use-case to get list of keys based on a field in the DataDocument which is the value in the state store. Is there a way I can achieve it?
One way I can think of is get all the keys and then perform filter. But it's going to be expensive as we iterate all keys always.
ReadOnlyKeyValueStore<String, DataDocument> roStore = streams.store(StoreQueryParameters.fromNameAndType("Data",
QueryableStoreTypes.<String, DataDocument>keyValueStore()));
KeyValueIterator<String, DataDocument> kvItr = roStore.all();
while(kvItr.hasNext()) {
if(kvItr.next().value.isField()) {
//Store to a list
}
}
Another approach is create state stores for all fields which has to be queryable with field as key and list of keys. But this is not scalable.
Is there a better way I can achieve it with kafka topology?
Today i found very strange thing in Kafka state store i google lot but didn't found the reason for the behavior.
Consider the below state store written in java:
private KeyValueStore<String, GenericRecord> userIdToUserRecord;
There are two processor who are using this state store.
topology.addStateStore(userIdToUserRecord, ALERT_PROCESSOR_NAME, USER_SETTING_PROCESSOR_NAME)
USER_SETTING_PROCESSOR_NAME will put the data to state store
userIdToUserRecord.put("user-12345", record);
ALERT_PROCESSOR_NAME will get the data from state store
userIdToUserRecord.get("user-12345");
Adding source to UserSettingProcessor
userSettingTopicName = user-setting-topic;
topology.addSource(sourceName, userSettingTopicName)
.addProcessor(processorName, UserSettingProcessor::new, sourceName);
Adding source to AlertEngineProcessor
alertTopicName = alert-topic;
topology.addSource(sourceName, alertTopicName)
.addProcessor(processorName, AlertEngineProcessor::new, sourceName);
Case 1:
Produce record using Kafka produce in java
First produce record to topic user-setting-topic using java it will add the user record to state store
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-12345");
Worked fine i am using kafkaavroproducer to produce record to both the topic
Case 2:
First produce record to topic user-setting-topic using python it will add the user record to state store *userIdToUserRecord.put("user-100", GenericRecord);
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-100");
the strange happen here userIdToUserRecord.get("user-100") will return null
I check the scenario like this also
i produce record to user-setting-topic using python then the userSettingProcessor process method triggered there is check in debug mode and try to get user record from state store userIdToUserRecord.get("user-100") it worked fine in userSettingProcessor i am able to get the data from state-store
Then i produce record to alert-topic using java then try to get the userIdToUserRecord.get("user-100") it will return null
i don't know this strange behavior anyone tell me about this behavior.
Python code:
value_schema = avro.load('user-setting.avsc')
value = {
"user-id":"user-12345",
"client_id":"5cfdd3db-b25a-4e21-a67d-462697096e20",
"alert_type":"WORK_ORDER_VOLUME"
}
print("------------------------Kafka Producer------------------------------")
avroProducer = AvroProducer(
{'bootstrap.servers': 'localhost:9092', 'schema.registry.url': 'http://localhost:8089'},
default_value_schema=value_schema)
avroProducer.produce(topic="user-setting-topic", value=value)
print("------------------------Sucess Producer------------------------------")
avroProducer.flush()
Java Code:
Schema schema = new Schema.Parser().parse(schemaString);
GenericData.Record record = new GenericData.Record(schema);
record.put("alert_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
record.put("alert_created_at",123449437L);
record.put("alert_type","WORK_ORDER_VOLUME");
record.put("client_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
//record.put("property_key","property_key-"+i);
record.put("alert_data","{\"alert_trigger_info\":{\"jll_value\":1.4,\"jll_category\":\"internal\",\"name\":\"trade_Value\",\"current_value\":40,\"calculated_value\":40.1},\"work_order\":{\"locations\":{\"country_name\":\"value\",\"state_province\":\"value\",\"city\":\"value\"},\"property\":{\"name\":\"property name\"}}}");
return record;
The problem is that the Java producer and the Python producer (that is based on the C producer) use a different default hash-function for data partitioning. You will need to provide a customized partitioning to one (or both) to make sure they use the same partitioning strategy.
Unfortunately, the Kafka protocol dose not specify what the default partitioning hash function should be and thus clients can use whatever they want by default.
I want to write a Kafka application that consumes from topics and save something in a database. The topics are created by Debezium Kafka connect based on mysql binlog. So I have one topic per table.
This is the code I am using for consuming from one topic:
KStream<GenericRecord,mysql.company.tiers.Envelope>[] tierStream = builder.stream("mysql.alopeyk.tiers",
Consumed.with(TierSerde.getGenericKeySerde(), TierSerde.getEnvelopeSerde()));
From architectural point of view I should create a KStream for each table and run them in parallel. But the number of tables is so big and having that amount of threads may not be the best option.
All the tables have a column called created_at (it is a laravel app) so I am curious if there is a way to have a generic Serde for values that extracts this common column. This is the only column I am interested in its value besides the name of the table.
It is all about how your value is serialized by the applicatino that produced messages (Connector).
If Deserializer (Serdes) can extract created_at from different type of messages it is possible.
So, the Answer is yes, but it depends on your message value nad Deserializer.
Assuming all your messages after serialization have format as follow:
create_at;name:position;...
create_at;city,country;...
create_at;product_name;...
In such case Deserializer needs only to take characters till first ; and cast it to date and the rest of value can be dropped.
Sample code:
public class CustomDeserializer implements Deserializer<Date> {
#Override
public Date deserialize(String topic, byte[] data) {
String strDate = new String(data);
return new Date(Long.parseLong(strDate.substring(0, strDate.indexOf(";"))));
}
}
I'm new to DynamoDb and I'm struggling to work out how to do this (using the java sdk).
I currently have a table (in mongo) for notifications. The schema is basically as follows (I've simplified it)
id: string
notifiedUsers: [123, 345, 456, 567]
message: "this is a message"
created: 12345678000 (epoch millis)
I wanted to migrate to Dynamodb, but I can't work out the best way to select all notifications that went to a particular user after a certain date?
I gather I can't have an index on a list like notifiedUsers, therefore I can't use a query in this case - is that correct?
I'd prefer not to scan and then filter, there could be a lot of records.
Is there a way to do this using a query or another approach?
EDIT
This is what I'm trying now, it's not working and I'm not sure where to take it (if anywhere).
Condition rangeKeyCondition = new Condition()
.withComparisonOperator(ComparisonOperator.CONTAINS.toString())
.withAttributeValueList(new AttributeValue().withS(userId));
if(startTimestamp != null) {
rangeKeyCondition = rangeKeyCondition.withComparisonOperator(ComparisonOperator.GT.toString())
.withAttributeValueList(new AttributeValue().withS(startTimestamp));
}
NotificationFeedDynamoRecord replyKey = new NotificationFeedDynamoRecord();
replyKey.setId(partitionKey);
DynamoDBQueryExpression<NotificationFeedDynamoRecord> queryExpression = new DynamoDBQueryExpression<NotificationFeedDynamoRecord>()
.withHashKeyValues(replyKey)
.withRangeKeyCondition(NOTIFICATIONS, rangeKeyCondition);
In case anyone else comes across this question, in the end we flattened the schema, so that there is now a record per userId. This has lead to problems because it's not possible with dynamoDb to atomically batch write records. With the original schema we had one record, and could write it atomically ensuring that all users got that notification. Now we cannot be certain, and this is causing pain.
Sample data store in hashmap is as ,
e.g.
{"client1-data1":data1,"client2-data2":data2,"client3-data3":data3,"client4-data4":data4,"client1-data2":data2,"client2-data1":data1,"client3-data4":data4,"client4-data3":data3}
every data can be get repeated for every other client , the key will be unique as client1-data1 combination value will get repeated but key will be unique.
Issue for handling multiple clients,
for every user , a different thread is created while making a connection so every thread will created a PrintWriter object which gets added to a Arraylist
List writers = new ArrayList();
Is their anyways where i can store the client_id along with the Printwriter object in the same array and pass the data to clients while filter the data with that client id as stored in the Hashmap in the above example.
PLease do reply/suggest
thanks,
Praveen T
You can use a collection which stores a pair of elements. A round about way of doing this is to use a Map.
Map<PrintWriter, String> writeAndClientId = new HashMap<>();
You can loop over these using
for(Map.Entry<PrintWriter, String> entry: writeAndClientId.entrySet())
If you want to be able to get one clientId at a time you can use
Map<String, PrintWriter> clientIdToWriter = new HashMap<>();
PrintWriter pw = clientIdToWriter.get(clientId);
Printers and Clients are unique, so take a Set(no duplicate Id) or HashMap and manage something like this :
final HashMap<Writer, HashMap<Integer, List<String>>> writers = new HashMap<Writer, HashMap<Integer, List<String>>>();
Let us know if I've misunderstood something.