I want to write a Kafka application that consumes from topics and save something in a database. The topics are created by Debezium Kafka connect based on mysql binlog. So I have one topic per table.
This is the code I am using for consuming from one topic:
KStream<GenericRecord,mysql.company.tiers.Envelope>[] tierStream = builder.stream("mysql.alopeyk.tiers",
Consumed.with(TierSerde.getGenericKeySerde(), TierSerde.getEnvelopeSerde()));
From architectural point of view I should create a KStream for each table and run them in parallel. But the number of tables is so big and having that amount of threads may not be the best option.
All the tables have a column called created_at (it is a laravel app) so I am curious if there is a way to have a generic Serde for values that extracts this common column. This is the only column I am interested in its value besides the name of the table.
It is all about how your value is serialized by the applicatino that produced messages (Connector).
If Deserializer (Serdes) can extract created_at from different type of messages it is possible.
So, the Answer is yes, but it depends on your message value nad Deserializer.
Assuming all your messages after serialization have format as follow:
create_at;name:position;...
create_at;city,country;...
create_at;product_name;...
In such case Deserializer needs only to take characters till first ; and cast it to date and the rest of value can be dropped.
Sample code:
public class CustomDeserializer implements Deserializer<Date> {
#Override
public Date deserialize(String topic, byte[] data) {
String strDate = new String(data);
return new Date(Long.parseLong(strDate.substring(0, strDate.indexOf(";"))));
}
}
Related
I'm currently building a streaming pipeline using Java SDK and trying to write to a BigQuery partitioned table using the BigQueryIO write/writeTableRows. I explored a couple of patterns but none of them succeed; few of them are below.
Using SerializableFunction to determine TableDestination
.withSchema(TableSchemaFactory.buildLineageSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) or CREATE_NEVER
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
and then calling this function inside the .to() method
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s", project, dataset, table);
return new TableDestination(dest, null, timePartitioning);
I also tried to format the partition column obtained from input and add it as part of the String location with $ annotation, like below:
#Override
public TableDestination apply(ValueInSingleWindow<TableRow> input) {
input.get("processingDate")
...convert to string MMddYYYY format
TimePartitioning timePartitioning = new TimePartitioning();
timePartitioning.setField("processingdate");
String dest = String.format("%s.%s.%s$%s", project, dataset, table, convertedDate);
return new TableDestination(dest, null, timePartitioning);
however, none of them succeed, either failing with
invalid timestamp
timestamp field value out of range
You can only stream to partitions within 0 days in the past and 0 days in the future relative to the current date.
The destination table's partition is not supported for streaming. You can only stream to meta-table of date partitioned tables.
Streaming to metadata partition of column based partitioning table is disallowed.
I can't seem to get the right combination. Has anyone encountered the same issue before? Can anyone point me to the right direction or give me some pointers? what I want to achieve is load the streaming data based on the date column defined and not on processing time.
Thank you!
I expect most of these issues will be solved if you drop the partition decorator from dest. In most cases the BigQuery APIs for loading data will be able to figure out the right partition based on the messages themselves.
So try changing your definition of dest to:
String dest = String.format("%s.%s.%s", project, dataset, table);
Today i found very strange thing in Kafka state store i google lot but didn't found the reason for the behavior.
Consider the below state store written in java:
private KeyValueStore<String, GenericRecord> userIdToUserRecord;
There are two processor who are using this state store.
topology.addStateStore(userIdToUserRecord, ALERT_PROCESSOR_NAME, USER_SETTING_PROCESSOR_NAME)
USER_SETTING_PROCESSOR_NAME will put the data to state store
userIdToUserRecord.put("user-12345", record);
ALERT_PROCESSOR_NAME will get the data from state store
userIdToUserRecord.get("user-12345");
Adding source to UserSettingProcessor
userSettingTopicName = user-setting-topic;
topology.addSource(sourceName, userSettingTopicName)
.addProcessor(processorName, UserSettingProcessor::new, sourceName);
Adding source to AlertEngineProcessor
alertTopicName = alert-topic;
topology.addSource(sourceName, alertTopicName)
.addProcessor(processorName, AlertEngineProcessor::new, sourceName);
Case 1:
Produce record using Kafka produce in java
First produce record to topic user-setting-topic using java it will add the user record to state store
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-12345");
Worked fine i am using kafkaavroproducer to produce record to both the topic
Case 2:
First produce record to topic user-setting-topic using python it will add the user record to state store *userIdToUserRecord.put("user-100", GenericRecord);
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-100");
the strange happen here userIdToUserRecord.get("user-100") will return null
I check the scenario like this also
i produce record to user-setting-topic using python then the userSettingProcessor process method triggered there is check in debug mode and try to get user record from state store userIdToUserRecord.get("user-100") it worked fine in userSettingProcessor i am able to get the data from state-store
Then i produce record to alert-topic using java then try to get the userIdToUserRecord.get("user-100") it will return null
i don't know this strange behavior anyone tell me about this behavior.
Python code:
value_schema = avro.load('user-setting.avsc')
value = {
"user-id":"user-12345",
"client_id":"5cfdd3db-b25a-4e21-a67d-462697096e20",
"alert_type":"WORK_ORDER_VOLUME"
}
print("------------------------Kafka Producer------------------------------")
avroProducer = AvroProducer(
{'bootstrap.servers': 'localhost:9092', 'schema.registry.url': 'http://localhost:8089'},
default_value_schema=value_schema)
avroProducer.produce(topic="user-setting-topic", value=value)
print("------------------------Sucess Producer------------------------------")
avroProducer.flush()
Java Code:
Schema schema = new Schema.Parser().parse(schemaString);
GenericData.Record record = new GenericData.Record(schema);
record.put("alert_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
record.put("alert_created_at",123449437L);
record.put("alert_type","WORK_ORDER_VOLUME");
record.put("client_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
//record.put("property_key","property_key-"+i);
record.put("alert_data","{\"alert_trigger_info\":{\"jll_value\":1.4,\"jll_category\":\"internal\",\"name\":\"trade_Value\",\"current_value\":40,\"calculated_value\":40.1},\"work_order\":{\"locations\":{\"country_name\":\"value\",\"state_province\":\"value\",\"city\":\"value\"},\"property\":{\"name\":\"property name\"}}}");
return record;
The problem is that the Java producer and the Python producer (that is based on the C producer) use a different default hash-function for data partitioning. You will need to provide a customized partitioning to one (or both) to make sure they use the same partitioning strategy.
Unfortunately, the Kafka protocol dose not specify what the default partitioning hash function should be and thus clients can use whatever they want by default.
I have records that are processed with Kafka Streams (using Processor API). Let's say the record has city_id and some other fields.
In Kafka Streams app I want to add current temperature in the target city to the record.
Temperature<->City pairs are stored in eg. Postgres.
In Java application I'm able to connect to Postgres using JDBC and build new HashMap<CityId, Temperature> so I'm able to lookup temperature based on city_id. Something like tempHM.get(record.city_id).
There are several questions how to best approach it:
Where to initiate the context data?
Originally, I have been doing it within AbstractProcessor::init() but that seems wrong as it's initialized for each thread and also reinitialized on rebalance.
So I moved it before streams topology builder and processors are build with it. Data are fetched only once independently on all processor instances.
Is that proper and valid approach. It works but...
HashMap<CityId, Temperature> tempHM = new HashMap<CityId, Temperature>;
// Connect to DB and initialize tempHM here
Topology topology = new Topology();
topology
.addSource(SOURCE, stringDerializer, protoDeserializer, "topic-in")
.addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(tempHm), SOURCE)
.addSink(SINK, "topic-out", stringSerializer, protoSerializer, TemperatureAppender.NAME)
;
How to refresh the context data?
I would like to refresh the temperature data every 15 minutes for example. I was thinking of using Hashmap container instead of Hashmap, that would handle it:
abstract class ContextContainer<T> {
T context;
Date lastRefreshAt;
ContextContainer(Date now) {
refresh(now);
}
abstract void refresh(Date now);
abstract Duration getRefreshInterval();
T get() {
return context;
}
boolean isDueToRefresh(Date now) {
return lastRefreshAt == null
|| lastRefreshAt.getTime() + getRefreshInterval().toMillis() < now.getTime();
}
}
final class CityTemperatureContextContainer extends ContextContainer<HashMap> {
CityTemperatureContextContainer(Date now) {
super(now);
}
void refresh(Date now) {
if (!isDueToRefresh(now)) {
return;
}
HashMap context = new HashMap();
// Connect to DB and get data and fill hashmap
lastRefreshAt = now;
this.context = context;
}
Duration getRefreshInterval() {
return Duration.ofMinutes(15);
}
}
this is a brief concept written in SO textarea, might contain some syntax errors but the point is clear I hope
then passing it into processor like .addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(cityTemperatureContextContainer), SOURCE)
And in processor do
public void init(final ProcessorContext context) {
context.schedule(
Duration.ofMinutes(1),
PunctuationType.STREAM_TIME,
(timestamp) -> {
cityTemperatureContextContainer.refresh(new Date(timestamp));
tempHm = cityTemperatureContextContainer.get();
}
);
super.init(context);
}
Is there a better way? The main question is about finding proper concept, I'm able to implement it then. There is not much resources on the topic out there though.
In Kafka Streams app I want to add current temperature in the target city to the record. Temperature<->City pairs are stored in eg. Postgres.
In Java application I'm able to connect to Postgres using JDBC and build new HashMap<CityId, Temperature> so I'm able to lookup temperature based on city_id. Something like tempHM.get(record.city_id).
A better alternative would be to use Kafka Connect to ingest your data from Postgres into a Kafka topic, read this topic into a KTable in your application with Kafka Streams, and then join this KTable with your other stream (the stream of records "with city_id and some other fields"). That is, you will be doing a KStream-to-KTable join.
Think:
### Architecture view
DB (here: Postgres) --Kafka Connect--> Kafka --> Kafka Streams Application
### Data view
Postgres Table ----------------------> Topic --> KTable
Example connectors for your use case are https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc and https://www.confluent.io/hub/debezium/debezium-connector-postgresql.
One of the advantages of the Kafka Connect based setup above is that you no longer need to talk directly from your Java application (which uses Kafka Streams) to your Postgres DB.
Another advantage is that you don't need to do "batch refreshes" of your context data (you mentioned every 15 minutes) from your DB into your Java application, because the application would get the latest DB changes in real-time automatically via the DB->KConnect->Kafka->KStreams-app flow.
So I was evaluating the Kafka Streams and what it can do to see if it can fit my use case as I needed to do the aggregation of sensor's data for each 15min, Hourly, Daily and found it useful due to its Windowing feature.
As I can create windows by applying windowedBy() on KGroupedStream but the problem is that windows are created in UTC and i want my data to be grouped by its originating timezone not by UTC Timezone as it hampers the aggregation so can any one help me on this.
You can "shift" the timestamps using a custom TimestampExtractor -- before you write the result back into the output topic, you can use a Transformer and "shift" the timestamps back via context.forward(key, value, To.all().withTimestamps()).
Feature request ticket: https://issues.apache.org/jira/browse/KAFKA-7911
So to solve this issue I created custom TimestampExtractor and used it to change the streams window creation time to record time from the payload as show below.
public class RecordTimeStampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
JsonObject data = (JsonObject) new JsonParser().parse(record.value().toString());
Timestamp recordTimestamp = Timestamp.valueOf(data.get(Constant.SLOT).getAsString());
return recordTimestamp.getTime();
}
}
so now I have tested it with my local timezone since yesterday which is IST 05:30 and its working fine also kafka streams are creating windows based on records timestamp. Will test with other timezone as well and update the answer
Conceptually
Say I have a topic called "addresses" where the key is a person's name (String) and the value is a person's address (String). An update to the person's address would be a message consisting of their name as the key and a new address as the value. So, if I only want the most recent value for any one key, I suppose I would make a ktable. When I do that, what's actually going on here? Is Kafka creating a new topic which is actually the ktable and truncating old values? Or do I have to create a new topic for the current addresses? Or is it something else entirely?
Practically
All the examples and tutorials I'm finding are using deprecated methods, so I'm hoping for something newer. My current solution has been to read the topic like this:
final Properties config = new Properties();
//leaving out all the config.put() for readability
final Consumer<String, String> consumer = new KafkaConsumer<>(config);
consumer.subscribe(Collections.singletonList("addresses"));
try {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(200);
for (ConsumerRecord<String, String> record : records) {
//do stuff
}
}
} finally {
consumer.close();
}
This worked for a while, but now I would like a stateful solution. Is there a simple way to just stick the information into a ktable as it comes in? I don't want to filter it or anything, I just want any entry to update the state. Thanks in advance.
The KTable does not create a new topic. Instead, it treats messages on the source topic like an upsert in a database - if it's the first time the key is seen, it's like inserting a new record. If it's not the first time, it's treated like an update to an existing record. The KTable can be used as inputs to other parts of Kafka Streams, and you could save that state in a local store as well and query that K/V store in other parts of your application.