So I was evaluating the Kafka Streams and what it can do to see if it can fit my use case as I needed to do the aggregation of sensor's data for each 15min, Hourly, Daily and found it useful due to its Windowing feature.
As I can create windows by applying windowedBy() on KGroupedStream but the problem is that windows are created in UTC and i want my data to be grouped by its originating timezone not by UTC Timezone as it hampers the aggregation so can any one help me on this.
You can "shift" the timestamps using a custom TimestampExtractor -- before you write the result back into the output topic, you can use a Transformer and "shift" the timestamps back via context.forward(key, value, To.all().withTimestamps()).
Feature request ticket: https://issues.apache.org/jira/browse/KAFKA-7911
So to solve this issue I created custom TimestampExtractor and used it to change the streams window creation time to record time from the payload as show below.
public class RecordTimeStampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
JsonObject data = (JsonObject) new JsonParser().parse(record.value().toString());
Timestamp recordTimestamp = Timestamp.valueOf(data.get(Constant.SLOT).getAsString());
return recordTimestamp.getTime();
}
}
so now I have tested it with my local timezone since yesterday which is IST 05:30 and its working fine also kafka streams are creating windows based on records timestamp. Will test with other timezone as well and update the answer
Related
Today i found very strange thing in Kafka state store i google lot but didn't found the reason for the behavior.
Consider the below state store written in java:
private KeyValueStore<String, GenericRecord> userIdToUserRecord;
There are two processor who are using this state store.
topology.addStateStore(userIdToUserRecord, ALERT_PROCESSOR_NAME, USER_SETTING_PROCESSOR_NAME)
USER_SETTING_PROCESSOR_NAME will put the data to state store
userIdToUserRecord.put("user-12345", record);
ALERT_PROCESSOR_NAME will get the data from state store
userIdToUserRecord.get("user-12345");
Adding source to UserSettingProcessor
userSettingTopicName = user-setting-topic;
topology.addSource(sourceName, userSettingTopicName)
.addProcessor(processorName, UserSettingProcessor::new, sourceName);
Adding source to AlertEngineProcessor
alertTopicName = alert-topic;
topology.addSource(sourceName, alertTopicName)
.addProcessor(processorName, AlertEngineProcessor::new, sourceName);
Case 1:
Produce record using Kafka produce in java
First produce record to topic user-setting-topic using java it will add the user record to state store
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-12345");
Worked fine i am using kafkaavroproducer to produce record to both the topic
Case 2:
First produce record to topic user-setting-topic using python it will add the user record to state store *userIdToUserRecord.put("user-100", GenericRecord);
Second produce record to topic alert-topic using java it will take record from state store using user id userIdToUserRecord.get("user-100");
the strange happen here userIdToUserRecord.get("user-100") will return null
I check the scenario like this also
i produce record to user-setting-topic using python then the userSettingProcessor process method triggered there is check in debug mode and try to get user record from state store userIdToUserRecord.get("user-100") it worked fine in userSettingProcessor i am able to get the data from state-store
Then i produce record to alert-topic using java then try to get the userIdToUserRecord.get("user-100") it will return null
i don't know this strange behavior anyone tell me about this behavior.
Python code:
value_schema = avro.load('user-setting.avsc')
value = {
"user-id":"user-12345",
"client_id":"5cfdd3db-b25a-4e21-a67d-462697096e20",
"alert_type":"WORK_ORDER_VOLUME"
}
print("------------------------Kafka Producer------------------------------")
avroProducer = AvroProducer(
{'bootstrap.servers': 'localhost:9092', 'schema.registry.url': 'http://localhost:8089'},
default_value_schema=value_schema)
avroProducer.produce(topic="user-setting-topic", value=value)
print("------------------------Sucess Producer------------------------------")
avroProducer.flush()
Java Code:
Schema schema = new Schema.Parser().parse(schemaString);
GenericData.Record record = new GenericData.Record(schema);
record.put("alert_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
record.put("alert_created_at",123449437L);
record.put("alert_type","WORK_ORDER_VOLUME");
record.put("client_id","5cfdd3db-b25a-4e21-a67d-462697096e20");
//record.put("property_key","property_key-"+i);
record.put("alert_data","{\"alert_trigger_info\":{\"jll_value\":1.4,\"jll_category\":\"internal\",\"name\":\"trade_Value\",\"current_value\":40,\"calculated_value\":40.1},\"work_order\":{\"locations\":{\"country_name\":\"value\",\"state_province\":\"value\",\"city\":\"value\"},\"property\":{\"name\":\"property name\"}}}");
return record;
The problem is that the Java producer and the Python producer (that is based on the C producer) use a different default hash-function for data partitioning. You will need to provide a customized partitioning to one (or both) to make sure they use the same partitioning strategy.
Unfortunately, the Kafka protocol dose not specify what the default partitioning hash function should be and thus clients can use whatever they want by default.
I want to write a Kafka application that consumes from topics and save something in a database. The topics are created by Debezium Kafka connect based on mysql binlog. So I have one topic per table.
This is the code I am using for consuming from one topic:
KStream<GenericRecord,mysql.company.tiers.Envelope>[] tierStream = builder.stream("mysql.alopeyk.tiers",
Consumed.with(TierSerde.getGenericKeySerde(), TierSerde.getEnvelopeSerde()));
From architectural point of view I should create a KStream for each table and run them in parallel. But the number of tables is so big and having that amount of threads may not be the best option.
All the tables have a column called created_at (it is a laravel app) so I am curious if there is a way to have a generic Serde for values that extracts this common column. This is the only column I am interested in its value besides the name of the table.
It is all about how your value is serialized by the applicatino that produced messages (Connector).
If Deserializer (Serdes) can extract created_at from different type of messages it is possible.
So, the Answer is yes, but it depends on your message value nad Deserializer.
Assuming all your messages after serialization have format as follow:
create_at;name:position;...
create_at;city,country;...
create_at;product_name;...
In such case Deserializer needs only to take characters till first ; and cast it to date and the rest of value can be dropped.
Sample code:
public class CustomDeserializer implements Deserializer<Date> {
#Override
public Date deserialize(String topic, byte[] data) {
String strDate = new String(data);
return new Date(Long.parseLong(strDate.substring(0, strDate.indexOf(";"))));
}
}
I have records that are processed with Kafka Streams (using Processor API). Let's say the record has city_id and some other fields.
In Kafka Streams app I want to add current temperature in the target city to the record.
Temperature<->City pairs are stored in eg. Postgres.
In Java application I'm able to connect to Postgres using JDBC and build new HashMap<CityId, Temperature> so I'm able to lookup temperature based on city_id. Something like tempHM.get(record.city_id).
There are several questions how to best approach it:
Where to initiate the context data?
Originally, I have been doing it within AbstractProcessor::init() but that seems wrong as it's initialized for each thread and also reinitialized on rebalance.
So I moved it before streams topology builder and processors are build with it. Data are fetched only once independently on all processor instances.
Is that proper and valid approach. It works but...
HashMap<CityId, Temperature> tempHM = new HashMap<CityId, Temperature>;
// Connect to DB and initialize tempHM here
Topology topology = new Topology();
topology
.addSource(SOURCE, stringDerializer, protoDeserializer, "topic-in")
.addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(tempHm), SOURCE)
.addSink(SINK, "topic-out", stringSerializer, protoSerializer, TemperatureAppender.NAME)
;
How to refresh the context data?
I would like to refresh the temperature data every 15 minutes for example. I was thinking of using Hashmap container instead of Hashmap, that would handle it:
abstract class ContextContainer<T> {
T context;
Date lastRefreshAt;
ContextContainer(Date now) {
refresh(now);
}
abstract void refresh(Date now);
abstract Duration getRefreshInterval();
T get() {
return context;
}
boolean isDueToRefresh(Date now) {
return lastRefreshAt == null
|| lastRefreshAt.getTime() + getRefreshInterval().toMillis() < now.getTime();
}
}
final class CityTemperatureContextContainer extends ContextContainer<HashMap> {
CityTemperatureContextContainer(Date now) {
super(now);
}
void refresh(Date now) {
if (!isDueToRefresh(now)) {
return;
}
HashMap context = new HashMap();
// Connect to DB and get data and fill hashmap
lastRefreshAt = now;
this.context = context;
}
Duration getRefreshInterval() {
return Duration.ofMinutes(15);
}
}
this is a brief concept written in SO textarea, might contain some syntax errors but the point is clear I hope
then passing it into processor like .addProcessor(TemperatureAppender.NAME, () -> new TemperatureAppender(cityTemperatureContextContainer), SOURCE)
And in processor do
public void init(final ProcessorContext context) {
context.schedule(
Duration.ofMinutes(1),
PunctuationType.STREAM_TIME,
(timestamp) -> {
cityTemperatureContextContainer.refresh(new Date(timestamp));
tempHm = cityTemperatureContextContainer.get();
}
);
super.init(context);
}
Is there a better way? The main question is about finding proper concept, I'm able to implement it then. There is not much resources on the topic out there though.
In Kafka Streams app I want to add current temperature in the target city to the record. Temperature<->City pairs are stored in eg. Postgres.
In Java application I'm able to connect to Postgres using JDBC and build new HashMap<CityId, Temperature> so I'm able to lookup temperature based on city_id. Something like tempHM.get(record.city_id).
A better alternative would be to use Kafka Connect to ingest your data from Postgres into a Kafka topic, read this topic into a KTable in your application with Kafka Streams, and then join this KTable with your other stream (the stream of records "with city_id and some other fields"). That is, you will be doing a KStream-to-KTable join.
Think:
### Architecture view
DB (here: Postgres) --Kafka Connect--> Kafka --> Kafka Streams Application
### Data view
Postgres Table ----------------------> Topic --> KTable
Example connectors for your use case are https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc and https://www.confluent.io/hub/debezium/debezium-connector-postgresql.
One of the advantages of the Kafka Connect based setup above is that you no longer need to talk directly from your Java application (which uses Kafka Streams) to your Postgres DB.
Another advantage is that you don't need to do "batch refreshes" of your context data (you mentioned every 15 minutes) from your DB into your Java application, because the application would get the latest DB changes in real-time automatically via the DB->KConnect->Kafka->KStreams-app flow.
I am working on providing API and I am storing data by month a database and by date a collection on mongodb.
So I have db db_08_2015 then I have 31 collection from date_01 to date_31
and I have to query data from date 1 to date 10 to have a total money spend so I need to send 31 request like this.
My question is How to get data by 1 request at the time to get a sum before I return to client like sync request into mongo to get result.
Something like I have date_01 = 10 then date_02 = 20 ... and I want to sum it all before return to client.
vertx.eventBus().send("mongodb-persistor", json, new Handler<Message<JsonObject>>() {
#Override
public void handle(Message<JsonObject> message) {
logger.info(message.body());
JsonObject result = new JsonObject(message.body().encodePrettily());
JsonArray r = result.getArray("results");
if (r.isArray()) {
if (r.size() > 0) {
String out = r.get(0).toString();
req.response().end(out);
} else {
req.response().end("{}");
}
} else {
req.response().end(message.body().encodePrettily());
}
}
});
I think in your case you might be better off by having a different approach to model your data.
In terms of analytics I would recommend the lambda architecture approach as quoted below:
All data entering the system is dispatched to both the batch layer and the speed layer for processing.
The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute
the batch views.
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.
Having the above in mind, why not have an aggregates collection that should hold the aggregated data in the format your query requires, while at the same time keep a raw copy in the format you described.
By having this you will have a view over the data in the required query format and a way to recreate the aggregated data in case your system backfires.
References for diagram and quotes - Lambda Architecture
I'm making a request from a java webapp to an Oracle' stored procedure which happens to have a Timestamp IN parameter.
The way info travels is something like:
javaWebApp --} webservice client --} ws --} storedProcedure
And I send the Timestamp param as a formatted string from the webservice client to the ws.
In the testing environment, it works sending:
SimpleDateFormat dateFormat = new SimpleDateFormat("dd-MMM-yyyy hh:mm:ss a");
input.setTimestampField(dateFormat.format(new Date()));
As you see, a formatted string is sent. But in the production environment, it raises an exception
ORA-01830: date format picture ends before converting entire input string.
It relates to the format not being the same, possibly due to differences in configuration from one DB to the other. I know the testing environment should be a replica of the production site, but it is not in my hands to set them properly. And I need to send the Timestamp-as-a-formatted-string field despite the way they setup the database. Any ideas? Thanks in advance.
**** EDIT ****: I've found the way to make it work properly despite the particular configuration. It is as simple as setting the call instruction in the web service with the appropiate Oracle instructions. I mean, the calling to the Oracle stored procedure went from
"call PACKAGE.MYPROCEDURE(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
to
"call PACKAGE.MYPROCEDURE(?,?,?,?,?,?,TO_TIMESTAMP(?, 'DD-MM-YYYY HH24:MI:SS'),?,?,?,?,?,?,?,?,?,?,?)"
while the format I set in the procedure calling matches the format sent by the webapp using the SimpleDateFormat stated in the original question, slightly modified:
SimpleDateFormat dateFormat = new SimpleDateFormat("dd-MM-yyyy HH:mm:ss");
Thank you all for the help and the ideas.
The default NLS_DATE_FORMAT generally doesn't include the time and only a two-digit year. It is probably either DD-MM-YY or MM-DD-YY.
If the WS receives a string and the database stored procedure needs a timestamp, then the two of them will need to negotiate the format mask. Either the WS, when it connects to the database, should set an explicit date format, or the database should be able to accept a string and convert it using a hard-coded format.
Unless there is some particular negotiation you have defined in the WS, nothing the JavaWebApp or WebServiceClient will be able to influence the format that the database assumes the WS is using.
All that said, I'd have a look around any other code at your end and see if there's anything doing a similar translation. You may find something else using a specific format.
What does your query look like in the input prepared statement? That error indicates that Oracle doesn't like the date format you have passed in. Your test environment may have a different NLS_DATE_FORMAT set on the database or machine/driver being used.