I have the following topology definition and there are two application instances on the environment:
KStream<String, ObjectMessage> stream = kStreamBuilder.stream(inputTopic);
stream.mapValues(new ProtobufObjectConverter())
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMillis(100)))
.aggregate(AggregatedObject::new, new ObjectAggregator(), buildStateStore(storeName))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded().withMaxRecords(config.suppressionBufferSize())))
.mapValues(new AggregatedObjectProtobufConverter())
.toStream((key, value) -> key.key())
.to(outputTopic);
private Materialized<String, AggregatedObject, WindowStore<Bytes, byte[]>> buildStateStore(String storeName) {
return Materialized.<String, AggregatedObject, WindowStore<Bytes, byte[]>>as(storeName)
.withKeySerde(Serdes.String())
.withValueSerde(new JsonSerde<>(AggregatedObject.class));
}
This topology is created for multiple input topics in a for loop so once application instance has multiple topologies. Every toplogy has a state store created from the pattern KSTREAM-AGGREGATE-%s-STATE-STORE-0000000001 like Opening store KSTREAM-AGGREGATE-my.topic.name-STATE-STORE-0000000001.
Now, until recently we didn't have configured state-dir directory and since we used K8S stateful-set, the store wasn't persisted between restarts, so application had to rebuild the state as far as I understand how kafka-streams work.
Our logs were full of logs like below, but only with changing time (the suffix after last dot).
INFO 1 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore Opening store KSTREAM-AGGREGATE-my.topic.name-STATE-STORE-0000000001.1675576920000 in regular mode
However the time in millis 1675576920000 is one day ago and some are even from 3 days ago. Today I added the state-dir to the app, but this log is still being shown all the time. Should we simply wait some time until everything be processed or we are doing something wrong?
Can someone explain to me why RocksDBTimestampedStore is logging so much? Also, the time that is being logged from those stores is not 100 ms as defined by windowed operation, why?
Related
I need to use Kafka stream with java application which runs in cronjob and read the whole topic each time. Unfortunately, for some reason, it commits the offset and on the next run, it reads of the last offset. I have tried various ways, but unfortunately without success. My settings are as follows:
streamsConfiguration.put(APPLICATION_ID_CONFIG, "app_id");
streamsConfiguration.put(AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(ENABLE_AUTO_COMMIT_CONFIG, "false");
And I read the topic with the following code:
Consumed<String, String> with = Consumed.with(Serdes.String(), Serdes.String());
with.withOffsetResetPolicy(Topology.AutoOffsetReset.EARLIEST);
final var stream = builder.stream("topic", with);
stream.foreach((key, value) -> {
log.info("Key= {}, value= {}", key, value);
});
final var kafkaStreams = new KafkaStreams(builder.build(), kafkaStreamProperties);
kafkaStreams.cleanUp();
kafkaStreams.start();
But still, it reads from the latest offset.
Kafka Streams commits offsets regularly, so after you run the application the first time and shut it down, the next time you start it up, Kafka Streams will pick up at the last committed offset. That's the standard Kafka behavior. The AUTO_OFFSET_RESET_CONFIG only applies when a consumer doesn't find an offset, so it relies on that config on where to start.
So if you want to reset it to read from the beginning the next time on startup, you can either use the application reset tool or change the application.id. If you get the properties for the Kafka Streams application externally, you could automate generating a unique name each time.
I am having a push notifications being send to android and ios application through spring boot every day at 8am Europe/Paris.
If I run multiple instances, the notifications will send multiple times. I am thinking to send every day notifications send on the database, and check them but I am worried it still run multiple times, this is what I am doing:
#Component
public class ScheduledTasks {
private static final Logger log = LoggerFactory.getLogger(ScheduledTasks.class);
private static final SimpleDateFormat dateFormat = new SimpleDateFormat("HH:mm:ss");
#Autowired
private ExpoPushTokenRepository expoPushTokenRepository;
#Autowired
private ExpoPushNotificationService expoPushNotificationService;
#Autowired
private MessageSource messageSource;
// TODO: if instances > 1, this will run multiple times, save to database the notifications send and prevent multiple sending.
#Scheduled(cron = "${cron.promotions.notification}", zone = "Europe/Paris")
public void sendNewPromotionsNotification() {
List<ExpoPushToken> expoPushTokenList = expoPushTokenRepository.findAll();
ArrayList<NotifyRequest> notifyRequestList = new ArrayList<>();
for (ExpoPushToken expoPushToken : expoPushTokenList) {
NotifyRequest notifyRequest = new NotifyRequest(
expoPushToken.getToken(),
"This is a test title",
"This is a test subtitle",
"This is a test body"
);
notifyRequestList.add(notifyRequest);
}
expoPushNotificationService.sendPushNotificationToList(notifyRequestList);
log.info("{} Send push notification to " + expoPushTokenList.size() + " userse", dateFormat.format(new Date()));
}
}
Does anybody have an idea on how I can prevent that safely?
Quartz would be my mostly database-agnostic solution for the task at hand, but was ruled out, so we are not going to discuss it.
The solution we are going to explore instead makes the following assumptions:
Postgres >= 9.5 is used (because we are going to use SKIP LOCKED, which was introduced in Postgresl 9.5).
It is okay to run a native query.
Under this conditions, we can retrieve batches of notifications from multiple instances of the application running through the following query:
SELECT * FROM expo_push_token FOR UPDATE SKIP LOCKED LIMIT 100;
This will retrieve and lock up to 100 entries from the table expo_push_token. If two instances of the application execute this query simultaneously, the received results will be disjoint. 100 is just some sample value. We may want to fine-tune this value for our use case. The locks stay active until the current transaction ends.
After an instance has fetched a batch of notifications, it has to also delete the entries it locked from the table or otherwise mark that this entry has been processed (if we go down this route, we have to modify the query above to filter-out already processed entires) and close the current transaction to release the locks. Each instance of the application would then repeat this query until the query returns zero entries.
There is also an alternative approach: an instance first fetches a batch size of notifications to send, keeps the transaction to the database open (thus continues holding the lock on the database), sends out its notification and then deletes/updates the entries and closes the transactions.
The two solutions have different strengths/weaknesses:
the first solutions keeps the transaction short. But if the application crashes in the middle of sending out notificatiosn, the part of its batch that was not send out is lost in this run.
the second solution keeps the transaction open, for possibly a long time. If it crashes in the middle fo sending out notifications, all entries will be unlocked and its batch would be re-processed, possibly resulting in some notifications being sent out twice.
For this solution to work, we also need some kind of job that fills table expo_push_token with the data we need. This job should run beforehand, i.e. its execution should not overlap with the notification sending process.
We are creating a POC to read database CDC and push it to external systems.
each source table CDC are sent to respective topics in Avro format(with Kafka Schema Registry and Kafka Server)
We are writing java code to consume the messages in avro schema,de-serialize it using AvroSerde and join them and then send to different topics so that it can be consumed by external systems.
We have a limitation though that we cannot produce messages to source table topics to send/receive new contents/changes. So only way to write join code is to read messages from beginning everytime from every source topic when we run the application.(until we have confident that code is working and can start receiving live data again)
In KafkaConsumer object we have an option to use seekToBeginning method to force reading from beginning in jave code, which works. However there are no option when we try to stream topic using KStream object and force to read it from beginning. What are the alternatives here?
We tried to reset the offset using kafka-consumer-groups reset-topic with --to-earliest but that sets the offset only to the nearest . When we try to reset offset manually with "0" with --to-offset parameter we get below warning but does not set to "0". my understanding is, setting to 0 should read messages from beginning. correct me if I am wrong.
"WARN New offset (0) is lower than earliest offset for topic partition"
Sample code below
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVER);
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, GROUP_ID);
properties.put("schema.registry.url", SCHEMA_REGISTRY_URL);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(StreamsConfig.APPLICATION_ID_CONFIG, APPLICATION_ID);
StreamsBuilder builder = new StreamsBuilder();
//nothing returned here, when some offset has already been set
KStream myStream = builder.stream("my-topic-in-avro-schema",ConsumedWith(myKeySerde,myValueSerde));
KafkaStreams streams = new KafkaStreams(builder.build(),properties);
streams.start();
One way to do this would be to generate a random ConsumerGroup every time you start the stream application. Something like:
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, GROUP_ID + currentTimestamp);
That way, the stream will start reading from "earliest" as you have set it already in auto.offset.reset.
By the way, you are setting the properties for group.id twice in your code...
It will help someone who is also facing same issue. Replace Application Id and Group Id with some unique identifier using UUID.randomId.toString() in the configuration property. It should fetch the messages from beginning
We recently upgraded Kafka to v1.1 and Confluent to v4.0.But upon upgrading we have encountered a persistent problems regarding state stores. Our application starts a collection of streams and we check for the state stores to be ready before killing the application after 100 tries. But after the upgrade there's atleast one stream that will have Store is not ready : the state store, <your stream>, may have migrated to another instance
The stream itself has RUNNING state and the messages will flow through but the state of the store still shows up as not ready. So I have no idea as to what may be happening.
Should I not check for store state?
And since our application has a lot of streams (~15), would starting
them simultaneously cause problems?
Should we not do a hard restart -- currently we run it as a service
on linux
We are running Kafka in cluster with 3 brokers.Below is a sample stream (not the entire code):
public BaseStream createStreamInstance() {
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
MessagePayLoadParser<Note> noteParser = new MessagePayLoadParser<Note>(Note.class);
GenericJsonSerde<Note> noteSerde = new GenericJsonSerde<Note>(Note.class);
StreamsBuilder builder = new StreamsBuilder();
//below reducer will use sets to combine
//value1 in the reducer is what is already present in the store.
//value2 is the incoming message and for notes should have max 1 item in it's list (since its 1 attachment 1 tag per row, but multiple rows per note)
Reducer<Note> reducer = new Reducer<Note>() {
#Override
public Note apply(Note value1, Note value2) {
value1.merge(value2);
return value1;
}
};
KTable<Long, Note> noteTable = builder
.stream(this.subTopic, Consumed.with(jsonSerde, jsonSerde))
.map(noteParser::parse)
.groupByKey(Serialized.with(Serdes.Long(), noteSerde))
.reduce(reducer);
noteTable.toStream().to(this.pubTopic, Produced.with(Serdes.Long(), noteSerde));
this.stream = new KafkaStreams(builder.build(), this.properties);
return this;
}
There are some open questions here, like the ones Matthias put on comment, but will try to answer/give help to your actual questions:
Should I not check for store state?
Rebalancing is usually the case here. But in that case, you should not see that partition's thread keep consuming, but that processing should be "transferred" to be done to another thread that took over. Make sure if it is actually that very thread the one that keeps on processing that partition, and not the new one. Check kafka-consumer-groups utility to follow the consumers (threads) there.
And since our application has a lot of streams (~15), would starting them simultaneously cause problems? No, rebalancing is automatic.
Should we not do a hard restart -- currently we run it as a service on linux Are you keeping your state stores in a certain, non-default directory? You should configure your state stores directory properly and make sure it is accessible, insensitive to application restarts. Unsure about how you perform your hard restart, but some exception handling code should cover against it, closing your streams application.
I am working on AWS EMR.
I want to get the information of died task node as soon as possible. But as per default setting in hadoop, heartbeat is shared after every 10 minutes.
This is the default key-value pair in mapred-default - mapreduce.jobtracker.expire.trackers.interval : 600000ms
I tried to modify default value to 6000ms using - this link
After that, whenever I terminate any ec2 machine from EMR cluster, I am not able to see state change that fast.(in 6 seconds)
Resource manager REST API - http://MASTER_DNS_NAME:8088/ws/v1/cluster/nodes
Questions-
What is the command to check the mapreduce.jobtracker.expire.trackers.interval value in running EMR cluster(Hadoop cluster)?
Is this the right key I am using to get the state change ? If it is not, please suggest any other solution.
What is the difference between DECOMMISSIONING vs DECOMMISSIONED vs LOST state of nodes in Resource manager UI ?
Update
I tried numbers of times, but it is showing ambiguous behaviour. Sometimes, it moved to DECOMMISSIONING/DECOMMISIONED state, and sometime it directly move to LOST state after 10 minutes.
I need a quick state change, so that I can trigger some event.
Here is my sample code -
List<Configuration> configurations = new ArrayList<Configuration>();
Configuration mapredSiteConfiguration = new Configuration();
mapredSiteConfiguration.setClassification("mapred-site");
Map<String, String> mapredSiteConfigurationMapper = new HashMap<String, String>();
mapredSiteConfigurationMapper.put("mapreduce.jobtracker.expire.trackers.interval", "7000");
mapredSiteConfiguration.setProperties(mapredSiteConfigurationMapper);
Configuration hdfsSiteConfiguration = new Configuration();
hdfsSiteConfiguration.setClassification("hdfs-site");
Map<String, String> hdfsSiteConfigurationMapper = new HashMap<String, String>();
hdfsSiteConfigurationMapper.put("dfs.namenode.decommission.interval", "10");
hdfsSiteConfiguration.setProperties(hdfsSiteConfigurationMapper);
Configuration yarnSiteConfiguration = new Configuration();
yarnSiteConfiguration.setClassification("yarn-site");
Map<String, String> yarnSiteConfigurationMapper = new HashMap<String, String>();
yarnSiteConfigurationMapper.put("yarn.resourcemanager.nodemanagers.heartbeat-interval-ms", "5000");
yarnSiteConfiguration.setProperties(yarnSiteConfigurationMapper);
configurations.add(mapredSiteConfiguration);
configurations.add(hdfsSiteConfiguration);
configurations.add(yarnSiteConfiguration);
This is the settings that I changed into AWS EMR (internally Hadoop) to reduce the time between state change from RUNNING to other state(DECOMMISSIONING/DECOMMISIONED/LOST).
You can use "hdfs getconf". Please refer to this post Get a yarn configuration from commandline
These links give info about node manager health-check and the properties you have to check:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html
Refer "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" in the below link:
https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Your queries are answered in this link:
https://issues.apache.org/jira/browse/YARN-914
Refer the "attachments" and "sub-tasks" area.
In simple terms, if the currently running application master and task containers gets shut-down properly (and/or re-initiated in different other nodes) then the node manager is said to be DECOMMISSIONED (gracefully), else it is LOST.
Update:
"dfs.namenode.decommission.interval" is for HDFS data node decommissioning, it does not matter if you are concerned only about node manager.
In exceptional cases, data node need not be a compute node.
Try yarn.nm.liveness-monitor.expiry-interval-ms (default 600000 - that is why you reported that the state changed to LOST in 10 minutes, set it to a smaller value as you require) instead of mapreduce.jobtracker.expire.trackers.interval.
You have set "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" as 5000, which means, the heartbeat goes to resource manager once in 5 seconds, whereas the default is 1000. Set it to a smaller value as you require.
hdfs getconf -confKey mapreduce.jobtracker.expire.trackers.interval
As mentioned in the other answer:
yarn.resourcemanager.nodemanagers.heartbeat-interval-ms should be set based on your network, if your network has high latency, you should set a bigger value.
3.
Its in DECOMMISSIONING when there are running containers and its waiting for them to complete so that those nodes can be decommissioned.
Its in LOST when its stuck in this process for too long. This state is reached after the set timeout is passed and decommissioning of node(s) couldn't be completed.
DECOMMISSIONED is when the decommissioning of the node(s) completes.
Reference : Resize a Running Cluster
For YARN NodeManager decommissioning, you can manually adjust the time
a node waits for decommissioning by setting
yarn.resourcemanager.decommissioning.timeout inside
/etc/hadoop/conf/yarn-site.xml; this setting is dynamically
propagated.