How to decrease heartbeat time of slave nodes in Hadoop - java

I am working on AWS EMR.
I want to get the information of died task node as soon as possible. But as per default setting in hadoop, heartbeat is shared after every 10 minutes.
This is the default key-value pair in mapred-default - mapreduce.jobtracker.expire.trackers.interval : 600000ms
I tried to modify default value to 6000ms using - this link
After that, whenever I terminate any ec2 machine from EMR cluster, I am not able to see state change that fast.(in 6 seconds)
Resource manager REST API - http://MASTER_DNS_NAME:8088/ws/v1/cluster/nodes
Questions-
What is the command to check the mapreduce.jobtracker.expire.trackers.interval value in running EMR cluster(Hadoop cluster)?
Is this the right key I am using to get the state change ? If it is not, please suggest any other solution.
What is the difference between DECOMMISSIONING vs DECOMMISSIONED vs LOST state of nodes in Resource manager UI ?
Update
I tried numbers of times, but it is showing ambiguous behaviour. Sometimes, it moved to DECOMMISSIONING/DECOMMISIONED state, and sometime it directly move to LOST state after 10 minutes.
I need a quick state change, so that I can trigger some event.
Here is my sample code -
List<Configuration> configurations = new ArrayList<Configuration>();
Configuration mapredSiteConfiguration = new Configuration();
mapredSiteConfiguration.setClassification("mapred-site");
Map<String, String> mapredSiteConfigurationMapper = new HashMap<String, String>();
mapredSiteConfigurationMapper.put("mapreduce.jobtracker.expire.trackers.interval", "7000");
mapredSiteConfiguration.setProperties(mapredSiteConfigurationMapper);
Configuration hdfsSiteConfiguration = new Configuration();
hdfsSiteConfiguration.setClassification("hdfs-site");
Map<String, String> hdfsSiteConfigurationMapper = new HashMap<String, String>();
hdfsSiteConfigurationMapper.put("dfs.namenode.decommission.interval", "10");
hdfsSiteConfiguration.setProperties(hdfsSiteConfigurationMapper);
Configuration yarnSiteConfiguration = new Configuration();
yarnSiteConfiguration.setClassification("yarn-site");
Map<String, String> yarnSiteConfigurationMapper = new HashMap<String, String>();
yarnSiteConfigurationMapper.put("yarn.resourcemanager.nodemanagers.heartbeat-interval-ms", "5000");
yarnSiteConfiguration.setProperties(yarnSiteConfigurationMapper);
configurations.add(mapredSiteConfiguration);
configurations.add(hdfsSiteConfiguration);
configurations.add(yarnSiteConfiguration);
This is the settings that I changed into AWS EMR (internally Hadoop) to reduce the time between state change from RUNNING to other state(DECOMMISSIONING/DECOMMISIONED/LOST).

You can use "hdfs getconf". Please refer to this post Get a yarn configuration from commandline
These links give info about node manager health-check and the properties you have to check:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html
Refer "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" in the below link:
https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Your queries are answered in this link:
https://issues.apache.org/jira/browse/YARN-914
Refer the "attachments" and "sub-tasks" area.
In simple terms, if the currently running application master and task containers gets shut-down properly (and/or re-initiated in different other nodes) then the node manager is said to be DECOMMISSIONED (gracefully), else it is LOST.
Update:
"dfs.namenode.decommission.interval" is for HDFS data node decommissioning, it does not matter if you are concerned only about node manager.
In exceptional cases, data node need not be a compute node.
Try yarn.nm.liveness-monitor.expiry-interval-ms (default 600000 - that is why you reported that the state changed to LOST in 10 minutes, set it to a smaller value as you require) instead of mapreduce.jobtracker.expire.trackers.interval.
You have set "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" as 5000, which means, the heartbeat goes to resource manager once in 5 seconds, whereas the default is 1000. Set it to a smaller value as you require.

hdfs getconf -confKey mapreduce.jobtracker.expire.trackers.interval
As mentioned in the other answer:
yarn.resourcemanager.nodemanagers.heartbeat-interval-ms should be set based on your network, if your network has high latency, you should set a bigger value.
3.
Its in DECOMMISSIONING when there are running containers and its waiting for them to complete so that those nodes can be decommissioned.
Its in LOST when its stuck in this process for too long. This state is reached after the set timeout is passed and decommissioning of node(s) couldn't be completed.
DECOMMISSIONED is when the decommissioning of the node(s) completes.
Reference : Resize a Running Cluster
For YARN NodeManager decommissioning, you can manually adjust the time
a node waits for decommissioning by setting
yarn.resourcemanager.decommissioning.timeout inside
/etc/hadoop/conf/yarn-site.xml; this setting is dynamically
propagated.

Related

Kafka Streams Opening StateStore log loop

I have the following topology definition and there are two application instances on the environment:
KStream<String, ObjectMessage> stream = kStreamBuilder.stream(inputTopic);
stream.mapValues(new ProtobufObjectConverter())
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMillis(100)))
.aggregate(AggregatedObject::new, new ObjectAggregator(), buildStateStore(storeName))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded().withMaxRecords(config.suppressionBufferSize())))
.mapValues(new AggregatedObjectProtobufConverter())
.toStream((key, value) -> key.key())
.to(outputTopic);
private Materialized<String, AggregatedObject, WindowStore<Bytes, byte[]>> buildStateStore(String storeName) {
return Materialized.<String, AggregatedObject, WindowStore<Bytes, byte[]>>as(storeName)
.withKeySerde(Serdes.String())
.withValueSerde(new JsonSerde<>(AggregatedObject.class));
}
This topology is created for multiple input topics in a for loop so once application instance has multiple topologies. Every toplogy has a state store created from the pattern KSTREAM-AGGREGATE-%s-STATE-STORE-0000000001 like Opening store KSTREAM-AGGREGATE-my.topic.name-STATE-STORE-0000000001.
Now, until recently we didn't have configured state-dir directory and since we used K8S stateful-set, the store wasn't persisted between restarts, so application had to rebuild the state as far as I understand how kafka-streams work.
Our logs were full of logs like below, but only with changing time (the suffix after last dot).
INFO 1 --- [-StreamThread-1] o.a.k.s.s.i.RocksDBTimestampedStore Opening store KSTREAM-AGGREGATE-my.topic.name-STATE-STORE-0000000001.1675576920000 in regular mode
However the time in millis 1675576920000 is one day ago and some are even from 3 days ago. Today I added the state-dir to the app, but this log is still being shown all the time. Should we simply wait some time until everything be processed or we are doing something wrong?
Can someone explain to me why RocksDBTimestampedStore is logging so much? Also, the time that is being logged from those stores is not 100 ms as defined by windowed operation, why?

Kafka Streams: Store is not ready

We recently upgraded Kafka to v1.1 and Confluent to v4.0.But upon upgrading we have encountered a persistent problems regarding state stores. Our application starts a collection of streams and we check for the state stores to be ready before killing the application after 100 tries. But after the upgrade there's atleast one stream that will have Store is not ready : the state store, <your stream>, may have migrated to another instance
The stream itself has RUNNING state and the messages will flow through but the state of the store still shows up as not ready. So I have no idea as to what may be happening.
Should I not check for store state?
And since our application has a lot of streams (~15), would starting
them simultaneously cause problems?
Should we not do a hard restart -- currently we run it as a service
on linux
We are running Kafka in cluster with 3 brokers.Below is a sample stream (not the entire code):
public BaseStream createStreamInstance() {
final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
MessagePayLoadParser<Note> noteParser = new MessagePayLoadParser<Note>(Note.class);
GenericJsonSerde<Note> noteSerde = new GenericJsonSerde<Note>(Note.class);
StreamsBuilder builder = new StreamsBuilder();
//below reducer will use sets to combine
//value1 in the reducer is what is already present in the store.
//value2 is the incoming message and for notes should have max 1 item in it's list (since its 1 attachment 1 tag per row, but multiple rows per note)
Reducer<Note> reducer = new Reducer<Note>() {
#Override
public Note apply(Note value1, Note value2) {
value1.merge(value2);
return value1;
}
};
KTable<Long, Note> noteTable = builder
.stream(this.subTopic, Consumed.with(jsonSerde, jsonSerde))
.map(noteParser::parse)
.groupByKey(Serialized.with(Serdes.Long(), noteSerde))
.reduce(reducer);
noteTable.toStream().to(this.pubTopic, Produced.with(Serdes.Long(), noteSerde));
this.stream = new KafkaStreams(builder.build(), this.properties);
return this;
}
There are some open questions here, like the ones Matthias put on comment, but will try to answer/give help to your actual questions:
Should I not check for store state?
Rebalancing is usually the case here. But in that case, you should not see that partition's thread keep consuming, but that processing should be "transferred" to be done to another thread that took over. Make sure if it is actually that very thread the one that keeps on processing that partition, and not the new one. Check kafka-consumer-groups utility to follow the consumers (threads) there.
And since our application has a lot of streams (~15), would starting them simultaneously cause problems? No, rebalancing is automatic.
Should we not do a hard restart -- currently we run it as a service on linux Are you keeping your state stores in a certain, non-default directory? You should configure your state stores directory properly and make sure it is accessible, insensitive to application restarts. Unsure about how you perform your hard restart, but some exception handling code should cover against it, closing your streams application.

Hazelcast : Tuning properties for a node having temporary network glitch in a cluster

We have embedded hazelcast cluster with 10 aws instances. Version of hazelcast is 3.7.3 Right now we have following settings for the hazelcast
hazelcast.max.no.heartbeat.seconds=30
hazelcast.max.no.master.confirmation.seconds=150
hazelcast.heartbeat.interval.seconds=1
hazelcast.operation.call.timeout.millis=5000
hazelcast.merge.first.run.delay.seconds=60
Apart from above settings other property values are default.
Recently one of the node was not reachable for few minutes or so and some of the operations slowed down while getting things from cache. We have backup for each map so if things were not available from one partition, hazelcast should have responded from another partition but it seems everything slowed down because of one node not reachable.
Following is the exception that we saw in the logs for hazelcast.
[3.7.2] PartitionIteratingOperation invocation failed to complete due
to operation-heartbeat-timeout. Current time: 2017-05-30 16:12:52.442.
Total elapsed time: 10825 ms. Last operation heartbeat: never. Last
operation heartbeat from member: 2017-05-30 16:12:42.166.
Invocation{op=com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation{serviceName='hz:impl:mapService',
identityHash=1798676695, partitionId=-1, replicaIndex=0, callId=0,
invocationTime=1496160761670 (2017-05-30 16:12:41.670),
waitTimeout=-1, callTimeout=5000,
operationFactory=com.hazelcast.map.impl.operation.MapGetAllOperationFactory#2afbcab7}, tryCount=10, tryPauseMillis=300, invokeCount=1,
callTimeoutMillis=5000, firstInvocationTimeMs=1496160761617,
firstInvocationTime='2017-05-30 16:12:41.617', lastHeartbeatMillis=0,
lastHeartbeatTime='1970-01-01 00:00:00.000',
target=[172.18.84.36]:9123, pendingResponse={VOID},
backupsAcksExpected=0, backupsAcksReceived=0,
connection=Connection[id=12, /172.18.64.219:9123->/172.18.84.36:48180,
endpoint=[172.18.84.36]:9123, alive=true, type=MEMBER]}
Can someone suggest what should be the correct settings for hazelcast so that one node temporary not reachable doesn't slow down the whole cluster?
Operation call timeout should not be set to a low value. Probably best to leave it at the default value. Some internal mechanism like heartbeat rely on call timeout.
According to the reference manual version 3.11.7.
I will recommend reading the split-brain syndrome.
Maybe you should create another quorum to fall back in the case that your node fails to communicate.
Also, by experience I will recommend to get the reference manual specific for your version. Even if the default is suppose to be set as 5, I found that the specific version recommends other values.

I am not finding evidence of NodeInitializationAction for Dataproc having run

I am specifying a NodeInitializationAction for Dataproc as follows:
ClusterConfig clusterConfig = new ClusterConfig();
clusterConfig.setGceClusterConfig(...);
clusterConfig.setMasterConfig(...);
clusterConfig.setWorkerConfig(...);
List<NodeInitializationAction> initActions = new ArrayList<>();
NodeInitializationAction action = new NodeInitializationAction();
action.setExecutableFile("gs://mybucket/myExecutableFile");
initActions.add(action);
clusterConfig.setInitializationActions(initActions);
Then later:
Cluster cluster = new Cluster();
cluster.setProjectId("wide-isotope-147019");
cluster.setConfig(clusterConfig);
cluster.setClusterName("cat");
Then finally, I invoke the dataproc.create operation with the cluster. I can see the cluster being created, but when I ssh into the master machine ("cat-m" in us-central1-f), I see no evidence of the script I specified having been copied over or run.
So this leads to my questions:
What should I expect in terms of evidence? (edit: I found the script itself in /etc/google-dataproc/startup-scripts/dataproc-initialization-script-0).
Where does the script get invoked from? I know it runs as the user root, but beyond that, I am not sure where to find it. I did not find it in the root directory.
At what point does the Operation returned from the Create call change from "CREATING" to "RUNNING"? Does this happen before or after the script gets invoked, and does it matter if the exit code of the script is non-zero?
Thanks in advance.
Dataproc makes a number of guarantees about init actions:
each script should be downloaded and stored locally in:
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0
the output of the script will be captured in a "staging bucket" (either the bucket specified via --bucket option, or a Dataproc auto-generated bucket). Assuming your cluster is named my-cluster, if you describe master instance via gcloud compute instances describe my-cluster-m, the exact location is in dataproc-agent-output-directory metadata key
Cluster may not enter RUNNING state (and Operation may not complete) until all init actions execute on all nodes. If init action exits with non-zero code, or init action exceeds specified timeout, it will be reported as such
similarly if you resize a cluster, we guarantee that new workers do not join cluster until each worker is fully configured in isolation
if you still don't belive me :) inspect Dataproc agent log in /var/log/google-dataproc-agent-0.log and look for entries from BootstrapActionRunner

quartz cluster mode only runs one task

I have two quartz apps that must run in cluster mode so I have two jars. When I run those two jars (java -jar) only one process seems to be working, the other seems to be in standby and does nothing and only begins to work when I kill the other process. I need the two processes to run in cluster mode.
This is my config:
private Properties getProperties() {
final Properties quartzProperties = new Properties();
quartzProperties.put("org.quartz.jobStore.class", "org.quartz.impl.jdbcjobstore.JobStoreTX");
quartzProperties.put("org.quartz.jobStore.isClustered", "true");
quartzProperties.put("org.quartz.jobStore.tablePrefix", "QRTZ_");
quartzProperties.put("org.quartz.jobStore.driverDelegateClass", "org.quartz.impl.jdbcjobstore.StdJDBCDelegate");
quartzProperties.put("org.quartz.threadPool.class", "org.quartz.simpl.SimpleThreadPool");
quartzProperties.put("org.quartz.threadPool.threadCount", "25");
quartzProperties.put("org.quartz.scheduler.instanceId", "AUTO");
quartzProperties.put("org.quartz.scheduler.instanceName", "qrtz");
quartzProperties.put("org.quartz.threadPool.threadPriority", "5");
quartzProperties.put("org.quartz.jobStore.clusterCheckinInterval","10000");
quartzProperties.put("org.quartz.jobStore.useProperties", "false");
quartzProperties.put("org.quartz.jobStore.dataSource", "quartzDS");
quartzProperties.put("org.quartz.dataSource.quartzDS.URL", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.URL"));
quartzProperties.put("org.quartz.dataSource.quartzDS.user", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.user"));
quartzProperties.put("org.quartz.dataSource.quartzDS.password", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.password"));
quartzProperties.put("org.quartz.dataSource.quartzDS.maxConnections", "5");
quartzProperties.put("org.quartz.dataSource.quartzDS.validationQuery", "select 0 from dual");
quartzProperties.put("org.quartz.dataSource.quartzDS.driver", environment.getRequiredProperty("org.quartz.dataSource.quartzDS.driver"));
return quartzProperties;
}
TL;TR : Your problem comes from Quartz Scheduler itself and there is no way to change its behaviour.
To make you understand why, I have to explain you how Quartz cluster mode behaves. We will take your case as example.
You start your two apps which each run a Quartz instance that synchronize through a database. Each jobs you are scheduling is stored in the database with processing data like "last time the job run", "last instance that run the job", etc. Each Quartz instance regularly scans the database for jobs to fire and fires as much jobs it cans.
The things is, if you don't have enough load, one of your node will always scans the database before the other one and take all the load.
To see your other instance working, you have to shutdown or standby the first one or increase the load of the cluster.
The only thing you can configure on this is the size of the thread pool of each node : See http://www.quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigJDBCJobStoreClustering.html

Categories