Understanding flink jobmanager memory

Understanding flink jobmanager memory - java

I have a flink job that has a nfs filesystem folder as a source and kafka as a sink. there are no transformations done at this point.
I have used continuousmonitoringfunction to continuously monitor for events on the folder and
ContinuousFileReaderOperator for reading the data.
ContinuousFileMonitoringFunction<String> monitoringFunction = new ContinuousFileMonitoringFunction<>(
inputFormat, FileProcessingMode.PROCESS_CONTINUOUSLY, env.getParallelism(),
MONITORING_INTERVAL);
ContinuousFileReaderOperator<String> reader = new ContinuousFileReaderOperator<>(inputFormat);
Initial size of the folder is ~40GB with 3785468 files(in all sub directories) in it.
I have created 1 job manager with heap 25G and 2 task managers with 4 task slots and following memory values.
taskmanager.memory.process.size: "26g"
taskmanager.memory.flink.size: "24g"
jobmanager.heap.size: "25g"
taskmanager.memory.jvm-overhead.max: "2g"
taskmanager.memory.task.off-heap.size: "1024M"
taskmanager.memory.task.heap.size: "16g"
taskmanager.memory.managed.fraction: 0.2
taskmanager.memory.network.max: "2g"
When the job started job manager is working on prepping the job and the prepping state is taking long time around 2 hrs. Once job starts it is working fine in transferring the files to kafka.
I am trying to fine tune the job, can anyone please help me understand what happens during prepping stage and what part of memory is important during this state ?
I am trying to play with memory params but nothing seems to work, without knowledge of what memory is used for what I am unable to proceed.
I have gone through flink documentation on memory but it is not clear on what managed memory is used for and DirectMemory is used for while processing the job.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#memory-configuration
Could some one help me understand what I should consider to fine tune the job ?

Related

Apache Storm Kafka Spout Lag Issue

I am building a Java Spring application using Storm 1.1.2 and Kafka 0.11 to be launched in a Docker container.
Everything in my topology works as planned but under a high load from Kafka, the Kafka lag increases more and more over time.
My KafkaSpoutConfig:
KafkaSpoutConfig<String,String> spoutConf =
KafkaSpoutConfig.builder("kafkaContainerName:9092", "myTopic")
.setProp(ConsumerConfig.GROUP_ID_CONFIG, "myGroup")
.setProp(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, MyObjectDeserializer.class)
.build()
Then my topology is as follows
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("stormKafkaSpout", new KafkaSpout<String,String>(spoutConf), 25);
builder.setBolt("routerBolt", new RouterBolt(),25).shuffleGrouping("stormKafkaSpout");
Config conf = new Config();
conf.setNumWorkers(10);
conf.put(Config.STORM_ZOOKEEPER_SERVERS, ImmutableList.of("zookeeper"));
conf.put(Config.STORM_ZOOKEEPER_PORT, 2181);
conf.put(Config.NIMBUS_SEEDS, ImmutableList.of("nimbus"));
conf.put(Config.NIMBUS_THRIFT_PORT, 6627);
System.setProperty("storm.jar", "/opt/storm.jar");
StormSubmitter.submitTopology("topologyId", conf, builder.createTopology());
The RouterBolt (which extends BaseRichBolt) does one very simple switch statement and then uses a local KafkaProducer object to send a new message to another topic. Like I said, everything compiles and the topology runs as expected but under a high load (3000 messages/s), the Kafka lag just piles up equating to low throughput for the topology.
I've tried disabling acking with
conf.setNumAckers(0);
and
conf.put(Config.TOPOLGY_ACKER_EXECUTORS, 0);
but I guess it's not an acking issue.
I've seen on the Storm UI that the RouterBolt has execution latency of 1.2ms and process latency of .03ms under the high load which leads me to believe the Spout is the bottleneck.Also the parallelism hint is 25 because there are 25 partitions of "myTopic". Thanks!

You may be affected by https://issues.apache.org/jira/browse/STORM-3102, which causes the spout to do a pretty expensive call on every emit. Please try upgrading to one of the fixed versions.
Edit: The fix isn't actually released yet. You might still want to try out the fix by building the spout from source using e.g. https://github.com/apache/storm/tree/1.1.x-branch to build a 1.1.4 snapshot.

Spark: Job restart and retries

Suppose you have Spark + Standalone cluster manager. You opened spark session with some configs and want to launch SomeSparkJob 40 times in parallel with different arguments.
Questions
How to set reties amount on job failures?
How to restart jobs programmatically on failure? This could be useful if jobs failure due lack of resources. Than I can launch one by one all jobs that require extra resources.
How to restart spark application on job failure? This could be useful if job lack resources even when it's launched simultaneously. Than to change cores, CPU etc configs I need to relaunch application in Standalone cluster manager.
My workarounds
1) I pretty sure the 1st point is possible, since it's possible at spark local mode. I just don't know how to do that in standalone mode.
2-3) It's possible to hand listener on spark context like spark.sparkContext().addSparkListener(new SparkListener() {. But seems SparkListener lacks failure callbacks.
Also there is a bunch of methods with very poor documentation. I've never used them, but perhaps they could help to solve my problem.
spark.sparkContext().dagScheduler().runJob();
spark.sparkContext().runJob()
spark.sparkContext().submitJob()
spark.sparkContext().taskScheduler().submitTasks();
spark.sparkContext().dagScheduler().handleJobCancellation();
spark.sparkContext().statusTracker()

You can use SparkLauncher and control the flow.
import org.apache.spark.launcher.SparkLauncher;
public class MyLauncher {
public static void main(String[] args) throws Exception {
Process spark = new SparkLauncher()
.setAppResource("/my/app.jar")
.setMainClass("my.spark.app.Main")
.setMaster("local")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.launch();
spark.waitFor();
}
}
See API for more details.
Since it creates process you can check the Process status and retry e.g. try following:
public boolean isAlive()
If Process is not live start again, see API for more details.
Hoping this gives high level idea of how we can achieve what you mentioned in your question. There could be more ways to do same thing but thought to share this approach.
Cheers !

check your spark.sql.broadcastTimeout and spark.broadcast.blockSize properties, try to increase them .

I am not finding evidence of NodeInitializationAction for Dataproc having run

I am specifying a NodeInitializationAction for Dataproc as follows:
ClusterConfig clusterConfig = new ClusterConfig();
clusterConfig.setGceClusterConfig(...);
clusterConfig.setMasterConfig(...);
clusterConfig.setWorkerConfig(...);
List<NodeInitializationAction> initActions = new ArrayList<>();
NodeInitializationAction action = new NodeInitializationAction();
action.setExecutableFile("gs://mybucket/myExecutableFile");
initActions.add(action);
clusterConfig.setInitializationActions(initActions);
Then later:
Cluster cluster = new Cluster();
cluster.setProjectId("wide-isotope-147019");
cluster.setConfig(clusterConfig);
cluster.setClusterName("cat");
Then finally, I invoke the dataproc.create operation with the cluster. I can see the cluster being created, but when I ssh into the master machine ("cat-m" in us-central1-f), I see no evidence of the script I specified having been copied over or run.
So this leads to my questions:
What should I expect in terms of evidence? (edit: I found the script itself in /etc/google-dataproc/startup-scripts/dataproc-initialization-script-0).
Where does the script get invoked from? I know it runs as the user root, but beyond that, I am not sure where to find it. I did not find it in the root directory.
At what point does the Operation returned from the Create call change from "CREATING" to "RUNNING"? Does this happen before or after the script gets invoked, and does it matter if the exit code of the script is non-zero?
Thanks in advance.

Dataproc makes a number of guarantees about init actions:
each script should be downloaded and stored locally in:
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0
the output of the script will be captured in a "staging bucket" (either the bucket specified via --bucket option, or a Dataproc auto-generated bucket). Assuming your cluster is named my-cluster, if you describe master instance via gcloud compute instances describe my-cluster-m, the exact location is in dataproc-agent-output-directory metadata key
Cluster may not enter RUNNING state (and Operation may not complete) until all init actions execute on all nodes. If init action exits with non-zero code, or init action exceeds specified timeout, it will be reported as such
similarly if you resize a cluster, we guarantee that new workers do not join cluster until each worker is fully configured in isolation
if you still don't belive me :) inspect Dataproc agent log in /var/log/google-dataproc-agent-0.log and look for entries from BootstrapActionRunner

How to decrease heartbeat time of slave nodes in Hadoop

I am working on AWS EMR.
I want to get the information of died task node as soon as possible. But as per default setting in hadoop, heartbeat is shared after every 10 minutes.
This is the default key-value pair in mapred-default - mapreduce.jobtracker.expire.trackers.interval : 600000ms
I tried to modify default value to 6000ms using - this link
After that, whenever I terminate any ec2 machine from EMR cluster, I am not able to see state change that fast.(in 6 seconds)
Resource manager REST API - http://MASTER_DNS_NAME:8088/ws/v1/cluster/nodes
Questions-
What is the command to check the mapreduce.jobtracker.expire.trackers.interval value in running EMR cluster(Hadoop cluster)?
Is this the right key I am using to get the state change ? If it is not, please suggest any other solution.
What is the difference between DECOMMISSIONING vs DECOMMISSIONED vs LOST state of nodes in Resource manager UI ?
Update
I tried numbers of times, but it is showing ambiguous behaviour. Sometimes, it moved to DECOMMISSIONING/DECOMMISIONED state, and sometime it directly move to LOST state after 10 minutes.
I need a quick state change, so that I can trigger some event.
Here is my sample code -
List<Configuration> configurations = new ArrayList<Configuration>();
Configuration mapredSiteConfiguration = new Configuration();
mapredSiteConfiguration.setClassification("mapred-site");
Map<String, String> mapredSiteConfigurationMapper = new HashMap<String, String>();
mapredSiteConfigurationMapper.put("mapreduce.jobtracker.expire.trackers.interval", "7000");
mapredSiteConfiguration.setProperties(mapredSiteConfigurationMapper);
Configuration hdfsSiteConfiguration = new Configuration();
hdfsSiteConfiguration.setClassification("hdfs-site");
Map<String, String> hdfsSiteConfigurationMapper = new HashMap<String, String>();
hdfsSiteConfigurationMapper.put("dfs.namenode.decommission.interval", "10");
hdfsSiteConfiguration.setProperties(hdfsSiteConfigurationMapper);
Configuration yarnSiteConfiguration = new Configuration();
yarnSiteConfiguration.setClassification("yarn-site");
Map<String, String> yarnSiteConfigurationMapper = new HashMap<String, String>();
yarnSiteConfigurationMapper.put("yarn.resourcemanager.nodemanagers.heartbeat-interval-ms", "5000");
yarnSiteConfiguration.setProperties(yarnSiteConfigurationMapper);
configurations.add(mapredSiteConfiguration);
configurations.add(hdfsSiteConfiguration);
configurations.add(yarnSiteConfiguration);
This is the settings that I changed into AWS EMR (internally Hadoop) to reduce the time between state change from RUNNING to other state(DECOMMISSIONING/DECOMMISIONED/LOST).

You can use "hdfs getconf". Please refer to this post Get a yarn configuration from commandline
These links give info about node manager health-check and the properties you have to check:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html
Refer "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" in the below link:
https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Your queries are answered in this link:
https://issues.apache.org/jira/browse/YARN-914
Refer the "attachments" and "sub-tasks" area.
In simple terms, if the currently running application master and task containers gets shut-down properly (and/or re-initiated in different other nodes) then the node manager is said to be DECOMMISSIONED (gracefully), else it is LOST.
Update:
"dfs.namenode.decommission.interval" is for HDFS data node decommissioning, it does not matter if you are concerned only about node manager.
In exceptional cases, data node need not be a compute node.
Try yarn.nm.liveness-monitor.expiry-interval-ms (default 600000 - that is why you reported that the state changed to LOST in 10 minutes, set it to a smaller value as you require) instead of mapreduce.jobtracker.expire.trackers.interval.
You have set "yarn.resourcemanager.nodemanagers.heartbeat-interval-ms" as 5000, which means, the heartbeat goes to resource manager once in 5 seconds, whereas the default is 1000. Set it to a smaller value as you require.

hdfs getconf -confKey mapreduce.jobtracker.expire.trackers.interval
As mentioned in the other answer:
yarn.resourcemanager.nodemanagers.heartbeat-interval-ms should be set based on your network, if your network has high latency, you should set a bigger value.
3.
Its in DECOMMISSIONING when there are running containers and its waiting for them to complete so that those nodes can be decommissioned.
Its in LOST when its stuck in this process for too long. This state is reached after the set timeout is passed and decommissioning of node(s) couldn't be completed.
DECOMMISSIONED is when the decommissioning of the node(s) completes.
Reference : Resize a Running Cluster
For YARN NodeManager decommissioning, you can manually adjust the time
a node waits for decommissioning by setting
yarn.resourcemanager.decommissioning.timeout inside
/etc/hadoop/conf/yarn-site.xml; this setting is dynamically
propagated.

Improve memory usage of Vert.x sever

I have a Vert.x web server which provides Websocket service. Vert.x server sends some data to client when client registers on server then the client sends a ACK back to server to make sure those data has already been delivered reliably.
I found the Vert.x server consumes a lot of memory after it has finished all the work.
Below are steps to reproduce the issue:
Config JVM parameter before starting our test:
Open /vert.x-2.02-final/bin/
Modify value of JVM_OPTS from "" to "-Xms128M -Xmx128M"
Save and exitModify serverIpAddress to your server ip address in VertXSocketClient
Client will register to 1180 websocket channel on the Vert.x server.
You can get code here:
https://www.dropbox.com/s/ptenlx78iin8dmj/VertXSocketClient.java
run testserver with command vertx run testserver.java
The memory usage of Vert.xserver will be printed out in your console with format:
total memory - free memory = used memory(MB)
System.gc()
Runtime runtime = Runtime.getRuntime();
int mb = 1024 * 1024;
totalMemory = runtime.totalMemory() / mb;
freeMemory = runtime.freeMemory() / mb;
I call System.gc() every 5 secs to make sure to free memory. Yes, I know. System.gc() shouldn't be called frequently. It has negative impact to system performance. The used memory does not decrease without such an instruction.
You can download the code here:
https://www.dropbox.com/sh/6oxtfhgwffed72c/AAAX-BvYdGaTBgnRagxD9Bf-a/TestServer.java
run VertXSocketClient with command vertx run VertXSocketClient.java
The client will register to websocket channel server automatically and the server instance will send data to client after registration has finished.
Here down the sample code to send data to client:
byte[] serverResponseData = serverResponse.getBytes();
Buffer buffer= new Buffer(serverResponseData);
ws.write(buffer);
With above code, used memory would be up to 62MB after all work is done, wherease it is only up to 15MB if I comment out ws.write(buffer).
My assumption is that Vert.x server always sets aside 62MB of memory for its lifetime.Isn't it supposed to release memory after the work is done?

You should definitely check Hazelcast map configuration if you are using Vert.x. Maybe it's not related to your problem and your Vert.x usage, but I had similar problems with memory usage and found that my configuration (in cluster.xml) was not appropriate regarding my use cases.
Some points you should check in Hazelcast config:
backup-count - By default, Hazelcast has one sync backup copy
time-to-live-seconds - Maximum time in seconds for each map entry to stay in the map is 0 (infinite) by default
max-idle-seconds - Maximum time in seconds for each entry to stay idle in the map is 0 (infinite) by default
eviction-policy - NONE eviction policy for Maps is used by default
See Hazelcast Map documentation for more details.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.