Oozie spark-submit, `--driver-cores` parameter not working - java

I am doing spark submit from oozie, --driver-cores option is not working. For examples if i provided --driver-cores 4, yarn still creates 1 vCore container for driver.
Spark Opts in oozie:
<master>yarn-cluster</master>
<spark-opts>--queue testQueue --num-executors 4 --driver-cores 4
...
</spark-opts>
I have tried other config keys also like --conf spark.driver.cores=4 and --conf spark.yarn.am.cores=4, even those are not working.
Any pointer will be helpful. Thanks

If you have specified this, your program is using 4 cores. There is no doubt in that.
You are seeing it wrong.
So in resource manager page, if you are in default settings DefaultResourceCalculator, it only calculates memory usage.
And for vCore usage it will always show 1, because it doesn’t calculate it.
If you can change resource manager class to DominantResourceCalculator, then it will show actual core usage.
Just add this properties to yarn-site.xml and restart yarn
yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
You can also verify this theory by going to Spark History server UI.
Before changing anything submit a spark job, find that job in spark UI.
Go to Executors section of that job and you will see all the executor used by spark and its configs.

Related

Configuring open telemetry for tracing service to service calls ONLY

I am experimenting with different instrumentation libraries but primarily spring-cloud-sleuth and open-telemetry ( OT) are the ones I liked the most. Spring-cloud-sleuth is simple but it will not work for a non-spring ( Jax-RS)project , so I diverted my attention to open telemetry.
I am able to export the metrics using OT, but there is just too much data which I do not need. Spring sleuth gave the perfect solution wherein it just traces the call across microservices and links all the spans with one traceId.
My question is - How to configure OT to get an output similar to spring-sleuth? I tried various configuration and few worked but still the information is huge.
My configuration
-Dotel.traces.exporter=zipkin -Dotel.instrumentation.[jdbc].enabled=false -Dotel.instrumentation.[methods].enabled=false -Dotel.instrumentation.[jdbc-datasource].enabled=false
However, this still gives me method calls and other data. Also, one big pain is am not able to SHUT DOWN metrics data.
gets error like below
ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export metrics. The request could not be executed. Full error message: Failed to connect to localhost/0:0:0:0:0:0:0:1:4317
Anyhelp will be appreciated -
There are 2 ways to configure the open telemetry agent(otel).
Environment variable
Java system property
you can either set
export OTEL_METRICS_EXPORTER=none
or
java -Dotel.metrics.exporter=none app.jar
Reference
https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/autoconfigure/README.md

Flink Job Cluster vs Session Cluster - deploying and configuration

I'm researching docker/k8s deployment possibilities for Flink 1.9.1.
I'm after reading/watching [1][2][3][4].
Currently we do think that we will try go with Job Cluster approach although
we would like to know what is the community trend with this? We would rather
not deploy more than one job per Flink cluster.
Anyways, I was wondering about few things:
How can I change the number of task slots per task manager for Job and
Session Cluster? In my case I'm running docker on VirtualBox where I have 4
CPUs assigned to this machine. However each task manager is spawned with
only one task slot for Job Cluster. With Session Cluster however, on the
same machine, each task manager is spawned with 4 task slots.
In both cases Flink's UI shows that each Task manager has 4 CPUs.
How can I resubmit job if I'm using a Job Cluster. I'm referring this use
case [5]. You may say that I have to start the job again but with different
arguments. What is the procedure for this? I'm using checkpoints btw.
Should I kill all task manager containers and rerun them with different
parameters?
How I can resubmit job using Session Cluster?
How I can provide log config for Job/Session cluster?
I have a case, where I changed log level and log format in log4j.properties
and this is working fine on local (IDE) environment. However when I build
the fat jar, and ran a Job Cluster based on this jar it seams that my log4j
properties are not passed to the cluster. I see the original format and
original (INFO) level.
Thanks,
[1] https://youtu.be/w721NI-mtAA
[2] https://youtu.be/WeHuTRwicSw
[3] https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html
[4] https://github.com/apache/flink/blob/release-1.9/flink-container/docker/README.md
[5] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Job-claster-scalability-td32027.html
Currently we do think that we will try go with Job Cluster approach although we would like to know what is the community trend with this? We would rather not deploy more than one job per Flink cluster.
This question is probably better suited on the user mailing list.
How can I change the number of task slots per task manager for Job and Session Cluster?
You can control this via the config option taskmanager.numberOfTaskSlots
How I can resubmit job using Session Cluster?
This is described here. The bottom line is that you create a savepoint and resume your job from it. It is also possible to resume a job from retained checkpoints.
How can I resubmit job if I'm using a Job Cluster.
Conceptually, this is no different from resuming a job from a savepoint in a session cluster. You can specify the path to the savepoint as a command line argument to the cluster entrypoint. The details are described here.
How I can provide log config for Job/Session cluster?
If you are using the scripts in the bin/ directory of the Flink binary distribution to start your cluster (e.g., bin/start-cluster.sh, bin/jobmanager.sh, bin/taskmanager.sh, etc.), you can change the log4j configuration by adapting conf/log4j.properties. The logging configuration is passed to the JobManager and TaskManager JVMs as a system variable (see bin/flink-daemon.sh). See also the Chapter "How to use logging" in the Flink documentation.

Spring boot micro services consuming lot of memory in docker-swarm

I have some docker swarm containers running on an Ubuntu 16.04.4 LTS instance on Azure. The containers are running Java Spring Boot and Netflix OSS Applications like Eureka, Ribbon, Gateway etc. applications. I obseverd my container is taking huge size of memory although services are just REST Endpoint.
I tried to limit the memory consumption by Passing Java VM args like below, but this didn't help the size didn't get changed even after.
Please note below configuration that I am using here,
Java Version : Java 8 Alpine
Kernel Version: 4.15.0-1023-azure
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 125.9GiB
Memory footprint after docker stats
Java VM Arguments,
docker service create --name xxxxxx-service --replicas 1 --network overnet 127.0.0.1:5000/xxxxxx-service --env JAVA_OPTS="-Xms16m -Xmx32m -XX:MaxMetaspaceSize=48m -XX:CompressedClassSpaceSize=8m -Xss256k -Xmn8m -XX:InitialCodeCacheSize=4m -XX:ReservedCodeCacheSize=8m -XX:MaxDirectMemorySize=16m -XX:+UseCGroupMemoryLimitForHeap -XX:-ShrinkHeapInSteps -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=70"
I've tried to look at the application log files within each of the containers also but can't find any memory related errors. I also tried to limit container resources. But that also didn't work for me.
Limit a container's resources
Any clue how I can troubleshoot this heavy memory Issue?
You can troubleshoot this by using a profiler such as visualvm or jprofiler, they will show you where the memory is allocated (which types of objects etc.).
You shouldn't use it on a production system though, if possible, because profiling can be very CPU heavy.
Another way to find out more I have used in the past is to use AspectJ's load time weaving to add special code that adds memory information to your log files.
This will also slow down your system, but when your aspects have been weel written not so much as using a profile.
If possible profiling would be prefered - if not AspectJ load time weaving might become helpful.
You can try enabling actuator and compare the memory consumption values with the values generated by docker stats.
to enable actuator you could add the following dependency in your pom.xml file.
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
I generally use HAL-browser for monitoring the application and consuming the actuator endpoints.
you can add that using the following maven dependency.
<dependency>
<groupId>org.springframework.data</groupId>
<artifactId>spring-data-rest-hal-browser</artifactId>
</dependency>
In the HAL browser you could try consuming the /metrics endpoint for your application.
A sample output would look like this.
{
"mem" : 193024,
"mem.free" : 87693,
"processors" : 4,
"instance.uptime" : 305027,
"uptime" : 307077,
"systemload.average" : 0.11,
"heap.committed" : 193024,
"heap.init" : 124928,
"heap.used" : 105330,
"heap" : 1764352,
"threads.peak" : 22,
"threads.daemon" : 19,
"threads" : 22,
"classes" : 5819,
"classes.loaded" : 5819,
"classes.unloaded" : 0,
"gc.ps_scavenge.count" : 7,
"gc.ps_scavenge.time" : 54,
"gc.ps_marksweep.count" : 1,
"gc.ps_marksweep.time" : 44,
"httpsessions.max" : -1,
"httpsessions.active" : 0,
"counter.status.200.root" : 1,
"gauge.response.root" : 37.0
}
this way you can monitor the memory performance of your application and find out how much memory your application is actually consuming. If this is analogous with the report generated by docker than this is a issue with your code.
However I must state that use of actuator is not production friendly as it has a significant resource overhead in itself.

GAE cron job shuts down unexpectedly

I have some cron jobs configured in cron.xml in an application on Google App Engine.
These jobs work once a day on a version of my application and make some work on the db.
For example a cron job calls v1.myapp.appspot.com...
After some weeks this application instance seems to no longer work correctly. It does not execute the cron jobs as I expect.
On GAE Dashboard I found a section with a list of cron job, but I can't see my cron jobs there.
Why did they disapper? What's wrong with my configuration environment? or Why google stops the execution of my cron jobs?
The cron job configuration is an app-wide scope configuration, it's not a configuration of a specific service/version. Every cron deployment (which can be done without necessarily updating a service/version) will overwrite the previously deployed one.
To avoid accidental errors personally I have a single cron config file at the app level, symlinked inside each service as needed.
If you want to keep the cron job for an older version running you need to add a configuration entry for it with a target matching that service/version, otherwise the cron job will stop working when that version is no longer the default one (as the cron-triggered requests will be directed towards the default service/version):
From Creating a cron job:
<?xml version="1.0" encoding="UTF-8"?>
<cronentries>
<cron>
<url>/tasks/summary</url>
<target>beta</target>
<description>daily summary job</description>
<schedule>every 24 hours</schedule>
</cron>
</cronentries>
The target specification is optional and is the name of a
service/version. If present, the target is prepended to your app's
hostname, causing the job to be routed to that service/version. If no
target is specified, the job will run in the default version of the
default service.

How does the JDBCJobStore work .?

So I started to tinker around with JDBCJobStore in Quartz. Firstly, I could not find a single good resource on how to configure it from scratch. After looking for it for a while and singling out a good resource for beginners, I downloaded the sample application at Job scheduling with Quartz. I have a few doubts regarding it.
How does JDBCJobStore capture jobs.? I mean in order for the job to get stored in the database does the job have to run manually once.? Or will JDBCJobStore automatically detect the jobs and their details..?
How does JDBCJobStore schedule the jobs.? Does it hit the database at a fixed interval like a heartbeat to check if there are any scheduled jobs.? Or does it keep the triggers in the memory while the application is running.?
In order to run the jobs will I have to manually specify the details of the job like like name and group and fetch the trigger accordingly.? Is there any alternative to this.?
On each application restart how can I tell the scheduler to start automatically..? Can it be specified somehow.?
If you are using servlet/app server you can start it during startup:
http://quartz-scheduler.org/documentation/quartz-2.2.x/cookbook/ServletInitScheduler
If you are running standalone you have to initialize it manually i think.
You can read more about JobStores here:
http://quartz-scheduler.org/documentation/quartz-2.2.x/tutorials/tutorial-lesson-09
And about jobs and triggers:
http://quartz-scheduler.org/documentation/quartz-2.2.x/tutorials/tutorial-lesson-02
http://quartz-scheduler.org/documentation/quartz-2.2.x/tutorials/tutorial-lesson-03
http://quartz-scheduler.org/documentation/quartz-2.2.x/tutorials/tutorial-lesson-04
I guess that quartz checks jobs based on time interval to proper work in clusters and distributed systems.

Categories