How to change logging levels for executors? - java

The default logging level for all classes executed by Spark executors seems to be INFO. I would like to change it to DEBUG, or WARN, etc. as needed.
I'm using Spark Streaming 1.3.1 but will switch to 1.4 soon.
I have the following line in the shell script that submits the Spark Streaming job:
export SPARK_JAVA_OPTS="-Dlog4j.configuration=file:./log4j.properties"
This allows me to change logging level for the classes running in driver, but not in the executors.
How can I control logging for classes that are run by executors?
Note: We are not running on Yarn. We're starting our own Spark cluster on EC2.
Note: Ideally, we would like to change logging level while Streaming process is still running. If that's not possible, at least we should be able to change some properties file. Recompiling code & redeploying is NOT an option.

tl;dr Change conf/log4j.properties to appropriate levels and distribute the file to workers.
You may have some luck with --files command-line option for spark-submit when you submit your Spark applications. It lets you change logging levels per application. It's just a guess as I didn't try it out yet.

Related

remove logs from Production Environment

How to implement log4j such that some desired loggers are not shown at PRODUCTION environment, but will show at test and acceptance environment.
Is it possible to do by using log4j only?
You can use maven profiles . use different log4j.xml config file for every environment
You Can use log4j for Production. But make sure that there is minimum usage of log in production. As less info and fatal can lead to less logger in production.
Lesser the logging faster the process, More the logging ,more verbose will be the server.
use warn precisely and only when there is any transaction happening with Database.
Hope it could help you!!!
You can work with the log4j defaults logs Levels, for example use minimal level DEBUG for test or acceptance environment and INFO for production.
Besides, you can implement your own custom log levels https://logging.apache.org/log4j/2.x/manual/customloglevels.html

How to schedule/trigger spark jobs in Cloudera?

Currently our project is on MR and we use Oozie to orchestrate our MR Jobs. Now we are moving to Spark, and would like to know the recommended ways to schedule/trigger Spark Jobs on the CDH cluster. Note that CDH Oozie does not support Spark2 Jobs. So please give an alternative for this.
Last time I looked, Hue had a Spark option in the Worlflow editor. If Cloudera didn't support that, I'm not sure why it'd be there...
CDH Oozie does support plain shell scripts, though, but you need to be sure all NodeManagers will have spark-submit command available on the local server.
If that doesn't work, it also supports Java actions for running a JAR, so you could write your Spark scripts all starting with a main method that loads up any configuration from there
As soon as you submit the spark job from the shell, like:
spark-submit <script_path> <arguments_list>
it gets submitted to the CDH cluster. Immediately you will be able to see the spark jobs and its progress in the Hue.This is how we trigger the spark jobs.
Further, to orchestrate a series of jobs, you can use a shell script wrapper around it. Or, you can use a cron job to trigger in timing.

Logging from Java app to ELK without need for parsing logs

I want to send logs from a Java app to ElasticSearch, and the conventional approach seems to be to set up Logstash on the server running the app, and have logstash parse the log files (with regex...!) and load them into ElasticSearch.
Is there a reason it's done this way, rather than just setting up log4J (or logback) to log things in the desired format directly into a log collector that can then be shipped to ElasticSearch asynchronously? It seems crazy to me to have to fiddle with grok filters to deal with multiline stack traces (and burn CPU cycles on log parsing) when the app itself could just log it the desired format in the first place?
On a tangentially related note, for apps running in a Docker container, is best practice to log directly to ElasticSearch, given the need to run only one process?
If you really want to go down that path, the idea would be to use something like an Elasticsearch appender (or this one or this other one) which would ship your logs directly to your ES cluster.
However, I'd advise against it for the same reasons mentioned by #Vineeth Mohan. You'd also need to ask yourself a couple questions, but mainly what would happen if your ES cluster goes down for any reason (OOM, network down, ES upgrade, etc)?
There are many reasons why asynchronicity exists, one of which is robustness of your architecture and most of the time that's much more important than burning a few more CPU cycles on log parsing.
Also note that there is an ongoing discussion about this very subject going on in the official ES discussion forum.
I think it's usually ill-advised to log directly to Elasticsearch from a Log4j/Logback/whatever appender, but I agree that writing Logstash filters to parse a "normal" human-readable Java log is a bad idea too. I use https://github.com/logstash/log4j-jsonevent-layout everywhere I can to have Log4j's regular file appenders produce JSON logs that don't require any further parsing by Logstash.
There is also https://github.com/elastic/java-ecs-logging which provides a layout for log4j, log4j2 and Logback. It's quite efficient and the Filebeat configuration is very minimal.
Disclaimer: I'm the author of this library.
If you need a quick solution I've written this appender here Log4J2 Elastic REST Appender if you want to use it. It has the ability to buffer log events based on time and/or number of events before sending it to Elastic (using the _bulk API so that it sends it all in one go). It has been published to Maven Central so it's pretty straight forward.
As the other folks have already mentioned the best way to do it would be to save it to file, and then ship it to ES separately. However I think that there is value if you need to get something running quickly until you have time/resources implement the optimal way.

Is it possible to cache the application jar when running Apache Spark applications on a cluster?

I've got an Apache Spark MLlib Java application that should be run on a cluster a lot of times with different input values. Is it possible to cache the application jar on the cluster and reuse it to reduce startup time, network load and coupling of components?
Does the used cluster manager make any difference?
If the application jar is cached, is it possible to use the same RDD caches in different instances of my application?
Vainilla Spark is not able to do this (at the time of writing - Spark is evolving fast) .
There's a Spark-JobServer contributed by Ooyala that exactly fulfills your needs. It keeps a register with the jars for sequential job submission and provides additional facilities to cache RDDs by name. Note that on a Spark Cluster, the Spark-JobServer acts as a Spark driver. The workers will still need to load the jars from the driver when going to execute a given task.
See docs here: https://github.com/ooyala/spark-jobserver

Running MapReduce job periodically without Oozie?

I have a mapreduce job as a 'jar' ,that should be run daily. Also, I need to run this jar from a remote java application. How can I schedule it: i.e, I just want to run job daily from my remote java application.
I read about Oozie, but I dont think it is apt here.
Take a look at Quartz. It enables you to run a standalone java programs or run inside an web or application container (like JBoss or Apache Tomcat). There is a good integration with Spring and Spring batch in particular.
Quartz can be configured outside of the java code - in XML and the syntax is exactly like in crontab. So, I found it very handy.
äSome examples can be found here and here.
I am not clear about your requirement. You can use ssh command execution libraries in your program.
SSH library for Java
If you are running your program in linux environment itself, You can set some crontab for periodic execution.
If the trigger of your jar is your java program, then you should schedule your java program hourly rather than the jar. And if that is separate, then you can schedule your jar in Oozie workflow where you can have the java code execution in step one of oozie workflow and jar execution in the second step.
In oozie, you can pass the parameters from one level to another as well.Hope this helps.
-Dipika Harwani

Categories