Configuring Hadoop logging to avoid too many log files

Configuring Hadoop logging to avoid too many log files - java

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: Error in Hadoop MapReduce
My question is: does anyone know how to configure Hadoop to roll the log dir or otherwise prevent this? I'm trying to avoid just setting the "mapred.userlog.retain.hours" and/or "mapred.userlog.limit.kb" properties because I want to actually keep the log files.
I was also hoping to configure this in log4j.properties, but looking at the Hadoop 0.20.2 source, it writes directly to logfiles instead of actually using log4j. Perhaps I don't understand how it's using log4j fully.
Any suggestions or clarifications would be greatly appreciated.

Unfortunately, there isn't a configurable way to prevent that. Every task for a job gets one directory in history/userlogs, which will hold the stdout, stderr, and syslog task log output files. The retain hours will help keep too many of those from accumulating, but you'd have to write a good log rotation tool to auto-tar them.
We had this problem too when we were writing to an NFS mount, because all nodes would share the same history/userlogs directory. This means one job with 30,000 tasks would be enough to break the FS. Logging locally is really the way to go when your cluster actually starts processing a lot of data.
If you are already logging locally and still manage to process 30,000+ tasks on one machine in less than a week, then you are probably creating too many small files, causing too many mappers to spawn for each job.

I had this same problem. Set the environment variable "HADOOP_ROOT_LOGGER=WARN,console" before starting Hadoop.
export HADOOP_ROOT_LOGGER="WARN,console"
hadoop jar start.jar

Configuring hadoop to use log4j and setting
log4j.appender.FILE_AP1.MaxFileSize=100MB
log4j.appender.FILE_AP1.MaxBackupIndex=10
like described on this wiki page doesn't work?
Looking at the LogLevel source code, seems like hadoop uses commons logging, and it'll try to use log4j by default, or jdk logger if log4j is not on the classpath.
Btw, it's possible to change log levels at runtime, take a look at the commands manual.

According to the documentation, Hadoop uses log4j for logging. Maybe you are looking in the wrong place ...

I also ran in the same problem.... Hive produce a lot of logs, and when the disk node is full, no more containers can be launched. In Yarn, there is currently no option to disable logging. One file particularly huge is the syslog file, generating GBs of logs in few minutes in our case.
Configuring in "yarn-site.xml" the property yarn.nodemanager.log.retain-seconds to a small value does not help. Setting "yarn.nodemanager.log-dirs" to "file:///dev/null" is not possible because a directory is needed. Removing the writing ritght (chmod -r /logs) did not work either.
One solution could be to a "null blackhole" directory. Check here:
https://unix.stackexchange.com/questions/9332/how-can-i-create-a-dev-null-like-blackhole-directory
Another solution working for us is to disable the log before running the jobs. For instance, in Hive, starting the script by the following lines is working:
set yarn.app.mapreduce.am.log.level=OFF;
set mapreduce.map.log.level=OFF;
set mapreduce.reduce.log.level=OFF;

Related

Different logging configurations per enviroment, good or bad?

I'm currently using a very basic logging configuration and I'm using the same configuration in all environments. For development it is beneficial to see the output in the console, so I've configured log4j with the following root categories:
log4j.rootCategory=INFO, console, file
When I deploy, I am only interested in the output that is directed to file and have configured each file to have maximum file size.
Is there any performance hit of logging to console in production where I have no use for it? Also, where does this output go in a Linux vs. a Windows machine when no console is available? What, if anything, do I gain by having separate configurations?

Log4j is logging some binary information at beginning of File

There is an issue we are facing in production environment.
The File generated using log4j is getting appended with some special characters at the start of file, before starting to log.
This is resulting in a binary file which is making tools like Splunk not able to access these files as it is expecting text files.
Please help me what could be the issue here.

According to Google, my best guess is that you are using GC logs (JVM Garbage Collector logs) from what I read here: https://developer.jboss.org/message/529671#529671 and here: https://developer.jboss.org/thread/148848?tstart=0&_sscc=t.
It seems that there is no real solution, except maybe using the right combination of ASCII encoding + right locale, according to the pages previously linked.
Since you said, in your question, that you have this problem on production environment, I may suggest you to simply disable GC logs in production, because you should not do this in production (enabling GC logs have a performance/storage impact). In your JVM start options, look for something like -XX:+PrintGC or -verbose:gc.

How to check YARN mapreduce tasks max heap size setting

I am facing a problem where my M/R job tasks fail with heap space errors, while job configuration lists 2 properties that should affect that:
mapred.child.java.opts=-Xmx200m
mapreduce.map.java.opts=-Xmx2048m
The first one is legacy configuration that ended up in job.xml who knows how, and the second one is my job configuration. I was using oozie to submit the job and it somehow put these two configurations in...
Is there a way to check which of these two configurations were actually used for java opts for my map task? Is there a log perhaps or a way to profile the jvm in order to see it? I need to know this to rule out a possibility of bug in my code causing heap space error.

there should be a job_xxx.conf.xml file in mapreduce history servers directory on hdfs when the job completes. otherwise you can find this file in the users hdfs directory under /user//.staging/jobid/ during runtime

Better way to locate problems in a running Java application?

We have a few java applications (jars) running as backend server applications on localhost. These programs are inside a virtual box (RHEL 6.2).
After one of the jar's ran for 5 days, it stopped working. No exceptions were thrown (didn't see any output of the errors that could be caught in the catch block). To find out what caused this, we put in some println's and redirected output to a text file using the > operator on the commandline using shellscript.
After about 4 or 5 days, we faced a situation where we could see that the jar was still running, but it wasn't outputting anything to the text file or to the database to which the application was supposed to write entries.
Perhaps the textfile became too large for the virtual box to handle, but basically we wanted to know this:
How are such runtime problems located in Java? In C++ we have valgrind, Purify etc, but
1. are there such tools in Java?
2. How would you recommend we output println's without facing the extremely-large-textfile problem? Or is there a better way to do it?

Rather than printing to System.out how about using tools like log4j. Log4J allows for logfile sizing, versioning and purging.
see http://logging.apache.org/log4j/1.2/
You may also want to re-consider your server architecture.

How are such runtime problems located in Java? In C++ we have
valgrind, Purify etc, but 1. are there such tools in Java?
There are lot of java profilers available, few are free as well. There is one called VisualVM, which comes along with java distribution. You can attach your process with profiler, but profilers will only help you find few problems such as memory leaks, cpu intenstive task etc
How would you recommend we output println's without facing the extremely-large-textfile problem? Or is there a better way to do it?
Sysout are not a good way to deal with this problem. Loggers such as log4j provides very roboust and easy to use API. Log4j also provides easy way to configure to roll over your log files, etc features

log4j Rolling file appender - multi-threading issues?

Are there any known bugs with the Log4J rolling file appender. I have been using log4j happily for a number of years but was not aware of this. A colleague of mine is suggesting that there are known issues ( and i found one a Bugzilla entry on this) where under heavy load,the rolling file appender (we use the time-based one) might not perform correctly when the rollover occurs # midnight.
Bugzilla entry - https://issues.apache.org/bugzilla/show_bug.cgi?id=44932
Appreciate inputs and pointers on how others overcome this.
Thanks,
Manglu

I have not encountered this issue myself, and from the bug report, I would suspect that it is very uncommon. Th Log4j RollingFileAppender has always worked in a predictable and reliable fashion for the apps I have developed and maintained.
This particular bug, If I understand correctly, would only happen if there are multiple instances of Log4j, like if you had multiple instances of the same app running simultaneously, writing to the same log file. Then, when it is rollover time, one instance cannot get a lock on the file in order to delete it and archive its contents, resulting in the loss of the data that was to be archived.
I cannot speak to any of the other known bugs your colleague mentioned unless you would like to cite them specifically. In general, I believe Log4j is reliable for production apps.

#kg, this happens to me too. This exact situation. 2 instances of the same program.
I updated it to the newer rolling.RollingFileAppender instead of using DailyFileRoller( whatever it was called ).
I run two instances simultenously via crontab. The instances output as many messages as they can until 5 seconds is reached. They measure the time for 1 second by using System.currentTimeMillis, and append to a counter to estimate a 5 second timeperiod for the loop. So there's minimum overhead in this test. Each output log message contains an incrementing number, and the message contains identifiers set from commandline to be able to separate them.
From putting the log message order together, one of the processes succeeds in writing from the start to end of the sequence, the other one loses the first entries of its output (from 0 onward).
This really ought to be fixed...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Configuring Hadoop logging to avoid too many log files - java

I had this same problem. Set the environment variable "HADOOP_ROOT_LOGGER=WARN,console" before starting Hadoop. export HADOOP_ROOT_LOGGER="WARN,console" hadoop jar start.jar

According to the documentation, Hadoop uses log4j for logging. Maybe you are looking in the wrong place ...

Related

Different logging configurations per enviroment, good or bad?

Log4j is logging some binary information at beginning of File

How to check YARN mapreduce tasks max heap size setting

Better way to locate problems in a running Java application?

log4j Rolling file appender - multi-threading issues?

Categories

Resources