Hadoop MapReduce producing log files only after certain interval?

Hadoop MapReduce producing log files only after certain interval? - java

So I have been trying to run a Hadoop MapReduce job for a while, and after it successfully started running (all errors sorted out), I wanted to check the log files for the stdout which is captured in the log files, but somehow I see that the log files aren't being generated every time. (some times it comes, other time it doesn't)
I was using an output directory (/user/hduser/output_dir), and deleting the contents and using that again (to avoid keeping track of so many output directories) but the log files indicate the time when the last change was made to it, and that does not match the last time I ran the job.
Also, the log files in /user/hduser/output_dir does not match with $HADOOP_HOME/logs/userlogs
Is it a known problem, and is there any solution to this? I did not find the answer to it anywhere.
Thanks for the help.!
EDIT - So we found out that the log files are being written only after certain intervals of time, so if a job is run twice within this time then new log files aren't written for that. Why would this be so, and how can I override it using some configuration changes, if it's possible?

Related

Logging from two simultaneous processes

I have written an application in java and deployed it on a unix server.
I have implemented the logging in my app and the logs are generated in a file say X.log.txt
If I run multiple instances of my jar using different users or single user different sessions: Is there a chance that my logs in X.log.txt get mixed?
or it will be in FCFS manner??
Example: let P1 and P2 be two processes that are calling the java app and are generating logs.
P1 and P2 ARE writing their individual logs at the same time to X.log.txt. Is this statement true? Or is it entirely based on the CPU scheduling algorithm (FCFS, SJF, etc.)?
Even if i don't use the timestamping Its working fine for me.
When I am executing them the logs are generated one after the other , Means For a particular instance all the logs are written into the file and then for the next instance. My question is still open is it all based on the way our processor is written to handle jobs or is it some thing else ?

If two processes are writing to same log file, data will get randomly corrupted. You will get lines cut in the middle and finishing with data from other log. You can even end up with good chunks of binary zeroes in various places of the file, depending on OS (and in some OSes, it will just fail to write to same file from two places at same time).
Write to separate files and then join/browse them using some 3rd party tools to get timestamp-ordered view.

If both your processes are writting to the same directory and file path you will get some odd behaviour. Depending on your implementation both applications will write to the file at the same time or one application will block the other from writing at all.
My suggestion would be to generate the log file's name at runtime and append something unique like a timestamp or a pid (process id) so there's no more conflict:
X.log.[PID].txt or X.log.[TIMESTAMP].txt
NOTE: You'll have to use a low enough resolution in the timestamp (seconds or nanoseconds) to avoid a name collision.

How to poll a directory and not hit a file-transfer race condition?

I am working on an application that polls a directory for new input files at a defined interval. The general process is:
Input files FTP'd to landing strip directory by another app
Our app wakes up
List files in the input directory
Atomic-move the files to a separate staging directory
Kick off worker threads (via a work-distributing queue) to consume the files from the staging directory
Go to back sleep
I've uncovered a problem where the app will pick up an input file while it is incomplete and still in the middle of being transferred, resulting in a worker thread error, requiring manual intervention. This is a scenario we need to avoid.
I should note the file transfer will complete successfully and the server will get a complete copy, but this will happen to occur after the app has given up due to an error.
I'd like to solve this in a clean way, and while I have some ideas for solutions, they all have problems I don't like.
Here's what I've considered:
Force the other apps (some of which are external to our company) to initially transfer the input files to a holding directory, then atomic-move them into the input directory once they're transferred. This is the most robust idea I've had, but I don't like this because I don't trust that it will always be implemented correctly.
Retry a finite number of times on error. I don't like this because it's a partial solution, it makes assumptions about transfer time and file size that could be violated. It would also blur the lines between a genuinely bad file and one that's just been incompletely transferred.
Watch the file sizes and only pick up the file if its size hasn't changed for a defined period of time. I don't like this because it's too complex in our environment: the poller is a non-concurrent clustered Quartz job, so I can't just persist this info in memory because the job can bounce between servers. I could store it in the jobdetail, but this solution just feels too complicated.
I can't be the first have encountered this problem, so I'm sure I'll get better ideas here.

I had that situation once, we got the other guys to load the files with a different extension, e.g. *.tmp, then after the file copy is completed they rename the file with the extension that my code is polling for. Not sure if that is as easily done when the files are coming in by FTP tho.

Java properties file under low disk space conditions on Linux

I unwittingly ran a Java program on Linux where the partition that it, it's properties and logs resided on was close to 100% full. It failed, but after clearing up the problem with the disk space, I ran it again and it failed a second time because it's properties file was 0 bytes long.
I don't have source of the file and I don't want to go as far as decompiling the class files, but I was wondering whether the corruption of the properites could be because the program failed to write to the properties file.
The mystery is that, I expected that the properties would be read-only and don't recall any items being updated by the program. Could it be that even if the properties are only read, the file is opened in read-write mode and could disappear if the partition is full?
N.b. this program has run without failure or incident at least 1000 times over several years.

I don't have source of the file and I don't want to go as far as decompiling the class files, but I was wondering whether the corruption of the properites could be because the program failed to write to the properties file.
That is the most likely explanation. There would have been an exception, but the application could have squashed it ... or maybe you didn't notice the error message. (Indeed, if the application tried to log the error to a file, that would most likely fail too.)
Could it be that even if the properties are only read, the file is opened in read-write mode and could disappear if the partition is full?
That is unlikely to be the answer in this case. Unlike many languages, in Java the code for reading a file and writing a file involves different stream classes. It is hard to imagine how / why the application's developer would open a property file for writing (with truncation) if there was never any intention to write it.
The most plausible explanation is that the application does update the properties file. (Try installing the program again, using it, stopping it, and looking at the property file's modification timestamp.)
N.b. this program has run without failure or incident at least 1000 times over several years.
And I bet this is the first time you've run it while the disk was full :-)

Configuring Hadoop logging to avoid too many log files

I'm having a problem with Hadoop producing too many log files in $HADOOP_LOG_DIR/userlogs (the Ext3 filesystem allows only 32000 subdirectories) which looks like the same problem in this question: Error in Hadoop MapReduce
My question is: does anyone know how to configure Hadoop to roll the log dir or otherwise prevent this? I'm trying to avoid just setting the "mapred.userlog.retain.hours" and/or "mapred.userlog.limit.kb" properties because I want to actually keep the log files.
I was also hoping to configure this in log4j.properties, but looking at the Hadoop 0.20.2 source, it writes directly to logfiles instead of actually using log4j. Perhaps I don't understand how it's using log4j fully.
Any suggestions or clarifications would be greatly appreciated.

Unfortunately, there isn't a configurable way to prevent that. Every task for a job gets one directory in history/userlogs, which will hold the stdout, stderr, and syslog task log output files. The retain hours will help keep too many of those from accumulating, but you'd have to write a good log rotation tool to auto-tar them.
We had this problem too when we were writing to an NFS mount, because all nodes would share the same history/userlogs directory. This means one job with 30,000 tasks would be enough to break the FS. Logging locally is really the way to go when your cluster actually starts processing a lot of data.
If you are already logging locally and still manage to process 30,000+ tasks on one machine in less than a week, then you are probably creating too many small files, causing too many mappers to spawn for each job.

I had this same problem. Set the environment variable "HADOOP_ROOT_LOGGER=WARN,console" before starting Hadoop.
export HADOOP_ROOT_LOGGER="WARN,console"
hadoop jar start.jar

Configuring hadoop to use log4j and setting
log4j.appender.FILE_AP1.MaxFileSize=100MB
log4j.appender.FILE_AP1.MaxBackupIndex=10
like described on this wiki page doesn't work?
Looking at the LogLevel source code, seems like hadoop uses commons logging, and it'll try to use log4j by default, or jdk logger if log4j is not on the classpath.
Btw, it's possible to change log levels at runtime, take a look at the commands manual.

According to the documentation, Hadoop uses log4j for logging. Maybe you are looking in the wrong place ...

I also ran in the same problem.... Hive produce a lot of logs, and when the disk node is full, no more containers can be launched. In Yarn, there is currently no option to disable logging. One file particularly huge is the syslog file, generating GBs of logs in few minutes in our case.
Configuring in "yarn-site.xml" the property yarn.nodemanager.log.retain-seconds to a small value does not help. Setting "yarn.nodemanager.log-dirs" to "file:///dev/null" is not possible because a directory is needed. Removing the writing ritght (chmod -r /logs) did not work either.
One solution could be to a "null blackhole" directory. Check here:
https://unix.stackexchange.com/questions/9332/how-can-i-create-a-dev-null-like-blackhole-directory
Another solution working for us is to disable the log before running the jobs. For instance, in Hive, starting the script by the following lines is working:
set yarn.app.mapreduce.am.log.level=OFF;
set mapreduce.map.log.level=OFF;
set mapreduce.reduce.log.level=OFF;

log4j Rolling file appender - multi-threading issues?

Are there any known bugs with the Log4J rolling file appender. I have been using log4j happily for a number of years but was not aware of this. A colleague of mine is suggesting that there are known issues ( and i found one a Bugzilla entry on this) where under heavy load,the rolling file appender (we use the time-based one) might not perform correctly when the rollover occurs # midnight.
Bugzilla entry - https://issues.apache.org/bugzilla/show_bug.cgi?id=44932
Appreciate inputs and pointers on how others overcome this.
Thanks,
Manglu

I have not encountered this issue myself, and from the bug report, I would suspect that it is very uncommon. Th Log4j RollingFileAppender has always worked in a predictable and reliable fashion for the apps I have developed and maintained.
This particular bug, If I understand correctly, would only happen if there are multiple instances of Log4j, like if you had multiple instances of the same app running simultaneously, writing to the same log file. Then, when it is rollover time, one instance cannot get a lock on the file in order to delete it and archive its contents, resulting in the loss of the data that was to be archived.
I cannot speak to any of the other known bugs your colleague mentioned unless you would like to cite them specifically. In general, I believe Log4j is reliable for production apps.

#kg, this happens to me too. This exact situation. 2 instances of the same program.
I updated it to the newer rolling.RollingFileAppender instead of using DailyFileRoller( whatever it was called ).
I run two instances simultenously via crontab. The instances output as many messages as they can until 5 seconds is reached. They measure the time for 1 second by using System.currentTimeMillis, and append to a counter to estimate a 5 second timeperiod for the loop. So there's minimum overhead in this test. Each output log message contains an incrementing number, and the message contains identifiers set from commandline to be able to separate them.
From putting the log message order together, one of the processes succeeds in writing from the start to end of the sequence, the other one loses the first entries of its output (from 0 onward).
This really ought to be fixed...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.