How can I consolidate hadoop logs? - java

I currently have a program that - among other things - produces log outputs. The problem is that the mappers all seem to want to log to wherever they are running. I want all of this log output to end up in a single file. I am using log4j to log information.
I was thinking that it might be possibble to somehow stream that data as a string from the Mapper back to the main function somehow, and log it that way. Is something like this possibble? Is there a better way to consolidate logs?

Each map or reduce task's log is written to the local file system of the task tracker nodes on which they were executed. The log is written to a 'userlog' directory which is defined by HADOOP_LOG_DIR, with sub directory named by task attempt ID.
These logs files can be accessed through job tracker web GUI, in each task detail page, there is a link you can click to check the log content. But from the original Apache Hadoop version, to my best knowledge, there is not any tool or mechanism that you could consolidate logs from different nodes.
While some vendors might have their own utilities for this purpose, I found this one recently, from MapR which seems a good solution but I did not try that myself, http://doc.mapr.com/display/MapR/Centralized+Logging
I am not sure if the idea of picking up logs and feed into one single map task will work. As you need to know the task and attempt id inside the task Java class file to pick up its own log file after it's completed.

OK, I ended up solving this problem by using MultipleOutputs. I set up one stream for the actual output, and one for the log results my code produced. This allowed me to manually log things by sending output through the log output stream. I then had a script consolidate the log files into a single file. While the automatically generated logs stayed where they originally were this solution has allowed me to send the log messsages I put into my code into a single location.
example log statement using MultipleOutputs:
mout.write("logOutput", new Text("INFO: "), new Text("reducertest"),
"templogging/logfile");

Related

Is it possible to retrieve all logs messages store them into a file or show them at the end of an ETL Job?

Introduction :
I work on a tool called Jedox. In this tool there is an ETL. This ETL can run job(job allow to execute multiple step from the ETL, by step I mean a set of operation on a table(most of the time)). Some jobs can just launch successively different step from etl, but there is another type of job than can run script. I try to run a job that use Groovy language. This language is very similar to Java, the two language share many similarity.
Context :
During the run, the ETL show message logs on a console. I can also print some message myself. Ex LOG.info("hello") will print Hello in the ETL Jedox Console. LOG is an object from the class ETLLogger. It's a specific class from Jedox library.
Question :
With this logger, how can I get messages logs printed during the
job's run ?
With another Logger(from log4j for example) is it possible to get
ALL logs message printed during a process ?
Goal :
I want those logs messages to print all the warnings happened during the job at the END of the job, because jedox console is very glitchy and I can't retrieve decent data from a simple copy and paste. Furthermore, if a just copy/paste all logs I had select only warning message manually. And if it's possible I want to write the logs that interest me in a file, it would be great !
Bonus :
Here's a visual of Jedox ETL Console.
The Jedox ETL - Integrator comes with an own API, where the logger is an additional class.
If you are working with the Groovy Job envrionment, than you have full access to all of the Classes and Methods. In your case the State Class would be helpful to get all Wanrnings or Errors of your current job/subjob. You can write the state items them to a seperate File or collect them into a Varialbe for further useage, as a example. The target can be a file, database or even a jedox olap cube.
Starting Point of the Groovy Scripting API:
https://knowledgebase.jedox.com/integration/integrator-scripting-api.htm
Here is API Documentation of all Jedox ETL Classes:
https://software.jedox.com/doc/2022_2/jedox_integrator_scripting_api_doc_2022_2/com/jedox/etl/components/scriptapi/package-summary.html
How do I store all those logs in a file ?
In Jedox you can create a new File-Load, which can be triggered by the 'main' groovy job.
A good point to start is here:
https://knowledgebase.jedox.com/integration/loads/file-load.htm

Java: Updating a .txt file as the program runs and being able to see the change

I have to run a Java program that needs to keep track of transactions the user makes. I need to log these transactions in the .txt file.
Everything is working well with my code, expect that I cannot see the .txt file - it is not created - till the programs closes.
The goal for our Project is to be able to see this file get updated live as the programs is running. The user completes Order #1 and the transactions of that order get logged into the .txt file and one can see the changes right away - while the program is still running. The user completes Order #2 and the transaction of that order are appended to the .txt file - again, while the program is running.
I am using:
PrintWriter out;
out = (new PrintWriter(new FileWriter("log.txt", true)));
(writes lines to file)
out.flush();
out.close();
This code is within a method that gets called every time the users finishes his or her order. As soon as the order is finish the log.txt file should reflect the changes right away without the program quitting. I have spend hours on searching how to do this but I have not suceeeded. I am also relatively new to Java and programming; therefore, any guidance is greatly appreciated.
Thank you.
have you looked at standard logging framework for java? (slf4j) it's an api that is pretty much ubiquitous and there are many very good implementations, like logback, or log4j and so on. Let those worry about writing to files. Program to an interface (slf4j interface, namely) and copy-paste (if you don't want to do anything fancy) some xml configuration for the logger implementation from the internet.
you would not have to open files, or flush and close them. your code would be:
log.info("something happened");
read up on this topic, as there practically aren't serious java projects that would not have a logging element to them. invest some time into learning this framework once, as you can use it forever.
Probably your buffer is unloaded and written onto the file only when you invoke the flush method. Buffer waits to accumulate some data before the writing operation.

Retry processing of failed files in Hadoop

I've a Hadoop Map-Reduce program which is nothing but a simple file processing code. Each mapper is having one file as the input.
My Reducer part is empty. All the work is done in map() method.
Now the problem that I'm facing now is -
If the file processing operation failes inside mapper, I'm not able to tell my Hadoop job to wait for some constant seconds before retrying from the same failed file and moving ahead.
Is there any way/configuration to specify as such ?
Let me know if anyone has encountered such use-case.
I think you should try writing the bad records into different file based on your logic using multiple output. For multiple output you can follow this link : Multiple output link
If you follow this approach you can filter out badrecords and good records based on your logic in map method and your job will not fail. Using multiple output you can write bad records to a separate file and do analysis on those later. Using this approach you can ensure your job does not fail because of bad records and your good records are processed properly.
You can also look into this link Counters to detect badrecord to figure out how many badrecords you actually have. I hope this helps

how to monitor a file in java while it is being updated by a shell script

I have a log file that is going to be updated by a shell script. This shell script has a number of operations and updates the file after each operation, saying the operation has finished. Now, I need to 'listen' on this file from a servlet and send response back to the end user in the same fashion as the logging happens (i.e. operation A finished, operation B finished and so on). Now if both the servlet and the shell script try to open the file at the same time I am sure I will get some error. In java I guess I can handle it as IOException and keep trying to read the file, so that it works when the shell script is not updating the file. How should I handle this in shell script? Will it help if I open the file in read only mode in java? Also note that the shell script only writes and doesn't read and the servlet only reads and doesn't write.
Also, suggestions welcome on a better way of implementing this workflow.
Are you using Java 7? If so then maybe the new Watcher service would work for you. I haven't personally used it but the idea is that you get notifications in your code when a file/folder has changed. This might make your code cleaner than simply polling a file repeatedly.
http://docs.oracle.com/javase/7/docs/api/java/nio/file/WatchService.html
Since you can't use the WatchService, you could poll the files last modification using file.lastModified().
If you do this periodically, you can compare the results and if they changed, the file was modified by the shell script. It might be necessary to create a new File object everytime you poll, but since the file isn't opened for reading at all no access problems will occur.
However, even if you open the file and compare its contents you should not experience any access problems, unless your shell opens the file with exclusive access.

Best way to interact with application logs output from log4j (while also are being updated by application)

May be it is simpler than I think but I am confused on the following:
I want to be able to present to a user (in a graphical interface) the logs produced by Log4j.
I could just read the files as it is and present it, but I was wondering if there is a standard way to do it to so as to also get any updates that happen at the same time from the other parts of the application that log concurrently.
The log4j files could be multiple i.e. rolling appender
Also the presentation could be while there is no logging happening.
I.e. view of logs up to date
UPDATE:
I am constraint to Java 6
You can use Java 7's NIO2 libraries to get notified when one of multiple files get's modified in a directory, and reread & display it:
http://blogs.oracle.com/thejavatutorials/entry/watching_a_directory_for_changes
Have you tried the following tools :
Chainsaw
Xpolog
Perhaps add a database appender (JDBCAppender) and present the log entries from that?
Fro the official documentation of log4j:
Is there a way to get log4j to automatically reload a configuration file if it changes?
Yes. Both the DOMConfigurator and the PropertyConfigurator support automatic reloading
through the configureAndWatch method. See the API documentation for more details.
PropertyConfigurator#configureAndWatch
DOMConfigurator#configureAndWatch
For the on-demand reload of log4j config using GUI I would suggest expose it via a servlet in your J2EE application so that whole file can be edited in a web page (text area may be) and once saved you can overwrite your existing log4j file and reload the log4j config.
Maybe you could think about more "OS-level" solution.
I don't know if you are using win or linux, but on linux there is this realy nice command "tail".
So you could use ProcessBuilder to create OS process which goes something like "tail -f yourLogFile.txt".
And then read the OutputStream of the returned Process. Reading the stream will block waiting for new output from the process to be available, and will immediately unblock when such is available, giving you immediate feedback and possibility to read the latest changes of the log file.
However, you might have problems shutting this process down from Java.
You should be able to send SIGTERM signal to it if you know the process id. Or you could start a different process which could lookup the id of the "tail" process and kill it via "kill" command or something similar.
Also I am not sure if there is similar tool available on windows, if this is your platform.
If you write your own simple appender and have your application include that appender in your log4j configuration, your appender will be called whenever events are written to other appenders, and you can choose to display the event messages, timestamps, etc. in a UI.
Try XpoLog log4j/log4net connector. It parses the data automaticly and has predefined set of dashboards for it:
Follow the below steps
Download and install XpoLog from here
Add the log4j data using the log4j data connector from here and
deploy the log4j app here

Categories