Retry processing of failed files in Hadoop - java

I've a Hadoop Map-Reduce program which is nothing but a simple file processing code. Each mapper is having one file as the input.
My Reducer part is empty. All the work is done in map() method.
Now the problem that I'm facing now is -
If the file processing operation failes inside mapper, I'm not able to tell my Hadoop job to wait for some constant seconds before retrying from the same failed file and moving ahead.
Is there any way/configuration to specify as such ?
Let me know if anyone has encountered such use-case.

I think you should try writing the bad records into different file based on your logic using multiple output. For multiple output you can follow this link : Multiple output link
If you follow this approach you can filter out badrecords and good records based on your logic in map method and your job will not fail. Using multiple output you can write bad records to a separate file and do analysis on those later. Using this approach you can ensure your job does not fail because of bad records and your good records are processed properly.
You can also look into this link Counters to detect badrecord to figure out how many badrecords you actually have. I hope this helps

Related

Is it possible to retrieve all logs messages store them into a file or show them at the end of an ETL Job?

Introduction :
I work on a tool called Jedox. In this tool there is an ETL. This ETL can run job(job allow to execute multiple step from the ETL, by step I mean a set of operation on a table(most of the time)). Some jobs can just launch successively different step from etl, but there is another type of job than can run script. I try to run a job that use Groovy language. This language is very similar to Java, the two language share many similarity.
Context :
During the run, the ETL show message logs on a console. I can also print some message myself. Ex LOG.info("hello") will print Hello in the ETL Jedox Console. LOG is an object from the class ETLLogger. It's a specific class from Jedox library.
Question :
With this logger, how can I get messages logs printed during the
job's run ?
With another Logger(from log4j for example) is it possible to get
ALL logs message printed during a process ?
Goal :
I want those logs messages to print all the warnings happened during the job at the END of the job, because jedox console is very glitchy and I can't retrieve decent data from a simple copy and paste. Furthermore, if a just copy/paste all logs I had select only warning message manually. And if it's possible I want to write the logs that interest me in a file, it would be great !
Bonus :
Here's a visual of Jedox ETL Console.
The Jedox ETL - Integrator comes with an own API, where the logger is an additional class.
If you are working with the Groovy Job envrionment, than you have full access to all of the Classes and Methods. In your case the State Class would be helpful to get all Wanrnings or Errors of your current job/subjob. You can write the state items them to a seperate File or collect them into a Varialbe for further useage, as a example. The target can be a file, database or even a jedox olap cube.
Starting Point of the Groovy Scripting API:
https://knowledgebase.jedox.com/integration/integrator-scripting-api.htm
Here is API Documentation of all Jedox ETL Classes:
https://software.jedox.com/doc/2022_2/jedox_integrator_scripting_api_doc_2022_2/com/jedox/etl/components/scriptapi/package-summary.html
How do I store all those logs in a file ?
In Jedox you can create a new File-Load, which can be triggered by the 'main' groovy job.
A good point to start is here:
https://knowledgebase.jedox.com/integration/loads/file-load.htm

Process a large file in background with Play Framework

What is the best way to process a large file with play framework? I need to execute some operations when a file was uploaded. Process can be slow then I need to return http 200 to client and send an email when the process ends
I was googling and I found these approaches:
Create an Actor
Create a new thread
Create a promise (CompletionStage without a .get())
Those approaches works, but I'd like to know what is the best or clean one
I think creating an Actor can be an eligant solution. Since you processing a huge fil, using stream processing engines either fs2/akka should help
I am using the actor based system for similar problem as yours and it works very well.
For reference and trying out you can refer to this
https://developer.lightbend.com/guides/akka-quickstart-scala/create-actors.html

Detect file deletion while using a FileOutputStream

I have created a Java process which writes to a plain text file and another Java process which consumes this text file. The 'consumer' reads then deletes the text file. For the sake of simplicity, I do not use file locks (I know it may lead to concurrency problems).
The 'consumer' process runs every 30 minutes from crontab. The 'producer' process currently just redirects whatever it receives from the standard input to the text file. This is just for testing - in the future, the 'producer' process will write the text file by itself.
The 'producer' process opens a FileOutputStream once and keeps writing to the text file usign this output stream. The problem is when the 'consumer' deletes the file. Since I'm in an UNIX environment, this situation is handled 'gracefully': the 'producer' keeps working as if nothing happened, since the inode of the file is still valid, but the file can no longer be found in the file system. This thread provides a way to handle this situation using C. Since I'm using Java, which is portable and therefore hides all platform-specific features, I'm not able to use the solution presented there.
Is there a portable way in Java to detect when the file was deleted while the FileOutputStream was still open?
This isn't a robust way for your processes to communicate, and the best I can advise is to stop doing that.
As far as I know there isn't a reliable way for a C program to detect when a file being written is unlinked, let alone a Java program. (The accepted answer you've linked to can only poll the directory entry to see if it's still there; I don't consider this sufficiently robust).
As you've noticed, UNIX doesn't consider it abnormal for an open file to be unlinked (indeed, it's an established practice to create a named tempfile, grab a filehandle, then delete it from the directory so that other processes can't get at it, before reading and writing).
If you must use files, consider having your consumer poll a directory. Have a .../pending/ directory for files in the process of being written and .../inbox/ for files that are ready for processing.
Producer creates a new uniquefilename (e.g. a UUID) and writes a new file to pending/.
After closing the file, Producer moves the file to inbox/ -- as long as both dirs are on the same filesystem, this will just be a relink, so the file will never be incomplete in inbox/.
Consumer looks for files in inbox/, reads them and deletes when done.
You can enhance this with more directories if there are eventually multiple consumers, but there's no immediate need.
But polling files/directories is always a bit fragile. Consider a database or a message queue.
You can check the filename itself for existence:
if (!Files.exists(Paths.get("/path/to/file"))) {
// The consumer has deleted the file.
}
but in any case, shouldn't the consumer be waiting for the producer to finish writing the file before it reads & deletes it? If it did, you wouldn't have this problem.
To solve this the way you're intending to do, you might have to look at JNI, which lets you call c/c++ functions from within Java, but this might also require you to program a wrapper-library for stat/fstat first (in c/c++).
However - that will cause you major headache.
This might be a workaround which doesnt require much change to your code right now (i assume).
You can let the producer write to a new File each time its producing new Data. Depending on the amount, you might want to group the data, so that the directory wont be flooded with files. For example, one file per minute that contains all data that's been produced so far.
Also it might be a good idea to write the files to another directory first and then move them to your Consumers input-directory - i'm a bit paranoid here, because there could be some race-conditions causing you some dataloss... - moving the files after everything has been already written and then moving them will make sure, no data gets lost.
Hope this helps Good luck :)

How can I consolidate hadoop logs?

I currently have a program that - among other things - produces log outputs. The problem is that the mappers all seem to want to log to wherever they are running. I want all of this log output to end up in a single file. I am using log4j to log information.
I was thinking that it might be possibble to somehow stream that data as a string from the Mapper back to the main function somehow, and log it that way. Is something like this possibble? Is there a better way to consolidate logs?
Each map or reduce task's log is written to the local file system of the task tracker nodes on which they were executed. The log is written to a 'userlog' directory which is defined by HADOOP_LOG_DIR, with sub directory named by task attempt ID.
These logs files can be accessed through job tracker web GUI, in each task detail page, there is a link you can click to check the log content. But from the original Apache Hadoop version, to my best knowledge, there is not any tool or mechanism that you could consolidate logs from different nodes.
While some vendors might have their own utilities for this purpose, I found this one recently, from MapR which seems a good solution but I did not try that myself, http://doc.mapr.com/display/MapR/Centralized+Logging
I am not sure if the idea of picking up logs and feed into one single map task will work. As you need to know the task and attempt id inside the task Java class file to pick up its own log file after it's completed.
OK, I ended up solving this problem by using MultipleOutputs. I set up one stream for the actual output, and one for the log results my code produced. This allowed me to manually log things by sending output through the log output stream. I then had a script consolidate the log files into a single file. While the automatically generated logs stayed where they originally were this solution has allowed me to send the log messsages I put into my code into a single location.
example log statement using MultipleOutputs:
mout.write("logOutput", new Text("INFO: "), new Text("reducertest"),
"templogging/logfile");

SAXException: Unexpected end of file after null

I'm getting the error in the title occasionally from a process the parses lots of XML files.
The files themselves seem OK, and running the process again on the same files that generated the error works just fine.
The exception occurs on a call to XMLReader.parse(InputStream is)
Could this be a bug in the parser (I use piccolo)? Or is it something about how I open the file stream?
No multithreading is involved.
Piccolo seemed like a good idea at the time, but I don't really have a good excuse for using it. I will to try to switch to the default SAX parser and see if that helps.
Update: It didn't help, and I found that Piccolo is considerably faster for some of the workloads, so I went back.
I should probably tell the end of this story: it was a stupid mistake. There were two processes: one that produces XML files and another that reads them. The reader just scans the directory and tries to process every new file it sees.
Every once in a while, the reader would detect a file before the producer was done writing, and so it legitimately raised an exception for "Unexpected end of file". Since we're talking about small files here, this event was pretty rare. By the time I came to check, the producer would already finish writing the file, so to me it seemed like the parser was complaining for nothing.
I wrote "No multithreading was involved". Obviously this was very misleading.
One solution would be to write the file elsewhere and move it to the monitored folder only after it was done. A better solution would be to use a proper message queue.
I'm experiencing something similar with Picolo under XMLBeans. After a quick Google, I came across the following post:
XMLBEANS-226 - Exception "Unexpected end of file after null"
The post states that use of the Apache Commons (v1.4 onwards) class org.apache.commons.io.input.AutoCloseInputStream may resolve this exception (not tried it myself, apologies).
Is this a multithreaded scenario? I.e. do you parse more than one at the same time.
Any particular reason you do not use the default XML parser in the JRE?

Categories