I am setting a path as input location to conf
FileInputFormat.setInputPaths(conf, new Path("path/to/folder"));
How can I retrieve this location back from conf as I am trying to implement my own RecordReader
Thanks in advance...
The property set by this call is map.input.dir, so this should work for you:
conf.get("map.input.dir");
On a side note, your record reader should act upon the input split it is given in the initialize(InputSplit, TaskAttemptContext) method, as the folder you pass in setInputPath will actually resolve to a number of input splits, typically one for each file in the folder (and possible multiple input splits for larger, splittable files).
FileInputFormat based input formats are passed a FileSplit to the initialize method, and you should be able to pull out the actual file to be processed from the FileSplit.getPath() method.
Related
I have a mapreduce job, where the file input path is: /basedirectory/*/*.txt
Inside the basedirectory, I have different subfolders (CaseA, CaseB etc), each of which contain hdfs text files.
In the map phase of the job, I want to find out where exactly the data shard came from (e.g. CaseA). How can I achieve that?
I've done something similar for mapreduce jobs with more than 1 input hbase tables where I use context.getInputSplit().getTableName() to find the actual table name but not sure what to do for HDFS input files.
You can get input split using context.getInputSplit() (where context is mapper.context) and then use .getPath() method on the inputSplit to return the file path.
Using Java.util.logging's FileHandler class to create cyclic logs. However why these logs are appended with .0 extension. .1,.2,.3 etc are fine, I only do not need .0 as my extension of file, since its confusing for the customer. Any way to achieve the same?
I am using java version java version "1.8.0_144".
FileHandler fileTxt = new FileHandler(pathOfLogFile+"_%g.log",
Integer.parseInt(prop.getProperty("MAX_LOG_FILE_SIZE")),
Integer.parseInt(prop.getProperty("NO_OF_LOG_FILES")), true);
SimpleFormatter formatterTxt = new SimpleFormatter();
fileTxt.setFormatter(formatterTxt);
logger.addHandler(fileTxt);
Name of log file is LOG_0.log. requirement is to not to have _0 on the latest file, need to be simply LOG.log.
You'll have to add more information about how your are setting up your FileHandler. Include code and or logging.properties file.
Most likely you are creating multiple open file handlers and are simply not closing the previous one before you create the next one. This can happen due to bug in your code or that you are simply not holding a strong reference to the logger that holds your FileHandler. Another way to create this issue is by create two running JVM processes. In which case you have no option but to choose the location of the where the unique number is placed in your file name.
Specify the %g token and %u in your file pattern. For example, %gfoo%u.log.
Per the FileHandler documentation:
If no "%g" field has been specified and the file count is greater than one, then the generation number will be added to the end of the generated filename, after a dot.
[snip]
Normally the "%u" unique field is set to 0. However, if the FileHandler tries to open the filename and finds the file is currently in use by another process it will increment the unique number field and try again. This will be repeated until FileHandler finds a file name that is not currently in use. If there is a conflict and no "%u" field has been specified, it will be added at the end of the filename after a dot. (This will be after any automatically added generation number.)
Thus if three processes were all trying to log to fred%u.%g.txt then they might end up using fred0.0.txt, fred1.0.txt, fred2.0.txt as the first file in their rotating sequences.
If you want to remove the zero from the first file only then the best approximation of the out of the box behavior would be to modify your code to:
If no LOG_0.log file exists then use File.rename to add the zero to the file. This changes LOG.log -> LOG_0.log.
Trigger a rotation. Results in LOG_0.log -> LOG_1.log etc. Then LOG_N.log -> LOG_0.log
Use File.rename to remove the zero from the file. LOG_0.log -> LOG.log
Open your file handler with number of logs as one and no append. This wipes the oldest log file and starts your new current one.
The problem with this is that your code won't rotate based on file size during a single run.
Simply use logger name as filename (Don't include %g in it). The latest file will be filename.log. Also note that the rotated files will have the numbers as extension.
I know this question has been asked before but i am unable to get a clear working answer.
result.saveAsTextFile(path);
when using spark saveAsTextFile the output is saved by the name of "part-00", "part-01" etc.
Is it possible to change this name to customized name?
Is it possible for a saveAsTextFile to append to existing file rather then overwriting it ?
I am using Java 7 for coding, the output file system would be cloud (Azure, Aws)
1) There is no direct support in saveAsTextFile method to control file output name.
You can try using saveAsHadoopDataset to control output file basename.
e.g.: instead of part-00000 you can get yourCustomName-00000.
Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.
In order to control that too as mentioned above in the comments you have to write your own custom OutputFormat.
SparkConf conf=new SparkConf();
conf.setMaster("local").setAppName("yello");
JavaSparkContext sc=new JavaSparkContext(conf);
JobConf jobConf=new JobConf();
jobConf.set("mapreduce.output.basename", "customName");
jobConf.set("mapred.output.dir", "outputPath");
JavaRDD<String> input = sc.textFile("inputDir");
input.saveAsHadoopDataset(jobConf);
2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.
I have the following code:
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[4]")
.set("spark.executor.memory", "2g")
.set("spark.driver.allowMultipleContexts", "true");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> trainingData = jssc.textFileStream("filesDirectory");
trainingData.print();
jssc.start();
jssc.awaitTermination();
Unfortunately, to stream any file exists in the directory I have to edit this file and rename it after starting stream context, otherwise it will not be processed.
Should I edit and rename each file to process it or there is another way to process the existing files by just edit and save them.
P.S. When I move new file to this directory, I need also to edit and rename this file to stream it!!!
Try touching file before moving to the destination directory.
Below is what javadoc says.
Identify whether the given path is a new file for the batch of currentTime. For it to be
accepted, it has to pass the following criteria.
It must pass the user-provided file filter.
It must be newer than the ignore threshold. It is assumed that files older than the ignore
threshold have already been considered or are existing files before start
(when newFileOnly = true).
It must not be present in the recently selected files that this class remembers.
It must not be newer than the time of the batch (i.e. currentTime for which this
file is being tested. This can occur if the driver was recovered, and the missing batches
(during downtime) are being generated. In that case, a batch of time T may be generated
at time T+x. Say x = 5. If that batch T contains file of mod time T+5, then bad things can
happen. Let's say the selected files are remembered for 60 seconds. At time t+61,
the batch of time t is forgotten, and the ignore threshold is still T+1.
The files with mod time T+5 are not remembered and cannot be ignored (since, t+5 > t+1).
Hence they can get selected as new files again. To prevent this, files whose mod time is more
than current batch time are not considered.
*
JavaStreamingContext.textFileStream returns a FileInputDStream, which is used to monitor a folder when the files in the folder are being added/updated regularly. You will get the notification after every two seconds, only when a new file is added/updated.
If your intent is just to read the file, you can rather use SparkContext.textFile.
Looking at the documentation from source code of JavaStreamingContext.textFileStream()
/**
* Create a input stream that monitors a Hadoop-compatible filesystem
* for new files and reads them as text files (using key as LongWritable, value
* as Text and input format as TextInputFormat). Files must be written to the
* monitored directory by "moving" them from another location within the same
* file system. File names starting with . are ignored.
*/
I am trying to make a small program that takes in console input such as a user's name, school and other information and then creates a file whose file name is that of the user. Each file will then be located in a folder named after the school. I am not sure how to create a file with those qualities since Camel seems to determine the path and file name before any input is read. Is there a way of getting around this problem?
There is an example on the file component page like so:
// set the output filename using java code logic, notice that this is done by setting
// a special header property of the out exchange
exchange.getOut().setHeader(Exchange.FILE_NAME, "report.txt");
you could replace report.txt with the filename you wish to use.
As for the directory, can you not store the directory name in a header and reference it from the endpoint:
.to("file://${headers.directory}");
more info here: http://camel.apache.org/file2.html