Renaming Part Files in Hadoop Map Reduce

Renaming Part Files in Hadoop Map Reduce - java

I have tried to use the MultipleOutputs class as per the example in page http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html
Driver Code
Configuration conf = new Configuration();
Job job = new Job(conf, "Wordcount");
job.setJarByClass(WordCount.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
Text.class, IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Reducer Code
public class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
private MultipleOutputs<Text, IntWritable> mos;
public void setup(Context context){
mos = new MultipleOutputs<Text, IntWritable>(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
//context.write(key, result);
mos.write("text", key,result);
}
public void cleanup(Context context) {
try {
mos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The output of the reducer is found to rename to text-r-00000
But the issue here is that I am also getting an empty part-r-00000 file as well. Is this how MultipleOutputs is expected to behave, or is there some problem with my code? Please advice.
Another alternative I have tried out is to iterate through my output folder using the FileSystem class and manually rename all files beginning with part.
What is the best way?
FileSystem hdfs = FileSystem.get(configuration);
FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
for (FileStatus aFile : fs) {
if (aFile.isDir()) {
hdfs.delete(aFile.getPath(), true);
// delete all directories and sub-directories (if any) in the output directory
}
else {
if (aFile.getPath().getName().contains("_"))
hdfs.delete(aFile.getPath(), true);
// delete all log files and the _SUCCESS file in the output directory
else {
hdfs.rename(aFile.getPath(), new Path(myCustomName));
}
}

Even if you are using MultipleOutputs, the default OutputFormat (I believe it is TextOutputFormat) is still being used, and so it will initialize and creating these part-r-xxxxx files that you are seeing.
The fact that they are empty is because you are not doing any context.write because you are using MultipleOutputs. But that doesn't prevent them from being created during initialization.
To get rid of them, you need to define your OutputFormat to say you are not expecting any output. You can do it this way:
job.setOutputFormat(NullOutputFormat.class);
With that property set, this should ensure that your part files are never initialized at all, but you still get your output in the MultipleOutputs.
You could also probably use LazyOutputFormat which would ensure that your output files are only created when/if there is some data, and not initialize empty files. You could do i this way:
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
Note that you are using in your Reducer the prototype MultipleOutputs.write(String namedOutput, K key, V value), which just uses a default output path that will be generated based on your namedOutput to something like: {namedOutput}-(m|r)-{part-number}. If you want to have more control over your output filenames, you should use the prototype MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath) which can allow you to get filenames generated at runtime based on your keys/values.

This is all you need to do in the Driver class to change the basename of the output file:
job.getConfiguration().set("mapreduce.output.basename", "text");
So this will result in your files being called "text-r-00000".

Related

How can I print the error message into a txt or json file in my directory?

I have a function that save all the error in a errormessage list
public class Util {
private List<String> errorMessages = new ArrayList<>();
public void outputResult(String content) {
logger.error(content);
errorMessages.add(content);
}
}
and my compare function add all the error message to the list,
public void compare(Config source, Config target) {
if (source.getId() != target.getId()) {
util.outputResult("id not equal");
}
// ...
}
And in my main function, I call this compare function and want to save all the error message in a txt or some other file in my current directory
public class MyClass {
public void main() {
compare();
// writeToFile
}
}
This is what I'm doing right now, I convert ByteArrayOutputStream to a string and print it, there a txt file generated but is empty, and I don't want to a string, I want each error message in the list be printed, how can I do that?
ByteArrayOutputStream errorMessages = new ByteArrayOutputStream();
try (FileWriter w = new FileWriter(pathToReport)) {
w.write(errorMessages.toString());
}
File errorMessagesFile = new File(pathToReport);
errorMessagesFile.writeText(errorMessages.toString());

What logger library that you are using? If you use sl4j, you can couple it with log4j by configuring properly to just log the error messages into the file that you have specified in the configuration. I've done some lookup and I find this stackoverflow: where-does-the-slf4j-log-file-get-saved answer provided a template for you to follow on with this setup.

Create and save Sequence<Text,Byte[]> files in a foreachPartition function in Spark

I am trying to load a group of files, make some checks on them and later saving them in HDFS. I haven't found a good way to create and save these Sequence files, though. Here is my loader main function
SparkConf sparkConf = new SparkConf().setAppName("writingHDFS")
.setMaster("local[2]")
.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5*1000));
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles("file:///home/cloudera/Pictures/cat");
JavaPairRDD<String, String> imageRDD = jsc.wholeTextFiles("file:///home/cloudera/Pictures/");
imageRDD.mapToPair(new PairFunction<Tuple2<String,String>, Text, Text>() {
#Override
public Tuple2<Text, Text> call(Tuple2<String, String> arg0)
throws Exception {
return new Tuple2<Text, Text>(new Text(arg0._1),new Text(arg0._2));
}
}).saveAsNewAPIHadoopFile("hdfs://localhost:8020/user/hdfs/sparkling/try.seq", Text.class, Text.class, SequenceFileOutputFormat.class);
It simply loads some images as text files, puts the name of the file as key of the PairRDD and use the native saveAsNewAPIHadoopFile.
I would like now to save file by file in ardd.foreach or rdd.foreachPartition` but I cannot find a proper method:
This stackoverflow answer creates a Job for the occasion. It seems to work, but it needs the file inputed as a path, while I already have an RDD made of them
A couple of solution I found create a directory for each file (OutputStream out = fs.create(new Path(dst));) which wouldn't be as much of a problem, if it weren't for the fact that I get an exception for Mkdirs didn't work
EDIT: I may have found a way, but I have a Task not serializable exception:
JavaPairRDD imageByteRDD = jsc.binaryFiles("file:///home/cloudera/Pictures/cat");
imageByteRDD.foreach(new VoidFunction<Tuple2<String,PortableDataStream>>() {
#Override
public void call(Tuple2<String, PortableDataStream> fileTuple) throws Exception {
Text key = new Text(fileTuple._1());
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(serializableConfiguration.getConf(), SequenceFile.Writer.file(new Path("/user/hdfs/sparkling/" + key)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text("MiaoMiao!");
writer.append(key, value);
IOUtils.closeStream(writer);
}
});
I have tried wrapping the entire function in a Serializable class, but no luck. Help?

The way I did it was a (pseudocode, I'll try to edit this answer as soon as I get to my office)
rdd.foreachPartition{
Configuration conf = ConfigurationSingletonClass.getConfiguration();
etcetera, etcetera...
}
EDIT: got to my office, here is the complete segment of code: the configuration is created inside the rdd.foreachPartition (for each was a little too much). In the iterator there is the file writing itself, to a sequence file format.
JavaPairRDD<String, PortableDataStream> imageByteRDD = jsc.binaryFiles(SOURCE_PATH);
if(!imageByteRDD.isEmpty())
imageByteRDD.foreachPartition(new VoidFunction<Iterator<Tuple2<String,PortableDataStream>>>() {
#Override
public void call(
Iterator<Tuple2<String, PortableDataStream>> arg0)
throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS", HDFS_PATH);
while(arg0.hasNext()){
Tuple2<String,PortableDataStream>fileTuple = arg0.next();
Text key = new Text(fileTuple._1());
String fileName = key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-1].split(DOT_REGEX)[0];
String fileExtension = fileName.split(DOT_REGEX)[fileName.split(DOT_REGEX).length-1];
BytesWritable value = new BytesWritable( fileTuple._2().toArray());
SequenceFile.Writer writer = SequenceFile.createWriter(
conf,
SequenceFile.Writer.file(new Path(DEST_PATH + fileName + SEP_KEY + getCurrentTimeStamp()+DOT+fileExtension)),
SequenceFile.Writer.compression(SequenceFile.CompressionType.RECORD, new BZip2Codec()),
SequenceFile.Writer.keyClass(Text.class), SequenceFile.Writer.valueClass(BytesWritable.class));
key = new Text(key.toString().split(SEP_PATH)[key.toString().split(SEP_PATH).length-2] + SEP_KEY + fileName + SEP_KEY + fileExtension);
writer.append(key, value);
IOUtils.closeStream(writer);
}
}
});
Hope this will help.

Hadoop 2 (YARN). getting java.io.IOException: wrong key class. Exception

I'm trying to run hadoop 2 MapReduce process that the output_format_class is SequenceFileOutputFormat and the input_format_class is SequenceFileInputFormat.
I chose that the Mapper emits key and value both as BytesWritable. For the Reducer it emits key as IntWritable and value as BytesWritable.
Every time I'm getting the following error:
Error: java.io.IOException: wrong key class: org.apache.hadoop.io.BytesWritable is not class org.apache.hadoop.io.IntWritable
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1306)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:83)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:150)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
I discovered that when I define the OutputFormat not as SequenceFileOutputFormat the problem is solved but I need it as a SequenceFileOutputFormat.
Here is the main:
Configuration conf = new Configuration(true);
conf.set("refpath", "/out/Sample1/Local/EU/CloudBurst/BinaryFiles/ref.br");
conf.set("qrypath", "/out/Sample1/Local/EU/CloudBurst/BinaryFiles/qry.br");
conf.set("MIN_READ_LEN", Integer.toString(MIN_READ_LEN));
conf.set("MAX_READ_LEN", Integer.toString(MAX_READ_LEN));
conf.set("K", Integer.toString(K));
conf.set("SEED_LEN", Integer.toString(SEED_LEN));
conf.set("FLANK_LEN", Integer.toString(FLANK_LEN));
conf.set("ALLOW_DIFFERENCES", Integer.toString(ALLOW_DIFFERENCES));
conf.set("BLOCK_SIZE", Integer.toString(BLOCK_SIZE));
conf.set("REDUNDANCY", Integer.toString(REDUNDANCY));
conf.set("FILTER_ALIGNMENTS", (FILTER_ALIGNMENTS ? "1" : "0"));
Job job = new Job(conf,"CloudBurst");
job.setNumReduceTasks(NUM_REDUCE_TASKS); // MV2
//conf.setNumMapTasks(NUM_MAP_TASKS); TODO find solution for mv2
FileInputFormat.addInputPath(job, new Path("/out/Sample1/Local/EU/CloudBurst/BinaryFiles/ref.br"));//TODO change it fit to the params
FileInputFormat.addInputPath(job, new Path("/out/Sample1/Local/EU/CloudBurst/BinaryFiles/qry.br"));//TODO change it fit to the params
job.setJarByClass(MerReduce.class);//mv2
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
// The order of seeds is not important, but make sure the reference seeds are seen before the qry seeds
job.setPartitionerClass(MerReduce.PartitionMers.class); // mv2
job.setGroupingComparatorClass(MerReduce.GroupMersWC.class); //mv2 TODO
job.setMapperClass(MerReduce.MapClass.class);
job.setReducerClass(MerReduce.ReduceClass.class);
job.setMapOutputKeyClass(BytesWritable.class);//mv2
job.setMapOutputValueClass(BytesWritable.class);//mv2
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(BytesWritable.class);
Path oPath = new Path("/out/Sample1/Local/EU/Vectors");//TODO change it fit to the params
//conf.setOutputPath(oPath);
FileOutputFormat.setOutputPath(job, oPath);
System.err.println(" Removing old results");
FileSystem.get(conf).delete(oPath);
int code = job.waitForCompletion(true) ? 0 : 1;
System.err.println("Finished");
}
The mapper class headline:
public static class MapClass extends Mapper<IntWritable, BytesWritable, BytesWritable, BytesWritable>
public void map(IntWritable id, BytesWritable rawRecord,Context context) throws IOException, InterruptedException
The reducer class headline:
public static class ReduceClass extends Reducer (BytesWritable, BytesWritable, IntWritable, BytesWritable)
public synchronized void reduce(BytesWritable mer, Iterator<BytesWritable> values,Context context)
throws IOException, InterruptedException {
Anybody has an idea?

job.setInputFormatClass(SequenceFileInputFormat.class);
should be
job.setInputFormatClass(IntWritable.class);
You mapper input is int and bytes, but in job you gave both inputs as sequence

Sort files by date created before showing it in the app

How do I sort files by date created before showing it in a listview within my own app?
I used files.lastModified(), which I do not want to use, because renaming would jumble it up. Is there anyway I can instead sort it by date creation? I've also tried mapping file names to time and date, but that's too tedious.

Since you use Java 7, you can use the attributes API to obtain what you want.
Unfortunately however, getting access to the creation time requires that you read attributes which can lead to an IOException, so you have to deal with that...
Sample code:
private static FileTime getCreationTime(final Path path)
{
final BasicFileAttributes attrs;
try {
attrs = Files.readAttributes(path, BasicFileAttributes.class);
return attrs.creationTime();
} catch (IOException oops) {
throw new RuntimeException("can't read attributes from " + path, oops);
}
}
private static final Comparator<Path> CREATION_TIME_COMPARATOR
= new Comparator<Path>()
{
#Override
public int compare(final Path o1, final Path o2)
{
return getCreationTime(o1).compareTo(getCreationTime(o2));
}
};
Now, use Files.newDirectoryStream() to read the file entries into a list:
final Path baseDir = Paths.get("whever");
final List<Path> entries = new ArrayList<>();
for (final Path entry: Files.newDirectoryStream(baseDir))
entries.add(entry);
Collections.sort(entries, CREATION_TIME_COMPARATOR);

How can I make Spark Streaming count the words in a file in a unit test?

I've successfully built a very simple Spark Streaming application in Java that is based on the HdfsCount example in Scala.
When I submit this application to my local Spark, it waits for a file to be written to a given directory, and when I create that file it successfully prints the number of words. I terminate the application by pressing Ctrl+C.
Now I've tried to create a very basic unit test for this functionality, but in the test I was not able to print the same information, that is the number of words.
What am I missing?
Below is the unit test file, and after that I've also included the code snippet that shows the countWords method:
StarterAppTest.java
import com.google.common.io.Files;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.junit.*;
import java.io.*;
public class StarterAppTest {
JavaStreamingContext ssc;
File tempDir;
#Before
public void setUp() {
ssc = new JavaStreamingContext("local", "test", new Duration(3000));
tempDir = Files.createTempDir();
tempDir.deleteOnExit();
}
#After
public void tearDown() {
ssc.stop();
ssc = null;
}
#Test
public void testInitialization() {
Assert.assertNotNull(ssc.sc());
}
#Test
public void testCountWords() {
StarterApp starterApp = new StarterApp();
try {
JavaDStream<String> lines = ssc.textFileStream(tempDir.getAbsolutePath());
JavaPairDStream<String, Integer> wordCounts = starterApp.countWords(lines);
ssc.start();
File tmpFile = new File(tempDir.getAbsolutePath(), "tmp.txt");
PrintWriter writer = new PrintWriter(tmpFile, "UTF-8");
writer.println("8-Dec-2014: Emre Emre Emre Ergin Ergin Ergin");
writer.close();
System.err.println("===== Word Counts =======");
wordCounts.print();
System.err.println("===== Word Counts =======");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
Assert.assertTrue(true);
}
}
This test compiles and starts to run, Spark Streaming prints a lot of diagnostic messages on the console but the call to wordCounts.print() does not print anything, whereas in StarterApp.java itself, they do.
I've also tried adding ssc.awaitTermination(); after ssc.start() but nothing changed in that respect. After that I've also tried to create a new file manually in the directory that this Spark Streaming application was checking but this time it gave an error.
For completeness, below is the wordCounts method:
public JavaPairDStream<String, Integer> countWords(JavaDStream<String> lines) {
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterable<String> call(String x) { return Lists.newArrayList(SPACE.split(x)); }
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) { return new Tuple2<>(s, 1); }
}).reduceByKey((i1, i2) -> i1 + i2);
return wordCounts;
}

Few pointers:
Give at least 2 cores to SparkStreaming context. 1 for the Streaming and 1 for the Spark processing. "local" -> "local[2]"
Your streaming interval is of 3000ms, so somewhere in your program you need to wait -at least- that time to expect an output.
Spark Streaming needs some time for the setup of listeners. The file is being created immediately after ssc.start is issued. There's no warranty that the filesystem listener is already in place. I'd do some sleep(xx) after ssc.start
In Streaming, it's all about the right timing.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.