Spark output filename and append on write - java

I know this question has been asked before but i am unable to get a clear working answer.
result.saveAsTextFile(path);
when using spark saveAsTextFile the output is saved by the name of "part-00", "part-01" etc.
Is it possible to change this name to customized name?
Is it possible for a saveAsTextFile to append to existing file rather then overwriting it ?
I am using Java 7 for coding, the output file system would be cloud (Azure, Aws)

1) There is no direct support in saveAsTextFile method to control file output name.
You can try using saveAsHadoopDataset to control output file basename.
e.g.: instead of part-00000 you can get yourCustomName-00000.
Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.
In order to control that too as mentioned above in the comments you have to write your own custom OutputFormat.
SparkConf conf=new SparkConf();
conf.setMaster("local").setAppName("yello");
JavaSparkContext sc=new JavaSparkContext(conf);
JobConf jobConf=new JobConf();
jobConf.set("mapreduce.output.basename", "customName");
jobConf.set("mapred.output.dir", "outputPath");
JavaRDD<String> input = sc.textFile("inputDir");
input.saveAsHadoopDataset(jobConf);
2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.

Related

How can I get the file path for data shard in the Mapper of a Mapreduce job?

I have a mapreduce job, where the file input path is: /basedirectory/*/*.txt
Inside the basedirectory, I have different subfolders (CaseA, CaseB etc), each of which contain hdfs text files.
In the map phase of the job, I want to find out where exactly the data shard came from (e.g. CaseA). How can I achieve that?
I've done something similar for mapreduce jobs with more than 1 input hbase tables where I use context.getInputSplit().getTableName() to find the actual table name but not sure what to do for HDFS input files.
You can get input split using context.getInputSplit() (where context is mapper.context) and then use .getPath() method on the inputSplit to return the file path.

Print the content of streams (Spark streaming) in Windows system

I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system?
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[2]")
.setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
.set("spark.executor.memory", "2g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
dataStream.print();
jssc.start();
jssc.awaitTermination();
}
UPDATE: The content of copy.csv is
0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0
textFileStream is for Monitoring the hadoop Compatible Directories. This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files.
You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files.
My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream.
May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console.
OR
may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. Once you have read the data, you further apply other transformation or output operations.
You can also think of leveraging Flume too
Refer to API Documentation for more details
This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor)
val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print
...
This monitors directory D:/tmp/data, ssc is my streaming context
Steps:
Create a file say 1.txt in D:/tmp/data
Enter some text
Start the spart application
Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark)
One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up.
Try below code, it works:
JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");

how to use file (many file's full path inside) as input to MapReduce job

I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.
However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.
Below is the file content,
/allfiles.txt
- /tmp/aaa/file1.txt
- /tmp/bbb/file2.txt
- /tmp/ccc/file3.txt
How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.
In your driver class, you can read in the file, and add each line as a file for input:
//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);
/Loop through the file names and add them as input
for(String file : files) {
//This Path is org.apache.hadoop.fs.Path
FileInputFormat.addInputPath(conf, new Path(file));
}
This is assuming that your allfiles.txt is local to the node on which your MR job is being run, but it's only a small change if allfiles.txt is actually on the HDFS.
I strongly recommended that you check that each file exists on the HDFS before you add it as input.
Instead of creating a file with path to other files, you could use globs.
In your example, you could have defined your inputs as -input /tmp/*/file?.txt

Apache Spark: loading multiple files or a directory in java

I am trying to load multiple files/directories in SPARK using Java, I have found a few examples on how to do this in scala, can someone give an example with explanation on how to do this in Java?
In particular I would like to use regex like paths, so that I do not have to specify a fully qualified name for each file. I can already give a comma separated file values with fully qualified names.
I am loading from the local file system, I don't know if this makes a difference
The following is the code I have used to load the files:
SparkConf sparkConf = new SparkConf().setAppName("TableAggregator");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile(args[0], 1);
In Spark, the method textFile() takes an URI for the file (either a local path on the machine or a hdfs://, etc URI).
You can run this method on directories, compressed files and wildcard :
ctx.textFile("data.txt");
ctx.textFile("/your/directory/");
ctx.textFile("/your/directory/*");
ctx.textFile("/your/directory/*.gz");
Be aware that when you use a path for your input, it has to be the same path for all the worker nodes. So you have to copy the file to all workers or use a shared network-mounted file system.
So you can use a pattern with the wildcard to do it simply.

Get Input path from job conf in hadoop

I am setting a path as input location to conf
FileInputFormat.setInputPaths(conf, new Path("path/to/folder"));
How can I retrieve this location back from conf as I am trying to implement my own RecordReader
Thanks in advance...
The property set by this call is map.input.dir, so this should work for you:
conf.get("map.input.dir");
On a side note, your record reader should act upon the input split it is given in the initialize(InputSplit, TaskAttemptContext) method, as the folder you pass in setInputPath will actually resolve to a number of input splits, typically one for each file in the folder (and possible multiple input splits for larger, splittable files).
FileInputFormat based input formats are passed a FileSplit to the initialize method, and you should be able to pull out the actual file to be processed from the FileSplit.getPath() method.

Categories