Apache Spark: loading multiple files or a directory in java - java

I am trying to load multiple files/directories in SPARK using Java, I have found a few examples on how to do this in scala, can someone give an example with explanation on how to do this in Java?
In particular I would like to use regex like paths, so that I do not have to specify a fully qualified name for each file. I can already give a comma separated file values with fully qualified names.
I am loading from the local file system, I don't know if this makes a difference
The following is the code I have used to load the files:
SparkConf sparkConf = new SparkConf().setAppName("TableAggregator");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile(args[0], 1);

In Spark, the method textFile() takes an URI for the file (either a local path on the machine or a hdfs://, etc URI).
You can run this method on directories, compressed files and wildcard :
ctx.textFile("data.txt");
ctx.textFile("/your/directory/");
ctx.textFile("/your/directory/*");
ctx.textFile("/your/directory/*.gz");
Be aware that when you use a path for your input, it has to be the same path for all the worker nodes. So you have to copy the file to all workers or use a shared network-mounted file system.
So you can use a pattern with the wildcard to do it simply.

Related

Create File instance with URI pointing to HDFS

Is it possible to create a file instance by putting the uri of my HDFS as File class's constructor? For example:
val conf = new Configuration()
conf.addResource(hdfsCoreSitePath)
conf.addResource(hdfsHDFSSitePath)
val uri = conf.get("fs.default.name")
val file = new File(uri + pathtothefile)
Then, with the file instance, I wish to access the file list with the functions provided by File class such as file.list() to returns an array of strings naming the files and directories in the directory denoted by this abstract pathname. I tried the code but it return null on the file.list().
The method below is not recommended as I am trying to writing the same codebase for normal file system and hdfs to achieve code reusable.
val fileSystem = FileSystem.get(conf)
val status = fileSystem.listStatus(new Path(filepath))
status.map(x => ...
The regular built-in Java/Scala File APIs will not work for HDFS files. The protocol and implementation are too different. You have to use the Hadoop API to access HDFS files as in your second example.
The good news, though, is that the Hadoop API will work for non-HDFS files (regular files). So that code is reusable. Just use a URI like: file:///foo/bar for a local file.
fs.default.name is deprecated. Try to use fs.defaultFS and make sure this property is available in the core-site.xml file you are referring using the below command
conf.addResource(hdfsCoreSitePath)
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/core-default.xml

loading network path in java code

I'm loading some network path in my java code. It is not taking the same format as present in the configuration file, missing one slash.
Example:
String path = "//abckatte.com/abc/test";
File fileobj = new File(path);
Whenever I saw the fileobj in log message it is displaying as /abckatte.com/abc/test. One slash is missing.
I tried with appending two more slash like.
String path = "////abckatte.com/abc/test";
then also it is not working.
You could make use of Apache Commons VFS 2, as it provides access to several file systems. Chek it out here, in Local files file://///somehost/someshare/afile.txt.

Spark output filename and append on write

I know this question has been asked before but i am unable to get a clear working answer.
result.saveAsTextFile(path);
when using spark saveAsTextFile the output is saved by the name of "part-00", "part-01" etc.
Is it possible to change this name to customized name?
Is it possible for a saveAsTextFile to append to existing file rather then overwriting it ?
I am using Java 7 for coding, the output file system would be cloud (Azure, Aws)
1) There is no direct support in saveAsTextFile method to control file output name.
You can try using saveAsHadoopDataset to control output file basename.
e.g.: instead of part-00000 you can get yourCustomName-00000.
Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.
In order to control that too as mentioned above in the comments you have to write your own custom OutputFormat.
SparkConf conf=new SparkConf();
conf.setMaster("local").setAppName("yello");
JavaSparkContext sc=new JavaSparkContext(conf);
JobConf jobConf=new JobConf();
jobConf.set("mapreduce.output.basename", "customName");
jobConf.set("mapred.output.dir", "outputPath");
JavaRDD<String> input = sc.textFile("inputDir");
input.saveAsHadoopDataset(jobConf);
2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.

Print the content of streams (Spark streaming) in Windows system

I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system?
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[2]")
.setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
.set("spark.executor.memory", "2g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
dataStream.print();
jssc.start();
jssc.awaitTermination();
}
UPDATE: The content of copy.csv is
0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0
textFileStream is for Monitoring the hadoop Compatible Directories. This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files.
You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files.
My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream.
May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console.
OR
may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. Once you have read the data, you further apply other transformation or output operations.
You can also think of leveraging Flume too
Refer to API Documentation for more details
This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor)
val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print
...
This monitors directory D:/tmp/data, ssc is my streaming context
Steps:
Create a file say 1.txt in D:/tmp/data
Enter some text
Start the spart application
Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark)
One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up.
Try below code, it works:
JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");

how to use file (many file's full path inside) as input to MapReduce job

I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.
However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.
Below is the file content,
/allfiles.txt
- /tmp/aaa/file1.txt
- /tmp/bbb/file2.txt
- /tmp/ccc/file3.txt
How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.
In your driver class, you can read in the file, and add each line as a file for input:
//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);
/Loop through the file names and add them as input
for(String file : files) {
//This Path is org.apache.hadoop.fs.Path
FileInputFormat.addInputPath(conf, new Path(file));
}
This is assuming that your allfiles.txt is local to the node on which your MR job is being run, but it's only a small change if allfiles.txt is actually on the HDFS.
I strongly recommended that you check that each file exists on the HDFS before you add it as input.
Instead of creating a file with path to other files, you could use globs.
In your example, you could have defined your inputs as -input /tmp/*/file?.txt

Categories