I want just to print the content of streams to console. I wrote the following code but it does not print anything. Anyone can help me to read text file as stream in Spark?? Is there a problem related to Windows system?
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[2]")
.setSparkHome("C:\\Spark\\spark-1.5.1-bin-hadoop2.6")
.set("spark.executor.memory", "2g");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> dataStream = jssc.textFileStream("C://testStream//copy.csv");
dataStream.print();
jssc.start();
jssc.awaitTermination();
}
UPDATE: The content of copy.csv is
0,0,12,5,0
0,0,12,5,0
0,1,2,0,42
0,0,0,0,264
0,0,12,5,0
textFileStream is for Monitoring the hadoop Compatible Directories. This operation will watch the provided directory and as you add new files in the provided directory it will read/ stream the data from the newly added files.
You cannot read text/ csv files using textFileStream or rather I would say that you do not need streaming in case you are just reading the files.
My Suggestion would be to monitor some directory (may be HDFS or local file system) and then add files and capture the content of these new files using textFileStream.
May be in your code may be you can replace "C://testStream//copy.csv" with C://testStream" and once your Spark Streaming job is up and running then add file copy.csv to C://testStream folder and see the output on Spark Console.
OR
may be you can write another command line Scala/ Java program which read the files and throw the content over the Socket (at a certain PORT#) and next you can leverage socketTextStream for capturing and reading the data. Once you have read the data, you further apply other transformation or output operations.
You can also think of leveraging Flume too
Refer to API Documentation for more details
This worked for me on Windows 7 and Spark 1.6.3: (removing the rest of code, important one is how to define the folder to monitor)
val ssc = ...
val lines = ssc.textFileStream("file:///D:/tmp/data")
...
print
...
This monitors directory D:/tmp/data, ssc is my streaming context
Steps:
Create a file say 1.txt in D:/tmp/data
Enter some text
Start the spart application
Rename the file to data.txt (i believe any arbitrary name will do as long as it's changed while directory is monitored by spark)
One other thing I noticed is that I had to change the line separator to Unix style (used Notepad++) otherwise file wasn't getting picked up.
Try below code, it works:
JavaDStream<String> dataStream = jssc.textFileStream("file:///C:/testStream/");
Related
I have tried to transfer a file from Azure container to GCS bucket, but end up with below issues
Order of the records in source file is different from the Destination file's records order as pipeline will do parallel processing
Have to write lot of custom code to provide the custom name for the GCS destination file as pipeline give default name for it.
Is there anyway, Apache pipeline can transfer the file itself without dealing with the content of the file (so that, above mentioned issues won't happen)? As I need to transfer multiple files from Azure container to GCS bucket
below code I am using to transfer the files at the moment
String format = LocalDateTime.now().format(DateTimeFormatter.ofPattern("YYYY_MM_DD_HH_MM_SS3")).toString();
String connectionString = "<<AZURE_STORAGE_CONNECTION_STRING>>";
PipelineOptions options = PipelineOptionsFactory.create();
options.as(BlobstoreOptions.class).setAzureConnectionString(connectionString);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.read().from("azfs://storageaccountname/containername/CSVSample.csv"))
.apply("",FileIO.<String>write().to("azfs://storageaccountname/containername/"+format+"/").withNumShards(1).withSuffix(".csv")
.via(TextIO.sink()));
p.run().waitUntilFinish();
You should be able to use FileIO transforms for this purpose.
For example (untested pseudocode),
FileIO.match().filepattern("azfs://storageaccountname/containername/CSVSample.csv")
.apply(FileIO.readMatches())
.apply(ParDo.of(new MyWriteDoFn()));
Above MyWriteDoFn() would be a DoFn that reads bytes from a single file (using AzureBlobStoreFileSystem) and writes to GCS (using GCSFileSystem). You can use the static methods in FileSystems class with the correct prefix instead of directly invoking methods of the underlying FileSystem implementations.
So after 36 hours of experimenting with this and that, I have finally managed to get a cluster up and running but now I am confused how I can write files to it using Java? A tutorial said this program should be used but I don't understand it at all and it doesn't work as well.
public class FileWriteToHDFS {
public static void main(String[] args) throws Exception {
//Source file in the local file system
String localSrc = args[0];
//Destination file in HDFS
String dst = args[1];
//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));
//Destination file in HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));
//Copy file from local to HDFS
IOUtils.copyBytes(in, out, 4096, true);
System.out.println(dst + " copied to HDFS");
}
}
My confusion is how does this piece of code identify specifics of my cluster? How will it know where the masternode is and where the slavenodes are?
Furthermore when I run this code and provide some local file in source and leave destination blank/or provide a file name only the program writes the file back to my local storage and not the location that I defined as storage space for my namenodes and datanodes. Should I be providing this path manually? How does this work? Please suggest some blog that can help me understand it better or can get working with a smallest example.
First off, you'll need to add some Hadoop libraries to your classpath. Without those, no, that code won't work.
How will it know where the masternode is and where the slavenodes are?
From the new Configuration(); and subsequent conf.get("fs.defaultFS").
It reads the core-site.xml of the HADOOP_CONF_DIR environment variable and returns the address of the namenode. The client only needs to talk to the namenode to receive the locations of the datanodes, from which file blocks are written to
the program writes the file back to my local storage
It's not clear where you've configured the filesystem, but the default is file://, your local disk. You change this in the core-site.xml. If you follow the Hadoop documentation, the pseudo distributed cluster setup mentions this
It's also not very clear why you need your own Java code when simply hdfs dfs -put will do the same thing
I know this question has been asked before but i am unable to get a clear working answer.
result.saveAsTextFile(path);
when using spark saveAsTextFile the output is saved by the name of "part-00", "part-01" etc.
Is it possible to change this name to customized name?
Is it possible for a saveAsTextFile to append to existing file rather then overwriting it ?
I am using Java 7 for coding, the output file system would be cloud (Azure, Aws)
1) There is no direct support in saveAsTextFile method to control file output name.
You can try using saveAsHadoopDataset to control output file basename.
e.g.: instead of part-00000 you can get yourCustomName-00000.
Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.
In order to control that too as mentioned above in the comments you have to write your own custom OutputFormat.
SparkConf conf=new SparkConf();
conf.setMaster("local").setAppName("yello");
JavaSparkContext sc=new JavaSparkContext(conf);
JobConf jobConf=new JobConf();
jobConf.set("mapreduce.output.basename", "customName");
jobConf.set("mapred.output.dir", "outputPath");
JavaRDD<String> input = sc.textFile("inputDir");
input.saveAsHadoopDataset(jobConf);
2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.
I have the following code:
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[4]")
.set("spark.executor.memory", "2g")
.set("spark.driver.allowMultipleContexts", "true");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> trainingData = jssc.textFileStream("filesDirectory");
trainingData.print();
jssc.start();
jssc.awaitTermination();
Unfortunately, to stream any file exists in the directory I have to edit this file and rename it after starting stream context, otherwise it will not be processed.
Should I edit and rename each file to process it or there is another way to process the existing files by just edit and save them.
P.S. When I move new file to this directory, I need also to edit and rename this file to stream it!!!
Try touching file before moving to the destination directory.
Below is what javadoc says.
Identify whether the given path is a new file for the batch of currentTime. For it to be
accepted, it has to pass the following criteria.
It must pass the user-provided file filter.
It must be newer than the ignore threshold. It is assumed that files older than the ignore
threshold have already been considered or are existing files before start
(when newFileOnly = true).
It must not be present in the recently selected files that this class remembers.
It must not be newer than the time of the batch (i.e. currentTime for which this
file is being tested. This can occur if the driver was recovered, and the missing batches
(during downtime) are being generated. In that case, a batch of time T may be generated
at time T+x. Say x = 5. If that batch T contains file of mod time T+5, then bad things can
happen. Let's say the selected files are remembered for 60 seconds. At time t+61,
the batch of time t is forgotten, and the ignore threshold is still T+1.
The files with mod time T+5 are not remembered and cannot be ignored (since, t+5 > t+1).
Hence they can get selected as new files again. To prevent this, files whose mod time is more
than current batch time are not considered.
*
JavaStreamingContext.textFileStream returns a FileInputDStream, which is used to monitor a folder when the files in the folder are being added/updated regularly. You will get the notification after every two seconds, only when a new file is added/updated.
If your intent is just to read the file, you can rather use SparkContext.textFile.
Looking at the documentation from source code of JavaStreamingContext.textFileStream()
/**
* Create a input stream that monitors a Hadoop-compatible filesystem
* for new files and reads them as text files (using key as LongWritable, value
* as Text and input format as TextInputFormat). Files must be written to the
* monitored directory by "moving" them from another location within the same
* file system. File names starting with . are ignored.
*/
I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.
However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.
Below is the file content,
/allfiles.txt
- /tmp/aaa/file1.txt
- /tmp/bbb/file2.txt
- /tmp/ccc/file3.txt
How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.
In your driver class, you can read in the file, and add each line as a file for input:
//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);
/Loop through the file names and add them as input
for(String file : files) {
//This Path is org.apache.hadoop.fs.Path
FileInputFormat.addInputPath(conf, new Path(file));
}
This is assuming that your allfiles.txt is local to the node on which your MR job is being run, but it's only a small change if allfiles.txt is actually on the HDFS.
I strongly recommended that you check that each file exists on the HDFS before you add it as input.
Instead of creating a file with path to other files, you could use globs.
In your example, you could have defined your inputs as -input /tmp/*/file?.txt