Saving spark dataset to an existing csv file - java

I am trying to save the contents of dataset to csv using
df.coalesce(1)
.write()
.format("csv")
.mode("append")
.save(PATH+"/trial.csv");
My aim is to keep appending the results of dataset to trial.csv file. However, it creates a folder called trial.csv and creates csv inside of that. When I run it again, it creates another csv file inside the trail.csv folder. But I just want it to keep appending to one csv file, which I unable to do.
I know we can do some script from outside of code(program) and do it, but can we achieve it from inside our code? I am using Java.

Appending to an existing file its something hard to do for a distributed, multi-thread application, it will turn something parallelised into a sequential task. The way that is usually achieved, is to persist per thread or task in spark, a single file in the specified path, and this path will be a folder containing all the files. To read them, you can read the complete folder.
If your data is not big, and you really need a single file, try with repartition method to 1, this will make a single task to write the new data, but it will never append the data to previous files.
You have to be careful, but you can do something like this:
df.union(spark.read(PATH+"/trial.csv"))
.coalesce(1)
.write
.format("csv")
.mode("append")
.save(PATH+"/trial_auxiliar.csv")
Then move it to the previous folder, with spark or a move command of Hadoop. Never write and read in the same job to the same folder, and keep in mind that this won't guarantee the data order.

Related

Java: Read 10,000 excel files and write to 1 master file in Java using Apache POI

I searched in Google but couldn't find a proper answer for my problem mentioned below. Pardon me if this is a duplicate question, but I couldn't find any proper answer.
So, coming to the question. I have to read multiple Excel files in Java and generate a final Excel report file out of these multiple files.
There are 2 folders:
Source folder: It contains multiple Excel file (Probably 10,000 files)
Destination folder: This folder will have one final Master Excel file after reading all the files from Source folder.
For each Excel file read from Source folder, the master file in the Destination folder will have 1 row each.
I am planning to use Apache POI to read and write excel files in Java.
I know its easy to read and write files in Java using POI, but my question is, given this scenario where there are almost 10,000 files to read and write into 1 single Master file, what will be the best approach to do that, considering the time taken and the CPU used by the program. Reading 1 file at a time will be too much time consuming.
So, I am planning to use threads to process files in batches of say 100 files at a time. Can anybody please point me to some resources or suggest me on how to proceed with this requirement?
Edited:
I have already written the program to read and write the file using POI. The code for the same is mentioned below:
// Loop through the directory, fetching each file.
File sourceDir = new File("SourceFolder");
System.out.println("The current directory is = "+sourceDir);
if(sourceDir.exists()) {
if(sourceDir.isDirectory()){
String[] filesInsideThisDir = sourceDir.list();
numberOfFiles = filesInsideThisDir.length;
for(String filename : filesInsideThisDir){
System.out.println("(processFiles) The file name to read is = "+filename);
// Read each file
readExcelFile(filename);
// Write the data
writeMasterReport();
}
} else {
System.out.println("(processFiles) Source directory specified is not a directory.");
}
} else {
}
Here, the SourceFolder contains all the Excel files to read. I am looping through this folder fetching 1 file at a time, reading the contents and then writing to 1 Master Excel file.
The readExcelFile() method is reading every excel file, and creating a List which contains the data for each row to be written to Master excel file.
The writeMasterReport() method is writing the data read from every excel file.
The program is running fine. My question is, is there any way I can optimize this code by using Threads for reading through the files? I know that there is only 1 file to write to, and it cannot be done parallely. If the sourceFolder contains 10,000 files, reading and writing this way will take a lot of time to execute.
The size of each Input file will be around few hundred KB.
So, my question is, can we use Threads to read the files in batches, say 100 or 500 files per thread, and then write the data for each thread? I know the write part will need to be synchronized. This way at least the read and write time will be minimized. Please let me know your thoughts on this.
With 10k of files ~100Kb each we're talking about reading ca. ~1Gb of data. If the processing is not overly complex (seems so) then your bottleneck will be IO.
So it most probably does not make sense to parallelize reading and processing files as IO has an upper limit.
Parallelizing would have made sense if processing were complex/the bottleneck. It does not seem to be the case here.

Update file after FlatFileItemReader in Spring Batch

I currently have the following processing in a Spring Batch job:
FlatFileItemReader reads a CSV file
Processor does some work
FlatFileItemWriter creates a mirror of the read file, but updates the file to reflect processing
I don't want to write to a new file, but I want to update the same file that is being read during processing.
My question is, is there a typical method in Spring to use the FlatFileItemReader and then update that same file per row in the processor at runtime?
Thanks for any help.
You can always write a custom writer in spring batch, just like an example below. Where you read data form the file into memory, and then update the same file with the data that you are intended to.
https://github.com/pkainulainen/spring-batch-examples/tree/master/spring/src/main/java/net/petrikainulainen/springbatch/custom/in
more than that FlatFileItemReader is not thread safe. Of course there are hacks to achieve thread safety but not a good practice to use such hacks, its always good to create custom writer.
Short answer is no, SB doesn't allow you to overwrite the same file you are reading from.
A better pratice is to write an intermediate file and then perform a delete/rename.
Write a temporary file is not a 'bad thing' especially if you are working with huge input file and OutOfMemoryException is round the corner; also using a temporary file can make your step restartable and allow you to manually retrive translated file also if delete/rename process fails.

Accessing resource file when running as a DataflowPipelineRunner in Google Dataflow

In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.

Write into Hadoop File System in parallel

I'm quite new with Hadoop, and I have a issue...
I have a output file (the result of a task) and I would like to modify it. As it can be a very big file, i want to do this operation in parralel.
Note : I don't want to simply append data, i want to modify structurally (even the size) so I have to read it completely and write it back
Read the file isn't a problem, i give to each worker a part of the file, they simply have to read it and make the changes they want.
But for writing the new file back to hdfs it seems more tricky.
My question is: how can I Create a big file into the hdfs and make my workers write into it simultaneously (i know the size of each part so two workers will never attempt to write at the same position).
Thanks in advance :)
Since the job is to read the input file and write select content from the input files to a output location in parallel, this is a mapper only job.
Create a Mapper class to read the file and perform your operations on
the file.
set the number for mappers in your driver class.
job.setNumMapTasks(n); n-number of mappers

How are large directory trees processed in using the Spark API?

I'm a new Spark user and I'm trying to process a large file set of XML files sitting on a HDFS file system. There are about 150k files, totalling about 28GB, on a "development" cluster of 1 machine (actually a VM).
The files are organised into a directory structure in HDFS such that there are about a hundred subdirectories under a single parent directory. Each "child" directory contains anything between a couple of hundred and a couple of thousand XML files.
My task is to parse each XML file, extract a few values using XPath expressions, and save the result to HBase. I'm trying to do this with Apache Spark, and I'm not having much luck. My problem appears to be a combination of the Spark API, and the way RDDs work. At this point it might be prudent to share some pseudocode to express what I'm trying to do:
RDD[String] filePaths = getAllFilePaths()
RDD[Map<String,String>] parsedFiles = filePaths.map((filePath) => {
// Load the file denoted by filePath
// Parse the file and apply XPath expressions
})
// After calling map() above, I should have an RDD[Map<String,String>] where
// the map is keyed by a "label" for an xpath expression, and the
// corresponding value is the result of the expression applied to the file
So, discounting the part where I write to HBase for a moment, lets focus on the above. I cannot load a file from within the RDD map() call.
I have tried this a number of different ways, and all have failed:
Using a call to SparkContext.textFile("/my/path") to load the file fails because SparkContext is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated outside the RDD fails because FileSystem is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated inside the RDD fails because the program runs out of file handles.
Alternative approaches have included attempting to use SparkContext.wholeTextFiles("/my/path/*") so I don't have to do the file load from within the map() call, fails because the program runs out of memory. This is presumably because it loads the files eagerly.
Has anyone attempted anything similar in their own work, and if so, what approach did you use?
Try using a wildcard to read the whole directory.
val errorCount = sc.textFile("hdfs://some-directory/*")
Actually, spark can read a whole hfs directory, quote from spark documentation
All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

Categories