I currently have the following processing in a Spring Batch job:
FlatFileItemReader reads a CSV file
Processor does some work
FlatFileItemWriter creates a mirror of the read file, but updates the file to reflect processing
I don't want to write to a new file, but I want to update the same file that is being read during processing.
My question is, is there a typical method in Spring to use the FlatFileItemReader and then update that same file per row in the processor at runtime?
Thanks for any help.
You can always write a custom writer in spring batch, just like an example below. Where you read data form the file into memory, and then update the same file with the data that you are intended to.
https://github.com/pkainulainen/spring-batch-examples/tree/master/spring/src/main/java/net/petrikainulainen/springbatch/custom/in
more than that FlatFileItemReader is not thread safe. Of course there are hacks to achieve thread safety but not a good practice to use such hacks, its always good to create custom writer.
Short answer is no, SB doesn't allow you to overwrite the same file you are reading from.
A better pratice is to write an intermediate file and then perform a delete/rename.
Write a temporary file is not a 'bad thing' especially if you are working with huge input file and OutOfMemoryException is round the corner; also using a temporary file can make your step restartable and allow you to manually retrive translated file also if delete/rename process fails.
Related
I am trying to save the contents of dataset to csv using
df.coalesce(1)
.write()
.format("csv")
.mode("append")
.save(PATH+"/trial.csv");
My aim is to keep appending the results of dataset to trial.csv file. However, it creates a folder called trial.csv and creates csv inside of that. When I run it again, it creates another csv file inside the trail.csv folder. But I just want it to keep appending to one csv file, which I unable to do.
I know we can do some script from outside of code(program) and do it, but can we achieve it from inside our code? I am using Java.
Appending to an existing file its something hard to do for a distributed, multi-thread application, it will turn something parallelised into a sequential task. The way that is usually achieved, is to persist per thread or task in spark, a single file in the specified path, and this path will be a folder containing all the files. To read them, you can read the complete folder.
If your data is not big, and you really need a single file, try with repartition method to 1, this will make a single task to write the new data, but it will never append the data to previous files.
You have to be careful, but you can do something like this:
df.union(spark.read(PATH+"/trial.csv"))
.coalesce(1)
.write
.format("csv")
.mode("append")
.save(PATH+"/trial_auxiliar.csv")
Then move it to the previous folder, with spark or a move command of Hadoop. Never write and read in the same job to the same folder, and keep in mind that this won't guarantee the data order.
Is it possible to use Spark APIs to read a large CSV file containing multiple sections having different headers? The structure of the file is as follows
BatchCode#1
Name,Surname,Address
AA1,BBB,CCC
AA2,BBB,CCC
AA3,BBB,CCC
BatchCode#2
Name,Surname,Address,Phone
XY1,BBB,CCC,DDD
XY2,BBB,CCC,DDD
XY3,BBB,CCC,DDD
While reading the records, we need to be careful with the headers as well as the file formats could be different between the sections. The BatchCode information needs to be extracted from the header and should be a part of every record within that section - for example, Data at line 1 should be parsed as:
Name: AAA1
Surname: BBB
Address:CCC
BatchCode:1
The following options come to my mind but I am not completely sure if it could create significant problems:
Reading the file using wholeTextFile. This will use a single thread to read the file but it would load the entire file in memory and could cause memory issues with large files.
Forcing Spark to read the file in a single thread using coalesce(1) on sc.textFile. I am not sure if the order is always guaranteed. Once we get the file as RDD, we will cache the header rows while reading the file and merge them with their corresponding data records.
Even if the above approaches work, would they be efficient? What would be the most efficient way?
I wrote Scala only programs for more complicated such use cases whereby sequentialness is guaranteed. It is too difficult otherwise. The files were processed via csvkit if emanating from xls or xlsx firstly.
The following program works for me:
JavaPairRDD<String, PortableDataStream> binaryFiles = sc.binaryFiles(file);
PortableRecordReader reader = new PortableRecordReader();
JavaPairRDD<String, Record> fileAndLines = binaryFiles.flatMapValues(reader);
Where PortableRecordReader opens a DataInputStream and converts it to an InputStreamReader and then uses a CSV parser to convert the lines to the intended output in Record object and also merges the header.
I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format.
The requirements are clear:
select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked
turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped)
send each line of json to a RESTFul Api
We have one step which uses a MultiResourceItemReader which processes files in sequence. The files are inputstreams which time out.
Ideally I think we want to have
a step which identifies the files to ingest
a step which processes files in parallel
Thanks in advance.
This is a fun one. I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler. I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names.
A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor. Your processor could then build your JSON strings and pass them off to your writer.
Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service.
Anyway, you don't need a separate step to just know the file name. Make it part of application initialization code.
Strategies to parallelize your application are listed here.
You just said a bunch of files. If number of lines in those files have similar count, I would go by partitioning approach ( i.e. by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer). You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. Partitioning
If line count in those files vary a lot i.e. if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe.
Approach Parallel Steps doesn't look suitable for your case since your steps are not independent.
Hope it helps !!
I'm quite new with Hadoop, and I have a issue...
I have a output file (the result of a task) and I would like to modify it. As it can be a very big file, i want to do this operation in parralel.
Note : I don't want to simply append data, i want to modify structurally (even the size) so I have to read it completely and write it back
Read the file isn't a problem, i give to each worker a part of the file, they simply have to read it and make the changes they want.
But for writing the new file back to hdfs it seems more tricky.
My question is: how can I Create a big file into the hdfs and make my workers write into it simultaneously (i know the size of each part so two workers will never attempt to write at the same position).
Thanks in advance :)
Since the job is to read the input file and write select content from the input files to a output location in parallel, this is a mapper only job.
Create a Mapper class to read the file and perform your operations on
the file.
set the number for mappers in your driver class.
job.setNumMapTasks(n); n-number of mappers
I'm writing a web application and want the user to be able click a link and get a file download.
I have an interface is in a third party library that I can't alter:
writeFancyData(File file, Data data);
Is there an easy way that I can create a file object that I can pass to this method that when written to will stream to the HTTP response?
Notes:
Obviously I could just write a temporary file and then read it back in and then write it the output stream of the http response. However what I'm looking for is a way to avoid the file system IO. Ideally by creating a fake file that when written to will instead write to the output stream of the http response.
e.g.
writeFancyData(new OutputStreamBackedFile(response.getOutputStream()), data);
I need to use the writeFancyData method as it writes a file in a very specific format that I can't reproduce.
Assuming writeFancyData is a black box, it's not possible. As a thought experiment, consider an implementation of writeFancyData that did something like this:
public void writeFancyData(File file, Data data){
File localFile = new File(file.getPath());
...
// process data from file
...
}
Given the only thing you can return from any extended version of File is the path name, you're just not going to be able to get the data you want into that method. If the signature included some sort of stream, you would be in a lot better position, but since all you can pass in is a File, this can't be done.
In practice the implementation is probably one of the FileInputStream or FileReader classes that use the File object really just for the name and then call out to native methods to get a file descriptor and handle the actual i/o.
As dlawrence writes the API it is impossible to determine what the API is doing with the File.
A non-java approach is to create a named pipe. You could establish a reader for the pipe in your program, create a File on that path and pass it to API.
Before doing anything so fancy, I would recommend analyzing performance and verify that disk i/o is indeed a bottleneck.
Given that API, the best you can do is to give it the File for a file in a RAM disk filesystem.
And lodge a bug / defect report against the API asking for an overload that takes a Writer or OutputStream argument.