I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format.
The requirements are clear:
select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked
turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped)
send each line of json to a RESTFul Api
We have one step which uses a MultiResourceItemReader which processes files in sequence. The files are inputstreams which time out.
Ideally I think we want to have
a step which identifies the files to ingest
a step which processes files in parallel
Thanks in advance.
This is a fun one. I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler. I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names.
A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor. Your processor could then build your JSON strings and pass them off to your writer.
Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service.
Anyway, you don't need a separate step to just know the file name. Make it part of application initialization code.
Strategies to parallelize your application are listed here.
You just said a bunch of files. If number of lines in those files have similar count, I would go by partitioning approach ( i.e. by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer). You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. Partitioning
If line count in those files vary a lot i.e. if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe.
Approach Parallel Steps doesn't look suitable for your case since your steps are not independent.
Hope it helps !!
Related
Is it possible to use Spark APIs to read a large CSV file containing multiple sections having different headers? The structure of the file is as follows
BatchCode#1
Name,Surname,Address
AA1,BBB,CCC
AA2,BBB,CCC
AA3,BBB,CCC
BatchCode#2
Name,Surname,Address,Phone
XY1,BBB,CCC,DDD
XY2,BBB,CCC,DDD
XY3,BBB,CCC,DDD
While reading the records, we need to be careful with the headers as well as the file formats could be different between the sections. The BatchCode information needs to be extracted from the header and should be a part of every record within that section - for example, Data at line 1 should be parsed as:
Name: AAA1
Surname: BBB
Address:CCC
BatchCode:1
The following options come to my mind but I am not completely sure if it could create significant problems:
Reading the file using wholeTextFile. This will use a single thread to read the file but it would load the entire file in memory and could cause memory issues with large files.
Forcing Spark to read the file in a single thread using coalesce(1) on sc.textFile. I am not sure if the order is always guaranteed. Once we get the file as RDD, we will cache the header rows while reading the file and merge them with their corresponding data records.
Even if the above approaches work, would they be efficient? What would be the most efficient way?
I wrote Scala only programs for more complicated such use cases whereby sequentialness is guaranteed. It is too difficult otherwise. The files were processed via csvkit if emanating from xls or xlsx firstly.
The following program works for me:
JavaPairRDD<String, PortableDataStream> binaryFiles = sc.binaryFiles(file);
PortableRecordReader reader = new PortableRecordReader();
JavaPairRDD<String, Record> fileAndLines = binaryFiles.flatMapValues(reader);
Where PortableRecordReader opens a DataInputStream and converts it to an InputStreamReader and then uses a CSV parser to convert the lines to the intended output in Record object and also merges the header.
I currently have the following processing in a Spring Batch job:
FlatFileItemReader reads a CSV file
Processor does some work
FlatFileItemWriter creates a mirror of the read file, but updates the file to reflect processing
I don't want to write to a new file, but I want to update the same file that is being read during processing.
My question is, is there a typical method in Spring to use the FlatFileItemReader and then update that same file per row in the processor at runtime?
Thanks for any help.
You can always write a custom writer in spring batch, just like an example below. Where you read data form the file into memory, and then update the same file with the data that you are intended to.
https://github.com/pkainulainen/spring-batch-examples/tree/master/spring/src/main/java/net/petrikainulainen/springbatch/custom/in
more than that FlatFileItemReader is not thread safe. Of course there are hacks to achieve thread safety but not a good practice to use such hacks, its always good to create custom writer.
Short answer is no, SB doesn't allow you to overwrite the same file you are reading from.
A better pratice is to write an intermediate file and then perform a delete/rename.
Write a temporary file is not a 'bad thing' especially if you are working with huge input file and OutOfMemoryException is round the corner; also using a temporary file can make your step restartable and allow you to manually retrive translated file also if delete/rename process fails.
I'm new to Spark and the Hadoop ecosystem and already fell in love with it.
Right now, I'm trying to port an existing Java application over to Spark.
This Java application is structured the following way:
Read file(s) one by one with a BufferedReader with a custom Parser Class that does some heavy computing on the input data. The input files are of 1 to maximum 2.5 GB size each.
Store data in memory (in a HashMap<String, TreeMap<DateTime, List<DataObjectInterface>>>)
Write out the in-memory-datastore as JSON. These JSON files are smaller of size.
I wrote a Scala application that does process my files by one worker but that is obviously not the most performance benefit I can get out of Spark.
Now to my problem with porting this over to Spark:
The input files are line-based. I usually have one message per line. However, some messages depend on preceding lines to form an actual valid message in the Parser. For example it could happen that I get data in the following order in an input file:
{timestamp}#0x033#{data_bytes} \n
{timestamp}#0x034#{data_bytes} \n
{timestamp}#0x035#{data_bytes} \n
{timestamp}#0x0FE#{data_bytes}\n
{timestamp}#0x036#{data_bytes} \n
To form an actual message that out of the "composition message" 0x036, the parser also needs the lines from message 0x033, 0x034 and 0x035. Other messages could also get in between these set of needed messages. The most messages can be parsed by reading a single line though.
Now finally my question:
How to get Spark to split my file correctly for my purposes? The files can not be Split "randomly"; they must be split in a way that makes sure that all my messages can be parsed and the Parser will not wait for input that he will never get. This means that each composition message (messages that depend on preceding lines) need to be in one split.
I guess there are several ways to achieve a correct output but I'll throw some ideas that I had into this post as well:
Define a manual Split algorithm for the file input? This will check that the last few lines of a split do not contain the start of a "big" message [0x033, 0x034, 0x035].
Split the file however spark wants but also add a fixed number of lines (lets say 50, that will do the job for sure) from the last split to the next split. Multiple data will be handled by the Parser class correctly and would not introduce any issues.
The second way might be easier, however I have no clue how to implement this in Spark. Can someone point me into the right direction?
Thanks in advance!
I saw your comment on my blogpost on http://blog.ae.be/ingesting-data-spark-using-custom-hadoop-fileinputformat/ and decided to give my input here.
First of all, I'm not entirely sure what you're trying to do. Help me out here: your file contains lines containing the 0x033, 0x034, 0x035 and 0x036 so Spark will process them separately? While actually these lines need to be processed together?
If this is the case, you shouldn't interpret this as a "corrupt split". As you can read in the blogpost, Spark splits files into records that it can process separately. By default it does this by splitting records on newlines. In your case however, your "record" is actually spread over multiple lines. So yes, you can use a custom fileinputformat. I'm not sure this will be the easiest solution however.
You can try to solve this using a custom fileinputformat that does the following: instead of giving line by line like the default fileinputformat does, you parse the file and keep track of encountered records (0x033, 0x034 etc). In the meanwhile you may filter out records like 0x0FE (not sure if you want to use them elsewhere). The result of this will be that Spark gets all these physical records as one logical record.
On the other hand, it might be easier to read the file line by line and map the records using a functional key (e.g. [object 33, 0x033], [object 33, 0x034], ...). This way you can combine these lines using the key you chose.
There are certainly other options. Whichever you choose depends on your use case.
Here is my scenario. I have a job that processed a large amount of csv data and writes it out using Avro into files divided up by date. I have been given a small file with that I want to use to update a few of these files with additional entries with a second job I can run whenever this needs to happen instead of reprocessing the whole data set again.
Here is sort of what the idea looks like:
Job1: Process lots of csv data, writes it out in compressed Avro files split into files by entry date. The source data is not divided by date so this job will do that.
Job2 (run as needed between Job1 runs): Process small update file and use this to add the entries to the appropriate appropriate Avro file. If it doesn't exist create a new file.
Job3 (always runs): Produce some metrics for reporting from the output of Job1 (and possibly Job 2).
So, I have to do it this way writing a Java job. My first job seems to work fine. So does 3. I'm not sure on how to approach job 2.
Here is what I was thinking:
Pass the update file in using distributed cache. Parse this file to
produce a list of dates in the Job class and use this to filter the
files from Job1 which will be the input of this job.
In the mapper, access the distributed update file and add them to the collection of my avro objects I've read in. What if the file doesn't exist yet here? Does this work?
Use Reducer to write the new object collection
Is this how one would implement this? If not what is the better way? Does a combiner make sense here? I feel like the answer is no.
Thanks in advance.
You can follow below approach:
1) run job1 on all your csv file
2) run job2 on small file and create new output
3) For update, you need to run one more job, in this job, load the output of job2 in setup() method and take output of job1 as a map() input. Then write the logic of update and generate final output.
4) then run your job3 for processing.
According to me, this will work.
Just one crazy idea: why do you need actually update job1 output?
JOB1 does its job producing one file for date. Why not add it with unique postfix like random UUID?
JOB2 processes 'update' information. Maybe several times. The logic of output file naming is the same: date based name and unique postfix.
JOB3 collects JOB1 and JOB2 output grouping them into splits by date prefix with all postfixes and taking as input.
If date-based grouping is target, here you have lot of advantages as for me, obvious ones:
You don't care abuot 'if you have output from JOB1 for this date'.
You even don't care if you need to update one JOB1 output with several JOB2 results.
You don't break HDFS approach with 'no file update' limitation having full power of 'write once' straightforward processing.
You need only some specific InputFormat for your JOB3. Looks not so complex.
If you need to combine data from different sources, no problem.
JOB3 itself can ignore fact that it receives data from several sources. InputFormat should take care.
Several JOB1 outputs can be combined the same way.
Limitations:
This could produce more small files than you can afford for large datasets and several passes.
You need custom InputFormat.
As for me good option if I properly understand your case and you can / need to group files by date as input for JOB3.
Hope this will help you.
For Job2, You can read the update file to filter the input data partitions in Driver code and set it in Input paths. You can follow the current approach to read the update file as distribute cache file.In case you want to fail the job if you are unable to read update file , throw exception in setup method itself.
If your update logic does not require aggregation at reduce side, Set Job2 as map only job.You might need to build logic to identify updated input partitions in Job3 as it will receive the Job1 output and Job2 output.
Consider an application that wants to use Hadoop in order to process large amounts of proprietary binary-encoded text data in approximately the following simplified MapReduce sequence:
Gets a URL to a file or a directory as input
Reads the list of the binary files found under the input URL
Extracts the text data from each of those files
Saves the text data into new, extracted plain text files
Classifies extracted files into (sub)formats of special characteristics (say, "context")
Splits each of the extracted text files according to its context, if necessary
Processes each of the splits using the context of the original (unsplit) file
Submits the processing results to a proprietary data repository
The format-specific characteristics (context) identified in Step 5 are also saved in a (small) text file as key-value pairs, so that they are accessible by Step 6 and Step 7.
Splitting in Step 6 takes place using custom InputFormat classes (one per custom file format).
In order to implement this scenario in Hadoop, one could integrate Step 1 - Step 5 in a Mapper and use another Mapper for Step 7.
A problem with this approach is how to make a custom InputFormat know which extracted files to use in order to produce the splits. For example, format A may represent 2 extracted files with slightly different characteristics (e.g., different line delimiter), hence 2 different contexts, saved in 2 different files.
Based on the above, the getSplits(JobConf) method of each custom InputFormat needs to have access to the context of each file before splitting it. However, there can be (at most) 1 InputFormat class per format, so how would one correlate the appropriate set of extracted files with the correct context file?
A solution could be to use some specific naming convention for associating extracted files and contexts (or vice versa), but is there any better way?
This sounds more like a Storm (stream processing) problem with a spout that loads the list of binary files from a URL and subsequent bolts in the topology performing each of the following actions.