How to process multiple different files in different ways using Spring Batch - java

Background/Context
I see almost countless examples of how to process multiple files using Spring Batch, but every single on of them has a single object that all the files are being processed into. So, many files containing compatible data, that are all being processed into a single destination target, like a database table, for instance.
I want to build an import process that will take in ten different files and map them to ten different destination tables in the same database/schema. The filenames will also change slightly in a predictable/code-able fashion every day, but I think I'll be able to handle that. I thought Spring could do this (a many-to-many data mapping), but this is the last thing I'm not finding HOW to do. The declarative structure of Spring is great for some things, but I'm honestly not sure how to set up the multiple mappings, and since there's really no procedural portion of the application to speak of, I can't really use any form of iteration. I could simply make separate jars for each file, and script the iteration on the console, but that also complicates logging and reporting... and frankly it sounds hacky
Question
How do I tell Spring Batch to process each of ten different files, in ten different ways, and map their data into ten different tables in the same database?
Example:
File Data_20190501_ABC_000.txt contains 4 columns of tilde-delimited data and needs to be mapped to table ABC_data with 6 columns (two are metadata)
File Data_20190501_DEF_000.txt contains 12 columns of tilde-delimited data and needs to be mapped to table DEF_data with 14 columns (two are metadata)
File Data_20190501_GHI_000.txt contains 10 columns of tilde-delimited data and needs to be mapped to table GHI_data with 12 columns (two are metadata)
etc... for ten different files and tables
I can handle the tilde delimiting, I THINK I can handle the dates in the file names programmatically, and one of the fields can be handled in a db trigger. the other metadata field should be the file name, but that can certainly be a different question.
UPDATE
According to what I think Mahmoud Ben Hassine suggested, I made a separate reader, mapper, and writer for each file/table pair and tried to add them with the start(step1), next(step2), build() paradigm in the format below as based on the examples at Configuring and Running a Job from Spring's docs:
#Autowired
private JobBuilderFactory jobs;
#Bean
public Job job(#Qualifier("step1") Step step1, #Qualifier("step2") Step step2) {
return jobs.get("myJob").start(step1).next(step2).build();
}
Either step runs independently, but once I add one in as the "next" step, it only executes the first one, and generates a "Step already complete or not restartable, so no action to execute" INFO message in the log output - where do I go from here?

A chunk-oriented step in Spring Batch can handle only one type of items at a time. I would use a job with different chunk-oriented steps in it. These steps can be run in parallel as there is no relation/order between input files.
Most of the configuration would be common in your case, so you can create an abstract step definition with common configuration properties, and multiple steps with specific properties for each one of them (in your case, I see it should be the file name, field set mapper and the target table).
Hope this helps.

Related

Spring Batch Crud Database

My problem is more or less asked here Spring Batch : Compare Data Between Database however I still cannot get my head around it. Maybe it's a bit different.
I have A datasource and I want to write into database B.
I have full trust in A datasource, so if;
A Does contain the record that B does not, I have to add B.
A Does not contain the record that B does, I have to delete from B
A does contain, B does contain, I check and update the record in B accordingly.
I thought my approach would be simple as;
Read Person from A datasource
Read Person from B datasource
(Those two Person can be having different entities)
Compare and find the ones to Add,Update,Delete.
Update the database.
However since I am pretty newbie to Spring Batch, the implementation is kind of ending up to a spaggetti code which I don't want and want to learn the right way for it.
So;
I created this job below
#Bean
public Job job() {
return jobBuilderFactory
.get("myNewbieJob")
.start(populateARepository())
.next(populateBRepository())
.next(compareAndSubmitCountryRepositoriesTasklet())
.build();
}
To explain;
populateARepository() populateARepository() : I have a Repository object just contains a list. This step just does add records to the list.
The part that I don't like is that compareAndSubmitCountryRepositoriesTasklet() is basically comparing those repositories... and then I don't know what to do.
If I create a DB access and push from that class, I won't like it, because I just wanted it to be a step where I find the differences.
If I create another class which contains 3 separate lists for toUpdate,toDelete,toInsert, and then in the next step somehow use that repository... that sounded wrong to me as well.
So, here I am. Any kind of guidance is appreciated. How would you deal in this situation?
Thank you in advance.
Before talking about Spring Batch, I would first look for an algorithm to solve this problem. If I understand correctly, you basically need to replicate the same state of records in database A into database B. What you can do is:
Read Person items from database A
Use an item processor to do the comparison with table B. Here, you would mark the item accordingly to be inserted, updated or deleted
Use an item writer that checks the type of record and do the necessary operation. Here, you can create a custom writer or use a ClassifierCompositeItemWriter (see this example)
This approach works well with small/medium datasets but not for large datasets (due to the additional query for each item, but this is inherent to the algorithm itself and not the implementation with Spring Batch).

design approach for loading data from different sources (Oracle, flat files) using Java

I am looking around for a design approach on loading data from different sources (oracle, flat files etc) and loading them in the target relational model using Java. I already have the target data model in place, currently it has four entities a,b,c,d - where d has references of a,b,c ids, so I need to populate the first three tables.
for entity a:
I need to read a record from source and compare it with already existing in entity a (In first load it will be empty so I would directly insert it), compare on all the columns of that record, if there is difference then I would update the target else I move to other record.
I am considering Spring batch, but for comparing each and every record I will have lot of DB calls which would impact the performance.
I would appreciate help on designing strategies. I don't want to consider ETL tools like informatica, abinitio etc.
target database would always remain as Oracle.
Probably the fastest way to do this is to load all the records into a temporary table on the target. Then you can run a Minus query (if your target is Oracle) between the 2 tables to find all records that need to be inserted, all others to be updated.

Hash based partitioning

I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).
Any idea if this is possible with Spring Batch?
How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?
Thanks,
Mickael
What you're describing is something normally seen in batch processing. You have a couple options here:
Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.
In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.
Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.
There is more info in the docs:
Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.
I think your question is somehow duplicate to this: multithreaded item reader

Are batchlets the correct way of implementing ETL steps in JavaEE Batch?

I am studying Javaee Batch API (jsr-352) in order to test the feasibility of changing out current ETL tool for our own solution using this technology.
My goal is to build a job in which I:
get some (dummy) data from a datasource in step1,
some other data from other data-source in step2 and
merge them in step3.
I would like to process each item and not write to a file, but send it to the next step. And also store the information for further use. I could do that using batchlets and jobContext.setTransientUserData().
I think I am not getting the concepts right: as far as I understood, JSR-352 is meant for this kind of ETL tasks, but it has 2 types of steps: chunk and batchlets. Chunks are "3-phase-steps", in which one reads, processes and writes the data. Batchlets are tasks that are not performed on each item on the data, but once (as calculating totals, sending email and others).
My problem is that my solution is not correct if I consider the definition of batchlets.
How could one implement this kinf od job using Javaee Batch API?
I think you better to use chunk rather than batchlet to implement ETLs. typical chunk processing with a datasource is something like following:
ItemReader#open(): open a cursor (create Connection, Statement and ResultSet) and save them as instance variables of ItemReader.
ItemReader#readItem(): create and return a object that contains data of a row using ResultSet
ItemReader#close(): close JDBC resources
ItemProcessor#processItem(): do calculation and create and return a object which contains result
ItemWriter#writeItems(): save calculated data to database. open Connection, Statement and invoke executeUpdate() and close them.
As to your situation, I think you have to choose one data which considerble as primary one, and open a cursor for it in ItemReader#open(). then get another one in ItemProcessor#processItem() for each item.
Also I recommend you to read useful examples of chunk processing:
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-1/
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-2/
My blog entries about JBatch and chunk processing:
http://www.nailedtothex.org/roller/kyle/category/JBatch

is spring-batch for me, even though I don't have a usage for itemReader and itemWriter?

spring-batch newbie: I have a series of batches that
read all new records (since the last execution) from some sql tables
upload all the new records to hadoop
run a series of map-reduce (pig) jobs on all the data (old and new)
download all the output to local and run some other local processing on all the output
point is, I don't have any obvious "item" - I don't want to relate to the specific lines of text in my data, I work with all of it as one big chunk and don't want any commit intervals and such...
however, I do want to keep all these steps loosely coupled - as in, step a+b+c might succeed for several days and accumulate processed stuff while step d keeps failing, and then when it finally succeeds it will read and process all of the output of it's previous steps.
SO: is my "item" a fictive "working-item" which will signify the entire new data? do I maintain a series of queues myself and pass this fictive working-items between them?
thanks!
people always assume that the only use of spring batch is really only for the chunk processing. that is a huge feature, but what's overlooked is the visibility of the processing and job control.
give 5 people the same task with no spring batch and they're going to implement flow control and visibility their own way. give 5 people the same task and spring batch and you may end up with custom tasklets all done differently, but getting access to the job metadata and starting and stopping jobs is going to be consistent. from my perspective it's a great tool for job management. if you already have your jobs written, you can implement them as custom tasklets if you don't want to rewrite them to conform the 'item' paradigm. you'll still see benefits.
I don't see the problem. Your scenario seems like a classic application of Spring Batch to me.
read all new records (since the last execution) from some sql tables
Here, an item is a record
upload all the new records to hadoop
Same here
run a series of map-reduce (pig) jobs on all the data (old and new)
Sounds like a StepListener or ChunkListener
download all the output to local and run some other local processing on all the output
That's the next step.
The only problem I see is if you don't have Domain Objects for your records. But even then, you can work with maps or arrays, while still using ItemReaders and ItemWriters.

Categories