In a JSR-352 batch I want to use partitioning. I can define the number of partitions via configuration or implement a PartitionMapper to do that.
Then, there are the JobContext and StepContext injectables to provide context information to my processing. However, there is no PartitionContext or the like which maintains and provides details about the partition I'm running in.
Hence the question:
How do I tell each partitioned instance of a chunk which partition it is running in so that its ItemReader can read only those items which belong to that particular partition?
If I don't do that, each partition would perform the same work on the same data instead of splitting up the input data set into n distinct partitions.
I know I can store some ID in the partition plan's properties which I can then use to set another property in the step's configuration like <property name="partitionId" value="#{partitionPlan['partitionId']}" />. But this seems overly complicated and fragile because I'd have to know the name of the property from the partition plan and must remember to always set another property to this value for each step.
Isn't there another, clean, standard way to provide partition information to steps?
Or, how else should I be splitting work by partitions and assign it to different ItemReader instances in the same partitioned chunk?
Update:
It appears that jberet has the org.jberet.cdi.PartitionScoped CDI scope, but it's not part of the JSR standard.
When defining a partition with either partition plan (XML), or partition mapper (programatical), include these information as partition properties, and then reference these partition properties within item reader/processor/writer properties.
This is the standard way to tell item reader and other batch artifacts what resource to handle, where to begin, and where to end. This is not much different from non-partitioned chunk configuration, where you also need to configure the source and range of input data with batch properties.
For example, please org.jberet.test.chunkPartitionFailComplete.xml from one of the jberet test apps.
Related
Background/Context
I see almost countless examples of how to process multiple files using Spring Batch, but every single on of them has a single object that all the files are being processed into. So, many files containing compatible data, that are all being processed into a single destination target, like a database table, for instance.
I want to build an import process that will take in ten different files and map them to ten different destination tables in the same database/schema. The filenames will also change slightly in a predictable/code-able fashion every day, but I think I'll be able to handle that. I thought Spring could do this (a many-to-many data mapping), but this is the last thing I'm not finding HOW to do. The declarative structure of Spring is great for some things, but I'm honestly not sure how to set up the multiple mappings, and since there's really no procedural portion of the application to speak of, I can't really use any form of iteration. I could simply make separate jars for each file, and script the iteration on the console, but that also complicates logging and reporting... and frankly it sounds hacky
Question
How do I tell Spring Batch to process each of ten different files, in ten different ways, and map their data into ten different tables in the same database?
Example:
File Data_20190501_ABC_000.txt contains 4 columns of tilde-delimited data and needs to be mapped to table ABC_data with 6 columns (two are metadata)
File Data_20190501_DEF_000.txt contains 12 columns of tilde-delimited data and needs to be mapped to table DEF_data with 14 columns (two are metadata)
File Data_20190501_GHI_000.txt contains 10 columns of tilde-delimited data and needs to be mapped to table GHI_data with 12 columns (two are metadata)
etc... for ten different files and tables
I can handle the tilde delimiting, I THINK I can handle the dates in the file names programmatically, and one of the fields can be handled in a db trigger. the other metadata field should be the file name, but that can certainly be a different question.
UPDATE
According to what I think Mahmoud Ben Hassine suggested, I made a separate reader, mapper, and writer for each file/table pair and tried to add them with the start(step1), next(step2), build() paradigm in the format below as based on the examples at Configuring and Running a Job from Spring's docs:
#Autowired
private JobBuilderFactory jobs;
#Bean
public Job job(#Qualifier("step1") Step step1, #Qualifier("step2") Step step2) {
return jobs.get("myJob").start(step1).next(step2).build();
}
Either step runs independently, but once I add one in as the "next" step, it only executes the first one, and generates a "Step already complete or not restartable, so no action to execute" INFO message in the log output - where do I go from here?
A chunk-oriented step in Spring Batch can handle only one type of items at a time. I would use a job with different chunk-oriented steps in it. These steps can be run in parallel as there is no relation/order between input files.
Most of the configuration would be common in your case, so you can create an abstract step definition with common configuration properties, and multiple steps with specific properties for each one of them (in your case, I see it should be the file name, field set mapper and the target table).
Hope this helps.
Trying to access all key values in the defined statestore but, in .transform() method i can access only with one key (which is the source key)
KeyValueStore<String, String> SS=context.getStateStore("macs");
the SS is not able to get all key values in statestore
SS.get("key1");
SS.get("key2");
SS.get("key3");
SS.get("key4");
only 1 out of 4 returns values rest all returns null
the SS is not able to get all key values in statestore
This is the expected behavior. The data in a "logical" state store in Kafka Streams is actually partitioned (sharded) across the actual instances of the state store across the running instances of your distributed Kafka Streams application (even if you run only 1 application instance, like 1 Docker container for your app). Let me explain below.
A simplified example to illustrate the nature of partitioned state stores: If your application reads from an input topic with 5 partitions, then processing topology of this application would be using 5 stream tasks, and each stream task would get one partition of the "logical" state store (see Kafka Streams Architecture). If you run only 1 application instance (like 1 Docker container) for your application, then this single instance will be executing all 5 stream tasks, but these stream tasks are a shared-nothing setup -- and that means that the data is still partitioned. This is also the case for KTables in Kafka Streams, which are also partitioned in this manner.
See also: Is Kafka Stream StateStore global over all instances or just local?
Your example above would only work in the special case where the input topic has only 1 partition, because then there is only 1 stream task, and thus only 1 state store (which would have access to all available keys in the input data).
Trying to access all key values in the defined statestore [...]
Now, if you do want to have access to all available keys in the input data, you have two options (unless you want to go down the route of the special case of an input topic with only 1 partition):
Option 1: Use global state stores (or GlobalKTable) instead of the normal, partitioned state stores. Global state stores can be defined/created via StreamsBuilder#addGlobalStore(...), but IIRC you don't need to explicitly add ("attach") global stores to Processors, which you would have to do for normal state stores. Instead, global stores can be accessed by any Processors automatically.
Option 2: Use the interactive queries feature (aka queryable state) in Kafka Streams.
Note that, in both options, you can access the data in the state store(s) only for reading. You cannot write directly to the state stores in these two situations. If you need to modify the data, then you must update them indirectly through the input topics that are used to populate the stores.
Right now I have one stream application in SCDF that is pulling data from multiple tables in a database and replicating it to another database. Currently, our goal is to reduce the amount of work that a given stream is doing, so we want to split the stream out into multiple streams and continue replicating the data into the second database.
Are there any recommended design patterns for funneling the processing of these various streams into one?
If I understand this requirement correctly, you'd want to split the ingest piece by DB/Table per App and then merge them all into a single "payload type" for downstream processing.
If you really do want to split the ingest by DB/Table, you can, but you may want to consider the pros/cons, though. One obvious benefit is granularity and that you can independently update the App in isolation, and maybe also reusability. Of course, it brings other challenges. Maintenance, fixes, and releases for individual apps to name a few.
That said, you can fan-in data to a single consumer. Here's an example:
foo1 = jdbc | transform | hdfs
foo2 = jdbc > :foo1.jdbc
foo3 = jdbc > :foo1.jdbc
foo4 = jdbc > :foo1.jdbc
Here, foo1 is the primary pipeline reading data from a particular DB/Table combination. Likewise, foo2, foo3, and foo4 could read from other DB/Table combinations. However, these 3 streams are writing the consumed data to a named-destination, which in this case happens to be foo1.jdbc (aka: topic name). This destination is automatically created by SCDF when deploying the foo1 pipeline; specifically to connect "jdbc" and "transform" Apps with the foo1.jdbc topic.
In summary, we are routing the different table data to land in the same destination, so the downstream App, in this case, the transform processor gets the data from different tables.
If the correlation of data is important, you can partition the data at the producer by a unique key (e.g., customer-id = 1001) at each jdbc source, so context-specific information land at the same transform processor instance (assuming you've "n" number of processor instances for scaled-out processing).
I'm learning spring-batch. I'm currently working with biological data that look like this:
interface Variant {
public String getChromosome();
public int getPosition();
public Set<String> getGenes();
}
(A Variant is a position on the genome which may overlap somes genes).
I've already written some Itemreaders/Itemwriters
Now I would like to run some analysis per gene. Thus I would like to split my workflow for each gene (gene1, gene2,... geneN) to do some statistics about all the variants linked to one gene.
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? All the examples I've seen use some 'indexes' or a finite number of gridSize ? Furthermore, does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? how can join the data at the end ?
thanks
EDIT: or may be I should look at MultiResourceItemWriter ?
When using Spring Batch's partitioning capabilities, there are two main classes involved, the Partitioner and the PartitionHandler.
Partitioner
The Partitioner interface is responsible for dividing up the data to be processed into partitions. It has a single method Partitioner#partition(int gridSize) that is responsible for analyzing the data that is to be partitioned and returning a Map with one entry per partition. The gridSize parameter is really just a piece of input into the overall calculation that can be used or ignored. For example, if the gridSize is 5, I may choose to return exactly 5 partitions, I may choose to overpartition and return some multiple of 5, or I may analyze the data and realize that I only need 3 partitions and completely ignore the gridSize value.
PartionHandler
The PartitionHandler is responsible for the delegation of the partitions returned by the Partitioner to workers. Within the Spring ecosystem, there are three provided PartitionHandler implementations, a TaskExecutorPartitionHandler that delegates the work to threads internal to the current JVM, a MessageChannelPartitionHandler that delegates work to remote workers listening on some form of messaging middleware, and a DeployerPartitionHandler out of the Spring Cloud Task project that launches new workers dynamically to execute the provided partitions.
With all the above laid out, to answer your specific questions:
What is the best way to implement a Partioner for this (is it the correct class anyway ?) ? That typically depends on the data your partitioning and the store it's in. Without further insights into how you are storing the gene data, I can't really comment on what the best approach is.
Does the map returned by partiton(gridsize) must have less than gridSize items or can I returned a 'big' map and spring-batch is able to run no more than gridSize jobs in parallel ? You can return as many items in the Map as you see fit. As mentioned above, the gridSize is really meant as a guide.
How can join the data at the end ? A partitioned step is expected to have each partition processed independently of each other. If you want some form of join at the end, you'll typically do that in a step after the partition step.
I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).
Any idea if this is possible with Spring Batch?
How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?
Thanks,
Mickael
What you're describing is something normally seen in batch processing. You have a couple options here:
Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.
In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.
Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.
There is more info in the docs:
Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.
I think your question is somehow duplicate to this: multithreaded item reader