I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).
Any idea if this is possible with Spring Batch?
How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?
Thanks,
Mickael
What you're describing is something normally seen in batch processing. You have a couple options here:
Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.
In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.
Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.
There is more info in the docs:
Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.
I think your question is somehow duplicate to this: multithreaded item reader
Related
Small question regarding Spark and how to read from the result of a http response please.
It is well known Spark can take as datasource some database, or CSV, etc...
sparkSession.read().format("csv").load("path/to/people.csv");
sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load()
May I ask how to read from the result of an http call directly please?
Without having to dump the data back inside another intermediate csv / intermediate database table.
For instance the csv and database would contains millions of rows, and once read, the job needs to perform some kind of map reduce operation.
Now, the exact same data comes from the result of an http call. It is small enough for the network layer, but the information contained inside the payload is big, so I would like to apply the same map reduce.
How to read from the response of an http call please?
Thank you
You have two options for reading data in Spark:
Read directly to the driver and distribute to the executors (not scalable as everything passes through driver)
Read directly from the executors
The built in data sources like csv, parquet etc all implement reading from the executors so the job can scale with the data. They define how each partition of the data should be read - e.g. if we have 10 executors, how do you cut up the data source into 10 sections so each executor can directly read one section.
If you want to load from a HTTP request you will either have to read through the driver and distribute, which may be OK if you know the data is going to be less than ~10mb. Otherwise you would need to implement a custom data source to allow the executors to each read partition, can read here for more: https://aamargajbhiye.medium.com/speed-up-apache-spark-job-execution-using-a-custom-data-source-fd791a0fa4b0
Will finish by saying that this second option is almost definitely an anti-pattern. You will likely be much better off providing an intermediate staging environment (e.g. S3/GCS), calling the server to load the data to the intermediate store and then reading to Spark on completion. In scenario 2, you will likely end up putting too much load on the server, amongst other issues.
In previous lifetimes, I created a custom datasource. It is not the most trivial thing to do, but this GitHub repo explains it: https://github.com/jgperrin/net.jgp.books.spark.ch09.
When it comes to reading from a network stream, make sure that only one executor does it.
I have a scenario where for each request, I've to make a batch get of atleast 1000 keys.
Currently I'm getting 2000 requests per minute and this is expected to rise.
Also I've read that batch get of aerospike internally makes individual request to server concurrently/sequentially.
I am using the aerospike as a cluster (running on SSD). So is this efficient to write UDF (user defined method) in lua for making a batch request, and aggregating the results at server level instead of multiple hits from client
Kindly suggest if default batch get of aerospike will be efficient or I've to do something else.
Batch read is the right way to do it. Results are returned in the order of keys specified in the list. Records not found will return null. Client parallel-izes the keys by nodes - waits (there is no callback in client unlike Secondary Index or Scan) and collects the returns from all nodes and presents them back in the client in original order. Make sure you have adequate memory in the client to hold all the returned batch results.
To UDF or Not to UDF?
First thing, you cannot do batch reads as a UDF, at least not in any way that's remotely efficient.
You have two kinds of UDF. The first is a record UDF, which is limited to operating on a single record. The record is locked as your UDF executes, so it can either read or modify the data, but it is sandboxed from accessing other records. The second is a stream UDF, which is read-only, and runs against either a query or a full scan of a namespace or set. Its purpose is to allow you to implement aggregations. Even if you're retrieving 1000 keys at a time, using stream UDFs to just pick a batch of keys from a much larger set or namespace is very inefficient. That aside, UDFs will always be slower than the native operations provided by Aerospike, and this is true for any database.
Batch Reads
Read the documentation for batch operations, and specifically the section on the batch-index protocol. There is a great pair of FAQs in the community forum you should read:
FAQ - Differences between getting single record versus batch
FAQ - batch-index tuning parameters
Capacity Planning
Finally, if you are getting 2000 requests per-second at your application, and each of those turns into a batch-read of 1000 keys, you need to make sure that your cluster is sized properly to handle 2000 * 1000 = 2Mtps reads. Tuning the batch-index parameters will help, but if you don't have enough aggregate SSD capacity to support those 2 million reads per-second, your problem is one of capacity planning.
I'll be getting a large number of xml files (numbering in tens of thousands every few minutes) from an MQ. The xml files aren't very big. I have to extract the information and save it into a database. I cannot use third party libraries unfortunately (except the apache commons). What strategies/techniques are normally used in this scenario? Is there any xml parser in java or apache which can handle such situations well?
I might also add that I'm using jdk 1.4
Based on the comments and discussion around this topic - I would like to propose a consolidated solution.
Parsing XML files using SAX - As #markspace mentioned, you should go
with SAX which is built-in and has good performance.
Use BULK INSERTS if possible - Since you plan to insert a large
amount of data consider what type of data are you reading and
storing into the database. Do all the XML files contain the same
schema (which means they correspond to a single table in the
database) OR do they represent different objects (which means you
would end up inserting data into multiple tables).
In case the schema of all XML files that needs to be inserted into
the same table in the database, then consider batching these data
objects and bulk-inserting them into the database. This will be
definitely more performing in terms of time as well as resources
(you would open only a single connection to persist a batch as
opposed to multiple connections for each objects). Of course you
would need to spend some time in tuning your batch size and also
deciding the error handling strategy for batch inserts (discard
all v/s discard erroneous)
If the schema of the XML files are different, then consider clubbing
similar XMLs into groups so that you can BULK INSERT these groups
later.
Finally - and this is important : Ensure that you release all the
resources such as File handles, Database connections etc once you
are done with processing or in case you encounter errors. In simple
words use try-catch-finally at the correct places.
While by no means complete, hope this answer provides you a set of critical checkpoints that you need to consider while writing scalable performant code
I am studying Javaee Batch API (jsr-352) in order to test the feasibility of changing out current ETL tool for our own solution using this technology.
My goal is to build a job in which I:
get some (dummy) data from a datasource in step1,
some other data from other data-source in step2 and
merge them in step3.
I would like to process each item and not write to a file, but send it to the next step. And also store the information for further use. I could do that using batchlets and jobContext.setTransientUserData().
I think I am not getting the concepts right: as far as I understood, JSR-352 is meant for this kind of ETL tasks, but it has 2 types of steps: chunk and batchlets. Chunks are "3-phase-steps", in which one reads, processes and writes the data. Batchlets are tasks that are not performed on each item on the data, but once (as calculating totals, sending email and others).
My problem is that my solution is not correct if I consider the definition of batchlets.
How could one implement this kinf od job using Javaee Batch API?
I think you better to use chunk rather than batchlet to implement ETLs. typical chunk processing with a datasource is something like following:
ItemReader#open(): open a cursor (create Connection, Statement and ResultSet) and save them as instance variables of ItemReader.
ItemReader#readItem(): create and return a object that contains data of a row using ResultSet
ItemReader#close(): close JDBC resources
ItemProcessor#processItem(): do calculation and create and return a object which contains result
ItemWriter#writeItems(): save calculated data to database. open Connection, Statement and invoke executeUpdate() and close them.
As to your situation, I think you have to choose one data which considerble as primary one, and open a cursor for it in ItemReader#open(). then get another one in ItemProcessor#processItem() for each item.
Also I recommend you to read useful examples of chunk processing:
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-1/
http://www.radcortez.com/java-ee-7-batch-processing-and-world-of-warcraft-part-2/
My blog entries about JBatch and chunk processing:
http://www.nailedtothex.org/roller/kyle/category/JBatch
spring-batch newbie: I have a series of batches that
read all new records (since the last execution) from some sql tables
upload all the new records to hadoop
run a series of map-reduce (pig) jobs on all the data (old and new)
download all the output to local and run some other local processing on all the output
point is, I don't have any obvious "item" - I don't want to relate to the specific lines of text in my data, I work with all of it as one big chunk and don't want any commit intervals and such...
however, I do want to keep all these steps loosely coupled - as in, step a+b+c might succeed for several days and accumulate processed stuff while step d keeps failing, and then when it finally succeeds it will read and process all of the output of it's previous steps.
SO: is my "item" a fictive "working-item" which will signify the entire new data? do I maintain a series of queues myself and pass this fictive working-items between them?
thanks!
people always assume that the only use of spring batch is really only for the chunk processing. that is a huge feature, but what's overlooked is the visibility of the processing and job control.
give 5 people the same task with no spring batch and they're going to implement flow control and visibility their own way. give 5 people the same task and spring batch and you may end up with custom tasklets all done differently, but getting access to the job metadata and starting and stopping jobs is going to be consistent. from my perspective it's a great tool for job management. if you already have your jobs written, you can implement them as custom tasklets if you don't want to rewrite them to conform the 'item' paradigm. you'll still see benefits.
I don't see the problem. Your scenario seems like a classic application of Spring Batch to me.
read all new records (since the last execution) from some sql tables
Here, an item is a record
upload all the new records to hadoop
Same here
run a series of map-reduce (pig) jobs on all the data (old and new)
Sounds like a StepListener or ChunkListener
download all the output to local and run some other local processing on all the output
That's the next step.
The only problem I see is if you don't have Domain Objects for your records. But even then, you can work with maps or arrays, while still using ItemReaders and ItemWriters.