How to use Spring Batch to connect to server? - java

I need help from experienced Spring Batch programmers for a particular problem I am having.
I am new to Spring Batch. It has detailed documentation for each class or interface, but next to nothing in all of it working together.
The Scenario:
I have a database hosted on the cloud, So I have to make REST calls to get some data from it and save related data to it.
- Data retrieval will return a JSON response with data I queried.
- Data save will return a JSON response on how many rows were added etc.
- All responses will have a valid HTTP status code.
- A transaction is complete when the save call is successful with an Http code of 200 and data which shows how many records were inserted is received.
The connection may not always be available, In that case the program must keep retrying every 5 minutes until the whole task is complete.
What I chose not to do
I could do some dirty Java tricks (which were surprisingly recommended by many in stack overflow)
Threads and sleep (Too crude)
Spring's #Scheduled (The scheduler keeps running even after job completion)
What I tried
So I decided to use Spring Batch since it seemed to be a framework made for this.
I have no file tasks, so I used a Tasklet instead of Readers and
Writers.
The Tasklet interface can return only FINISHED status code. No codes
for FAILURE
So, inside the tasklet, I set a custom value in the StepContext and retrieved my custom
value in a StepExecutionListener and accordingly configured
ExistStatus of the Step to FAILURE
To handle this workaround I had to configure a JobExecutionListener
to make the Job fail accordingly.
Apart from all these above work-arounds,
Spring batch does not have any scheduling. I have to end up using
another scheduler.
Spring Batch's retry within a step is valid only
for ItemReader,ItemWriter etc and not for tasklets
The Question
Is Spring Batch right for this situation?
Is my design correct? It seems very "hack"-ey.
I need help with the most efficient way to handle my scenario

I was using spring batch for similar case - as a execution engine to process large files which resulted as lots of REST requests to other systems.
What Spring batch brought for me:
execution engine/model for large dependent operations. In other words I could maintain my input as one single entry point and have 'huge' transaction on top of other small operations.
Possibility to see execution results and monitor them.
Retriability of batch operations. This is one of the best thing in spring batch
it allows you design your operation in such manner that if something goes wrong in the middle of execution, you can simply restart it and continue from failing point. But you need to invest some effort to maintain this.
More on business cases here: https://docs.spring.io/spring-batch/trunk/reference/html/spring-batch-intro.html#springBatchUsageScenarios
So you need to check carefully those business cases and answer yourself if you really need them.
So far what you have described - I really don't see benefit of spring batch for you.

Related

Axon commands on read model

We are currently using axon framework with hikaricp as data source pooling system. We are facing pool exhausting issues from time to time and we have a theory:
To update our read models we use the command bus to send UpdateEntityViewCommand.
As the command bus starts a transaction using the primary transaction manager (the write one) it acquires a connection from the write pool
On the handler, we open an inner transaction using a connection from the read pool, thus blocking the outer one.
This seems to exhaust the pools under some conditions. The question is: should we stop using the command bus to update our read models? Is appropiate to have two buses( one for write and one for read?
Thanks in advance
Axon's intent with the separation of Commands, Events, and Queries, is in general to support CQRS within your system. This means you would introduce dedicated Command Models and Query Models, which respectively only receive Command and Query messages.
The Query Models are typically also referred to as Projections or Read Models.
It is this combination of ideas Axon is normally used for which makes your question a bit vague. Or, perchance, you are using it differently than normally intended.
That Axon supports CQRS through this approach, does not necessitate that you do CQRS within your set up as well, by the way.
At any note, it would be good too (as also stated in the comments):
Include code snippets of the message handlers when these prove useful
Give some deeper insight in your configuration
Whether you aim to use the CQRS pattern in your application, yes or no
Completely separate from this, I can give some guidance on how Axon deals with transactions. Every message inside an Axon application will start a so-called "Unit of Work". It is the UnitOfWork (UoW for short) that will start a transaction. This means that no matter if you use commands, events or queries, Axon will have started that transaction for you already.
Taking this a step further, this also means that whatever you do inside a message handling function (thus the #CommandHandler, #EventHandler and #QueryHandler annotated methods) will always already have an active transaction running. That for example defines that you do not have to include your own transaction management inside a message handling function, with for example the #Transactional annotation.
Concluding, I am guessing you might be mixing some concepts. This can obviously happen, so no worries there; that's what SO is for. Next to this though, you thus do not have to start your own transaction inside any message handling function, as you already have an active transaction at that point in time.

Best Practice for Kafka rollback scenario in microservices [duplicate]

We have a micro-services architecture, with Kafka used as the communication mechanism between the services. Some of the services have their own databases. Say the user makes a call to Service A, which should result in a record (or set of records) being created in that service’s database. Additionally, this event should be reported to other services, as an item on a Kafka topic. What is the best way of ensuring that the database record(s) are only written if the Kafka topic is successfully updated (essentially creating a distributed transaction around the database update and the Kafka update)?
We are thinking of using spring-kafka (in a Spring Boot WebFlux service), and I can see that it has a KafkaTransactionManager, but from what I understand this is more about Kafka transactions themselves (ensuring consistency across the Kafka producers and consumers), rather than synchronising transactions across two systems (see here: “Kafka doesn't support XA and you have to deal with the possibility that the DB tx might commit while the Kafka tx rolls back.”). Additionally, I think this class relies on Spring’s transaction framework which, at least as far as I currently understand, is thread-bound, and won’t work if using a reactive approach (e.g. WebFlux) where different parts of an operation may execute on different threads. (We are using reactive-pg-client, so are manually handling transactions, rather than using Spring’s framework.)
Some options I can think of:
Don’t write the data to the database: only write it to Kafka. Then use a consumer (in Service A) to update the database. This seems like it might not be the most efficient, and will have problems in that the service which the user called cannot immediately see the database changes it should have just created.
Don’t write directly to Kafka: write to the database only, and use something like Debezium to report the change to Kafka. The problem here is that the changes are based on individual database records, whereas the business significant event to store in Kafka might involve a combination of data from multiple tables.
Write to the database first (if that fails, do nothing and just throw the exception). Then, when writing to Kafka, assume that the write might fail. Use the built-in auto-retry functionality to get it to keep trying for a while. If that eventually completely fails, try to write to a dead letter queue and create some sort of manual mechanism for admins to sort it out. And if writing to the DLQ fails (i.e. Kafka is completely down), just log it some other way (e.g. to the database), and again create some sort of manual mechanism for admins to sort it out.
Anyone got any thoughts or advice on the above, or able to correct any mistakes in my assumptions above?
Thanks in advance!
I'd suggest to use a slightly altered variant of approach 2.
Write into your database only, but in addition to the actual table writes, also write "events" into a special table within that same database; these event records would contain the aggregations you need. In the easiest way, you'd simply insert another entity e.g. mapped by JPA, which contains a JSON property with the aggregate payload. Of course this could be automated by some means of transaction listener / framework component.
Then use Debezium to capture the changes just from that table and stream them into Kafka. That way you have both: eventually consistent state in Kafka (the events in Kafka may trail behind or you might see a few events a second time after a restart, but eventually they'll reflect the database state) without the need for distributed transactions, and the business level event semantics you're after.
(Disclaimer: I'm the lead of Debezium; funnily enough I'm just in the process of writing a blog post discussing this approach in more detail)
Here are the posts
https://debezium.io/blog/2018/09/20/materializing-aggregate-views-with-hibernate-and-debezium/
https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/
first of all, I have to say that I’m no Kafka, nor a Spring expert but I think that it’s more a conceptual challenge when writing to independent resources and the solution should be adaptable to your technology stack. Furthermore, I should say that this solution tries to solve the problem without an external component like Debezium, because in my opinion each additional component brings challenges in testing, maintaining and running an application which is often underestimated when choosing such an option. Also not every database can be used as a Debezium-source.
To make sure that we are talking about the same goals, let’s clarify the situation in an simplified airline example, where customers can buy tickets. After a successful order the customer will receive a message (mail, push-notification, …) that is sent by an external messaging system (the system we have to talk with).
In a traditional JMS world with an XA transaction between our database (where we store orders) and the JMS provider it would look like the following: The client sets the order to our app where we start a transaction. The app stores the order in its database. Then the message is sent to JMS and you can commit the transaction. Both operations participate at the transaction even when they’re talking to their own resources. As the XA transaction guarantees ACID we’re fine.
Let’s bring Kafka (or any other resource that is not able to participate at the XA transaction) in the game. As there is no coordinator that syncs both transactions anymore the main idea of the following is to split processing in two parts with a persistent state.
When you store the order in your database you can also store the message (with aggregated data) in the same database (e.g. as JSON in a CLOB-column) that you want to send to Kafka afterwards. Same resource – ACID guaranteed, everything fine so far. Now you need a mechanism that polls your “KafkaTasks”-Table for new tasks that should be send to a Kafka-Topic (e.g. with a timer service, maybe #Scheduled annotation can be used in Spring). After the message has been successfully sent to Kafka you can delete the task entry. This ensures that the message to Kafka is only sent when the order is also successfully stored in application database. Did we achieve the same guarantees as we have when using a XA transaction? Unfortunately, no, as there is still the chance that writing to Kafka works but the deletion of the task fails. In this case the retry-mechanism (you would need one as mentioned in your question) would reprocess the task an sends the message twice. If your business case is happy with this “at-least-once”-guarantee you’re done here with a imho semi-complex solution that could be easily implemented as framework functionality so not everyone has to bother with the details.
If you need “exactly-once” then you cannot store your state in the application database (in this case “deletion of a task” is the “state”) but instead you must store it in Kafka (assuming that you have ACID guarantees between two Kafka topics). An example: Let’s say you have 100 tasks in the table (IDs 1 to 100) and the task job processes the first 10. You write your Kafka messages to their topic and another message with the ID 10 to “your topic”. All in the same Kafka-transaction. In the next cycle you consume your topic (value is 10) and take this value to get the next 10 tasks (and delete the already processed tasks).
If there are easier (in-application) solutions with the same guarantees I’m looking forward to hear from you!
Sorry for the long answer but I hope it helps.
All the approach described above are the best way to approach the problem and are well defined pattern. You can explore these in the links provided below.
Pattern: Transactional outbox
Publish an event or message as part of a database transaction by saving it in an OUTBOX in the database.
http://microservices.io/patterns/data/transactional-outbox.html
Pattern: Polling publisher
Publish messages by polling the outbox in the database.
http://microservices.io/patterns/data/polling-publisher.html
Pattern: Transaction log tailing
Publish changes made to the database by tailing the transaction log.
http://microservices.io/patterns/data/transaction-log-tailing.html
Debezium is a valid answer but (as I've experienced) it can require some extra overhead of running an extra pod and making sure that pod doesn't fall over. This could just be me griping about a few back to back instances where pods OOM errored and didn't come back up, networking rule rollouts dropped some messages, WAL access to an aws aurora db started behaving oddly... It seems that everything that could have gone wrong, did. Not saying Debezium is bad, it's fantastically stable, but often for devs running it becomes a networking skill rather than a coding skill.
As a KISS solution using normal coding solutions that will work 99.99% of the time (and inform you of the .01%) would be:
Start Transaction
Sync save to DB
-> If fail, then bail out.
Async send message to kafka.
Block until the topic reports that it has received the
message.
-> if it times out or fails Abort Transaction.
-> if it succeeds Commit Transaction.
I'd suggest to use a new approach 2-phase message. In this new approach, much less codes are needed, and you don't need Debeziums any more.
https://betterprogramming.pub/an-alternative-to-outbox-pattern-7564562843ae
For this new approach, what you need to do is:
When writing your database, write an event record to an auxiliary table.
Submit a 2-phase message to DTM
Write a service to query whether an event is saved in the auxiliary table.
With the help of DTM SDK, you can accomplish the above 3 steps with 8 lines in Go, much less codes than other solutions.
msg := dtmcli.NewMsg(DtmServer, gid).
Add(busi.Busi+"/TransIn", &TransReq{Amount: 30})
err := msg.DoAndSubmitDB(busi.Busi+"/QueryPrepared", db, func(tx *sql.Tx) error {
return AdjustBalance(tx, busi.TransOutUID, -req.Amount)
})
app.GET(BusiAPI+"/QueryPrepared", dtmutil.WrapHandler2(func(c *gin.Context) interface{} {
return MustBarrierFromGin(c).QueryPrepared(db)
}))
Each of your origin options has its disadvantage:
The user cannot immediately see the database changes it have just created.
Debezium will capture the log of the database, which may be much larger than the events you wanted. Also deployment and maintenance of Debezium is not an easy job.
"built-in auto-retry functionality" is not cheap, it may require much codes or maintenance efforts.

Spring Batch Multithreading stuck without any exception output

I am implementing Spring Batch multithreading for some daily process. Item reader, Item processor and item writer are all bean(singleton). Also I am using Hibernate and spring data jpa for db access. For thread, I am using threadpooltaskexecutor.
Thread will hang for no reason, probably I don't know the root cause. Right now, in log, hibernate will stuck at either select or insert statement and just hang there forever. I don't really know the reason. For transaction, I have required_new and read_committed for propagation and isolation. Everything else is just default from Spring Boot.
I am processing 20k jsons, size might differ, but some are big. I can't share the whole code because I don't know where the problem is. Basically in item processor, I have few sync blocks to handle the business logic.
For this question, I just want to know what are the possible reasons? Because Java doesn't give any information.
Without code it is hard to say what might be the problem .You need to dig down exactly where its getting hanged. Possible reasons are
In reader select query is taking long time. (May be due to improper indexing or table might have locked). You can add more logs in reader to check exactly which record/s is causing this.
If you are doing any DB operations in ItemProcessor, add logs add check.
In writer, insertion might be taking long to complete writing.

Parallel updates to different entity properties

I'm using JDO to access Datastore entities. I'm currently running into issues because different processes access the same entities in parallel and I'm unsure how to go around solving this.
I have entities containing values and calculated values: (key, value1, value2, value3, calculated)
The calculation happens in a separate task queue.
The user can edit the values at any time.
If the values are updated, a new task is pushed to the queue that overwrite the old calculated value.
The problem I currently have is in the following scenario:
User creates entity
Task is started
User notices an error in his initial entry and quickly updates the entity
Task finishes based on the old data (from step 1) and overwrites the entire entity, also removing the newly entered values (from step 3)
User is not happy
So my questions:
Can I make the task fail on update in step 4? Wrapping the task in a transaction does not seem to solve this issue for all cases due to eventual consistency (or, quite possibly, my understanding of datastore transactions is just wrong)
Is using the low-level setProperty method the only way to update a single field of an entity and will this solve my problem?
If none of the above, what's the best way to deal with a use case like this
background:
At the moment, I don't mind trading performance for consistency. I will care about performance later.
This was my first AppEngine application, and because it was a learning process, it does not use some of the best practices. I'm well aware that, in hindsight, I should have thought longer and harder about my data schema. For instance, none of my entities use ancestor relationships where they would be appropriate. I come from a relational background and it shows.
I am planning a major refactoring, probably moving to Objectify, but in the meantime I have a few urgent issues that need to be solved ASAP. And I'd like to first fully understand the Datastore.
Obviously JDO comes with optimistic concurrency checking (should the user enable it) for transactions, which would prevent/reduce the chance of such things. Optimistic concurrency is equally applicable with relational datastores, so you likely know what it does.
Google's JDO plugin uses the low-level API setProperty() method obviously. The log even tells you what low level calls are made (in terms of PUT and GET). Moving to some other API will not on its own solve such problems.
Whenever you need to handle write conflicts in GAE, you almost always need transactions. However, it's not just as simple as "use a transaction":
First of all, make sure each logical unit of work can be defined in a transaction. There are limits to transactions; no queries without ancestors, only a certain number of entity groups can be accessed. You might find you need to do some extra work prior to the transaction starting (ie, lookup keys of entities that will participate in the transaction).
Make sure each unit of work is idempotent. This is critical. Some units of work are automatically idempotent, for example "set my email address to xyz". Some units of work are not automatically idempotent, for example "move $5 from account A to account B". You can make transactions idempotent by creating an entity before the transaction starts, then deleting the entity inside the transaction. Check for existence of the entity at the start of the transaction and simply return (completing the txn) if it's been deleted.
When you run a transaction, catch ConcurrentModificationException and retry the process in a loop. Now when any txn gets conflicted, it will simply retry until it succeeds.
The only bad thing about collisions here is that it slows the system down and wastes effort during retries. However, you will get at least one completed transaction per second (maybe a bit less if you have XG transactions) throughput.
Objectify4 handles the retries for you; just define your unit of work as a run() method and run it with ofy().transact(). Just make sure your work is idempotent.
The way I see it, you can either prevent the first task from updating the object because certain values have changed from when the task was first launched.
Or you can you embed the object's values within the task request so that the 2nd calc task will restore the object state with consistent value and calcuated members.

Batch insert using JPA/Toplink

I have a web application that receives messages through an HTTP interface, e.g.:
http://server/application?source=123&destination=234&text=hello
This request contains the ID of the sender, the ID of the recipient and the text of the message.
This message should be processed like:
finding the matching User object for both the source and the destination from the database
creating a tree of objects: a Message that contains a field for the message text and two User objects for the source and the destination
persisting this tree to a database.
The tree will be loaded by other applications that I can't touch.
I use Oracle as the backing database and JPA with Toplink for the database handling tasks. If possible, I'd stay with these.
Without much optimization I can achieve ~30 requests/sec throughput in my environment. That's not much, I'd require ~300 requests/sec. So I measured where the performance bottleneck is and found that the calls to em.persist() takes most of the time. If I simply comment out that line, the throughput go well over 1000 requests/sec.
I tried to write a small test application that used simple JDBC calls to persist 1 million messages to the same database. I used batching, meaning I did 100 inserts then a commit, and repeated until all the records was in the database. I measured ~500 requests/sec throughput in this scenario, that would meet my needs.
It is clear that I need to optimize insert performance here. However as I mentioned earlier I would like to keep using JPA and Toplink for this, not pure JDBC.
Do you know a way to create batch inserts with JPA and Toplink? Can you recommend any other technique for improving JPA persist performance?
ADDITIONAL INFO:
"requests/sec" means here: total number of requests / total time from beginning of test to last record written to database.
I tried to make the calls to em.persist() asynchronous by creating an in-memory queue between the servlet stuff and the persister. It helped the performance greatly. However the queue did grow really fast and as the application will receive ~200 requests/second continuously, It is not an acceptable solution for me.
In this decoupled approach I collected requests for 100 msec and called em.persist() on all collected items before commiting the transaction. The EntityManagerFactory is cached between each transaction.
You should decouple from the JPA interface and use the bare TopLink API. You can probably chuck the objects you're persisting into a UnitOfWork and commit the UnitOfWork on your schedule (sync or async). Note that one of the costs of em.persist() is the implicit clone that happens of the whole object graph. TopLink will work rather better if you uow.registerObject() your two user objects yourself, saving itself the identity tests it has to otherwise do. So you'll end up with:
uow=sess.acquireUnitOfWork();
for (job in batch) {
thingyCl=uow.registerObject(new Thingy());
user1Cl=uow.registerObject(user1);
user2Cl=uow.registerObject(user2);
thingyCl.setUsers(user1Cl,user2Cl);
}
uow.commit();
This is very old school TopLink btw ;)
Note that the batch will help a lot, because batch writing and more especially batch writing with parameter binding will kick in which for this simple example will probably have a very large impact on your performance.
Other things to look for: your sequencing size. A lot of the time spent writing objects in TopLink is actually spent reading sequencing information from the database, especially with the small defaults (I would probably have several hundred or even more as my sequence size).
What is your measure of "requests/sec"? In other words, what happens for the 31st request? What resource is being blocked? If it is the front-end/servlet/web portion, can you run em.persist() in another thread and return immediately?
Also, are you creating transactions each time? Are you creating EntityManagerFactory objects with each request?

Categories