I am using executor service feature of Java. I want to understand the design perspective.
If something goes wrong in one of the batch what will be best approach to handle it?
I am creating fixed thread pool as,
ExecutorService pool = Executors.newFixedThreadPool(10);
Also I am using invokeall() to invoke all callable which is returning future object.
Here is my scenario -
I have 1000 records coming from xml-file and I wanted to save into DB.
I created batch of 10, each batch containing 100 records.
Batches started processing(say batch1, batch2, batch3... batch10) and lets say one of batch(batch7) came across error for a particular record while parsing the record from xml and it could not save into DB.
So my question is how I can handle this situation ?
How I can get/store failed batch information (batch7 above) ?
I mean, if there is any error in any of batch should i stop all other batches ?
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
The handler that has the logic to process the records should have an variable that stores the batch number.
The handler ideally should have a finite retry logic for few set of database errors.
Once the retry counts exhausts, it warrants a human intervention and it should exit throwing exceptions and the batch number . The executor should ideally should call shutDown . If your logic demands to stop the process immediately , then you should call shutDownNow . Ideally your design should be resistive to such failures and let other batches continue its work even if one fails. Hope it helped you
You should use CompletableFuture to do this
Use CompletableFuture.runAsync( ) to start a process asynchronous, it returns a future. On this future, you can use thenAccept(..) or thenRun(..) methods to do something when process is complete.
There is also a method, exceptionally(..) to do something when an exception is thrown.
By default, it uses a default executor service to do this async, but you can use your own if necessary.
So my question is how I can handle this situation ?
It all depends on your requirement.
How I can get/store failed batch information (batch7 above) ?
You can store it either in a file or database.
I mean, if there is any error in any of batch should i stop all other batches ?
This depends on your business use case. If you have requirement to stop batch processing even with single batch failure, you have to stop next batches. Otherwise you can continue with next set of batches.
Or where i can store information for failed batch and how I can take it for further processing once error corrected ?
This also depends on your requirement & design. You may have to inform the source about problematic XML file so that they can correct the file and sent it back to you. Once you receive the new copy, you have to push new file for processing. It can be manual or automated which depends on your design.
Related
Currently, I am using Spark Consumer built in Java to read records(json) published by Kafka Producer and store it in hdfs. If let's say my record has following attributes (id, name, company, published date), Currently, I am handling the exception such that if one of the attribute is missing then the program throws a Run time Exception with log message displaying that one of the attribute is missing, but the problem is, due to the exception the whole spark jobs completely stops. I would like to handle those bad records, by avoiding this such that instead of stopping the whole spark job, the program would drop and log those bad records instead of throwing exception.
The answer is going to be opinion based. Here is what I would do,
Don't log rejections in a log file because that could be big and you may need to reprocess them. Instead create another dataset for rejected records with reason for rejection. Your process would produce 2 data sets - good ones and rejected ones.
Exception shouldn't be used for control flow of the code though it is possible. I would use the idea of predicate/filter/IF-condition which will check on the data and reject the ones not meeting the predicate/filter/IF-condition.
If you are using exception then bound it around processing of an individual record not the entire job. It is better to avoid this idea.
My program needs to add data to two lists in Redis as a transaction. Data should be consistent in both lists. If there is an exception or system failure and thus program only added data to one list, system should be able to recover and rollback. But based on Redis doc, it doesn't support rollback. How can I implement this? The language I use is Java.
If you need transaction rollback, I recommend using something other than Redis. Redis transactions are not the same as for other datastores. Even Multi/Exec doesn't work for what you want - first because there is no rollback. If you want rollback you will have to pull down both lists so you can restore - and hope that between our error condition and the "rollback" no other client also modified either of the lists. Doing this in a sane and reliable way is not trivial, nor simple. It would also probably not be a good question for SO as it would be very broad, and not Redis specific.
Now as to why EXEC doesn't do what one might think. In your proposed scenario MULTI/EXEC only handles the cases of:
You set up WATCHes to ensure no other changes happened
Your client dies before issuing EXEC
Redis is out of memory
It is entirely possible to get errors as a result of issuing the EXEC command. When you issue EXEC, Redis will execute all commands in the queue and return a list of errors. It will not provide the case of the add-to-list-1 working and add-to-list-2 failing. You would still have your two lists out of sync. When you issue, say an LPUSH after issuing MULTI, you will always get back an OK unless you:
a) previously added a watch and something in that list changed or
b) Redis returns an OOM condition in response to a queued push command
DISCARD does not work like some might think. DISCARD is used instead of EXEC, not as a rollback mechanism. Once you issue EXEC your transaction is completed. Redis does not have any rollback mechanism at all - that isn't what Redis' transaction are about.
The key to understanding what Redis calls transactions is to realize they are essentially a command queue at the client connection level. They are not a database state machine.
Redis transactions are different. It guarantees two things.
All or none of the commands are executed
sequential and uninterrupted commands
Having said that, if you have the control over your code and know when the system failure would happen (some sort of catching the exception) you can achieve your requirement in this way.
MULTI -> Start transaction
LPUSH queue1 1 -> pushing in queue 1
LPUSH queue2 1 -> pushing in queue 2
EXEC/DISCARD
In the 4th step do EXEC if there is no error, if you encounter an error or exception and you wanna rollback do DISCARD.
Hope it makes sense.
My Data Model is based on time series(inserts feeds from various sources in cassandra CFs.) Can anyone suggest how to do inserts in Multi Threading.? Is executing query with executeAsync method similar to multi threading ? Is there any property of cassandra.yaml which I need to set to achieve Multi Threading ? Or any other prerequisites.
The driver is safe for multi-threaded use. What you will typically do is build your Cluster and get a Session instance during application startup, and then share the Session among all threads.
How you handle multi-threading is specific to your code. I don't know SQS either, but I imagine you'd either have multiple consumers that poll from the queue and process the messages themselves, or maybe dispatch the messages to a pool of workers.
Regarding executeAsync, the returned ResultSetFuture implements Guava's ListenableFuture, so you can register a success callback with addListener. But you'll have to provide an Executor to run that callback on (I don't recommend MoreExecutors#sameThreadExecutor as mentioned in the Javadoc, because your callback would end up running on one of the driver's I/O threads).
As mentioned by Carlo, a simple approach is to use the synchronous execute, and have your worker block until it gets a response from Cassandra, and then acknowledge the message.
executeAsync() creates a separate thread for the execution of the statement and immediately returns the control to caller -- a Future<ResultSet> will have your result. When working with this approach you won't know if any exception occurred until you're inside the Future.
In Cassandra you don't have to set anything. Just keep under control the thread-number within your application and initialize properly the Java Driver providing a PoolingOptions object that match your needs.
HTH, Carlo
If you are executing the query in multithreading environment, then make sure you wait for the executeAsync(statement) to complete,
session.executeAsync(statement) will return immediately, it does not guarantee whether the query is valid or submitted successfully. So if you're using threadpool then always use
ResultSetFuture future = session.executeAsync(statement);
future.getUninterruptibly();
This will wait for the query to be submitted and will not consume memory.
I am working on a Java Project using Spring. Database is Oracle
We have a Message listener configured in the container attached to a remote queue. Following are the steps we do once the onMessage gets triggered
Parse the message
insert the message in the database.
Based on the content of the message do some additional process involving file processing, DB insert/update etc..
If the message received in the queue is good and due to some issue on our side, we were unable to process it, We do not have a way to reprocess the message after waiting for some time [assuming the issue which triggered the error gets resolved].
Following is the new design proposed.
1. Parse the message
2. insert the message in the database with a flag. say "false" [The flag only gets changed when the message gets successfully processed.]
A New process to be added which queries the database for record flagged as "false" [one at a time], process it and update the flag to true. If the processing fails, retry configurable amount of time to process the same record. The process can die if there are no more records to process or have exhausted the retry count...
Please suggest a reasonable design which process the message at the earliest possible time detecting a record flagged as "false'
Trigger a java process using Database Trigger ? [DBA is against it]
Is there a way we can trigger the process in the onMessage method after the Database insert is done and without blocking the retrieval of next message ?
Scheduling a job which polls the database at regular interval ?
This can be done in Spring with the #Async annotation. This annotation allows to launch a task asynchronously after the completion of the insert.
This means the thread that made the insert will not block while the #Async operation runs, and it will return immediately.
Depending on the task executor configured, the #Async will get executed in a separate thread, which is what you need in this case. I would suggest to start with SimpleAsyncTaskExecutor, see here what are the different task executors available.
Check also this Spring tutorial for further info.
Since you are already using Spring Integration, why not just send the enhanced message to a new channel and process it there? If the channel is a QueueChannel the processing will be ansynchronous. There are retry features available as well.
I have a tasks thread running in two separate instances of tomcat.
The Task threads concurrently reads (using select) TASKS table on certain where condition and then does some processing.
Issue is ,sometimes both the threads pick the same task , because of which the task is executed twice.
My question is how do i make both thread not to read the same set of data from the TASKS table
It is just because your code(which is accessing data base)DAO function is not synchronized.Make it synchronized,i think your problem will be solved.
If the TASKS table you mention is a database table then I would use Transaction isolation.
As a suggestion, within a trasaction, set an attribute of the TASK table to some unique identifiable value if not set. Commit the tracaction. If all is OK then the task has be selected by the thread.
I haven't come across this usecase so treat my suggestion with catuion.
I think you need to see some information how does work with any enterprise job scheduler, for example with Quartz
For your use case there is a better tool for the job - and that's messaging. You are persisting items that need to be worked on, and then attempting to synchronise access between workers. There are a number of issues that you would need to resolve in making this work - in general updating a table and selecting from it should not be mixed (it locks), so storing state there doesn't work; neither would synchronization in your Java code, as that wouldn't survive a server restart.
Using the JMS API with a message broker like ActiveMQ, you would publish a message to a queue. This message would contain the details of the task to be executed. The message broker would persist this somewhere (either in its own message store, or a database). Worker threads would then subscribe to the queue on the message broker, and each message would only be handed off to one of them. This is quite a powerful model, as you can have hundreds of message consumers all acting on tasks so it scales nicely. You can also make this as resilient as it needs to be, so tasks can survive both Tomcat and broker restarts.
Whether the database can provide graceful management of this will depend largely on whether it is using strict two-phase locking (S2PL) or multi-version concurrency control (MVCC) techniques to manage concurrency. Under MVCC reads don't block writes, and vice versa, so it is very possible to manage this with relatively simple logic. Under S2PL you would spend too much time blocking for the database to be a good mechanism for managing this, so you would probably want to look at external mechanisms. Of course, an external mechanism can work regardless of the database, it's just not really necessary with MVCC.
Databases using MVCC are PostgreSQL, Oracle, MS SQL Server (in certain configurations), InnoDB (except at the SERIALIZABLE isolation level), and probably many others. (These are the ones I know of off-hand.)
I didn't pick up any clues in the question as to which database product you are using, but if it is PostgreSQL you might want to consider using advisory locks. http://www.postgresql.org/docs/current/interactive/explicit-locking.html#ADVISORY-LOCKS I suspect many of the other products have some similar mechanism.
I think you need have some variable (column) where you keep last modified date of rows. Your threads can read same set of data with same modified date limitation.
Edit:
I did not see "not to read"
In this case you need have another table TaskExecutor (taskId , executorId) , and when some thread runs task you put data to TaskExecutor; and when you start another thread it just checks that task is already executing or not (Select ... from RanTask where taskId = ...).
Нou also need to take care of isolation level for transaсtions.