In my database, i have many records of a certain table that need to be processed from time to time by my java spring app.
There is a boolean flag, on each row of that table saying whether a given record is currently being processed.
What I'm looking at is having my java spring app deployed multiple times on different servers, all accessing the same shared DB, the same app duplicated with some load balancer, etc.
But only one java app instance at a time can process a given DB record of that particular table.
What are the different approaches to enforce that constraint?
I can think of some unique queue that would dispatch those processing tasks to different java running instances making sure that the same DB record is not processed simultaneously by two different java instances. But that sounds quite complicated for what it is. Maybe there is something simpler? Anything else? Thanks in advance.
you can use the locking strategies to enforce the exclusiveness of access to the particular records in you table. there are 2 different approaches that can be applied to reach this requirement. optimistic locking or pessimistic locking, take a look at hibernate docs
additionally, there's another issue that you should think of. with current approach, if the server would crash during the time when it was processing a certain record and eventually would not succeed to complete, then this record would stay in "incomplete" state and would not be processed by others. one possible solution that come to my mind is to use the 'node id' of server that took responsibility for processing instead of state flag.
Related
Context of My question:
I use a proprietary Database (target database) and I can not reveal the name of the DB (you may not know even If I reveal the name).
Here, I usually need to update the records using java. (The number of records vary from 20000 to 40000)
Each update transaction is taking one or two seconds for this DB. So, you see that the execution time would be in hours. There are no Batch execution functions are available for this Database API. For this, I am thinking to use Java multi-threaded feature, instead of executing all the records in single process I want to create a thread for every 100 records. We know that Java can make these threads run parallelly.
But, I want to know how does the DB process these threads sharing the same connection? I can find this by running a trail program and compare time intervals. I feel that it may be deceiving to some extent. I know that you don't have much information about the database. You can just answer this question assuming the DB as MS SQL/MySQL.
Please suggest me if there is any other feature in java I can utilize to make this program execute faster if not multi-threading.
It is not recommended to use single connection with multiple threads, you can read the pitfalls of doing so here.
If you really need to use a single connection with multiple threads, then I would suggest making sure threads start and stop successfully within a transaction. If one of them fails you have to make sure to rollback the changes. So, first get the count, make cursor ranges and for each range start a thread that will execute that on that range. One thing to look for is to not close the connection after executing the partitions individually, but to close it when the transaction is complete and the db is committed.
If you have an option to use Spring Framework, check out Spring Batch.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advanced technical services and features that will enable extremely high-volume and high performance batch jobs through optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.
Hope this helps.
I want to develop a program that reads data from the database and written into file.
For a better performance, I want to use multithreading.
The solution I plan to implement is based on these assumptions:
it is not necessary to put multiple threads to read from the database because there is a concurrency problem to be managed by the DBMS (similarly to the writing into the file). Given that each read element from the database will be deleted in the same transaction.
Using the model producer-consumer: a thread to read the data (main program). and another thread to write the data in the file.
For implementation I will use the executor framework: a thread pool (size=1) to represent the consumer thread.
Can these assumptions make a good solution ?
Is this problem requires a solution based on multithreading?
it is not necessary to put multiple threads to read from the database because there is a concurrency problem to be managed by the DBMS
Ok. So you want one thread that is reading from the database.
Can these assumptions make a good solution ? Is this problem requires a solution based on multithreading?
Your solution will work but as mentioned by others, there are questions about the performance improvements (if any). Threading programs work because you can make use of the multiple processor (or core) hardware on your computer. In your case, if the threads are blocked by the database or blocked by the file-system, the performance improvement may be minimal if at all. If you were doing a lot of processing of the data, then having multiple threads handle the task would work well.
This is more of a comment:
For your first assumption: You should post the db part on https://dba.stackexchange.com/ .
A simple search returned :
https://dba.stackexchange.com/questions/2918/about-single-threaded-versus-multithreaded-databases-performance - so you need to check if your read action is complex enough and if multithread even serves your need for db connection.
Also, your program seems to be sequential read and write. I dont think you even need multithreading unless you want multiple writes on the same file at the same time.
You should have a look at Spring Batch, http://projects.spring.io/spring-batch/, which relates to JSR 352 specs.
This framework comes with pretty good patterns to manage ETL related operations, including multi-threaded processing, data partitioning, etc.
Here is my requirement:
a date is inserted in to a db table with each record. Two weeks
before that particulate date, a separate record should be entered to a
different table.
My initial solution was to put up a SQL schedule job, but my client insisted on it being handled through java.
What is the best approach for this?
What are the pros and cons of using SQL schedule job and Java scheduling for this task?
Ask yourself the question: to what domain does this piece of work belong? If it's required for data integrity, then it's obviously the DBMS' problem and would probably best be handled there. If it's part of the business domain rather than the data, or might require information or processing that's not available or natural to the DBMS, it's probably best made external.
I'd say, use the best tool for the job. Having stuff handled by the database using whatever features it offers is often nice. For example, a log table that keeps "snapshots" of status updates of records in another table is something I typically like to have a trigger for, taking that responsibility out of my app's hands.
But that's something that's available in practically any DBMS. There's the possibility that other databases won't offer the job scheduling capacities you require. If it's conceivable that some day you'll be switching to a different DBMS, you'll then be forced to do it in Java anyway. That's the advantage of the Java approach: you've got the functionality independently of the database. If you're using pure JDBC with standard SQL queries, you've got a fully portable solution.
Both approaches seem valid. Consider what induces the least work and worries. If it's done in Java you'll need to make sure that process is running or scheduled. That's some external dependency. If it's in the database, you'll be sure the job is done as long as the DB is up.
Well, first off, if you want to do it in Java, you can use the Timer for a simple basic repetitive job, or Quartz for more advanced stuff.
Personally I also think that it would be better to have the same entity (application) deal with all related database actions. In other words, if your Java app is reading/writing to/from the db, it should be consistent and also deal with scheduled reading/writings. And as a plus, this way you can synchronize your actions easier, like, if you want to make sure that a scheduled job is running, has started, has finished, you can do that a lot easier if all is done in Java as compared with having a different process (like the SQL Scheduler) doing it.
I plan to implement a GAE app only for my own usage.
The application will get its data using URL Fetch service, updating it every x minutes (using Scheduled tasks). Then it will serve that information to me when I request it.
I have barely started to look into GAE, but I have a main question that I am not able to clear. Can state be maintained in GAE between different requests without using jdo/jpa and the datastore?
As I am the only user, I guess I could keep the info in a servlet subclass and so I can avoid having to deal with Datastore...but my concern is that, as this app will have very few request, if it is moved to disk or whatever (don't know yet if it has some specific name), it will loose its status?
I am not concerned about having to restart the whole app and start collecting data from scratch from time to time, that is ok.
If this is an app for your own use, and you're double-extra sure that you won't be making it multi-user, and you're not concerned about the possibility that you might be using it from two browsers at once, you can skip using sessions and use a known key for storing information in memcache.
If your reason for avoiding datastore is concern over performance, then I strong recommend testing that assumption. You may be pleasantly surprised.
You could use the http session to maintain state between requests, but that will use the datastore itself (although you won't have to write any code to get this behaviour).
You might also consider using the Cache API (like memcache). It's JSR 107 I think, which Google provide an implementation of. The Cache is shared between instances, but it can empty at anytime. But if you're happy with that behaviour this may be an option. Looking at your requirements this may be the most feasible option, if you don't want to write your own persistence code.
You could store data as a static against your Class or in an application scoped Object, but doing that means when your instance spins down or your instance switches to another instance, the data would be lost as your classes would need to be loaded into the new instance.
Or you could serialize the state to the client and send it back in with each request.
The most robust option is persistence to the datastore - the JPA code is trivial. Perhaps you should reconsider?
My requirement is I have server J2EE web application and client J2EE web application. Sometimes client can go offline. When client comes online he should be able to synchronize changes to and fro. Also I should be able to control which rows/tables need to be synchronized based on some filters/rules. Is there any existing Java frameworks for doing it? If I need to implement on my own, what are the different strategies that you can suggest?
One solution in my mind is maintaining sql logs and executing same statements at other side during synchronization. Do you see any problems with this strategy?
There are a number of Java libraries for data synchronizing/replication. Two that I'm aware of are daffodil and SymmetricDS. In a previous life I foolishly implemented (in Java) my own data replication process. It seems like the sort of thing that should be fairly straightforward, but if the data can be updated in multiple places simultaneously, it's hellishly complicated. I strongly recommend you use one of the aforementioned projects to try and bypass dealing with this complexity yourself.
The biggist issue with synchronization is when the user edits something offline, and it is edited online at the same time. You need to merge the two changed pieces of data, or deal with the UI to allow the user to say which version is correct. If you eliminate the possibility of both being edited at the same time, then you don't have to solve this sticky problem.
The method is usually to add a field 'modified' to all tables, and compare the client's modified field for a given record in a given row, against the server's modified date. If they don't match, then you replace the server's data.
Be careful with autogenerated keys - you need to make sure your data integrity is maintained when you copy from the client to the server. Strictly running the SQL statements again on the server could put you in a situation where the autogenerated key has changed, and suddenly your foreign keys are pointing to different records than you intended.
Often when importing data from another source, you keep track of the primary key from the foreign source as well as your own personal primary key. This makes determining the changes and differences between the data sets easier for difficult synchronization situations.
Your synchronizer needs to identify when data can just be updated and when a human being needs to mediate a potential conflict. I have written a paper that explains how to do this using logging and algebraic laws.
What is best suited as the client-side data store in your application? You can choose from an embedded database like SQLite or a message queue or some object store or (if none of these can be used since it is a web application) files/ documents saved on the client using Web DB or IndexedDB through HTML 5's LocalStorage API.
Check the paper Gold Rush: Mobile Transaction Middleware with Java-Object Replication. Microsoft's documentation of occasionally connected systems describes two approaches: service-oriented or message-oriented and data-oriented. Gold Rush takes the earlier approach. The later approach uses database merge-replication.