I have list of primary keys ex: empids, I want to get the employee information for each emplid from the database. Or rather, I want to get the data from different databases based on different types of empids using multiple threads.
Currently I'm fetching first employee information and save it into Java bean and fetch second employee and saved it into bean so on. Finally adding all these beans into ArrayList, but now I want to get data from databases in parallel. Means at a time I want to get the employee information for each employee and save it into bean.
Basically I'm looking parallel processing rather than sequentially to improve the performance.
I don't think you're looking for parallelism in this case. You really are looking for a single query that will return all employees whose id is in the collection of Ids that you have. One database connection, one thread, one query, and a result set.
If you are using hibernate, this is super easy with Hibernate Criteria where you can use a Restrictions.IN on the employeeId and pass it the collection of ids. The query underneath will be something like select a, b, c, ..., n from Employee where employee_id in (1,2,3,4...,m)
If you are using straight JDBC, you can achieve the same in your native query, you will need to change the ResultSet parsing because you will now expect a collection back.
You can create a callable task for fetching the employee information and return the ArrayList from that callable (thread).
You can then submit the tasks using a Executor and get the handle of the futures to loop back on results.
//sudo code for
Future<Arraylist<Employee>> fut = executor.submit(new EmployeeInfoTask(empIds));
//EmployeeInfoTask is a callable
for(Arraylist<Employee> result : fut){
//print result;
}
See Executor, Callable
EDIT - for java 1.4
In this case you can still make the database calls in different threads but you will need to make each thread write to a shared Employee collection. Don't forget to synchronize the access to this collection.
Also you will need to join() on all the threads which you have spawned so that you know when all the threads are done..
Related
Interview question
Say , we have a table with 2 million records in Employee table and we need to cut 10% salary(need to do some processing) of each employee and then save it back to collection. How can you do it efficiently.
i asked him we can use executor framework for the same to create multiple threads which can fetch values from table then we can process and save it to list.
then he asked me how will you check that a record is already processed or not, there i was clueless(how to do that).
even i am not sure whether i am good with my approach or not.
please help.
One thing that you could do is to use a producer/consumer type model, where you have one thread working to feed the others the records to update. This way you would not have to worry as much about duplicate processing.
The best approach given the question as stated is to use pure SQL, something like:
update employees set
salary = salary * .9
It is very hard to imagine needing to do something to employee data that SQL could not handle.
If by some quirk of bad design you really needed to do something to employee type data that SQL absolutely could not do, then you would open a cursor to the rowset and iterate through it, making the update synchronously so you only do one pass over the data.
In pseudo code:
cursor = forUpdate ("select for update * from employees")
while (cursor.next()) {
cursor.salary = cursor.salary * .9
}
This is the simplest and likely fastest executing approach.
—-
Regarding logging
It’s only 2M rows, which is a “small” quantity, so most DB could handle it in a single transaction. However if not, add a where clause, eg where id between <start> and <end> to the query to chunk up the process into loggable amounts if using the shell script approach.
If using the code approach, most databases allow you to commit while holding the cursor open, so just commit every 10K rows or so.
Regarding locking
Similar aspects to logging. All rows in such a query are locked for the duration of the transaction. Given it would take that long to run, pick a quiet time to run. If it’s really a big deal, chunk up but realise that locking is unavoidable.
I would load in this table, then add a column for the state. By default, you could set this column to "Not Processed". Once a thread starts processing this employee it would change the state to "Processing", then when finished it would finally switch it to "Processed".
Having 3 states like this would also allow you to use this as a Lock preventing the processing from happening twice.
I use executeBatch() with JDBC to insert multiple rows and I want to get id of inserted rows for another insert I use this code for that purpose:
insertInternalStatement = dbConncetion.prepareStatement(INSERT_RECORD, generatedColumns);
for (Foo foo: foosHashSet) {
insertInternalStatement.setInt(1, foo.getMe());
insertInternalStatement.setInt(1, foo.getMe2());
// ..
insertInternalStatement.addBatch();
}
insertInternalStatement.executeBatch();
// now get inserted ids
try (ResultSet generatedKeys = insertInternalStatement.getGeneratedKeys()) {
Iterator<Foo> fooIterator= foosHashSet.iterator();
while (generatedKeys.next() && fooIterator.hasNext()) {
fooIterator.next().setId(generatedKeys.getLong(1));
}
}
It works fine and ids are returned, my question are:
if I iterate over getGeneratedKeys() and foosHashSet will ids return in same order so that each returned id from database belongs to corresponding Foo instance?
What about when I use multi thread and above code run in multiple threads simultaneously?
Is there any other solution for this? I have two table foo1 and foo2 and I want first insert foo1 records then use their primary ids as foo2 foreign key.
Given support for getGeneratedKeys for batch execution is not defined in the JDBC specification, the behavior will depend on the driver used. I would expect any driver that supports generated keys for batch execution, to return the ids in order they where added to the batch.
However the fact you are using a Set is problematic. Iteration order for most sets are not defined, and could change between iterations (usually only after modification, but in theory you can't assume anything about the order). You need to use something with a guaranteed order, eg a List or maybe a LinkedHashSet.
Applying multi-threading here would probably be a bad idea: you should only use a JDBC connection from a single-thread at a time. Accounting for multi-threading would either require correct locking, or requiring you to split up the workload so it can use separate connections. Whether that would improve or worsen performance is hard to say.
You should be able to iterate through multiple generated keys without problem. They will return in the correct order they were inserted.
I think there should not be any problem adding threads in this matter. The only thing I'm pretty sure is that you would not be able to control the order the ids are inserted on both tables without some code complication.
You could store all firstly inserted ids on a Collection and after all threads/iterations have finished, insert them on second table.
The iteration is the same as long as the fooHashSet is not altered.
One could think using a LinkedHashSet which yields the items in order of insertion. Especially when nothing is removed or overwritten that would be nice.
Concurrent access would be problematic.
Use LinkedHashSet without removal, only adding new items. And additionally wrap it in Collections.synchronizedMap. For set alterations one
would need a Semaphore or such, as synchronizing such a large code block is a no-go.
An - even better performing - solution might be to make a local copy:
List<Me> list = fooHashSet.stream().map(Foo::Me)
.collect(Collectors.toList());
However this still is a somewhat unsatisfying solution:
a batch for multiple inserts and then per insert several other updates/inserts.
Transition to JPA instead of JDBC would somewhat alleviate the situation.
After some experience however I would pose the question whether a database at that point is still the correct tool (hammer)? If it is a graph, a hierarchical data structure, then storing the entire data structure as XML with JAXB in a single database table, could be the best solution. Faster. Easier development. Verifiable data.
Using the database for main data, and the XML for an edited/processed document.
Yes as per the definition of executing batch it says
createFcCouponStatement.executeBatch()
Submits a batch of commands to the database for execution and if all commands execute successfully, returns an array of update counts. The int elements of the array that is returned are ordered to correspond to the commands in the batch, which are ordered according to the order in which they were added to the batch. The elements in the array returned by the method executeBatch may be one of the following:
A number greater than or equal to zero -- indicates that the command was processed successfully and is an update count giving the number of rows in the database that was affected by the command's execution
A value of SUCCESS_NO_INFO -- indicates that the command was processed successfully but that the number of rows affected is unknown
If one of the commands in a batch update fails to execute properly, this method throws a BatchUpdateException, and a JDBC driver may or may not continue to process the remaining commands in the batch. However, the driver's behavior must be consistent with a particular DBMS, either always continuing to process commands or never continuing to process commands. If the driver continues processing after a failure, the array returned by the method BatchUpdateException.getUpdateCounts will contain as many elements as there are commands in the batch, and at least one of the elements will be the following:
A value of EXECUTE_FAILED -- indicates that the command failed to execute successfully and occurs only if a driver continues to process commands after a command fails
The possible implementations and return values have been modified in the Java 2 SDK, Standard Edition, version 1.3 to accommodate the option of continuing to process commands in a batch update after a BatchUpdateException object has been thrown.
I need to insert employee in Employee table What i want is is to avoid duplicate inserts i.e. if thwo thread tries to insert same employee at same time then last transaction
should fail. For example if first_name and hire_date is same for two employees(same employee coming from two threads) then fail the last transaction.
Approach 1:- First approach i can think of put the constraint at column level(like combined unique constraint on first_name and hire_date) or in the query check if
employee exist throw error(i believe it will be possible through PL/SQL)
Approach 2:- Can it be done at java level too like create a method which first check if employee exists then throw error. In that case i need to make the method scynchronized (or
synchronized block) but it will impact performance it will unnecassrily hold other transactions also. Is there a way i can make put the lock(Reentrant lock) or use the synchronized method based on name/hiredate so that only those specific thransaction are put on hold which has same name and hiredate
public void save(Employee emp){
//hibernate api to save
}
I believe Approach 1 should be preferred as its simple and easier to implement. Right ? Even yes, i would like to know if it can be handled efficiently at java level ?
What i want is is to avoid duplicate inserts
and
but it will impact performance it will unnecassrily hold other transactions also
So, you want highly concurrent inserts that guarantee no duplicates.
Whether you do this in Java or in the database, the only way to avoid duplicate inserts is to serialize (or, Java-speak, synchronize). That is, have one transaction wait for another.
The Oracle database will do this automatically for you if you create a PRIMARY KEY or UNIQUE constraint on your key values. Simultaneous inserts that are not duplicates will not interfere or wait for one another. However, if two sessions simultaneously attempt duplicate inserts, the second will wait until the first completes. If the first session completed via COMMIT, then the second transaction will fail with a duplicate key on index violation. If the first session completed via ROLLBACK, the second transaction will complete successfully.
You can do something similar in Java as well, but the problem is you need a locking mechanism that is accessible to all sessions. synchronize and similar alternatives work only if all sessions are running in the same JVM.
Also, in Java, a key to maximizing concurrency and minimizing waits would be to only wait for actual duplicates. You can achieve something close to that by hashing the incoming key values and then synchronzing only on that hash. That is, for example, put 65,536 objects into a list. Then when an insert wants to happen, hash the incoming key values to a number between 1 and 65536. Then get that object from list and synchronize on that. Of course, you can also synchronize on the actual key values, but a hash is usually as good and can be easier to work with, especially if the incoming key values are unwieldly or sensitive.
That all said, this should absolutely all be done in the database using a simple PRIMARY KEY constraint on your table and appropriate error handling.
One of the main reasons of using databases is that they give you consistency.
You are volunteering to put some of that responsibility back into your application. That very much sounds like the wrong approach. Instead, you should study exactly which capabilities your database offers; and try to make "as much use of them as possible".
In that sense you try to fix a problem on the wrong level.
Pseudo Code :
void save (Employee emp){
if(!isEmployeeExist(emp)){
//Hibernate api to save
}
}
boolean isEmployeeExist(Employee emp){
// build and run query for finding the employee
return true; //if employee exists else return false
}
Good question. I would strongly suggest using MERGE (INSERT and UPDATE in single DML) in this case. Let Oracle handle txn and locks. It's best in your case.
You should create Primary Key, Unique constraint (approach 1) regardless of any solution to preserve data integrity.
-- Sample statement
MERGE INTO employees e
USING (SELECT * FROM hr_records) h
ON (e.id = h.emp_id)
WHEN MATCHED THEN
UPDATE SET e.address = h.address
WHEN NOT MATCHED THEN
INSERT (id, address)
VALUES (h.emp_id, h.address);
since the row is not inserted yet, the isolation level such as READ_COMMITED/REPEATABLE_READ will not be applicable on them.
Best is to apply DB constraint(unique) , if that does not exist then in a multi node setup
you can't achive it thru java locks as well. As request can go to any node.
So, in that case we need to have distributed lock kind of functionality.
We can create a table lock where we can define for each table only one/or collection of insertion is possible at a node.
Ex:
Table_Name, Lock_Acquired
emp, 'N'
no any code can get READ_COMMITED on this row and try to update Lock_acuired to 'Y'
so , any other code in other thread or other node wont be able to proceed further and lock will be given only when the previous lock has been released.
This will make sure highly concurrent system which can avoid duplication, however this will suffer from scalibiliy issue. So decide accordingly what you want to achive.
I have an implementation of an ItemWriter which persists all of my value objects nicely. When the first value object (for the batch job) is passed to the ItemWRiter can I perform a separate db insert, and guarantee that this insert will not occur for subsequent value objects coming into the ItemWriter?
Apologies it sounds wordy. In simpler terms I want to get a record to a status table to show that the batch job has started writing and not have this inserted n times.
You can use JobExplorer to query SB metadata tables and check if step is started.
Another way: you can use a listener like the ItemWriterListener.afterWrite() and store your flag into audit table (and - also - into execution context to prevent multiple writes).
I am creating a java application that needs to collect loads of data, process the data into objects and return the objects as a list.
All of the data collected is from different tables in my database (some are joined but all of them are different SQL calls)
I was thinking of getting this data through different threads but since multiple threads cannot use the same connection to acess data in the database i would have to create a new connection for each of these threads.
My Question is: what is the best way to acess and process multiple data from database at the same time?
If you have enough memory I would use a full second level change that syncs to the database. Using a cache makes it extremely faster. If you won't have enough memory on the server/client you can cache your query on the sqlserver with a table that has all the values from your query and this table gets updated every second.
Otherwise you can use a Threadpool with Threads which inserts the queryresults into a shared object for the result.
I am using Spring framework.Suppose there is ModelBean class present where all constants are declared.In ModelBean class name is field which are declared.
public class CurriculumReportDaoImpl extends JdbcDaoSupport{
public List<String> fetchList(){
String query="";
List<ModelBean > tempList=new ArrayList<ModelBean >();
List<Map<String,Object>> record = getJdbcTemplate().queryForList(query);
for(Map<String,Object> result=record){
/*create new instance of ModelBean*/
model=new ModelBean();
/*"name" is the column name which are fetch from db*/
model.setName(result.get("name").toString);
/*now set model in tempList*/
tempList.add(ModelBean );
}
return tempList;
}
}
If you have many connection Then you can create many List here and set into ModelBean class.
I think this will help you.
Ideally the database should be designed in such a way that you should be able to get all the related data in a single query(maybe with JOINS). Not sure if that is possible to achieve in your case.
There are three viable options, one you have already tried by creating multiple threads and fetching the data. I will just add an input to that approach and you may try in case that optimizes. Create Data Fetcher threads to fetch the data from different tables and create one Data Processor thread to process the data as it is fetched by Data Fetcher threads.
or
Second approach can be to to create a stored procedure, which will run directly on the database and can do some data processing for you. This will avoid the need of creating too many threads and doing a lot of processing in your java code.
or
Mix both the approaches to achieve the best results.
Good luck!