I am creating a java application that needs to collect loads of data, process the data into objects and return the objects as a list.
All of the data collected is from different tables in my database (some are joined but all of them are different SQL calls)
I was thinking of getting this data through different threads but since multiple threads cannot use the same connection to acess data in the database i would have to create a new connection for each of these threads.
My Question is: what is the best way to acess and process multiple data from database at the same time?
If you have enough memory I would use a full second level change that syncs to the database. Using a cache makes it extremely faster. If you won't have enough memory on the server/client you can cache your query on the sqlserver with a table that has all the values from your query and this table gets updated every second.
Otherwise you can use a Threadpool with Threads which inserts the queryresults into a shared object for the result.
I am using Spring framework.Suppose there is ModelBean class present where all constants are declared.In ModelBean class name is field which are declared.
public class CurriculumReportDaoImpl extends JdbcDaoSupport{
public List<String> fetchList(){
String query="";
List<ModelBean > tempList=new ArrayList<ModelBean >();
List<Map<String,Object>> record = getJdbcTemplate().queryForList(query);
for(Map<String,Object> result=record){
/*create new instance of ModelBean*/
model=new ModelBean();
/*"name" is the column name which are fetch from db*/
model.setName(result.get("name").toString);
/*now set model in tempList*/
tempList.add(ModelBean );
}
return tempList;
}
}
If you have many connection Then you can create many List here and set into ModelBean class.
I think this will help you.
Ideally the database should be designed in such a way that you should be able to get all the related data in a single query(maybe with JOINS). Not sure if that is possible to achieve in your case.
There are three viable options, one you have already tried by creating multiple threads and fetching the data. I will just add an input to that approach and you may try in case that optimizes. Create Data Fetcher threads to fetch the data from different tables and create one Data Processor thread to process the data as it is fetched by Data Fetcher threads.
or
Second approach can be to to create a stored procedure, which will run directly on the database and can do some data processing for you. This will avoid the need of creating too many threads and doing a lot of processing in your java code.
or
Mix both the approaches to achieve the best results.
Good luck!
Related
I have a requirement in which I am retrieving values in one Reader of the Step using SQL statements and doing the same request in next reader.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step.
For this I have gone through the following link on SO :
How can we share data between the different steps of a Job in Spring Batch?
In many of the comments it is mentioned that 'data must be short'.
Also it is mentioned in one response that: these contexts are good to share strings or simple values, but not for sharing collections or huge amounts of data.
By passing that HashMap, I believe it automatically infers that the reference of the HashMap will be passed.
It would be good to know the possible consequences of passing it before hand and any better alternative approach.
Passing data between step is indeed done via the execution context. However, you should be careful about the size of data you put in the execution context as it is persisted between steps.
I do not want to make another request if the data is already fetched in the First reader and pass that collection (possibly a HashMap) to next step
You can read the data from the database only once and put it in a cache. The second reader can then get the data from the cache. This would be faster that reading the data from the database a second time.
Hope this helps.
I have a set of java classes, and they have certain number of attributes,
these attributes are being assigned values through SQL Queries which is run against a database, in certain classes all the attributes are not being fetched by a single SQL query, but multiple queries instead, so my current implementation is running these queries one after the other and working with multiple resultsets to initialize the java objects. I am looking for a better way to do this, please note that i am not the producer for the SQL database, i am just a consumer, so i don't have access to the schema of the tables.
The only thing you can do to avoid many ResulSet for one object is to refactor your queries into one. Of course if you don't have access to the schema this will not be easy to do.. But the producer of this db should be sensible to the performance that you can gain by executing one query instead of "multiples".
If you really cannot do anything for the queries then you can search for or build an utility to merge/decorate/compose many ResulSet in one class.
Anyway I don't see any problem of building one object from many ResulSet. The problem is more the reason why you cannot have one ResulSet.
I am creating a job to create a complex multi level document for mongoDB from relational data.
I read 'product' records in from Oracle.
I have a tJavaRow and I use the mongoDB API to create a product document (BasicDBObject) using the product details coming in. I store this document in the global map (call this 'product_doc')..as I need to embed a sub-document in this later in the sub job.
I use a tFlowToIterate to store the product_id in the globalMap.
I then have another Oracle input which uses the product_id from the global map as a parameter in the sql, so getting the many part of the relationship to products (call this 'product_orders').
I build a java List of 'product_order' documents and write the List to the globalMap, let's call this 'product_orders'.
I then insert the 'product_documents' List as a sub document to the 'product' document in a tJava component. And I write 'product' to mongoDB and then I move on to the next product row from Oracle.
It is more complex than this, creating a 5 level hierarchy...but this s the basic idea - but it takes 3 hours to run.
So,I want to set the job to run parallelized, so each product row from Oracle gets despatched onto a new thread...Round Robin style.
However, I have a heavy dependency on the globalMap to store objects for later use in the flow....and I know the threads will trample all over each other. I assume each thread maintains the same variable scope across the sub job...
I can identify the thread_id using a global variable in the globalMap "tCollector_1_THREAD_ID" I think.
So I had considered doing this when I add documents/objects into the globalMap.
globalMap.put("product_doc_" + globalMap.get("
tCollector_1_THREAD_ID"))
So that everything I put in the globalMap is thread specific and tagged...but I don't know how tCollector_1_THREAD_ID gets populated, if it is in the globalMap then surely each thread can trample over this value also?
It didn't work...I was getting a load of Null Errors.
So I guess my question is about variable scope and use of globalMap when using tJavaRow components in a parallelized data flow, when you need to maintain references in each thread.
---- UPDATE ------
For clarity if you look at this page it states you can get the thread ID from the variable tCollector_1_Thread_ID. BUt it gets that variable from the globalMap.
Surely the globalMap is a global variable so how can the multiple threads not be all changing this global variable all the time and interfering with each other?
https://help.talend.com//pages/viewpage.action?pageId=265114338
Here's a few approaches that I am using successfully for parallel executions:
If possible create another job that you run parallel, that helps understanding the tasks:
In this example I use the "Use or Register a shared Db connection" feature, so I re-use my connections. tJavaFlex just contains a simple try{ } catch block, so I handle / hide the errors.
GpOutput uses a connection that was created outside of the threads.
I more prefer this approach where I create a separate job and use the context parameters to pass information to the job.
Here you can see how the globalMap is used.
I found that the tPartition/Departitioner in talend is very hard to use. I prefer to use more controlled ways to handle parallel executions. Such as a Loop that splits the workload on 20 parallel threads.
" WHERE mod(num,20) = " + context.i "
I am trying to fetch data from 1 single table having 22 rows and 20 columns (lets say reference_table) of database and compare it's values as a refernce with few elements of other 16 tables(holding the present state of data of some environment).
I am using vector to store the data of REFERENCE_TABLE,each row as a object of some class "X",and use individual data through "vector.get(0).getvalue()".
[getValue() is method of class "X" having column_names as variables ]
So I am fetching reference_values only once into vector at the initial phase of application and then using it in different methods through out the application , rather than fetching the data from database everytime.
So my dilema is:
Whether using data from vector(by passing it to different methods) is efficient or fetching data from database table ?
I WANT THE EXECUTION TIME OF THE APPLICATION TO BE THE MINIMUM.
Please help!!!!
According to my comprehension with java web application, it is better to leave data in database as this will comply with model/control separation, or Data/business-logic separation.
But actually, to achieve this separation will result in a bad application performance problem, that each time you need data, you have to get a connection with database. Thus many developer like to control database themselves, which might have transaction consistency problems.
Hence generally speaking, according to the transaction consistency principle, you should use a transaction manager, or control transaction all by yourself very carefully.
Also, you need to measure the connection time and some other metrix to ensure the performance of your web application. but honestly, I think Java EE application's performance is very satisfactory compared with Python or php
I have list of primary keys ex: empids, I want to get the employee information for each emplid from the database. Or rather, I want to get the data from different databases based on different types of empids using multiple threads.
Currently I'm fetching first employee information and save it into Java bean and fetch second employee and saved it into bean so on. Finally adding all these beans into ArrayList, but now I want to get data from databases in parallel. Means at a time I want to get the employee information for each employee and save it into bean.
Basically I'm looking parallel processing rather than sequentially to improve the performance.
I don't think you're looking for parallelism in this case. You really are looking for a single query that will return all employees whose id is in the collection of Ids that you have. One database connection, one thread, one query, and a result set.
If you are using hibernate, this is super easy with Hibernate Criteria where you can use a Restrictions.IN on the employeeId and pass it the collection of ids. The query underneath will be something like select a, b, c, ..., n from Employee where employee_id in (1,2,3,4...,m)
If you are using straight JDBC, you can achieve the same in your native query, you will need to change the ResultSet parsing because you will now expect a collection back.
You can create a callable task for fetching the employee information and return the ArrayList from that callable (thread).
You can then submit the tasks using a Executor and get the handle of the futures to loop back on results.
//sudo code for
Future<Arraylist<Employee>> fut = executor.submit(new EmployeeInfoTask(empIds));
//EmployeeInfoTask is a callable
for(Arraylist<Employee> result : fut){
//print result;
}
See Executor, Callable
EDIT - for java 1.4
In this case you can still make the database calls in different threads but you will need to make each thread write to a shared Employee collection. Don't forget to synchronize the access to this collection.
Also you will need to join() on all the threads which you have spawned so that you know when all the threads are done..