I am working on data visualization. I have a MySQL database, java backend, js front end. I access the DB via Hibernate.
I need to retrieve a couple of mio. rows from the DB at a time in order to visualize them properly.
Initial problem: it takes about 20-30 seconds for the whole process
This is not very user-friendly so I need to reduce that time!
I looked where the long waiting time is coming from, and it is definitely the data receiving from the DB: when I want to retrieve the same data in MySQLWorkbench with pure SQL instead of Hibernate it takes almost the same time.
So data processing in the backend is not the problem, it is getting the data in the first place.
I tried to tackle this problem with multithreading.
I read about the common problems using multithreading with hibernate and tried to avoid them, but still I have the feeling I am overlooking something important.
Here is what i tried so far:
In the method which should return all the data for further processing I have
some code...
for(int i = 0; i <= numberOfThreads; i++) {
Runnable r = new Runnable() {
#Override
public void run() {
try{
someDataStructure = HibernateUtil.executeHqlCustom(hql,
lowerLimit, stepSize);
}catch(someException){
...
}
}
};
r.run();
r = null;
}
return someDataStructure;
The "executeHqlCustom" creates a session, creates a Query, uses the setFirstResult to set the lower bound and uses the setMaxResult to set the amount of data each thread should be responsible of (this way the data should be received in chunks).
Session gets closed.
I tried a lot of different combinations (from a couple of threads up to 1 thread for every single row in the DB (what is ridiculous I know!)) but in the end using no multi threading was always faster than all other options I just described.
I also tried using an own class which implements Runnable / extends Thread and had a boolean flag as an instance variable in order to set the thread to null when it is not needed anymore.
None of that was any better.
Like I said I am 99% sure I am overlooking something very important or that I misunderstood some of the concepts of multithreading (I am very new to that).
Any help is appreciated!
Related
I'm pulling down a table full of data, and I need to handle this and for every row do a bit of formatting and then push out to a REST API.
I use a PostgreSQL database and java implementation, the idea is to pull all data down, get the amount of rows and spin up threads to handle a chunk at a time.
I've got the connection set up and pulling the table into a cached row set, and using last(), getRow() and beforeFirst() to get row count.
I'm trying to find a way to split out a chunk of a rowset and hand it off to be handled, but I can't seem to see anything to do this.
There's limit x and things, but I want to avoid numerous database calls with data this size.
Any ideas would be greatly appreciated.
Here's the kind of thing I'm looking at
RowSet rst = RowSetProvider.newFactory().createCachedRowSet();
rst.setUrl(url);
rst.setUsername(username);
rst.setPassword(password);
String cmd = "select * from event_log";
rst.setCommand(cmd);
rst.execute();
ResultSetMetaData rsmd = rst.getMetaData();
int columnsNumber = rsmd.getColumnCount();
rst.last();
int size = rst.getRow();
int maxPerThread = 1000;
rst.beforeFirst();
int threadsToCreate = size / maxPerThread;
for (int loopCount = 0; loopCount < threadsToCreate; loopCount++)
{
//Create chunk
//Create thread
//Pass chunk into thread and start it
//Once chunk is finished then thread and chunk are destroyed
}
This is the proper way to think about JDBC interactions:
All queries are like an ad-hoc view: SELECT foo, bar BETWEEN a AND b AS baz FROM foo INNER JOIN whatever; - this effectively creates a new temporary table.
A ResultSet is a live interactive concept: A ResultSet is not a dump of the returned data. It is like the relationship between a FileInputStream and a file on disk: The ResultSet has methods that give you data, and that data is probably obtained by chatting to the database, 'live', to obtain this information. The ResultSet itself only has a few handles, and not actual data, though it may do some caching, you have no idea.
As a consequence:
ResultSet is utterly non-parallellizable. If you share a ResultSet object with more than one thread, you wrote a bug, and you can't recover from there.
In many DBs, 'ask for the length' is tantamount to running the entire query soup to nuts, and is therefore quite slow. You probably don't want to do that, and there is no real reason to do that from the perspective of 'I want to concurrently process the information I receieved'. You've picked the wrong approach.
ResultSets can (and generally, for performance reasons, should be!) configured as 'forward only', meaning: You can advance by one row by calling .next(), and once you did that, you can't go back. This significantly reduces the load on the DB server, as it doesn't have to be prepared to properly respond to the request to hop back to the start.
Here's what I suggest you do:
You have a single 'controller' thread which has the ResultSet and runs the query.
Once the query returns, you have no idea how many records you do have. But you do know how much you want to parallelize - how many threads you want to be concurrently churning away at processing this data.
Thus, the answer: Spin up that many threads in the form of an ExecutorPool. Then, have your controller pull rows (call resultSet.next() and pull all data into java types by invoking all the various .getFoo(idxOrColName) methods), marshalling it all into a single java object. I suggest you write a POJO that represents one row's worth of data and create one for each row.
Then, your controller thread will take this object and considers this object 'a job'.
You've now reduced the problem to a basic forkjoin style strategy: You have one thread that produces jobs, and you have some code that will take a single job and completes it. I've just described what ExecutorPool and friends are designed to do.
It is crucial that the ResultSet object is not accessible by your processor threads. There is no point to pull rows from the DB in parallel, because the DB isn't parallel and wouldn't be able to give you this information any faster than a single thread. The only parallelising win you can score here is to do the processing of the data in a concurrent fashion, which is why the above model cannot be improved upon without much more drastic changes.
If you're looking for drastic redesigns, you need to 'pre-chunk'. Let's say, for example, that you already know you have a database with a million rows, and each row has the property that it has a completely random ID. You also know you have X processor threads, where X is a dynamic number that depends on many factors, such as how many CPU cores the hardware you run on has.
Then:
You fire up X threads. You tell each thread its index (so, if you have 7 threads, one has 'index 0', another has 'index 1', all the way up to 'index 6'), and how many total threads there are.
Then, each thread runs the following query:
SELECT * FROM jobs WHERE unid % 7 = 5;
That's the query the 6th job thread would run.
This guarantees that each thread is running about an equal number of jobs, give or take.
Generally this is less efficient than the previous model, given that this most likely means the DB is just doing more work (running the same query 7-fold, instead of only once), and any given worker thread may start idling whilst others are still running, vs. the controller-that-pulls-and-hands-jobs-out model where you won't run into the situation that one thread is done whilst others still have lots of jobs left.
NB: RowSet and ResultSet work effectively the exact same way. In fact, the DB version of RowSet (JdbcRowSet) is implemented as a light wrapper around ResultSet.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Java: TaskExecutor for Asynchronous Database Writes?
I have a Map of data objects in memory that I need to be able to read and write to very quickly. I would like these objects to be persistent across process restarts so I'd like to have them stored in a DB.
Since I'd rather not have any DB inserts or updates slow down my running time, I'd like to have them done immediately in memory and later, asynchronously to the DB. (It's even acceptable to me if the process crashes and a little bit of data is lost.)
Is there a Java tool (preferably open source) that has this ability "out of the box"? Can this easily be done with Hibernate?
As I have also stated in my comment, if you need async writes and do not want to use hibernate or ehcache:
Runnable: The only way to achieve async processing in Java will be via simple class which extends Runnable:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
Also stated here: Java: TaskExecutor for Asynchronous Database Writes?
Distributed workers: If you want to introduce more moving parts in your system, then you can try java alternative of distributed task queue like celery: hazelcast or octobot. This will need a messaging tier in between which will act as a queue and the workers will do the task of writing to DB for you. This looks likes an overkill, but again depends on your use case and the scale at which you want to use your app.
I did something very similar where I had a use case to write to DB in an async manner, so I went with celery (python). Sample implementation can be found here: artemis
Consider using write behind caching, e.g. in EhCache:
[...] writing data into the cache, instead of writing the data into database at the same time, the write-behind cache saves the changed data into a queue and lets a backend thread to do the writing later. Therefore, the cache-write process can proceed without waiting for the database-write and, thus, be finished much faster. Any data that has been changed can be persisted into database eventually. In the mean time, any read from cache will still get the latest data.
Unfortunately I don't know how and if write behind integrates with Hibernate (don't think so).
In hibernate you choose when you want to persist data in the database, you work mainly with proxy objects (that you can keep and manipulate in memory) and you call save method whenever you want to insert or update in the database.
I have a Spring-MVC, Hibernate, (Postgres 9 db) Web app. An admin user can send in a request to process nearly 200,000 records (each record collected from various tables via joins). Such operation is requested on a weekly or monthly basis (OR whenever the data reaches to a limit of around 200,000/100,000 records). On the database end, i am correctly implementing batching.
PROBLEM: Such a long running request holds up the server thread and that causes the the normal users to suffer.
REQUIREMENT: The high response time of this request is not an issue. Whats required is not make other users suffer because of this time consuming process.
MY SOLUTION:
Implementing threadpool using Spring taskExecutor abstraction. So i can initialize my threadpool with say 5 or 6 threads and break the 200,000 records into smaller chunks say of size 1000 each. I can queue in these chunks. To further allow the normal users to have a faster db access, maybe I can make every runnable thread sleep for 2 or 3 secs.
Advantages of this approach i see is: Instead of executing a huge db interacting request in one go, we have a asynchronous design spanning over a larger time. Thus behaving like multiple normal user requests.
Can some experienced people please give their opinion on this?
I have also read about implementing the same beahviour with a Message Oriented Middleware like JMS/AMQP OR Quartz Scheduling. But frankly speaking, i think internally they are also gonna do the same thing i.e making a thread pool and queueing in the jobs. So why not go with the Spring taskexecutors instead of adding a completely new infrastructure in my web app just for this feature?
Please share your views on this and let me know if there is other better ways to do this?
Once again: the time to completely process all the records in not a concern, whats required is that normal users accessing the web app during that time should not suffer in any way.
You can parallelize the tasks and wait for all of them to finish before returning the call. For this, you want to use ExecutorCompletionService which is available in Java standard since 5.0
In short, you use your container's service locator to create an instance of ExecutorCompletionService
ExecutorCompletionService<List<MyResult>> queue = new ExecutorCompletionService<List<MyResult>>(executor);
// do this in a loop
queue.submit(aCallable);
//after looping
queue.take().get(); //take will block till all threads finish
If you do not want to wait then, you can process the jobs in the background without blocking the current thread but then you will need some mechanism to inform the client when the job has finished. That can be through JMS or if you have an ajax client then, it can poll for updates.
Quartz also has a job scheduling mechanism but, Java provides a standard way.
EDIT:
I might have misunderstood the question. If you do not want a faster response but rather you want to throttle the CPU, use this approach
You can make an inner class like this PollingThread where batches containing java.util.UUID for each job and the number of PollingThreads are defined in the outer class. This will keep going forever and can be tuned to keep your CPUs free to handle other requests
class PollingThread implements Runnable {
#SuppressWarnings("unchecked")
public void run(){
Thread.currentThread().setName("MyPollingThread");
while (!Thread.interrupted()) {
try {
synchronized (incomingList) {
if (incomingList.size() == 0) {
// incoming is empty, wait for some time
} else {
//clear the original
list = (LinkedHashSet<UUID>)
incomingList.clone();
incomingList.clear();
}
}
if (list != null && list.size() > 0) {
processJobs(list);
}
// Sleep for some time
try {
Thread.sleep(seconds * 1000);
} catch (InterruptedException e) {
//ignore
}
} catch (Throwable e) {
//ignore
}
}
}
}
Huge-db-operations are usually triggered at wee hours, where user traffic is pretty less. (Say something like 1 Am to 2 Am.. ) Once you find that out, you can simply schedule a job to run at that time. Quartz can come in handy here, with time based triggers. (Note: Manually triggering a job is also possible.)
The processed result could now be stored in different table(s). (I'll refer to it as result tables) Later when a user wants this result, the db operations would be against these result tables which have minimal records and hardly any joins would be involved.
instead of adding a completely new infrastructure in my web app just for this feature?
Quartz.jar is ~ 350 kb and adding this dependency shouldn't be a problem. Also note that there's no reason this need to be as a web-app. These few classes that do ETL could be placed in a standalone module.The request from the web-app needs to only fetch from the result tables
All these apart, if you already had a master-slave db model(discuss on that with your dba) then you could do the huge-db operations with the slave-db rather than the master, which normal users would be pointed to.
I'm thinking of using Java's TaskExecutor to fire off asynchronous database writes. Understandably threads don't come for free, but assuming I'm using a fixed threadpool size of say 5-10, how is this a bad idea?
Our application reads from a very large file using a buffer and flushes this information to a database after performing some data manipulation. Using asynchronous writes seems ideal here so that we can continue working on the file. What am I missing? Why doesn't every application use asynchronous writes?
Why doesn't every application use asynchronous writes?
It's often necessary/usefull/easier to deal with a write failure in a synchronous manner.
I'm not sure a threadpool is even necessary. I would consider using a dedicated databaseWriter thread which does all writing and error handling for you. Something like:
public class AsyncDatabaseWriter implements Runnable {
private LinkedBlockingQueue<Data> queue = ....
private volatile boolean terminate = false;
public void run() {
while(!terminate) {
Data data = queue.take();
// write to database
}
}
public void ScheduleWrite(Data data) {
queue.add(data);
}
}
I personally fancy the style of using a Proxy for threading out operations which might take a long time. I'm not saying this approach is better than using executors in any way, just adding it as an alternative.
Idea is not bad at all. Actually I just tried it yesterday because I needed to create a copy of online database which has 5 different categories with like 60000 items each.
By moving parse/save operation of each category into the parallel tasks and partitioning each category import into smaller batches run in parallel I reduced the total import time from several hours (estimated) to 26 minutes. Along the way I found good piece of code for splitting the collection: http://www.vogella.de/articles/JavaAlgorithmsPartitionCollection/article.html
I used ThreadPoolTaskExecutor to run tasks. Your tasks are just simple implementation of Callable interface.
why doesn't every application use asynchronous writes? - erm because every application does a different thing.
can you believe some applications don't even use a database OMG!!!!!!!!!
seriously though, given as you don't say what your failure strategies are - sounds like it could be reasonable. What happens if the write fails? or the db does away somehow
some databases - like sybase - have (or at least had) a thing where they really don't like multiple writers to a single table - all the writers ended up blocking each other - so maybe it wont actually make much difference...
I'm playing around with the GPars library while working to improve the scalability of a matching system. I'd like to be able to query the database and immediately query the database while the results are being processed concurrently. The bottleneck is reading from the database so I would like to keep the database busy full time while processing the results asynchronously when they are available. I realise I may have some fundamental misunderstandings on how the actor framework works and I'd be happy to be corrected!
In pseudo code I'm trying to do the following:
Define two actors, One for running selects against the database and another for processing the records.
queryActor querys database and sends results to processorActor
queryActor immediately querys database again without waiting for processorActor to finish
I could probably achieve the simple use case without using actors but my end goal is to have an actor pool that is always working on new queries with potentially different datasources in order to increase the throughput of the system in general.
The processing Actor will always be much faster than the database query so I would like to query multiple replicas concurrently in future.
def processor = actor {
loop {
react {querySet ->
println "processing recordset"
if (querySet instanceof Object[]) {
MatcherDataRowProcessor matcher = new MatcherDataRowProcessor(matchedRecords, matchedRecordSet);
matchedRecords = matcher.processRecordset(querySet);
reply matchedRecords
}
else {
println 'processor fed nothing, halting processor actor'
stop()
}
}
}
}
def dbqueryer = actor {
println "dbqueryer has started"
while (batchNum.longValue() <= loopLimiter) {
println "hitting db"
Object[] querySet
def thisRuleBatch = new MatchRuleBatch(targetuidFrom, targetuidTo)
thisRuleBatch.targetuidFrom = batchNum * perBatch - perBatch
thisRuleBatch.targetuidTo = thisRuleBatch.targetuidFrom + perBatch
thisRuleBatch.targetName = targetName
thisRuleBatch.whereClause = whereClause
querySet = dao.getRecordSet(thisRuleBatch)
processor.send querySet
batchNum++
}
react { processedRecords ->
processor.send false
}
}
I would suggest taking a look at Dataflow Queues in the Dataflow Concurrency section of the user guide for GPars. You may find that Dataflows provide a better/cleaner abstraction for your problem at hand. Dataflows can also be used in conjunction with actors.
I think either actors or dataflows would work in this situation and feel that the decision comes down to which one provides the abstraction that more closely matches what you are trying to accomplish. For me, the concept of tasks, queues, dataflows seems to be a closer fit terminology-wise.
After some more research I have found that the DataFlow concurrency stuff in Gpars is actually built on top of the Actor support. The DataflowOperatorTest in the gpars java demo distribution (I need to do a java implementation) seems to be a good match for what I need to do. The main thread waits for multiple stream inputs to be populated which in my case are the parallel database queries.