How to parallelize Multiple Requests to mongoDb? - java

I am using a single standalone Mongo DB server with no special topology like replication or sharding etc. Currently I have an issue that mongo DB does not support more than 500 parallel requests. Note that I am using only one instance of MongoClient and the remaining threads are used for inserts. I am using a java executor framework to create the threads and these threads are used to insert data to a collection [all insert in the same collection]

You should queue the requests before you issue them towards the database. There is no use requesting 500 things from your database in parallel. Remember a single request comes with some costs memory wise, locking wise and so on. Actually you are wasting resources by asking your database too much at once - remember I mean this request wise not data wise.
So use a queue (or more) and pool up the requests. From that pool you feed your worker threads (lets say 5 or 10 are enough) and that's it.
Take a look at the Future interface in the concurrent package of java. Using asynchrone processing here looks like the thing with the highest throughput and the lowest resource impact.
But check the MongoDB driver first. I would not be surprised if they have implemented it already this way. If this is the case you just have to limit yourself by using a queue to have only lets say 10 or 100 requests at once being handled by the database driver. Do some performance check tweaking the number of actual requests send to the database.

Related

How to effectively process lot of objects on a list on server side

I have a List which contains a lot of objects.
The problem is that i have to process these objects (process includes cloning, deep copy, and making DB calls, running business logic etc etc.
Doing this in a normal fashion, first come first serve is really time consuming and in a web application , this generally results in transaction timeouts at the server side (as this processing is anync from client perspective).
How do i process those objects so as to take minimal time and not overload the DB.
I'm using java 7 on server environment.
I'm already using a messaging solution , rabbitmq, which gets me the item and its quantity. problem occurs when i try to deep copy items to mimic real items (business logic every item should be uniquely processed) and save them to DB.
After some discussions, the viable solution is using a ABQ (array blocking queues) which is processed by a pool of threads.
Following are the thought out benefits:
1) we wont have to manage the 3rd party queues created e.g. rabbitmq
2) At any point in time the blocking queue wont have all the items to be processed as the consumer threads will be simultaneously processing them, so it will leave lesser memory footprint.
#cody123 i'm using spring batch for retry mechanisms in this case.
After another round of profiling i found that the bottle neck was the DB connection pool having low number of max connections.
I deduced this by running the same transaction without db thread pool and it went perfectly well and completed without any exception.
However combining the previous approach i.e. managing an ABQ and light commits with HA DB will be the best solution.

Publishing to KDB from multiple threads

We have an application with multiple threads which reuses one KDB connection.
From performance perspective, will it be good to open multiple connection to multithreaded KDB instance to speed up the process? Just also interesting is there any potential downside effect if we publish from multiple threads to a single connection: we have java app and use exxeleron java library.
Aside from the fact that a single socket connection to KDB isn't very resource hungry by itself, in the end I think you'll find that disk seeks and memory allocation are by far the largest bottlenecks, not how many connections you have to a database. That said, since you ask...
Let's go on simple assumptions:
The KDB database is a historical database. Multithread options on that side are negative port number and -s - which can't be set simultaneously
You have a single process, let's call it A, that accesses it
With a negative port number, you get multi-threaded input queue. So if A has the ability to do multiple queries they can be dispatched simultaneously and KDB+ won't block on each call. However A would somehow need to be able to identify the incoming stream of results as the responses to particular queries. You can query it like (<queryId>;<actualQuery>) and parse the the first element for identification I suppose. However in this use case it sounds like you should have multiple A's.
With -s you get multi-threaded queries so you q queries have to written as such (sometimes you get it for free though, like querying across partitions). You'll block on every call, so no real advantage in having multiple A's.

Concurrent multithreaded Bulk data inserts/updates to Mysql

Multiple instances of my multi-threaded(approx 10 threads) application is running on different machines(approx 10 machines). So overall 100 threads of this application are active simultaneously.
Each of these threads produce 4 output sets, each set containing 1k-5k rows. Each of these sets is pushed to a single Mysql machine , same db, same table(insert or update operation). So there are 4 tables consuming 4 sets produced by each thread.
I am using mybatis as ORM. These threads may consume a lot of time in writing output to DB than processing the requests.
How can I optimize the database writes in this case?
1. Use batch processing of mybatis
2. Write data to files which will be picked up by single consumer thread & written into DB?
3. Write each data set to different files & use 4 consumer threads to pick data from same set that must be pushed to same table, so locking is minimized?
Please suggest other better ways if possible?
Databases are made to handle concurrency. Not sure what exactly mybatis brings into the picture (not a huge fan of ORM in general), but if it is using it, that makes you start thinking about hacks like intermediate files and single-threaded updates, you are probably much better off ripping it out and writing to db with plain jdbc, which should have no problem handling your use case, provided, you batch your updates adequately.

Connection Pools Size vs. Number of concurrent requests

I have to develop a high scalable webservice, but the connection pool size (Oracle DB) is set to 50.
Having this size means that the number of concurrent request served will be max 50 ,otherwise the no new connections will be available right ?
But by configuration is possible for the Weblogic or Glassfish server to accept more then 50 requests simultaneously ?
I read that the server accepts the request which are 'queued' until a thread is handling them.
Regarding 'scalability' I have a question mark as well because the average DB calls take 1,2 sec. + the soap overhead...==> in a 2,3 sec response time on each call.
Can I estimate how many concurrent users the server will support (Weblogic or Glasfish 4gb) ?
Thank you
Having a maximum of 50 connections in the pool doesn't mean you can only handle 50 users at any one time. Each page request should generate queries that can interleave with each other: so while you can only have 50 queries running at any one time, should be able to handle many more page requests. This can be helped by making sure you only connect to the database for short periods.
The use of connection pools is primarily to avoid the cost of setting up new connections all the time (plus prepared statements are cached etc.), so the intention is to re-use them as frequently as possible.
When you say the average DB call takes 1.2 secs: if this a single query I think you want to look at the query or table indexes to reduce this time (otherwise I'm afraid you are going to get scalability problems no matter what), but if it is multiple queries then they should interleave with other requests quite happily.
As regards queuing: weblogic will queue queries, but you can set a timeout so the query is returned unfulfilled after a set time. You can then decide to try again or tell the user the system is busy and perhaps try again later.
When you are talking about web service, you need to keep a optimum balance between your connection pool and concurrent requests. For the concept you can refer: https://dzone.com/articles/optimum-database-connection-pool-size

Querying over 1,000,000 records using salesforce Java API and looking for best approach

I am developing a Java application which will query tables which may hold over 1,000,000 records. I have tried everything I could to be as efficient as possible but I am only able to achieve on avg. about 5,000 records a minute and a maximum of 10,000 at one point. I have tried reverse engineering the data loader and my code seems to be very similar but still no luck.
Is threading a viable solution here? I have tried this but with very minimal results.
I have been reading and have applied every thing possible it seems (compressing requests/responses, threads etc.) but I cannot achieve data loader like speeds.
To note, it seems that the queryMore method seems to be the bottle neck.
Does anyone have any code samples or experiences they can share to steer me in the right direction?
Thanks
An approach I've used in the past is to query just for the IDs that you want (which makes the queries significantly faster). You can then parallelize the retrieves() across several threads.
That looks something like this:
[query thread] -> BlockingQueue -> [thread pool doing retrieve()] -> BlockingQueue
The first thread does query() and queryMore() as fast as it can, writing all ids it gets into the BlockingQueue. queryMore() isn't something you should call concurrently, as far as I know, so there's no way to parallelize this step. All ids are written into a BlockingQueue. You may wish to package them up into bundles of a few hundred to reduce lock contention if that becomes an issue. A thread pool can then do concurrent retrieve() calls on the ids to get all the fields for the SObjects and put them in a queue for the rest of your app to deal with.
I wrote a Java library for using the SF API that may be useful. http://blog.teamlazerbeez.com/2011/03/03/a-new-java-salesforce-api-library/
With the Salesforce API, the batch size limit is what can really slow you down. When you use the query/queryMore methods, the maximum batch size is 2000. However, even though you may specify 2000 as the batch size in your SOAP header, Salesforce may be sending smaller batches in response. Their batch size decision is based on server activity as well as the output of your original query.
I have noticed that if I submit a query that includes any "text" fields, the batch size is limited to 50.
My suggestion would be to make sure your queries are only pulling the data that you need. I know a lot of Salesforce tables end up with a lot of custom fields that may not be needed for every integration.
Salesforce documentation on this subject
We have about 14000 records in our Accounts object and it takes quite some time to get all the records. I perform a query which takes about a minute but SF only returns batches of no more than 500 even though I set batchsize to 2000. Each query more operation takes from 45 seconds to a minute also. This limitation is quite frustrating when you need to get bulk data.
Make use of Bulk-api to query any number of records from Java. I'm making use of it and performs very effectively even in seconds you get the result. The String returned is comma separated. Even you can maintain batches less than or equal to 10k to get the records either in CSV (using open csv) or directly in String.
Let me know if you require the code help.
Latency is going to be a killer for this type of situation - and the solution will be either multi-thread, or asynchronous operations (using NIO). I would start by running 10 worker threads in parallel and see what difference it makes (assuming that the back-end supports simultaneous gets).
I don't have any concrete code or anything I can provide here, sorry - just painful experience with API calls going over high latency networks.

Categories