We have an application with multiple threads which reuses one KDB connection.
From performance perspective, will it be good to open multiple connection to multithreaded KDB instance to speed up the process? Just also interesting is there any potential downside effect if we publish from multiple threads to a single connection: we have java app and use exxeleron java library.
Aside from the fact that a single socket connection to KDB isn't very resource hungry by itself, in the end I think you'll find that disk seeks and memory allocation are by far the largest bottlenecks, not how many connections you have to a database. That said, since you ask...
Let's go on simple assumptions:
The KDB database is a historical database. Multithread options on that side are negative port number and -s - which can't be set simultaneously
You have a single process, let's call it A, that accesses it
With a negative port number, you get multi-threaded input queue. So if A has the ability to do multiple queries they can be dispatched simultaneously and KDB+ won't block on each call. However A would somehow need to be able to identify the incoming stream of results as the responses to particular queries. You can query it like (<queryId>;<actualQuery>) and parse the the first element for identification I suppose. However in this use case it sounds like you should have multiple A's.
With -s you get multi-threaded queries so you q queries have to written as such (sometimes you get it for free though, like querying across partitions). You'll block on every call, so no real advantage in having multiple A's.
Related
At our company we have a server which is distributed into few instances. Server handles users requests. Requests from different users can be processed in parallel. Requests from same users should be executed strongly sequentionally. But they can arrive to different instances due to balancing. Currently we use Redis-based distributed locks but this is error-prone and requires more work around concurrency than business logic.
What I want is something like this (more like a concept):
Distinct queue for each user
Queue is named after user id
Each requests identified by request id
Imagine two requests from the same user arriving at two different instances concurrently:
Each instance put their request id into this user queue.
Additionaly, they both store their request ids locally.
Then some broker takes request id from the top of "some_user_queue" and moves it into "some_user_queue_processing"
Both instances listen for "some_user_queue_processing". They peek into it and see if this is request id they stored locally. If yes, then do processing. If not, then ignore and wait.
When work is done server deletes this id from "some_user_queue_processing".
Then step 3 again.
And all of this happens concurrently for a lot (thousands of them) of different users (and their queues).
Now, I know this sounds a lot like actors, but:
We need solution requiring as small changes as possible to make fast transition from locks. Akka will force us to rewrite almost everything from scratch.
We need production ready solution. Quasar sounds good, but is not production ready yet (more correctly, their Galaxy cluster).
Tops at my work are very conservative, they simply don't want another dependency which we'll need to support. But we already use Redis (for distributed locks), so I thought maybe it could help with this too.
Thanks
The best solution that matches the description of your problem is Redis Cluster.
Basically, the cluster solves your concurrency problem, in the following way:
Two (or more) requests from the same user, will always go to the same instance, assuming that you use the user-id as a key and the request as a value. The value must be actually a list of requests. When you receive one, you will append it to that list. In other words, that is your queue of requests (a single one for every user).
That matching is being possible by the design of the cluster implementation. It is based on a range of hash-slots spread over all the instances.
When a set command is executed, the cluster performs a hashing operation, which results in a value (the hash-slot that we are going to write on), which is located on a specific instance. The cluster finds the instance that contains the right range, and then performs the writing procedure.
Also, when a get is performed, the cluster does the same procedure: it finds the instance that contains the key, and then it gets the value.
The transition from locks is very easy to perform because you only need to have the instances ready (with the cluster-enabled directive set on "yes") and then to run the cluster-create command from redis-trib.rb script.
I've worked last summer with the cluster in a production environment and it behaved very well.
I have a List which contains a lot of objects.
The problem is that i have to process these objects (process includes cloning, deep copy, and making DB calls, running business logic etc etc.
Doing this in a normal fashion, first come first serve is really time consuming and in a web application , this generally results in transaction timeouts at the server side (as this processing is anync from client perspective).
How do i process those objects so as to take minimal time and not overload the DB.
I'm using java 7 on server environment.
I'm already using a messaging solution , rabbitmq, which gets me the item and its quantity. problem occurs when i try to deep copy items to mimic real items (business logic every item should be uniquely processed) and save them to DB.
After some discussions, the viable solution is using a ABQ (array blocking queues) which is processed by a pool of threads.
Following are the thought out benefits:
1) we wont have to manage the 3rd party queues created e.g. rabbitmq
2) At any point in time the blocking queue wont have all the items to be processed as the consumer threads will be simultaneously processing them, so it will leave lesser memory footprint.
#cody123 i'm using spring batch for retry mechanisms in this case.
After another round of profiling i found that the bottle neck was the DB connection pool having low number of max connections.
I deduced this by running the same transaction without db thread pool and it went perfectly well and completed without any exception.
However combining the previous approach i.e. managing an ABQ and light commits with HA DB will be the best solution.
MongoDB introduced Bulk() since version 2.6, I checked the APIs, it's seems great to me.
Before this API, if I need to do a bulk insert, I have to store documents in a List, them use insert() to insert the whole List. In a multi-thread environment, concurrency should also be considered.
Is there a queue/buffer implemented inside the bulk API? each time I
put something into the bulk before execute(), the data is stored int
he queue/buffer, is that right?
Thus, I don't need to write my own queue/buffer, just use Bulk.insert() or Bulk.find().update(), is that right?
Could someone tell me more about the queue. Do I still need to concern the concurrency issues?
Since a Bulk is created like db.collection.initializeUnorderedBulkOp(), so if a bulk instance is not released, it will stay connected to the MongoDB server, is that right?
From the basic idea of "do you need to store your own list?", then not really, but I suppose it all really depends on what you are doing.
For a basic idea of the internals of what is happening under the Bulk Operations API the best place to look is at the individual command forms for each type of operation. So the relevant manual section is here.
So you can think of the "Bulk" interface as being a list or collection of all of the operations that you add to it. And you can pretty much add to that as much as you wish to ( within certain memory and practical constraints ) and consider that the "drain" method for this "queue" is the .execute() method.
As noted in the documentation there, regardless of how many operations you "queue" this will only actually send to the server in groups of 1000 operations at a time at maximum. The other thing to keep in mind is that there is no governance that makes sure that these 1000 operations requests actually fit under the 16MB BSON limit. So that is still a hard limit with MongoDB and you can only effectively form one "request" at a time that totals in less than that data limit in size when sending to the server.
So generally speaking, it is often more practical to make your own "execute/drain" requests to the sever once per every 1000 or less entries or so. Mileage may vary on this but there are some considerations to make here.
With respect to either "Ordered" or "UnOrdered" operations requests, in the former case all queued operations will be aborted in the event of an error being generated in the batch sent. Meaning of course all operations occuring after the error is encountered.
In the later case for "UnOrdered" operations, there is not fatal error reported, but rather in the WriteResult that is returned you get a "list" of any errors that are encountered, in addition to the "UnOrdered" meaning that the operations are not necessarily "applied" in any particular order, which means you cannot "queue" operations that rely on something else in the "queue" being processed before that operation is applied.
So there is the concern of how large a WriteResult you are going to get and indeed how you handle that response in your application. As stated earlier, mileage may vary to the importance of this being a very large response to a smaller and manageable response.
As far and concurrency is concerned there is really one thing to consider here. Even though you are sending many instructions to the sever in a single call and not waiting for individual transfers and acknowledgements, it is still only really processing one instruction at a time. These are either ordered as implied by the initialize method, or "un-ordered" where that is chosen and of course the operations can then run in "parallel" as it were on the server until the batch is drained.
But there is no "lock" until the "batch" completes, so it is not a substitute for a "transaction", so don't make that mistake as a design point. The same MongoDB rules apply, but the benefit here is "one write to server" and "one response back", rather that one for each operation.
Finally, as to whether there is some "server connection" held here by the API, then the answer is not there is not. As pointed to by the initial points of looking at the command internals, this "queue" building is purely "client side only". There is no communication with the server in any way until the .execute() method is called. This is "by design" and actually half the point, as mainly we don't want to be sending data to the server each time you add an operation. It is done all at once.
So "Bulk Operations" are a "client side queue". Everything is stored within the client side until the .execute() "drains" the queue and sends the operations to the server all at once. A response is then given from the server containing all of the results from the operations sent that you can handle however you wish.
Also, once .execute() is called, no more operations can be "queued" to the bulk object, and neither can .execute() be called again. Depending on implementation, you can have some further examination of the "Bulk" object and results. But the general case is where you need to send more "bulk" operations, you re-initialize and start again, just as you would with most queue systems.
Summing up:
Yes. The object effectively "queues" operations.
You don't need your own lists. The methods are "list builders" in themselves
Operations are either "Ordered" or "Un-Ordered" as far as sequence, but all operations are individually processed by the server as per normal MongoDB rules. No transactions.
The "initialize" commands do not talk to the server directly and do not "hold connections" in themselves. The only method that actually "talks" to the server is .execute()
So it is a really good tool. You get much better write operations that you do from legacy command implementations. But do not expect that this offers functionality outside of what MongoDB basically does.
I am using a single standalone Mongo DB server with no special topology like replication or sharding etc. Currently I have an issue that mongo DB does not support more than 500 parallel requests. Note that I am using only one instance of MongoClient and the remaining threads are used for inserts. I am using a java executor framework to create the threads and these threads are used to insert data to a collection [all insert in the same collection]
You should queue the requests before you issue them towards the database. There is no use requesting 500 things from your database in parallel. Remember a single request comes with some costs memory wise, locking wise and so on. Actually you are wasting resources by asking your database too much at once - remember I mean this request wise not data wise.
So use a queue (or more) and pool up the requests. From that pool you feed your worker threads (lets say 5 or 10 are enough) and that's it.
Take a look at the Future interface in the concurrent package of java. Using asynchrone processing here looks like the thing with the highest throughput and the lowest resource impact.
But check the MongoDB driver first. I would not be surprised if they have implemented it already this way. If this is the case you just have to limit yourself by using a queue to have only lets say 10 or 100 requests at once being handled by the database driver. Do some performance check tweaking the number of actual requests send to the database.
Our company has a Batch Application which runs every day, It does some database related jobs mostly, import data into database table from file for example.
There are 20+ tasks defined in that application, each one may depends on other ones or not.
The application execute tasks one by one, the whole application runs in a single thread.
It takes 3~7 hours to finish all the tasks. I think it's too long, so I think maybe I can improve performance by multi-threading.
I think as there is dependency between tasks, it not good (or it's not easy) to make tasks run in parallel, but maybe I can use multi-threading to improve performance inside a task.
for example : we have a task defined as "ImportBizData", which copy data into a database table from a data file(usually contains 100,0000+ rows). I wonder is that worth to use multi-threading?
As I know a little about multi-threading, I hope some one provide some tutorial links on this topic.
Multi-threading will improve your performance but there are a couple of things you need to know:
Each thread needs its own JDBC connection. Connections can't be shared between threads because each connection is also a transaction.
Upload the data in chunks and commit once in a while to avoid accumulating huge rollback/undo tables.
Cut tasks into several work units where each unit does one job.
To elaborate the last point: Currently, you have a task that reads a file, parses it, opens a JDBC connection, does some calculations, sends the data to the database, etc.
What you should do:
One (!) thread to read the file and create "jobs" out of it. Each job should contains a small, but not too small "unit of work". Push those into a queue
The next thread(s) wait(s) for jobs in the queue and do the calculations. This can happen while the threads in step #1 wait for the slow hard disk to return the new lines of data. The result of this conversion step goes into the next queue
One or more threads to upload the data via JDBC.
The first and the last threads are pretty slow because they are I/O bound (hard disks are slow and network connections are even worse). Plus inserting data in a database is a very complex task (allocating space, updating indexes, checking foreign keys)
Using different worker threads gives you lots of advantages:
It's easy to test each thread separately. Since they don't share data, you need no synchronization. The queues will do that for you
You can quickly change the number of threads for each step to tweak performance
Multi threading may be of help, if the lines are uncorrelated, you may start off two processes one reading even lines, another uneven lines, and get your db connection from a connection pool (dbcp) and analyze performance. But first I would investigate whether jdbc is the best approach normally databases have optimized solutions for imports like this. These solutions may also temporarily switch of constraint checking of your table, and turn that back on later, which is also great for performance. As always depending on your requirements.
Also you may want to checkout springbatch which is designed for batch processing.
As far as I know,the JDBC Bridge uses synchronized methods to serialize all calls to ODBC so using mutliple threads won't give you any performance boost unless it boosts your application itself.
I am not all that familiar with JDBC but regarding the multithreading bit of your question, what you should keep in mind is that parallel processing relies on effectively dividing your problem into bits that are independent of one another and in some way putting them back together (their output that is). If you dont know the underlying dependencies between tasks you might end up having really odd errors/exceptions in your code. Even worse, it might all execute without any problems, but the results might be off from true values. Multi-threading is tricky business, in a way fun to learn (at least I think so) but pain in the neck when things go south.
Here are a couple of links that might provide useful:
Oracle's java trail: best place to start
A good tutorial for java concurrency
an interesting article on concurrency
If you are serious about putting effort to getting into multi-threading I can recommend GOETZ, BRIAN: JAVA CONCURRENCY, amazing book really..
Good luck
I had a similar task. But in my case, all the tables were unrelated to each other.
STEP1:
Using SQL Loader(Oracle) for uploading data into database(very fast) OR any similar bulk update tools for your database.
STEP2:
Running each uploading process in a different thread(for unrelated tasks) and in a single thread for related tasks.
P.S. You could identify different inter-related jobs in your application and categorize them in groups; and running each group in different threads.
Links to run you up:
JAVA Threading
follow the last example in the above link(Example: Partitioning a large task with multiple threads)
SQL Loader can dramatically improve performance
The fastest way I've found to insert large numbers of records into Oracle is with array operations. See the "setExecuteBatch" method, which is specific to OraclePreparedStatement. It's described in one of the examples here:
http://betteratoracle.com/posts/25-array-batch-inserts-with-jdbc
If Multi threading would complicate your work, you could go with Async messaging. I'm not fully aware of what your needs are, so, the following is from what I am seeing currently.
Create a file reader java whose purpose is to read the biz file and put messages into the JMS queue on the server. This could be plain Java with static void main()
Consume the JMS messages in the Message driven beans(You can set the limit on the number of beans to be created in the pool, 50 or 100 depending on the need) if you have mutliple servers, well and good, your job is now split into multiple servers.
Each row of data is asynchronously split between 2 servers and 50 beans on each server.
You do not have to deal with threads in the whole process, JMS is ideal because your data is within a transaction, if something fails before you send an ack to the server, the message will be resent to the consumer, the load will be split between the servers without you doing anything special like multi threading.
Also, spring is providing spring-batch which can help you. http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html#springBatchUsageScenarios