I would like to ask for some advices concerning my problem.
I have a batch that does some computation (multi threading environement) and do some inserts in a table.
I would like to do something like batch insert, meaning that once I got a query, wait to have 1000 queries for instance, and then execute the batch insert (not doing it one by one).
I was wondering if there is any design pattern on this.
I have a solution in mind, but it's a bit complicated:
build a method that will receive the queries
add them to a list (the string and/or the statements)
do not execute until the list has 1000 items
The problem : how do I handle the end ?
What I mean is, the last 999 queries, when do I execute them since I'll never get to 1000 ?
What should I do ?
I'm thinking at a thread that wakes up every 5 minutes and check the number of items in a list. If he wakes up twice and the number is the same , execute the existing queries.
Does anyone has a better idea ?
Your database driver needs to support batch inserting. See this.
Have you established your system is choking on network traffic because there is too much communication between the service and the database? If not, I wouldn't worry about batching, until you are sure you need it.
You mention that in your plan you want to check every 5 minutes. That's an eternity. If you are going to get 1000 items in 5 minutes, you shouldn't need batching. That's ~ 3 a second.
Assuming you do want to batch, have a process wake up every 2 seconds and commit whatever is queued up. Don't wait five minutes. It might commit 0 rows, it might commit 10...who cares...With this approach, you don't need to worry that your arbitrary threshold hasn't been met.
I'm assuming that the inserts come in one at a time. If your incoming data comes in n at once, I would just commit every incoming request, no matter how many inserts happen. If your messages are coming in as some sort of messaging system, it's asynchronous anyway, so you shouldn't need to worry about batching. Under high load, the incoming messages just wait till there is capacity to handle them.
Add a commit kind of method to that API that will be called to confirm all items have been added. Also, the optimum batch size is somewhere in the range 20-50. After that the potential gain is outweighed by the bookkeeping necessary for a growing number of statements. You don't mention it explicitly, but of course you must use the dedicated batch API in JDBC.
If you need to keep track of many writers, each in its own thread, then you'll also need a begin kind of method and you can count how many times it was called, compared to how many times commit was called. Something like reference-counting. When you reach zero, you know you can flush your statement buffer.
This is most amazing concept , I have faced many time.So, according to your problem you are creating a batch and that batch has 1000 or more queries for insert . But , if you are inserting into same table with repeated manner.
To avoid this type of situation you can make the insert query like this:-
INSERT INTO table1 VALUES('4','India'),('5','Odisha'),('6','Bhubaneswar')
It can execute only once with multiple values.So, better you can keep all values inside any collections elements (arraylist,list,etc) and finally make a query like above and insert it once.
Also you can use SQL Transaction API.(Commit,rollback,setTraction() ) etc.
Hope ,it will help you.
All the best.
Related
I've a typical scenario & need to understand best possible way to handle this, so here it goes -
I'm developing a solution that will retrieve data from a remote SOAP based web service & will then push this data to an Oracle database on network.
Also, this will be a scheduled task that will execute every 15 minutes.
I've event queues on remote service that contains the INSERT/UPDATE/DELETE operations that have been done since last retrieval, & once I retrieve the events for last 15 minutes, it again add events for next retrieval.
Now, its just pushing data to Oracle so all my interactions are INSERT & UPDATE statements.
There are around 60 tables on Oracle with some of them having 100+ columns. Moreover, for every 15 minutes cycle there would be around 60-70 Inserts, 100+ Updates & 10-20 Deletes.
This will be an executable jar file that will terminate after operation & will again start on next 15 minutes cycle.
So, I need to understand how should I handle WRITE operations (best practices) to improve performance for this application as whole ?
Current Test Code (on every cycle) -
Connects to remote service to get events.
Creates a connection with DB (single connection object).
Identifies the type of operation (INSERT/UPDATE/DELETE) & table on which it is done.
After above, calls the respective method based on type of operation & table.
Uses Preparedstatement with positional parameters, & retrieves each column value from remote service & assigns that to statement parameters.
Commits the statement & returns to get event class to process next event.
Above is repeated till all the retrieved events are processed after which program closes & then starts on next cycle & everything repeats again.
Thanks for help !
If you are inserting or updating one row at a time,You can consider executing a batch Insert or a batch Update. It has been proven that if you are attempting to update or insert rows after a certain quantity, you get much better performance.
The number of DB operations you are talking about (200 every 15 minutes) is tiny and will be easy to finish in less than 15 minutes. Some concrete suggestions:
You should profile your application to understand where it is spending its time. If you don't do this, then you don't know what to optimize next and you don't know if something you did helped or hurt.
If possible, try to get all of the events in one round-trip to the remote server.
You should reuse the connection to the remote service (probably by using a library that supports connection persistence and reuse).
You should reuse the DB connections by using a connection pooling library rather than creating a new connection for each insert/update/delete. Believe it or not, creating the connection probably takes 100+ times as long as doing your DB operation once you have the connection in hand.
You should consider doing multiple (or all) of the database operations in the same transaction rather than creating a new transaction for each row that is changed. However, you should carefully consider your failure modes such that you don't lose any events (if that is an important consideration).
You should consider utilizing prepared statement caching. This may help, but maybe not if Oracle is configured properly.
You should consider trying to analyze your operations to find any that can be batched together. This can be a lot faster if you have some "hot" operations that get done often.
"I've a typical scenario"
No you haven't. You have a bespoke architecture, with a unique data model, unique data and unique business requirements. That's not a bad thing, it's the state of pretty much every computer system that's not been bought off-the-shelf (and even some of them).
So, it's an experiment and you must approach it as such. There is no "best practice". Try various things and see what works best.
"need to understand best possible way to handle this"
You will improve your chances of success enormously by hiring somebody who understands Oracle databases.
I have a long running job that updates 1000's of entity groups. I want to kick off a 2nd job afterwards that will have to assume all of those items have been updated. Since there are so many entity groups, I can't do it in a transaction, so i've just scheduled the 2nd job to run 15 minutes after the 1st completes using task queues.
Is there a better way?
Is it even safe to assume that 15 minutes gives a promise that the datastore is in sync with my previous calls?
I am using high replication.
In the google IO videos about HRD, they give a list of ways to deal with eventual consistency. One of them was to "accept it". Some updates (like twitter posts) don't need to be consistent with the next read. But they also said something like "hey, we're only talking miliseconds to a couple of seconds before they are consistent". Is that time frame documented anywhere else? Is it safe assuming that waiting 1 minute after a write before reading again will mean all my preivous writes are there in the read?
The mention of that is at the 39:30 mark in this video http://www.youtube.com/watch?feature=player_embedded&v=xO015C3R6dw
I don't think there is any built in way to determine if the updates are done. I would recommend adding a lastUpdated field to your entities and updating it with your first job, then check for the timestamp on the entity you're updating with the 2nd before running... kind of a hack but it should work.
Interested to see if anybody has a better solution. Kinda hope they do ;-)
This is automatic as long as you are getting entities without changing the consistency to Eventual. The HRD puts data to a majority of relevant datastore servers before returning. If you are calling the asynchronous version of put, you'll need to call get on all the Future objects before you can be sure it's completed.
If however you are querying for the items in the first job, there's no way to be sure that the index has been updated.
So for example...
If you are updating a property on every entity (but not creating any entities), then retrieving all entities of that kind. You can do a keys-only query followed by a batch get (which is approximately as fast/cheap as doing a normal query) and be sure that you have all updates applied.
On the other hand, if you're adding new entities or updating a property in the first process that the second process queries, there's no way to be sure.
I did find this statement:
With eventual consistency, more than 99.9% of your writes are available for queries within a few seconds.
at the bottom of this page:
http://code.google.com/appengine/docs/java/datastore/hr/overview.html
So, for my application, a 0.1% chance of it not being there on the next read is probably OK. However, I do plan to redesign my schema to make use of ancestor queries.
I am using Informix DB. This question may not be tied to one specific database. But I want to know how I can in Java, continuously probe into a Database and check if a certain row has been added to a table in the DB. Basically, the flow is:
My Java application should use JDBC to check if a certain table is populated.
If no, it should wait until a row has been inserted.
My question how can I have Java be aware of a row insertion. I am not expecting to add any triggers or anything, but in pure Java be able to check that the row is added.
Some thoughts that come to my mind are continuously call DB for the row, or periodically (every half-hour or so) call DB and check if the row is available. But what I am looking for is something like a Listener which can do this.
There is no facility in the Informix DBMS to signal when a particular row arrives in a table.
Well, I say that, but there is the DB-Cron facility which can periodically execute tasks (inside the server), and you could conceivably schedule a task to poll for the data to see if it has arrived and to send a message (somehow) to indicate that it has. It would be non-trivial, especially the part the indicates that it has arrived.
The JDBC protocol (and SQL protocols generally) are essentially synchronous; the client sends a request and waits for an answer from the DBMS.
So, pragmatically, if your delay period is half an hour, you can either create an admin task to handle the processing (you could write a Java UDR to be executed in the server by the server if that's crucial to you), or you can arrange for the Java (client-side) program to poll periodically to find out whether the information you need is there. A half-hour delay is not going to stress anything, even with a moderate number of processes polling for separate values (or even the same value). On the other hand, you normally try to avoid polling when you can. You'll need to strike a balance between responsiveness to the special data arriving and general system responsiveness. On the whole, general system responsiveness is more important, so keep the polling interval as large as you can.
If your polling interval needed to be sub-second, then the balance would be different - the job would be a lot harder.
We have a JDBC batch job. There are two tables:
BUSINESS_CONTRACT
CLASSIFY_RECORD
The table BUSINESS_CONTRACT stores information of business contracts, we classify business contracts every month and store classify result in the table CLASSIFY_RECORD.
The batch job runs once per month, query the BUSINESS_CONTRACT for those business contracts need to be classified and classify them then insert classify results into CLASSIFY_RECORD.
The batch job runs in a single thread right now, and I want to make it runs with multi-threads
How should I write the basic code structure using the dispatcher-worker pattern?
I learn java multi-threading, but found theoretical resources mostly.Now I want to use multi-threading to solve a real problem, but don't know how to write the first line code.
First, do you need the added complexity of multi-threading? How long does your current process take to run? Do you have multiple CPUs or multiple CPU cores available on the server you would be running this on, that would make the multi-threading beneficial?
I'm not going to write your code for you, but can give you a few pointers...
How would you do this work manually? Assume you had these as paper records, and had to split the task with a co-worker. How would you divide up the work? Between 2 people or 20 people? (That's how many threads you could potentially split this into.)
Once you have these details figured out, you can create multiple threads (your workers, using parent "dispatcher" code) - each configured to select only a portion of the results from your query. You should keep references to each of your threads, and call .join() on each of them once they are all started in order to wait for the entire batch to complete. If there is a large amount of data that will be difficult to split into equal units of work (1,000 records divided into 500 and 500 may require 75% and 25% of the resources for whatever reason), you may want to consider splitting the work into much smaller units (more units than threads), then have the dispatcher continue to feed the units of work to the workers until all work has been assigned.
Also consider, would these split functions of work be truly distinct? If one unit of work fails for some reason and needs to be rolled-back in the database, does this mean that all of the other units of work need to be stopped and any existing inserts rolled-back as well?
Are you using batch updates? It will probably make more of a difference than multiple threads doing single updates.
I am developing a Java application which will query tables which may hold over 1,000,000 records. I have tried everything I could to be as efficient as possible but I am only able to achieve on avg. about 5,000 records a minute and a maximum of 10,000 at one point. I have tried reverse engineering the data loader and my code seems to be very similar but still no luck.
Is threading a viable solution here? I have tried this but with very minimal results.
I have been reading and have applied every thing possible it seems (compressing requests/responses, threads etc.) but I cannot achieve data loader like speeds.
To note, it seems that the queryMore method seems to be the bottle neck.
Does anyone have any code samples or experiences they can share to steer me in the right direction?
Thanks
An approach I've used in the past is to query just for the IDs that you want (which makes the queries significantly faster). You can then parallelize the retrieves() across several threads.
That looks something like this:
[query thread] -> BlockingQueue -> [thread pool doing retrieve()] -> BlockingQueue
The first thread does query() and queryMore() as fast as it can, writing all ids it gets into the BlockingQueue. queryMore() isn't something you should call concurrently, as far as I know, so there's no way to parallelize this step. All ids are written into a BlockingQueue. You may wish to package them up into bundles of a few hundred to reduce lock contention if that becomes an issue. A thread pool can then do concurrent retrieve() calls on the ids to get all the fields for the SObjects and put them in a queue for the rest of your app to deal with.
I wrote a Java library for using the SF API that may be useful. http://blog.teamlazerbeez.com/2011/03/03/a-new-java-salesforce-api-library/
With the Salesforce API, the batch size limit is what can really slow you down. When you use the query/queryMore methods, the maximum batch size is 2000. However, even though you may specify 2000 as the batch size in your SOAP header, Salesforce may be sending smaller batches in response. Their batch size decision is based on server activity as well as the output of your original query.
I have noticed that if I submit a query that includes any "text" fields, the batch size is limited to 50.
My suggestion would be to make sure your queries are only pulling the data that you need. I know a lot of Salesforce tables end up with a lot of custom fields that may not be needed for every integration.
Salesforce documentation on this subject
We have about 14000 records in our Accounts object and it takes quite some time to get all the records. I perform a query which takes about a minute but SF only returns batches of no more than 500 even though I set batchsize to 2000. Each query more operation takes from 45 seconds to a minute also. This limitation is quite frustrating when you need to get bulk data.
Make use of Bulk-api to query any number of records from Java. I'm making use of it and performs very effectively even in seconds you get the result. The String returned is comma separated. Even you can maintain batches less than or equal to 10k to get the records either in CSV (using open csv) or directly in String.
Let me know if you require the code help.
Latency is going to be a killer for this type of situation - and the solution will be either multi-thread, or asynchronous operations (using NIO). I would start by running 10 worker threads in parallel and see what difference it makes (assuming that the back-end supports simultaneous gets).
I don't have any concrete code or anything I can provide here, sorry - just painful experience with API calls going over high latency networks.