I have a table A, about 400k rows one day, I want to write java program to back up old data to another table, there are two ways:
use jdbc to fetch data from table A, 500 rows one time for example, then concatenate sql like insert into table B values(value1, value2...),(value1, value2...),... then execute.
use insert into table B select * from A where, about 2~3 million rows.
As somebody said the second way is slower, but I am not sure, so which way is better? Must not crash database.
The 1st one is the realistic option but you dont have to do it by concatenating SQL. Java JDBC already has functions for sending inserts in batches. In a for loop you can keep performing inserts for one row at a time then when you reach your desired batch size call executeBatch() to post it in one big bulk insert.
The value that you'll have to play around with is how many to insert at at time or the batch size. That answer will depend on your hardware so just see what works. The larger the batch size the more units held in memory at anyone time. For really large databases doing this I have still crashed the program BUT NOT the database. It was only when I found the right batch size for my system did everything work out. For me 5000 was fine. But running it many many times for over a million rows the database was fine. But again this depends on the database as well. But you are on the correct path.
https://www.tutorialspoint.com/jdbc/jdbc-batch-processing.htm
Related
I am new to Cassandra, so I may be missing something. My goal is to insert 500,000 rows as quickly as possible, using Java (DataStax driver). It is currently inserting only 400 records per second, and the full 500,000 inserts is taking many minutes to fully execute. Duplicates in the ArrayList are possible, so the insert process should do an insert/update statement (in other words, the java list might contain duplicates, but the db table should contain only distinct values).
A select-query returns the 500k records in less than 1 second from cassandra, but the insert into cassandra takes a really long time. I am hoping the insert of 500k records could be less than 10 seconds. What can I do to get the inserts to be much faster?
Here is a definition for the Cassandra table:
create table mykeyspace.mytablename
(
my_id_record text primary key
);
Here is the java insert (just relevant code shown, any error handling removes for simplicity):
String insertCQL = "INSERT INTO mykeyspace.mytablename(my_id_record) VALUES (?);";
PreparedStatement insertPrepStmnt = session.prepare(insertCQL);
for( String myId: myArrayList) {
cassandraConnect.session.execute(insertPrepStmnt.bind(myId));
}
As you can see, it's inserting 500,00 records of a string value into a table with a single field (the primary key field).
Is 400 inserts per second the expected speed for Cassandra?
Any suggestions for what I can do to speed it up would be greatly appreciated.
You are using synchronous API - this means that you wait for answer before inserting next record. You can get much better throughput by using asynchronous API, but you need to control how many requests per connection is in-flight at the same time. You may need to control/tune connection pooling for that.
But if you really want to load data from files, such as CSV or JSON, the I recommend to look to DSBulk. If you want just generate test data - use NoSQLBench. Both tools are heavily optimized for maximum throughput.
I have a JAVA application that can use a SQL database from any vendor. Right now we have tested Vertica and PostgreSQL. I want to export all the data from one table in the DB and import it later on in a different instance of the application. The size of the DB is pretty big so there are many rows in there. The export and import process has to be done from inside the java code.
What we've tried so far is:
Export: we read the whole table (select * from) through JDBC and then dump it to an SQL file with all the INSERTS needed.
Import: The file containing those thousands of INSERTS is executed in the target database through JDBC.
This is not an efficient process. Firstly, the select * from part is giving us problems because of the size of it and secondly, executing a lot if inserts one after another gives us problems in Vertica (https://forum.vertica.com/discussion/235201/vjdbc-5065-error-too-many-ros-containers-exist-for-the-following-projections)
What would be a more efficient way of doing this? Are there any tools that can help with the process or there is no "elegant" solution?
Why not do the export/import in a single step with batching (for performance) and chunking (to avoid errors and provide a checkpoint where to start off after a failure).
In most cases, databases support INSERT queries with many values, e.g.:
INSERT INTO table_a (col_a, col_b, ...) VALUES
(val_a, val_b, ...),
(val_a, val_b, ...),
(val_a, val_b, ...),
...
The number of rows you generate into a single such INSERT statement is then your chunk-size, which might need tuning for the specific target database (big enough to speed things up but small enough to make the chunk not exceed some database limit and create failures).
As already proposed, each of this chunk should then be executed in a transaction and your application should remember which chunk it successfully executed last in case some error occurs so it can continue at the next run there.
For the chunks itself, you really should use LIMIT OFFSET .
This way, you can repeat any chunk at any time, each chunk by itself is atomic and it should perform much better than with single row statements.
I can only speak about PostgreSQL.
The size of the SELECT is not a problem if you use server-side cursors by calling setFetchSize with a value greater than 0 (perhaps 10000) on the statement.
The INSERTS will perform well if
you run them all in a single transaction
you use a PreparedStatement for the INSERT
Each insert into Vertica goes into WOS (memory), and periodically data from WOS gets moved to ROS (disk) into a single container. You can only have 1024 ROS containers per projection per node. Doing many thousands of INSERTs at a time is never a good idea for Vertica. The best way to do this is to copy all that data into a file and bulk load the file into Vertica using the COPY command.
This will create a single ROS container for the contents of the file. Depending on how many rows you want to copy it will be many times (sometimes even hundreds of times) faster.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/COPY/COPY.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ConnectingToVertica/ClientJDBC/UsingCOPYLOCALWithJDBC.htm
I need an advice :)
I have a database with almost 70 tables, many of them have over a dozen million big records. I want to split it into a few smaller ones. One for every big client data and one main database for the rest of the client's data(while also moving some of the data into NoSQL database). Because of many complicated relations between tables, before copying the data, I was disabling the triggers, that were checking the correctness of the foreign keys and then, just before a commit I was enabling them again.
It was all working with a small amount of data, but now, when I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
I could increase the heap size, but it's not the point here.
I'm selecting data by some specific id from every table that has any relation to client data and copy it to another database. The process looks like this:
Select data from table
Insert data to another database
Copy sequence (max(id) of data being copied)
Flush/Clear
Repeat for every table containing client data
I was trying to select portions of data(something like select parts with 5000 rows instead of all 50 000 in one) but it fails in the exact same position.
And here I am asking for an advice, how to manage this problem. I think it is all because I am trying to copy all data in one big fatty commit. The reason of it is that I have to disable triggers while copying but also I must enable them before I can commit my changes.
When I'm trying to copy one of the big client data I have a problem with the java heap size/GC out of memory.
Copying data should not be using the heap, so it seems you're not using cursor-based queries.
See "Getting results based on a cursor" in the PostgreSQL JDBC documentation:
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.
[...]
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
So, add a stmt.setFetchSize(1000) (or something like that) to your code will ensure that the JDBC driver will not exhaust the heap.
If you still have trouble after that, then it's because your code is retaining all data, which means it's coded wrong for a copy operation.
I have a problem regarding the Resultset of a large database. (MySQLDB, Java 1.7)
The task is to perform a transformation of all the entries of one column into another database.
(e.g. divide every number by three and write them into another database)
As the database contains about 70 columns and a few million rows, my first approach would have been to get a SELECT * and parse the Resultset by columns.
Unfortunately I found no way to parse it this way, as the designated way intends to go through it row by row (while(rs.next()) {} etc).
I don't like this way, as it would create 70 large arrays, I would have had only one per time to reduce memory usage.
So here are my main questions:
Is there a way?
Should I either create a query for every column and parse them (one array at a time but 70 queries) or
Should I just get the whole ResultSet and parse it row by row, writing them into 70 arrays?
Greetings and thanks in advance!
Why not just page your queries ? Pull out 'n' rows at a time, perform the transformation, and then write them into the new database.
This means you don't pull everything up in one query/iteration and then write the whole lot in one go, and you don't have the inefficiencies of working row-by-row.
My other comment is perhaps this is premature optimisation. Have you tried loading the whole dataset, and seeing how much memory it would take. If it's of the order of 10's or even 100's of megs, I would expect the JVM to handle that easily.
I'm assuming your transformation needs to be done in Java. If you can possibly do it in SQL, then doing it entirely within the database is likely to be even more efficient.
Why don't you do it with mysql only.
use this query :
create table <table_name> as select <column_name_on_which_you_want_transformation>/3 from <table name>;
How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.