search resultset by column - java

I have a problem regarding the Resultset of a large database. (MySQLDB, Java 1.7)
The task is to perform a transformation of all the entries of one column into another database.
(e.g. divide every number by three and write them into another database)
As the database contains about 70 columns and a few million rows, my first approach would have been to get a SELECT * and parse the Resultset by columns.
Unfortunately I found no way to parse it this way, as the designated way intends to go through it row by row (while(rs.next()) {} etc).
I don't like this way, as it would create 70 large arrays, I would have had only one per time to reduce memory usage.
So here are my main questions:
Is there a way?
Should I either create a query for every column and parse them (one array at a time but 70 queries) or
Should I just get the whole ResultSet and parse it row by row, writing them into 70 arrays?
Greetings and thanks in advance!

Why not just page your queries ? Pull out 'n' rows at a time, perform the transformation, and then write them into the new database.
This means you don't pull everything up in one query/iteration and then write the whole lot in one go, and you don't have the inefficiencies of working row-by-row.
My other comment is perhaps this is premature optimisation. Have you tried loading the whole dataset, and seeing how much memory it would take. If it's of the order of 10's or even 100's of megs, I would expect the JVM to handle that easily.
I'm assuming your transformation needs to be done in Java. If you can possibly do it in SQL, then doing it entirely within the database is likely to be even more efficient.

Why don't you do it with mysql only.
use this query :
create table <table_name> as select <column_name_on_which_you_want_transformation>/3 from <table name>;

Related

how to use java to very quickly insert records into cassandra table

I am new to Cassandra, so I may be missing something. My goal is to insert 500,000 rows as quickly as possible, using Java (DataStax driver). It is currently inserting only 400 records per second, and the full 500,000 inserts is taking many minutes to fully execute. Duplicates in the ArrayList are possible, so the insert process should do an insert/update statement (in other words, the java list might contain duplicates, but the db table should contain only distinct values).
A select-query returns the 500k records in less than 1 second from cassandra, but the insert into cassandra takes a really long time. I am hoping the insert of 500k records could be less than 10 seconds. What can I do to get the inserts to be much faster?
Here is a definition for the Cassandra table:
create table mykeyspace.mytablename
(
my_id_record text primary key
);
Here is the java insert (just relevant code shown, any error handling removes for simplicity):
String insertCQL = "INSERT INTO mykeyspace.mytablename(my_id_record) VALUES (?);";
PreparedStatement insertPrepStmnt = session.prepare(insertCQL);
for( String myId: myArrayList) {
cassandraConnect.session.execute(insertPrepStmnt.bind(myId));
}
As you can see, it's inserting 500,00 records of a string value into a table with a single field (the primary key field).
Is 400 inserts per second the expected speed for Cassandra?
Any suggestions for what I can do to speed it up would be greatly appreciated.
You are using synchronous API - this means that you wait for answer before inserting next record. You can get much better throughput by using asynchronous API, but you need to control how many requests per connection is in-flight at the same time. You may need to control/tune connection pooling for that.
But if you really want to load data from files, such as CSV or JSON, the I recommend to look to DSBulk. If you want just generate test data - use NoSQLBench. Both tools are heavily optimized for maximum throughput.

insert into select vs bulk insert

I have a table A, about 400k rows one day, I want to write java program to back up old data to another table, there are two ways:
use jdbc to fetch data from table A, 500 rows one time for example, then concatenate sql like insert into table B values(value1, value2...),(value1, value2...),... then execute.
use insert into table B select * from A where, about 2~3 million rows.
As somebody said the second way is slower, but I am not sure, so which way is better? Must not crash database.
The 1st one is the realistic option but you dont have to do it by concatenating SQL. Java JDBC already has functions for sending inserts in batches. In a for loop you can keep performing inserts for one row at a time then when you reach your desired batch size call executeBatch() to post it in one big bulk insert.
The value that you'll have to play around with is how many to insert at at time or the batch size. That answer will depend on your hardware so just see what works. The larger the batch size the more units held in memory at anyone time. For really large databases doing this I have still crashed the program BUT NOT the database. It was only when I found the right batch size for my system did everything work out. For me 5000 was fine. But running it many many times for over a million rows the database was fine. But again this depends on the database as well. But you are on the correct path.
https://www.tutorialspoint.com/jdbc/jdbc-batch-processing.htm

How to implement several threads in Java for downloading a single table data?

How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.
Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.
So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.
How to position a record pointer in several threads then how to add all thread records into a single large file??
Thanks in Advance
First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.
If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.
As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).
It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.
I just felt compelled to answer on this old posting.
Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).
To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:
Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data
This could translate into a (pseudo) sql like:
SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)
In the end it is all pretty conventional, nothing super advanced.

Determine if column in ResultSet contains values in all rows

In my application, I perform a costly query that takes minutes to produce a report. I am trying to make a generic class that transforms a ResultSet to and Excel spreadsheet, where a column is excluded from the spreadsheet if it only contains nulls. I can remove the columns from the Excel sheet after the fact easily, but it is difficult to "glue" worksheets back together after I have already split them when there are too many columns.
I could do a query to check if each column is null, but this would entail running the costly query all over again, perhaps multiple times, which would make the generation of the spreadsheet take too long.
Is there a way that I can query the ResultSet object that I already have (a little like ColdFusion) and remove columns from it?
EDIT
I ended up adding a pre-processing step where I added the column numbers of the used columns to a List<Integer> and then iterating through that collection rather than the set of all columns in the ResultSet. A few off-by-one errors later, and it works great.
Can you extract the data from the ResultSet and store it in memory first, before creating the work sheet, or is it too large? If so, then while you're extracting it you could remember whether a non-null value has been seen in each column. Once you're done extracting, you know exactly which columns can be omitted. Of course this doesn't work so well if the amount of data is so large that you wouldn't want to store it in memory.
Another solution would be to store the results of the costly query in a "results" table in the database. Each row for a given query execution would get stamped with a "query id" taken from a database sequence. Once the data is loaded into this table, subsequent queries to check whether "all values in column X are null" should be pretty speedy.
Note: if you're going to take this second approach, don't pull all the query data up to your application before storing it back to the results table. Rewrite the original "costly" query to do the insert. "insert into query_result(columns...) select {costly query}".
I could do a query to check if each
column is null
Better still you could incorporate that check into the original query, via a COUNT etc. This will be miles quicker than writing Java code to the same effect.

Hibernate displaytag big lists

I'm using displaytag to build tables with data from my db. This works well if the requested list isn't that big but if the list size grows over 2500 entries, fetching the result list takes very long (more than 5 min.). I was wondering if this behavior is normal.
How you handle big list / queries which return big results?
This article links to an example app of how to go about solving the problem. Displaytag expects to be passed a full dataset to create paging links and handle sorting. This kind of breaks the idea of paging externally on the data and fetching only those rows that are asked for (as the user pages to them). The project linked in the article describes how to go about setting this type of thing up.
If you're working with a large database, you could also have a problem executing your query. I assume you have ruled this out. If not, you have the SQL as mentioned earlier - I would run it through the DB2 query analyzer to see if there are any DB bottlenecks. The next step up the chain is to run a test of the Hibernate/DAO call in a unit test without displaytag in the mix. Again, from how you've worded things, it sounds like you've already done this.
The Displaytag hauls and stores everything in the memory (the session). Hibernate also does that. You don't want to have the entire DB table contents at once in memory (however, if the slowdown already begins at 2500 rows, it more look like a matter of badly optimized SQL query / DB table; 2500 rows should be peanuts for a decent DB, but OK, that's another story).
Rather create a HTML table yourself with little help of JSTL c:forEach and a shot of EL. Keep one or two request parameters in the background in input type="hidden": the first row to be displayed (firstrow) and eventually the amount of rows to be displayed at once (rowcount).
Then, in your DAO class just do a SELECT stuff FROM data LIMIT firstrow OFFSET rowcount or something like that depending on the DB used. In MySQL and PostgreSQL you can use the LIMIT and/or OFFSET clause like that. In Oracle you'll need to fire a subquery. In MSSQL and DB2 you'll need to create a SP. You can do that with HQL.
Then, to page through the table, just have a bunch buttons which instructs the server side code to in/decrement the firstrow with rowcount everytime. Just do the math.
Edit: you commented that you're using DB2. I've done a bit research and it appears that you can use the UDB OLAP function ROW_NUMBER() for this:
SELECT id, colA, colB, colC
FROM (
SELECT
ROW_NUMBER() OVER (ORDER BY id) AS row, id, colA, colB, colC
FROM
data
) AS temp_data
WHERE
row BETWEEN 1 AND 10;
This example should return the first 10 rows from the data table. You can parameterize this query so that you can reuse it for every page. This is more efficient than querying the entire table in Java's memory. Also ensure that the table is properly indexed.

Categories