In my application, I perform a costly query that takes minutes to produce a report. I am trying to make a generic class that transforms a ResultSet to and Excel spreadsheet, where a column is excluded from the spreadsheet if it only contains nulls. I can remove the columns from the Excel sheet after the fact easily, but it is difficult to "glue" worksheets back together after I have already split them when there are too many columns.
I could do a query to check if each column is null, but this would entail running the costly query all over again, perhaps multiple times, which would make the generation of the spreadsheet take too long.
Is there a way that I can query the ResultSet object that I already have (a little like ColdFusion) and remove columns from it?
EDIT
I ended up adding a pre-processing step where I added the column numbers of the used columns to a List<Integer> and then iterating through that collection rather than the set of all columns in the ResultSet. A few off-by-one errors later, and it works great.
Can you extract the data from the ResultSet and store it in memory first, before creating the work sheet, or is it too large? If so, then while you're extracting it you could remember whether a non-null value has been seen in each column. Once you're done extracting, you know exactly which columns can be omitted. Of course this doesn't work so well if the amount of data is so large that you wouldn't want to store it in memory.
Another solution would be to store the results of the costly query in a "results" table in the database. Each row for a given query execution would get stamped with a "query id" taken from a database sequence. Once the data is loaded into this table, subsequent queries to check whether "all values in column X are null" should be pretty speedy.
Note: if you're going to take this second approach, don't pull all the query data up to your application before storing it back to the results table. Rewrite the original "costly" query to do the insert. "insert into query_result(columns...) select {costly query}".
I could do a query to check if each
column is null
Better still you could incorporate that check into the original query, via a COUNT etc. This will be miles quicker than writing Java code to the same effect.
Related
I have a problem regarding the Resultset of a large database. (MySQLDB, Java 1.7)
The task is to perform a transformation of all the entries of one column into another database.
(e.g. divide every number by three and write them into another database)
As the database contains about 70 columns and a few million rows, my first approach would have been to get a SELECT * and parse the Resultset by columns.
Unfortunately I found no way to parse it this way, as the designated way intends to go through it row by row (while(rs.next()) {} etc).
I don't like this way, as it would create 70 large arrays, I would have had only one per time to reduce memory usage.
So here are my main questions:
Is there a way?
Should I either create a query for every column and parse them (one array at a time but 70 queries) or
Should I just get the whole ResultSet and parse it row by row, writing them into 70 arrays?
Greetings and thanks in advance!
Why not just page your queries ? Pull out 'n' rows at a time, perform the transformation, and then write them into the new database.
This means you don't pull everything up in one query/iteration and then write the whole lot in one go, and you don't have the inefficiencies of working row-by-row.
My other comment is perhaps this is premature optimisation. Have you tried loading the whole dataset, and seeing how much memory it would take. If it's of the order of 10's or even 100's of megs, I would expect the JVM to handle that easily.
I'm assuming your transformation needs to be done in Java. If you can possibly do it in SQL, then doing it entirely within the database is likely to be even more efficient.
Why don't you do it with mysql only.
use this query :
create table <table_name> as select <column_name_on_which_you_want_transformation>/3 from <table name>;
There's a DB that contains approximately 300-400 records. I can make a simple query for fetching 30 records like:
SELECT * FROM table
WHERE isValidated = false
LIMIT 30
Some more words about content of DB table. There's a column named isValidated, that can (as you correctly guessed) take one of two values: true or false. After a query some of the records should be made validated (isValidated=true). It is approximately 5-6 records from each bunch of 30 records. Correspondingly after each query, I will fetch the records (isValidated=false) from previous query. In fact, I'll never get to the end of the table with such approach.
The validation process is made with Java + Hibernate. I'm new to Hibernate, so I use Criterion for making this simple query.
Is there any best practices for such task? The variant with adding a flag-field (that marks records which were fetched already) is inappropriate (over-engineering for this DB).
Maybe there's an opportunity to create some virtual table where records that were already processed will be stored or something like this. BTW, after all the records are processed, it is planned to start processing them again (it is possible, that some of them need to be validated).
Thank you for your help in advance.
I can imagine several solutions:
store everything in memory. You only have 400 records, and it could be a perfectly fine solution given this small number
use an order by clause (which you should do anyway) on a unique column (the PK, for example), store the ID of the last loaded record, and make sure the next query uses where ID > :lastId
If we use the Limit clause in a query which also has ORDER BY clause and execute the query in JDBC, will there be any effect in performance? (using MySQL database)
Example:
SELECT modelName from Cars ORDER BY manuDate DESC Limit 1
I read in one of the threads in this forum that, by default a set size is fetched at a time. How can I find the default fetch size?
I want only one record. Originally, I was using as follows:
SQL Query:
SELECT modelName from Cars ORDER BY manuDate DESC
In the JAVA code, I was extracting as follows:
if(resultSett.next()){
//do something here.
}
Definitely the LIMIT 1 will have a positive effect on the performance. Instead of the entire (well, depends on default fetch size) data set of mathes being returned from the DB server to the Java code, only one row will be returned. This saves a lot of network bandwidth and Java memory usage.
Always delegate as much as possible constraints like LIMIT, ORDER, WHERE, etc to the SQL language instead of doing it in the Java side. The DB will do it much better than your Java code can ever do (if the table is properly indexed, of course). You should try to write the SQL query as much as possibe that it returns exactly the information you need.
Only disadvantage of writing DB-specific SQL queries is that the SQL language is not entirely portable among different DB servers, which would require you to change the SQL queries everytime when you change of DB server. But it's in real world very rare anyway to switch to a completely different DB make. Externalizing SQL strings to XML or properties files should help a lot anyway.
There are two ways the LIMIT could speed things up:
by producing less data, which means less data gets sent over the wire and processed by the JDBC client
by potentially having MySQL itself look at fewer rows
The second one of those depends on how MySQL can produce the ordering. If you don't have an index on manuDate, MySQL will have to fetch all the rows from Cars, then order them, then give you the first one. But if there's an index on manuDate, MySQL can just look at the first entry in that index, fetch the appropriate row, and that's it. (If the index also contains modelName, MySQL doesn't even need to fetch the row after it looks at the index -- it's a covering index.)
With all that said, watch out! If manuDate isn't unique, the ordering is only partially deterministic (the order for all rows with the same manuDate is undefined), and your LIMIT 1 therefore doesn't have a single correct answer. For instance, if you switch storage engines, you might start getting different results.
I'm trying to retrieve some most occurring values from a SQLite table containing a few hundreds of millions of rows.
The query so far may look like this:
SELECT value, COUNT(value) AS count FROM table GROUP BY value ORDER BY count DESC LIMIT 10
There is a index on the value field.
However, with the ORDER BY clause, the query takes so much time I've never seen the end of it.
What could be done to drastically improve such queries on such big amount of data?
I tried to add a HAVING clause (e.g: HAVING count > 100000) to lower the number of rows to be sorted, without success.
Note that I don't care much on the time required to do the insertion (it still need to be reasonable, but priority is given to the selection), so I'm opened for solutions suggesting computation at insertion time ...
Thanks in advance,
1) create a new table where you'll store one row per unique "value" and the "count", put a descending index on the count column
2) add a trigger to the original table, where you maintain this new table (inset and update) as necessary to increment/decrement the count.
3) run your query off this new table, which will run fast because of the descending count index
this query forces you to look at every row in the table. that is what is taking time.
I almost never recommend this, but in this case, you could maintain the count in a denormalized fashion in an external table.
place the value and count into another table during insert, update, and delete via triggers.
I'm using displaytag to build tables with data from my db. This works well if the requested list isn't that big but if the list size grows over 2500 entries, fetching the result list takes very long (more than 5 min.). I was wondering if this behavior is normal.
How you handle big list / queries which return big results?
This article links to an example app of how to go about solving the problem. Displaytag expects to be passed a full dataset to create paging links and handle sorting. This kind of breaks the idea of paging externally on the data and fetching only those rows that are asked for (as the user pages to them). The project linked in the article describes how to go about setting this type of thing up.
If you're working with a large database, you could also have a problem executing your query. I assume you have ruled this out. If not, you have the SQL as mentioned earlier - I would run it through the DB2 query analyzer to see if there are any DB bottlenecks. The next step up the chain is to run a test of the Hibernate/DAO call in a unit test without displaytag in the mix. Again, from how you've worded things, it sounds like you've already done this.
The Displaytag hauls and stores everything in the memory (the session). Hibernate also does that. You don't want to have the entire DB table contents at once in memory (however, if the slowdown already begins at 2500 rows, it more look like a matter of badly optimized SQL query / DB table; 2500 rows should be peanuts for a decent DB, but OK, that's another story).
Rather create a HTML table yourself with little help of JSTL c:forEach and a shot of EL. Keep one or two request parameters in the background in input type="hidden": the first row to be displayed (firstrow) and eventually the amount of rows to be displayed at once (rowcount).
Then, in your DAO class just do a SELECT stuff FROM data LIMIT firstrow OFFSET rowcount or something like that depending on the DB used. In MySQL and PostgreSQL you can use the LIMIT and/or OFFSET clause like that. In Oracle you'll need to fire a subquery. In MSSQL and DB2 you'll need to create a SP. You can do that with HQL.
Then, to page through the table, just have a bunch buttons which instructs the server side code to in/decrement the firstrow with rowcount everytime. Just do the math.
Edit: you commented that you're using DB2. I've done a bit research and it appears that you can use the UDB OLAP function ROW_NUMBER() for this:
SELECT id, colA, colB, colC
FROM (
SELECT
ROW_NUMBER() OVER (ORDER BY id) AS row, id, colA, colB, colC
FROM
data
) AS temp_data
WHERE
row BETWEEN 1 AND 10;
This example should return the first 10 rows from the data table. You can parameterize this query so that you can reuse it for every page. This is more efficient than querying the entire table in Java's memory. Also ensure that the table is properly indexed.