Counting rows before proceeding to actual searching - java

Given an web app (Java, Spring, Hibernate and Sybase as DB) with several, say 5 different search screens, I want to count first if the search result based on the user's criteria will be exceeding a limit, say 1000 rows. Results that are huge, going past 1000 can happen even if user provides reasonable filters and criteria.
Is doing it this way recommended:
select count(*) from table --clauses, etc here
then if > 1000, don't do actual search, return and show limit error (tell user to refine search)
else if < 1000, do the actual search and give back the resultset to user
Or is there a better solution to handle this?
If this is the way to go, my followup question would be, how can we avoid duplicating the sql query? Because I understand doing this, will require me to declare the same search sql except the select clause will only contain count(*).
UPDATES
Additionally, I want to avoid 2 things:
1. processing from executing the actual sql
2. loading/mapping of the domain objects by the ORM (Hibernate in this case)
* both 1 & 2 are avoided when I detect that the count is > 1000.

I wouldn't run a COUNT(*) at all, just run the query with a LIMIT 1001. It's likely you are generating the exact same result set (i.e., to do the COUNT, you have to generate the result set) in the count and the next hit will be from the cache, or at worst you'll have to recalculate. You're just doing the same work twice

We followed the same procedure for our application as well. and Yes the only difference will be of placing count(1) instead of * in the SQL.
However you might need to understand that on occasions the Count Query is the one which takes more time then fetching a subset of results.

Depending on how you are retrieving the rows from the row set, you could simply filter the results at that level.
ie
int rowIndex = 0;
while (rs.hasNext() && rowIndex < 1000) {
// ... Extract results
rowIndex++;
}
You may want to warn the user that there result set has been trimmed though ;)

Related

Creating report from 1 million + records in MySQL and display in Java JSP page

I am working on a MySQL database with 3 tables - workout_data, excercises and sets tables. I'm facing issues related to generating reports based on these three tables.
To add more information, a number of sets make up an excercise and a number of excercises will be a workout.
I currently have the metrics to which a report is to be generated from the data in these tables. I've to generate reports for the past 42 days including this week. The queries run for a long time by the time I get the report by joining these tables.
For example - the sets table has more than 1 million records just for the past 42 days. The id in this table is the excercise_id in excercise table. The id of excercise table is the workout_id in workout_data table.
I'm running this query and it takes more than 10 minutes to get the data. I have to prepare a report and show it to the user in the browser. But due to this long running query the webpage times out and the user is not able to see the report.
Any advice on how to achieve this?
SELECT REPORTSETS.USER_ID,REPORTSETS.WORKOUT_LOG_ID,
REPORTSETS.SET_DATE,REPORTSETS.EXCERCISE_ID,REPORTSETS.SET_NUMBER
FROM EXCERCISES
INNER JOIN REPORTSETS ON EXCERCISES.ID=REPORTSETS.EXCERCISE_ID
where user_id=(select id from users where email='testuser1#gmail.com')
and substr(set_date,1,10)='2013-10-29'
GROUP BY REPORTSETS.USER_ID,REPORTSETS.WORKOUT_LOG_ID,
REPORTSETS.SET_DATE,REPORTSETS.EXCERCISE_ID,REPORTSETS.SET_NUMBER
Two things:
First, You have the following WHERE clause item to pull out a single day's data.
AND substr(set_date,1,10)='2013-10-29'
This definitively defeats the use of an index on the date. If your set_date column has a DATETIME datatype, what you want is
AND set_date >= `2013-10-09`
AND set date < `2013-10-09` + INTERVAL 1 DAY
This will allow the use of a range scan on an index on set_date. It looks to me like you might want a compound index on (user_id, set_date). But you should muck around with EXPLAIN to figure out whether that's right.
Second, you're misusing GROUP BY. That clause is pointless unless you have some kind of summary function like SUM() or GROUP_CONCAT() in your query. Do you want ORDER BY?
Comments on your SQL that you might want to look into:
1) Do you have an index on USER_ID and SET_DATE?
2) Your datatype for SET_DATE looks wrong, is it a varchar? Storing it as a date will mean that the db can optimise your search much more efficiently. At the moment the substring method will be called countless times per query as it has to be run for every row returned by the first part of your where clause.
3) Is the group by really required? Unless I'm missing something the 'group by' part of the statement brings nothing to the table ;)
It should make a significant difference if you could store the date either as a date, or in the format you need to make the comparison. Performing a substr() call on every date must be time consuming.
Surely the suggestions with tuning the query would help to improve the query speed. But I think the main point here is what can be done with more than 1 million plus records before session timed out. What if you have like 2 or 3 million records, will some performance tuning solve the problem? I don't think so. So:
1) If you want to display on browser, use pagination and query (for example) the first 100 record.
2) If you want to generate a report (like pdf), then use asynchronous method (JMS)

ResultSet.next very slow only when query contains FIRST_ROWS or ROWNUM restriction

I execute a native query using
entityManager.createNativeQuery(sqlQuery);
query.setMaxResults(maxResults);
List<Object[]> resultList = query.getResultList();
To speed up the query, I thought to include the FIRST_ROWS(n) hint or limiting using WHERE ROWNUM > n.
Using instrumentation, I see that indeed OraclePreparedStatement.executeQuery is faster, but a lot more time is spent in EJBQueryImpl.getResultList leading to an overall very poor performance. Looking more into detail, I see that every 10th call of ResultSet.next() takes about as long as executeQuery itself(). This strange behaviour stops when I leave out the query hint or the ROWNUM condition, then every 10th call of resultset.next is somewhat lower than the others, but only 2ms instead of 3 seconds.
Do you get different query plans when you include the hint? My assumption is that you do based on your description of the problem.
When you execute a query in Oracle, the database does not generally materialize the entire result set at any point in time (obviously, it may have to if you specify an ORDER BY clause that requires all the data to be materialized before the sort happens). Oracle doesn't actually start materializing data until the client starts fetching data. It runs enough of the query to generate however many rows the client has asked to fetch (which it sounds like is 10 in your case), returns those results to the client, and waits for the client to request more data before continuing to process the query.
It sounds like when the FIRST_ROWS hint is included, the query plan is changing in a way that makes it more expensive to execute. Obviously, that's not the goal of the FIRST_ROWS hint. The goal is to tell the optimizer to generate a plan that makes fetching the first N rows more efficient even if it makes fetching all the rows from the query less efficient. That tends to cause the optimizer to favor things like index scans over table scans where a table scan might be more efficient overall. It sounds like in your case, however, the optimizer's estimates are incorrect and it ends up picking a plan that is just generally less efficient. That frequently implies that some of the statistics on some of the objects your query is referencing are incomplete or incorrect.
Sounds like you made JDBC executeQuery faster but JDBC ResultSet next slower. You made executing the query faster but fetching the data slower. Seems to be a JDBC issue, not EclipseLink, you would get the same result through raw JDBC if you actually fetched the data.
10 is the default fetch size, so you could try setting that to be bigger.
See,
http://www.eclipse.org/eclipselink/api/2.3/org/eclipse/persistence/config/QueryHints.html#JDBC_FETCH_SIZE
Try adding the max rows limit to the SQL directly instead of using setMaxResults, ie add where rownum < maxResults to the sql string. EclipseLink will use rownum in the query for max rows when it creates the SQL, but since you are using your own SQL, it will use the result set to limit rows.

How to deal with pagination for a large, non materializable resultset in Oracle and Java

I'm dealing with some sort of problem here. A web application built on java stuff calls a stored procedure in oracle which has as out parameters some varchars and a parameter whose type is a ref cursor returning a record type (both explicitly defined).
The content of the ref cursor is gathered using a complex query which I guess runs O(n) depending on the number of records in a table.
The idea is to paginate the result in the server because getting all the data causes a long delay (500 records take about 40-50 seconds due to the calculation and the join resolution). I've already rebuilt the query using row_number()
open out_rcvar for
SELECT *
FROM ( select a, b, c,..., row_number() over (order by f, g) rn
from t1, t2,...
where some_conditions
) where rn between initial_row and final_row
order by rn;
in order to avoid the offset-limit approach (and its equivalence in oracle). But, here's the catch, the user wants a pagination menu like
[first || <<5previous || 1 2 3 4 5 || next5>> || last ]
and knowing the total rows implies counting (hence, "querying") the whole package and taking the whole 50secs. What approach could I use here?
Thanks in advance for your help.
EDIT: The long query should not be setted a s a materialized view because the data in the records is required to be updated as it is requested (the web app does some operations with the data and needs to know if the selected item is "available" or "sold")
You could do something like:
SELECT *
FROM ( select count(*),a, b, c,..., row_number() over (order by f, g) rn
from t1, t2,...
where some_conditions
) where rn between initial_row and final_row
order by rn;
This is probably inefficient given your description, but if you find some quicker way to calculate the total rows, you could stick it the inner select, and return it with every row. It's not great, but it works and it's a single select (as opposed to having one for the total row number and a second one for the actual rows).
What is the performance if you do not select any columns but just a count to determine the rows? Is that acceptable?
And use that as a guide to build the pagination.
Otherwise we have no option without knowing the count to build the number of pages as (1,2,3,45)
The other option is to not show the number of pages, but just show next and previous.
Just my thoughts.
Perhaps you might consider creating temporary table. You might store your results there and then use some paging mechanism. This way the computation will be done once. Then you will only select the data, which will be pretty fast.
There is one catch, in this approach. You have to ensure that you will not break session, since temporary tables are private and exists only for your session. Take a look at this link.

JDO on GoogleAppEngine: How to count and group with BigTable

I need to collect some statistics on my entities in the datastore.
As an example, I need to know how many objects of a kind I have, how
many objects with some properties setted to particular values, etc.
In usual relational DBMS I may use
SELECT COUNT(*) ... WHERE property=<some value>
or
SELECT MAX(*), ... GROUP BY property
etc.
But here I cannot see any of these structures.
Moreover, I cannot take load all the objects in memory (e.g. using
pm.getExtent(MyCall.class, false)) as I have too much entities (more
than 100k).
Do you know any trick to achieve my goal?
Actually it depends on your specific requirements.
Btw, there is a common way, to prepare this stats data in background.
For example, you can run few tasks, by using Queue service, that will use query like select x where x.property == some value + cursor + an sum variable. If you at the first step, cursor will be empty and sum will be zero. Then, you'll iterate your query result, for 1000 items (query limit) or 9 minutes (task limit), incrementing sum on every step, and then, if it's not finished, call this task with new cursor and sum values. I mean you add request to next step into queue. Cursor is easily serializable into string.
When you have final step - you have to save result value somewhere into stat results table.
Take a look at:
task queues - http://code.google.com/intl/en/appengine/docs/java/taskqueue/
cursor - http://code.google.com/intl/en/appengine/docs/java/datastore/queries.html#Query_Cursors
And also, this stats/aggregation stuff is really depends on your actual task/requirements/project, there few way to accomplish this, optimal for different tasks. There is no standard way, like in SQL
Support for aggregate functions is limited on GAE. This is primarily an artifact of the schema-less nature of BigTable. The alternative is to maintain the aggregate functions as separate fields yourself to access them quickly.
To do a count, you could do something like this --
Query q = em.createQuery("SELECT count(p) FROM your.package.Class p");
Integer i = (Integer) q.getSingleResult();
but this will probably return you just 1000 rows since GAE limits the number of rows fetched to 1000.
Some helpful reading how to work around these issues --
http://marceloverdijk.blogspot.com/2009/06/google-app-engine-datastore-doubts.html
Is there a way to do aggregate functions on Google App Engine?

How to optimize retrieval of most occurring values (hundreds of millions of rows)

I'm trying to retrieve some most occurring values from a SQLite table containing a few hundreds of millions of rows.
The query so far may look like this:
SELECT value, COUNT(value) AS count FROM table GROUP BY value ORDER BY count DESC LIMIT 10
There is a index on the value field.
However, with the ORDER BY clause, the query takes so much time I've never seen the end of it.
What could be done to drastically improve such queries on such big amount of data?
I tried to add a HAVING clause (e.g: HAVING count > 100000) to lower the number of rows to be sorted, without success.
Note that I don't care much on the time required to do the insertion (it still need to be reasonable, but priority is given to the selection), so I'm opened for solutions suggesting computation at insertion time ...
Thanks in advance,
1) create a new table where you'll store one row per unique "value" and the "count", put a descending index on the count column
2) add a trigger to the original table, where you maintain this new table (inset and update) as necessary to increment/decrement the count.
3) run your query off this new table, which will run fast because of the descending count index
this query forces you to look at every row in the table. that is what is taking time.
I almost never recommend this, but in this case, you could maintain the count in a denormalized fashion in an external table.
place the value and count into another table during insert, update, and delete via triggers.

Categories