How to handle huge result sets from database

How to handle huge result sets from database - java

I'm designing a multi-tiered database driven web application – SQL relational database, Java for the middle service tier, web for the UI. The language doesn't really matter.
The middle service tier performs the actual querying of the database. The UI simply asks for certain data and has no concept that it's backed by a database.
The question is how to handle large data sets? The UI asks for data but the results might be huge, possibly too big to fit in memory. For example, a street sign application might have a service layer of:
StreetSign getStreetSign(int identifier)
Collection<StreetSign> getStreetSigns(Street street)
Collection<StreetSign> getStreetSigns(LatLonBox box)
The UI layer asks to get all street signs meeting some criteria. Depending on the criteria, the result set might be huge. The UI layer might divide the results into separate pages (for a browser) or just present them all (serving up to Goolge Earth). The potentially huge result set could be a performance and resource problem (out of memory).
One solution is not to return fully loaded objects (StreetSign objects). Rather return some sort of result set or iterator that lazily loads each individual object.
Another solution is to change the service API to return a subset of the requested data:
Collection<StreetSign> getStreetSigns(LatLonBox box, int pageNumber, int resultsPerPage)
Of course the UI can still request a huge result set:
getStreetSigns(box, 1, 1000000000)
I'm curious what is the standard industry design pattern for this scenario?

The very first question should be:
¿The user needs to, or is capable of, manage this amount of data?
Although the result set should be paged, if its potentially size is so huge, the answer will be "probably not", so the UI shouldn't try to show it.
I worked on J2EE projects on Health Care Systems, that deal with enormous amount of stored data, literally millions of patients, visits, forms, etc, and the general rule is not to show more than 100 or 200 rows for any user search, advising the user that those set of criteria produces more information that he can understand.
The way to implement this varies from one project to another, it is possible to force the UI to ask the service tier the size of a query before launching it, or it is possible to throw an Exception from the service tier if the result set grows too much (however this way couples the service tier with the limited implementation of an UI).
Be careful! This not means that every method on the service tier must throw an Exception if its result sizes more than 100, this general rule only applies to result sets that are shown to the user directly, that is a better reason to place the control in the UI instead on the service tier.

The most frequent pattern I've seen for this situation is some sort of paging, usually done server-side to reduce the amount of information sent over the wire.
Here's a SQL Server 2000 example using a table variable (generally faster than a temp table) together with your street signs example:
CREATE PROCEDURE GetPagedStreetSigns
(
#Page int = 1,
#PageSize int = 10
)
AS
SET NOCOUNT ON
-- This memory-variable table will control paging
DECLARE #TempTable TABLE (RowNumber int identity, StreetSignId int)
INSERT INTO #TempTable
(
StreetSignId
)
SELECT [Id]
FROM StreetSign
ORDER BY [Id]
-- select only those rows belonging to the requested page
SELECT SS.*
FROM StreetSign SS
INNER JOIN #TempTable TT ON TT.StreetSignId = SS.[Id]
WHERE TT.RowNumber BETWEEN ((#Page - 1) * #PageSize + 1)
AND (#Page * #PageSize)
In SQL Server 2005, you can get more clever with stuff like Common Table Expressions and the new SQL Ranking functions. But the general theme is that you use the server to return only the information belonging to the current page.
Be aware that this approach can get messy if you're allowing the end-user to apply on-the-fly filters to the data that s/he's seeing.

I would say if the potential exsists for a large set of data, then go the paging route.
You can still set a MAX that you do not want them to go over.
E.G. SO uses page sizes of 15, 30, 50...

One thing to be wary of when working with home-grown row-wrapper classes like you (apparently) have, is code that makes additional calls to the database without you (the developer) being aware of it. For example, you might call a method that returns a collection of Person objects and think that the only thing going on under the hood is a single "SELECT * FROM PERSONS" call. In actuality, the method you're calling might iterate through the returned collection of Person objects and make additional DB calls to populate each Person's Orders collection.
As you say, one of your solutions is to not return fully-loaded objects, so you're probably aware of this potential problem. One of the reasons I tend to avoid using row wrappers is that they invariably make it difficult to tune your application and minimize the size and frequency of database traffic.

In ASP.NET I would use server-side paging, where you only retrieve the page of data the user has requested from the data store. This is opposed to retrieving the entire result set, putting it into memory and paging through it on request.

JSF or JavaServerFaces has widgets for chunking large result sets to the browser. It can be parameterized as you suggest. I wouldn't call it a "standard industry design pattern" by any means, but it is worth a look to see how someone else solved the problem.

When I deal with this type of issue, I usually chunk the data sent to the browser (or thin/thick client, whichever is more appropriate for your situation) as regardless of the actual total size of the data that meets some certain criteria, only a small portion is really usable in any UI at one time.
I live in a Microsoft world, so my primary environment is ASP.Net with SQL Server. Here are two articles about paging (which mention some techniques for paging through result sets) that may be helpful:
Paging through lots of data efficiently (and in an Ajax way) with ASP.NET 2.0
Efficient Data Paging with the ASP.NET 2.0 DataList Control and ObjectDataSource
Another mechanism that Microsoft has shipped lately is their idea of "Dynamic Data" - you might be able to check out the guts of this for some guidance as to how they're dealing with this issue.

I've done similar things on two different products. In one case the data source is optionally paginated -- for java, implements a Pageable interface similar to:
public interface Pageable
{
public void setStartIndex( int index );
public int getStartIndex();
public int getRowsPerPage() throws Exception;
public void setRowsPerPage( int rowsPerPage );
}
The data source implements another method for get() of items, and the implementation of a paginated data source just returns the current page. So you can set your start index, and grab a page in your controller.
One thing to consider will be to cache your cursors server side. For a web app you'll have to expire them, but they'll really help performance wise.

The fedora digital repository project returns a maximum number of results with a result-set-id. You then get the rest of the result by asking for the next chunk supplying the result-set-id in the subsequent query. It works ok as long as you don't want to do any searching or sorting outside of the query.

From the datay retrieval layer, the standard design pattern is to have two method interfaces, one for all and one for a block size.
If you wish, you can layer components that do paging over it.

Related

A hybrid of cache based and query based paging in hibernate/JPA

If the result set is large, then having the entire result set in memory (server cache e.g. hazelcast) will not be feasible. With large result sets, you cannot afford to have them in memory. In such case, you have to fetch a chunk of data at a time (query based paging). The down side of using query based paging, is that there will be multiple calls to the database for multiple page requests.
Can anyone suggest how to implement a hybrid approach of it.
I haven't put any sample code here since I think the question is more about a logic instead of specific code. Still if you need sample code I can put it.
Thanks in advance.

The most effective solution is to use the primary key as a paging criterion.This enables us to rely of first class constructs like a between range query which is simple for the RDBMS to optimize, the primary key of the queried entity will most likely be indexed already.
Retrieving data using a range query on the primary key is a two-step process. First one have to retrieve the collection of primary-keys, followed by a step to generate the intervals to properly identify a proper subset of the data,followed by the actual queries against the data.
This approach is almost as fast as the brute-force version. The memory consumption is about one tenth. By selecting the appropriate page-size for this implementation, you may alter the ratio between execution time and memory consumption. This version is also stateless, it does not keep references to resources like the ScrollableResults version does, nor does it strain the database like the version using setFirstResult/setMaxResult.
Effective pagination using Hibernate

JDBC Pagination: vendor specific sql versus result set fetchSize

There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof

Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.

There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.

ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.

Performance Optimization in Java

In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?

Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.

It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.

way(client side or server side) to go for pagination /sortable columns?

I have 3000 records in an employee table which I have fetched from my database with a single query. I can show 20 records per page. So there will be 150 pages for with each page showing 20 records. I have two questions on pagination and sortable column approach:
1) If I implement a simple pagination without sortable columns, should I send all 3000 records to client and do the pagination client side using javascript or jquery. So if user clicks second page, call will not go to server side and it will be faster. Though I am not sure what will be the impact of sending 3000 or more records on browser/client side? So what is the best approach either sending all the records to client in single go and do the sorting there or on click of page send the call to server side and then just return that specific page results?
2) In this scenario, I need to provide the pagination along with sortable columns (6 columns). So here user can click any column like employee name or department name, then names should be arranged in ascending or descending order. Again I want to know the best approach in terms of time response/memory?

Sending data to your client is almost certainly going to your bottleneck (especially for mobile clients), so you should always strive to send as little data as possible. With that said, it is almost definitely better to do your pagination on the server side. This is a much more scalable solution. It is likely that the amount of data will grow, so it's a safer bet for the future to just do the pagination on the server.
Also, remember that it is fairly unlikely that any user will actually bother looking through hundreds of result pages, so transferring all the data is likely wasteful as well. This may be a relevant read for you.

I assume you have a bean class representing records in this table, with instances loaded from whatever ORM you have in place.
If you haven't already, you should implement caching of these beans in your application. This can be done locally, perhaps using Guava's CacheBuilder, or remotely using calls to Memcached for example (the latter would be necessary for multiple app servers/load balancing). The cache for these beans should be keyed on a unique id, most likely the mapping to the primary key column of the corresponding table.
Getting to the pagination: simply write your queries to return only IDs of the selected records. Include LIMIT and OFFSET or your DB language's equivalent to paginate. The caller of the query can also filter/sort at will using WHERE, ORDER BY etc.
Back in the Java layer, iterate through the resulting IDs of these queries and build your List of beans by calling the cache. If the cache misses, it will call your ORM to individually query and load that bean. Once the List is built, it can be processed/serialized and sent to the UI.

I know this doesn't directly answer the client vs server side pagination, but I would recommend using DataTables.net to both display and paginate your data. It provides a very nice display, allows for sorting and pagination, built in search function, and a lot more. The first time I used it was for the first web project I worked on, and I, as a complete noobie, was able to get it to work. The forums also provide very good information/help, and the creator will answer your questions.
DataTables can be used both client-side and server-side, and can support thousands of rows.
As for speed, I only had a few hundred rows, but used the client-side processing and never noticed a delay.

USE SERVER PAGINATION!
Sure, you could probably get away with sending down a JSON array of 3000 elements and using JavaScript to page/sort on the client. But a good web programmer should know how to page and sort records on the server. (They should really know a couple ways). So, think of it as good practice :)
If you want a slick user interface, consider using a JavaScript grid component that uses AJAX to fetch data. Typically, these components pass back the following parameters (or some variant of them):
Start Record Index
Number of Records to Return
Sort Column
Sort Direction
Columns to Fetch (sometimes)
It is up to the developer to implement a handler or interface that returns a result set based on these input parameters.

Strategy for locale sensitive sort with pagination

I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.

Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.

How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)

I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.

You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.

This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.