Currently working in the deployment of an OFBiz based ERP, we've come to the following problem: some of the code of the framework calls the resultSet.last() to know the total rows of the resultset. Using the Oracle JDBC Driver v11 and v10, it tries to cache all of the rows in the client memory, crashing the JVM because it doesn't have enough heap space.
After researching, the problem seems to be that the Oracle JDBC implements the Scrollable Cursor in the client-side, instead of in the server, by the use of a cache. Using the datadirect driver, that issue is solved, but it seems that the call to resultset.last() takes too much to complete, thus the application server aborts the transaction
is there any way to implemente scrollable cursors via jdbc in oracle without resorting to the datadirect driver?
and what is the fastest way to know the length of a given resultSet??
Thanks in advance
Ismael
"what is the fastest way to know the length of a given resultSet"
The ONLY way to really know is to count them all. You want to know how many 'SMITH's are in the phone book. You count them.
If it is a small result set, and quickly arrived at, it is not a problem. EG There won't be many Gandalfs in the phone book, and you probably want to get them all anyway.
If it is a large result set, you might be able to do an estimate, though that's not generally something that SQL is well-designed for.
To avoid caching the entire result set on the client, you can try
select id, count(1) over () n from junk;
Then each row will have an extra column (in this case n) with the count of rows in the result set. But it will still take the same amount of time to arrive at the count, so there's still a strong chance of a timeout.
A compromise is get the first hundred (or thousand) rows, and don't worry about the pagination beyond that.
your proposed "workaround" with count basically doubles the work done by DB server. It must first walk through everything to count number of results and then do the same + return results. Much better is the method mentioned by Gary (count(*) over() - analytics). But even here the whole result set must be created before first output is returned to the client. So it is potentially slow a memory consuming for large outputs.
Best way in my opinion is select only the page you want on the screen (+1 to determine that next one exists) e.g. rows from 21 to 41. And have another button (usecase) to count them all in the (rare) case someone needs it.
Related
There are a lot of different tutorials across the internet about pagination with JDBC/iterating over huge result set.
So, basically there are a number of approaches I've found so far:
Vendor specific sql
Scrollable result set (?)
Holding plain result set in a memory and map the rows only when necessary (using fetchSize)
The result set fetch size, either set explicitly, or by default equal
to the statement fetch size that was passed to it, determines the
number of rows that are retrieved in any subsequent trips to the
database for that result set. This includes any trips that are still
required to complete the original query, as well as any refetching of
data into the result set. Data can be refetched, either explicitly or
implicitly, to update a scroll-sensitive or
scroll-insensitive/updatable result set.
Cursor (?)
Custom seek method paging implemented by jooq
Sorry for messing all these but I need someone to clear that out for me.
I have a simple task where service consumer asks for results with a pageNumber and pageSize. Looks like I have two options:
Use vendor specific sql
Hold the connection/statement/result set in the memory and rely on jdbc fetchSize
In the latter case I use rxJava-jdbc and if you look at producer implementation it holds the result set, then all you do is calling request(long n) and another n rows are processed. Of course everything is hidden under Observable suggar of rxJava. What I don't like about this approach is that you have to hold the resultSet between different service calls and have to clear that resultSet if client forgets to exhaust or close it. (Note: resultSet here is java ResultSet class, not the actual data)
So, what is recommended way of doing pagination? Is vendor specific sql considered slow compared to holding the connection?
I am using oracle, ScrollableResultSet is not recommended to be used with huge result sets as it caches the whole result set data on the client side. proof
Keeping resources open for an indefinite time is a bad thing in general. The database will, for example, create a cursor for you to obtain the fetched rows. That cursor and other resources will be kept open until you close the result set. The more queries you do in parallel the more resources will be occupied and at some point the database will reject further requests due to an exhausted resource pool (e.g. there is a limited number of cursors, that can be opened at a time).
Hibernate, for example, uses vendor specific SQL to fetch a "page" and I would do it just like that.
There are many approaches because there are many different use cases.
Do you actually expect users to fetch every page of the result set? Or are they more likely to fetch the first page or two and try something else if the data they're interested in isn't there. If you are Google, for example, you can be pretty confident that people will look at results from the first page, a small number will look at results from the second page, and a tiny fraction of results will come from the third page. It makes perfect sense in that case to use vendor-specific code to request a page of data and only run that for the next page when the user asks for it. If you expect the user to fetch the last page of the result, on the other hand, running a separate query for each page is going to be more expensive than running a single query and doing multiple fetches.
How long do users need to keep the queries open? How many concurrent users? If you're building an internal application that dozens of users will have access to and you expect users to keep cursors open for a few minutes, that might be reasonable. If you are trying to build an application that will have thousands of users that will be paging through a result over a span of hours, keeping resources allocated is a bad idea. If your users are really machines that are going to fetch data and process it in a loop as quickly as possible, a single ResultSet with multiple fetches makes far more sense.
How important is it that no row is missed/ every row is seen exactly once/ the results across pages are consistent? Multiple fetches from a single cursor guarantees that every row in the result is seen exactly once. Separate paginated queries might not-- new data could have been added or removed between queries being executed, your sort might not be fully deterministic, etc.
ScrollableResultSet caches result on client side - this requires memory resources. But for example PostgreSQL does it by default and nobody complains. Some databases simply use client's memory to hold the whole resultset. In most cases the database has to process much more data to re-evaluate the query.
Also you usually have much more clients, than database instances.
Also note that query re-execution - using rownum - as implemented by Hibernate does not guarantee correct(consistent) results. If data are modified between executions and default isolation level is used.
It really depends on use case. Changing Oracle's init parameter for max. connections and also for open cursors requires database restart.
So ScrollableResultSet and cursors can be used only when you can predict amount of (concurrent) users.
I work with a very large, enterprise application written in Java which queries an Oracle SQL database. We use JavaScript on the front end, and are always looking for ways to improve upon the performance of the application with increased use.
The issue we're having right now is that we are sending a query, via Java, that results in 39,000 records. This is putting a significant load on the server and causes the browser to hang. I should mention that the data is relatively static (only changes about once a year) and we could use an xml map or something similar (flat file) since we know the exact results that will be returned each time.
The query, however, is still taking 1.5 - 2 minutes to load, which is unacceptable. I wanted to see if there were any suggestions as to how this scenario can be optimized, especially if it can be done any quicker with JavaScript (or jQuery) and using AJAX for the db connection. Or, are we going about this problem all wrong?
You want to determine if the slowness is due to:
the query executing in the database
the network is slow returning 39k records
the javascript working with the 39k records after the ajax is complete
If you can run the query in sqlplus or toad, this will eliminate the web-tier and network all together. If this is slow, then tune the query by checking indexes.
If after adding the appropriate indexes, the query is still slow, then you could prebuild the query's results and store the results in a table or you could create a materialized view.
Once you have the query performing well from sqlplus, then add the network back into the equation. Run it from your web browser and see what overhead is being added.
If it is still slow, then you need to determine if the problem is the act of ajaxing the data or if the slowness occurs after the page does something with the data (ie. populating a data grid via javascript).
If the slowness is because the browser is waiting for the data, then you want to make sure it's only ever fetched once. You can do this by setting the cache headers in the ajax request to cache the result for 1 year. Or you can store the results in localstorage.
If the slowness is due to the browser working with the 39k rows (ie. moving the data into a data grid), then you have a few options.
find a better approach or library
use pagination
You may find performance issues from each of these areas. Most likely the query just needs to be tuned and by adding indexes or pre-querying the data and storing it will solve the problem.
Another thing to consider is if you really need 39k rows at one time. If you can, paginate at the db level so you're returning 100 rows per page.
If I have a java application that performs some inserts against a database, if there an easy way to get how much bytes was committed (i.e. sum size of all the data in all the fields), without having to calculate it manually / fetching and checking the size of the result set?
--
As lucho points out, implementing statistics-aware statement class on top of the PreparedStatement might be the way to go. Going to stick with that and see how well this is going to work.
As far as I know, nope.
You'll have to ask your database that question; perhaps it's possible to do it without querying the same thing you inserted (because that sounds a bit pointless).
Interesting problem. I like lucho's solution, but I have two quicker (hackier) options:
You can try to use InnoDB's SHOW TABLE STATUS and keep a running log of the data size. That would let you know, but on my development machine calling it on one database takes 5.3s (56 tables) so unless you only want the data for one or two tables it's probably too slow (not to mention whatever locking it may incur).
You could monitor the DB process and use the OS to tell you how much it's writing. I know Windows can tell you this, and I'm pretty sure Linux can as well. But if you host 3 databases you'll only get the total, and it will be off some due to transactions and such.
Just random ideas.
I have a SELECT query with lot of IF conditions, which I can do either in the query itself (takes DB machine's CPU) or I can put it in my java code (takes server machine's CPU).
Is there any preferred approach here (to put conditions in DB Vs in mid-tier)?
UPDATE: My query is a join on more than 2 tables,
and I am using left join to combine and there are some rows which will have corresponding row in 2nd table and some are not.
I need to have some default value for those columns when I don't have corresponding row in 2nd table.
SElECT CASE WHEN t2.col1 is null
then 'default' else t2.col1
END
FROM table1 t1
LEFT JOIN table2 t2 ON t1.id = t2.id
If it's really something that the DB cannot do any faster than the app server, and which actually reduces the load on the DB server if moved to the app server, then I'd move it to the app server.
The reason: if you reach the limits of your hardware, it's much easier to have multiple app servers than to have a clustered database.
However, the second condition above should be tested thoroughly: many things will not reduce (or even increase) the DB load if moved away from the DB.
Update: For the kind of thing you need, I doubt whether the first condition is satisfied - have you tested it? A simple CASE is completely insignificant, unless the condition or the branches contain some very expensive calculations.
Yes, though I would suggest another approach, one that adds no load to the app server and minimal load to the DBMS. It's a little hard to answer the question since you haven't provided a concrete example but I'll give it a shot.
My preferred solution is to get rid of the if conditions totally if you can. At a bare minimum, you can re-jig your database schema to move the cost of calculation away from the select (which happens a lot) and into the insert/update (which happens less often).
That's the normal case, I have seen databases that write more frequently than read, but they're the exception rather than the rule.
By way of example, let's say you store person information and you want to get a list of people whose first name is more than 5 characters long. Don't ask why, I'm the customer, you have to give me what I want :-)
Rather than a monstrous select statement to (possibly) split apart the name and count the characters in it, do that as an insert/update trigger when the data enters the table - that's the only time when the value can change after all.
Put that calculation in another column (indexed) and use that in your select. The cost of the calculation is amortised over al the selects, which will be blindingly fast.
It will take up more storage space but, if you compare the number of database "how can I make this faster?" questions against the number of "how can I use less space?" questions, you'll find the former greatly outweigh the latter.
And, yes, it does mean you store redundant data but the triggers mitigate the possibility of losing ACID properties. It's okay to bend rules if you know the possible consequences and how best to avoid them.
Based on your update, you should put the workload on to the machine where it causes the least impact. That may be the DBMS, it may be the app server, it may even be on the client side (of the app server) itself since that would distribute the cost across a lot of machines rather than concentrating it at a single point.
You should measure, not guess! Set up realistic performance test systems along with realistic production-quality data, then try the different approaches. That's the only real way to be certain.
I am connecting oracle db thru java program. The problem is i am getting Outofmemeory exception because the sql is returning 3 million records. I cannot increase the JVM heapsize for some reason.
What is the best solution to solve this?
Is the only option is to run the sql with LIMIT?
If your program needs to return 3 mil records at once, you're doing something wrong. What do you need to do that requires processing 3 mil records at once?
You can either split the query into smaller ones using LIMIT, or rethink what you need to do to reduce the amount of data you need to process.
In my opinion is pointless to have queries that return 3 million records. What would you do with them? There is no meaning to present them to the user and if you want to do some calculations it is better to run more than one queries that return considerably fewer records.
Using LIMIT is one solution, but a better solution would be to restructure your database and application so that you can have "smarter" queries that do not return everything in one go. For example you could return records based on a date column. This way you could have the most recent ones.
Application scaling is always an issue. The solution here will to do whatever you are trying to do in Java as a stored procedure in Oracle PL/SQL. Let oracle process the data and use internal query planners to limit amount of data flowing in an out and possibly causing major latencies.
You can even write the stored procedure in Java.
Second solution will be to indeed make a limited query and process from several java nodes and collate results. Look up map-reduce.
If each record is around 1 kilobyte that means 3gb of data, do you have that amount of memory available for your application?
Should be better if you explain the "real" problem, since OutOfMemory is not your actual problem.
Try this:
http://w3schools.com/sql/sql_where.asp
There could be three possible solutions
1. If retreiving 3million records at once is not necessary.. Use LIMIT
Consider using meaningful where clause
Export database entries into txt or csv or excel format with the tool that oracle provides and use that file for your use..
Cheers :-)
reconsider your where clause. see if you can make it more restrictive.
and/or
use limit
Just for reference, In Oracle queries, LIMIT is ROWNUM
Eg., ... WHERE ROWNUM<=1000
If you get that large a response then take care to process the result set row by row so the full result does not need to be in memory. If you do that properly you can process enormous data sets without problems.